J.P. Morgan interview question

dealing with unbalanced data for classification?

Interview Answer

Anonymous

Aug 8, 2020

For Data perspective, Oversampling and Undersampling are the techniques which could be used. If the major class has a lot of data ( say 10 million samples) then undersampling could be used. But generally that poses a risk of losing information. Therefore it is preferable to use oversampling algos like SMOTE which helps in increasing samples of minor class. From Algorithm perspective one should refrain using Random Forest and Neural Net techniques and should stick to techniques like SVM. If data is extremely unbalanced with class ratio of say 1:100, choose anomaly detection techniques like one class SVM.