Imagine if you could get all the tips and tricks you need to tackle a binary classification problem on Kaggle or anywhere else. I have gone over 10 Kaggle competitions including:
- Toxic Comment Classification Challenge $35,000
- TalkingData AdTracking Fraud Detection Challenge $25,000
- IEEE-CIS Fraud Detection $20,000
- Jigsaw Multilingual Toxic Comment Classification $50,000
- RSNA Intracranial Hemorrhage Detection $25,000
- SIIM-ACR Pneumothorax Segmentation $30,000
- Jigsaw Unintended Bias in Toxicity Classification $65,000
- Santander Customer Transaction Prediction $65,000
- Microsoft Malware Prediction $25,000
- Humpback Whale Identification $25,000
– and pulled out that information for you.
Dive in.
Modeling
- Use two BiGru layers feeding into two final Dense layers
- Decide on the best parameters by selecting the best out of 250 runs with Bayesian optimization
- Use a 2-level bidirectional GRU followed by max-pooling and 2 fully-connected layers
Dealing with imbalance problems
- Check this extensive notebook on handling imbalanced classes
- Class balancing of one-shots: Getting top-1 frequencies of classes and replacing “new whale” class with classes that are one-shots and not presented @ top-1
Metrics
- Global AUC
- ROC AUC score for fraud detection explained (+alternatives)
- Good Old Accuracy
- Mean Average Precision (MAP)
- Binary Log Loss
- F beta score, where beta is equal to 0.5
Loss
BCE and Dice Based
Focal Loss Based
Custom Losses
Others
- Lovasz Loss
- Weighted sigmoid cross-entropy loss
- Hard triplet loss
- Center Loss
- Additive Angular Margin Loss
- Margin loss
- CosFace Loss
- KL-Div loss with soft label
Cross-validation + proper evaluation
- Use Adversarial validation
- Apply GroupKFold cross-validation
- Simple time-split and using about last 100k records as a validation set
- Generate predictions using unshuffled KFold
- Use stratified 5 fold without early stopping for predicting test data
- Implement LightGBM on 10 KFolds with no shuffle
- If using pseudo labeling, don’t validate on the pseudo labels to avoid overfitting
- Use the Standard 10 fold Stratified cross-validation with multiple seeds for the final blend
Post-processing
- Use of the history of submissions to tweak our test set predictions
- Select random 30% of CV, optimize the thresholds for the 30% and apply them to the remaining 70% and check how far off they are from the optimal thresholds of the 70%
- Use a re-scaling factor for predictions >0.8 as well as <0.01 through the use of probabilistic random noise that introduces a small penalty
- Scale up the predicted probability of comments that contain cursed words of different languages
- Label the test samples using the best-performing ensemble, adding them to the train set, and training to convergence
Ensembling
Averaging
Averaging over multiple seeds
- Average 10 out-of-fold predictions
- Average multiple seeds
- Add model diversity by seed averaging and bagging models with different folds
Geometric mean
Average different models
- An average ensemble of XLM-R models
- Average predictions for 7 language-specific models
- An ensemble of XLM-R models
- An ensemble of CatBoost, XGBoost, and LightGBM
Stacking
- Stack Bi-LSTM, Bert-Large-Uncased with WWM, XLNET, with the meta model as ExtraTreesClassifier
- LightGBM Stacking
- Stack LightGBM with heavy bayesian optimization
- Stack models using PyStackNet and MlXtend
- An ensemble of RNN, CNN, LightGBM, and NBSV
- Use 5 time bagged XGB
- CV scores with heavy bayesian optimization
Blending
- Use power blending
- Blend using Hyperopt and OOF to find optimal weights
Others
- Implement Hillclimb ensembling
- Apply LGB bagged 10 times with different training data samples
Repositories and open solutions
Repos with open source solutions
Image based solutions
- Humpback Whale Identification 1st Place Code
- Data Science Bowl 2nd Place Solution
- Forecasting Lung Cancer Diagnoses with Deep Learning
- Kaggle data science bowl 2017
- RSNA Intracranial Hemorrhage Detection 1st Place Solution
- 2nd Place Solution — RSNA Intracranial Hemorrhage Detection
- 3rd place solution RSNA Intracranial Hemorrhage Detection
- 4th Place Solution with code RSNA Intracranial Hemorrhage Detection
- 5th place solution for RSNA Intracranial Hemorrhage Detection
- RSNA Intracranial Hemorrhage Detection Entrypoint for the 5th-place-solution
- SIIM-ACR Pneumothorax Segmentation 1st Place Solution
- SIIM-ACR Pneumothorax Segmentation 3rd Place Solution
- 5th place solution SIIM-ACR Pneumothorax Segmentation
- Humpback Whale Identification 5th Place Solution
- 4th Place Solution Humpback Whale Identification
- Kaggle Humpback Whale Identification Challenge 2019 2nd place code
Tabular based solutions
- How to implement LibFM in Keras and how it was used in the Talking Data competition on Kaggle
- XGB Fraud Detection Solution
- Fraud Detection Feature Engineering
- 2nd Place Solution Santander Customer Transaction Prediction
- Santander Customer Transaction Prediction 5th Place Solution
- Solution to the Kaggle Santander Customer Transaction Prediction competition
- 2nd place Solution the Microsoft Malware Prediction Challenge on Kaggle
Text classification based solutions
- Toxic Comment Classification Challenge, 12th place solution
- Code and write-up for the Kaggle Toxic Comment Classification Challenge
- Jigsaw Unintended Bias in Toxicity Classification 4th Place Solution
- An open solution to the Toxic Comment Classification Challenge
- TalkingData AdTracking Fraud Detection Challenge 4th Place Solution
- Bronze medal Jigsaw Solution
- 2nd place solution for the 2017 national data science bowl
- Jigsaw Unintended Bias in Toxicity Classification 10th Place Solution
- Code for 3rd place solution in Kaggle Humpback Whale Identification Challenge
Final thoughts
Hopefully, this article gave you some background into binary classification tips and tricks, as well as, some tools and frameworks that you can use to start competing.
We’ve covered tips on:
- architectures,
- losses,
- post-processing,
- ensembling,
- tools and frameworks.
If you want to go deeper, simply follow the links and see how the best binary classification models are built.