Image Classification: Tips and Tricks From 13 Kaggle Competitions (+ Tons of References)
Success in any field can be distilled into a set of small rules and fundamentals that produce great results when coupled together.
Machine learning and image classification is no different, and engineers can showcase best practices by taking part in competitions like Kaggle.
In this article, I’m going to give you a lot of resources to learn from, focusing on the best Kaggle kernels from 13 Kaggle competitions – with the most prominent competitions being:
- Intel Image Classification
- Recursion Cellular Image Classification
- SIIM-ISIC Melanoma Classification
- APTOS 2019 Blindness Detection
- Diabetic Retinopathy Detection
- ML Project — Image Classification
- Cdiscount’s Image Classification Challenge
- Plant seedlings classifications
- Aesthetic Visual Analysis
We’ll go through three main areas of tweaking a Deep learning solution:
- Data
- Model
- Loss function
…and there will be a lot of example projects (and references) for you to check out along the way.
Data
Image pre-processing + EDA

Every Machine Learning/Deep Learning Solution starts with raw data. There are 2 essential steps in the data processing pipeline.
The first step is Exploratory Data Analysis (EDA). It helps us analyse the entire dataset and summarise its main characteristics, like class distribution, size distribution, and so on. Visual methods are often used to display the results of this analysis.
The second step is Image Pre-Processing, where the aim is to take the raw image and improve image data (also known as image features) by suppressing unwanted distortions, resizing and/or enhancing important features, making the data more suited to the model and improving performance.
You can dig into these Kaggle notebooks to check out a few examples of Image Pre-Processing and EDA techniques:
- Visualisation
- Dealing with Class imbalance
- Fill missing values (labels, features and, etc.)
- Normalisation
- Pre-processing
Data augmentation
Data augmentation can expand our dataset by generating more training data from existing training samples. New samples are generated via a number of random transformations that not only yield believable-looking images but also reflect real-life scenarios—more on this later.
This technique is widely used, and not just in cases with too few data samples to train the model. In this case, the model starts to memorize the training set, but it is unable to generalize (performs poorly on never seen data).
Usually, when a model performs great on training data but poorly on validation data, we call this condition overfitting. To solve this problem, we usually try to get new data, and if new data isn’t available, data augmentation comes to the rescue.
Note: A general rule of thumb is to always use data augmentation techniques because it helps expose our model to more variations and generalize better. Even if we have a large dataset, although it comes at the cost of slow training speed because augmentations are done on-the-fly (which means during training).
Plus, for each task or dataset, we have to use augmentation techniques that reflect possible real-life scenarios (i.e. if we have a cat/dog detector we can use horizontal flip, crop, brightness and contrast because these augmentations match differences in how photos are taken).
Here are a few Kaggle competition notebooks for you to check out popular data augmentation techniques in practice:
- Horizontal Flip
- Random Rotate and Random Dihedral
- Hue, Saturation, Contrast, Brightness, Crop
- Colour jitter
Model
Develop a baseline (example project)
Here we create a basic model using a very simple architecture, without any regularization or dropout layers, and see if we can beat the baseline score of 50% accuracy. Although we can’t always get there, if we can’t beat the baseline after trying multiple reasonable architectures, maybe the input data doesn’t hold the information required for our model to make a prediction.
In the wise and paraphrased words of Jeremy Howard:
“You should be able to quickly test if you are going into a promising direction, in 15 minutes using 50% or less of the dataset, if not you have to rethink everything.”
Develop a model large enough that it overfits (example project)
Once our baseline model has enough capacity to beat the baseline score, we can increase the baseline model capacity until it overfits the dataset, then we move to applying regularization. We can increase module capacity by:
- Adding more layers
- Using a better architecture
- Better training procedures
Architecture
According to literature, the architecture refinements below improve model capacity, but barely change the computational complexity. They’re still pretty interesting if you want to dig into the linked examples:
- Residual Networks
- Wide Residual Networks
- Inception
- EfficientNet
- Swish activation
- Residual Attention Network
Most of the time, model capacity and accuracy are positively correlated to each other – as the capacity increases, the accuracy increases too, and vice-versa.
Training procedures
Here are some training procedures you can use to tweak your model, with example projects to see how they work:
- Mixed-Precision Training
- Large Batch-Size Training
- Cross-Validation Set
- Weight Initialization
- Self-Supervised Training (Knowledge Distillation)
- Learning Rate Scheduler
- Learning Rate Warmup
- Early Stopping
- Differential Learning Rates
- Ensemble
- Transfer Learning
- Fine-Tuning
Hyperparameter tuning
Unlike parameters, hyperparameters are specified by you when you configure the model (i.e. learning rate, number of epochs, number of hidden units, batch size, etc).
Instead of trying different model configurations manually, you can automate this process by using hyperparameter tuning libraries like Scikit learn Grid Search, Keras Tuner, and others that will try all hyperparameter combinations within the range you specify, and it will return the best performing model.
The more hyperparameters you need to tune, the slower the process, so it’s good to select a minimum subset of model hyperparameters to tune.
Not all model hyperparameters are equally important. Some hyperparameters have an outsized effect on the behaviour, and in turn the performance, of a machine learning algorithm. You should carefully pick the ones that impact your model’s performance the most, and tune them for maximum performance.
Regularization
This method forces the model to learn a meaningful and generalizable representation of the data by penalizing memorization/overfitting and underfitting, making the model more robust at dealing with data it has never seen before.
One simple method to solve the problems stated above is to get more training data because a model trained on more data will naturally generalize better.
Here are some techniques you can try to mitigate overfitting and underfitting, with example project links for you to dig into:
- Adding Dropout
- Adding or changing the position of Batch Norm
- Data augmentation
- Mixup
- Weight regularization
- Gradient clipping
Loss function
Also known as cost function or objective function, the loss function is used to find the difference between the models output from the target output, and to help the model minimize the distance between them.
Here are some of the most popular loss functions, with project examples where you’ll find tricks to improve your model capacity:
- Label smoothing
- Focal loss
- SparseMax loss and Weighted cross-entropy
- BCE loss, BCE with logits loss and Categorical cross-entropy loss
- Additive Angular Margin Loss for Deep Face Recognition
Evaluation + error analysis
Here, we do an ablation study, and analyse our experiment results. We identify our model’s weaknesses and strengths, and identify areas to improve in the future. You can use the below techniques at this stage, and see how they’re implemented in the linked examples:
There are many experiment tracking and management tools that take the minimal setup to save all the data for you automatically, which makes the ablation study easier – Neptune.ai does a great job here.
Closing thoughts
There are many ways to tweak your models, and new ideas come out all the time. Deep Learning is a fast moving field and there are no silver bullet methods. We have to experiment a lot, and enough trial and error causes breakthroughs. This article already contains a lot of links, but for the most knowledge-hungry readers, I also added a long reference section below for you to read more and run some notebooks.
Further research
- Distributed Training
- 3D Image classification
- Converting data from other domains into images
- PK-GCN: Prior Knowledge Assisted Image Classification using Graph Convolution Networks
References
Papers:
- Wide Residual Networks
- mixup: BEYOND EMPIRICAL RISK MINIMIZATION
- ArcFace: Additive Angular Margin Loss for Deep Face Recognition
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- Searching for Activation Functions
- Residual Attention Network for Image Classification
- Mixed Precision Training
- Self-training with Noisy Student improves ImageNet classification
- When Does Label Smoothing Help?
- Additive Angular Margin Loss for Deep Face Recognition
- Grad-CAM: Why did you say that? Visual Explanations from Deep Networks…
- A Comparative Study of Deep Learning Loss Functions for Multi-Label Remote Sensing Image Classification
Blogs:
- Wide Residual Nets: “Why deeper isn’t always better…”
- Tune Hyperparameters for Classification Machine Learning Algorithms
- Image Pre-Processing
- Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss…
- Noisy student
- Overfitting and Underfitting With Machine Learning Algorithms
- Developing AI projects under pressure
- Understanding Neural Networks
- Kaggle competitions
Books:
- Deep Learning with Python by F.chollet
- Deep Learning with Pytorch by V.Subramanian
- Evaluating Machine Learning Models
- Fastbook
Kaggle Competitions:
- Intel Image Classification
- Recursion Cellular Image Classification
- SIIM-ISIC Melanoma Classification
- APTOS 2019 Blindness Detection
- Diabetic Retinopathy Detection
- ML Project — Image Classification
- Cdiscount’s Image Classification Challenge
- Plant seedlings classifications
- Aesthetic Visual Analysis
- Data Science Bowl 2017
- Plant Pathology 2020 – FGVC7
- Lyft Motion Prediction for Autonomous Vehicles
- Humpback Whale Identification
- Distributed Training
- 3D Image classification
Kaggle notebooks:
- Ultimate Image Classification Guide 2020
- Protein Atlas – Exploration and Baseline
- Intel Image Classification (CNN – Keras)
- APTOS : Eye Preprocessing in Diabetic Retinopathy
- Lyft Level5: EDA + Training + Inference
- [BEG][TUT]Intel Image Classification[93.76% Accur]
- pretrained ResNet34 with RGBY (0.460 public LB)
- Fold1h4r3 ArcENetB4/2 256px RCIC
- Quick Visualization + EDA
- Analysis of Melanoma Metadata and EffNet Ensemble
- Triple Stratified KFold with TFRecords
- Melanoma. Pytorch starter. EfficientNet
- InceptionV3 for Retinopathy (GPU-HR)
- Fastai tutorial for image classification
- Google image classification v4
- Chest X-Ray Image Classification – TF Hub ResNet50