MLOps Blog

Image Classification: Tips and Tricks From 13 Kaggle Competitions (+ Tons of References)

5 min
25th August, 2023

Success in any field can be distilled into a set of small rules and fundamentals that produce great results when coupled together. 

Machine learning and image classification is no different, and engineers can showcase best practices by taking part in competitions like Kaggle. 

In this article, I’m going to give you a lot of resources to learn from, focusing on the best Kaggle kernels from 13 Kaggle competitions  – with the most prominent competitions being:

We’ll go through three main areas of tweaking a Deep learning solution:

  • Data
  • Model 
  • Loss function 

…and there will be a lot of example projects (and references) for you to check out along the way. 


Image pre-processing + EDA 

Kaggle data

Every Machine Learning/Deep Learning Solution starts with raw data. There are 2 essential steps in the data processing pipeline.

The first step is Exploratory Data Analysis (EDA). It helps us analyse the entire dataset and summarise its main characteristics, like class distribution, size distribution, and so on. Visual methods are often used to display the results of this analysis. 

The second step is Image Pre-Processing, where the aim is to take the raw image and improve image data (also known as image features) by suppressing unwanted distortions, resizing and/or enhancing important features, making the data more suited to the model and improving performance. 

You can dig into these Kaggle notebooks to check out a few examples of Image Pre-Processing and EDA techniques:

Data augmentation

Data augmentation can expand our dataset by generating more training data from existing training samples. New samples are generated via a number of random transformations that not only yield believable-looking images but also reflect real-life scenarios—more on this later.

This technique is widely used, and not just in cases with too few data samples to train the model. In this case, the model starts to memorize the training set, but it is unable to generalize (performs poorly on never seen data). 

Usually, when a model performs great on training data but poorly on validation data, we call this condition overfitting. To solve this problem, we usually try to get new data, and if new data isn’t available, data augmentation comes to the rescue.

Note: A general rule of thumb is to always use data augmentation techniques because it helps expose our model to more variations and generalize better. Even if we have a large dataset, although it comes at the cost of slow training speed because augmentations are done on-the-fly (which means during training). 

Plus, for each task or dataset, we have to use augmentation techniques that reflect possible real-life scenarios (i.e. if we have a cat/dog detector we can use horizontal flip, crop, brightness and contrast because these augmentations match differences in how photos are taken).

Here are a few Kaggle competition notebooks for you to check out popular data augmentation techniques in practice:


Develop a baseline (example project)

Here we create a basic model using a very simple architecture, without any regularization or dropout layers, and see if we can beat the baseline score of 50% accuracy. Although we can’t always get there, if we can’t beat the baseline after trying multiple reasonable architectures, maybe the input data doesn’t hold the information required for our model to make a prediction. 

In the wise and paraphrased words of Jeremy Howard:

“You should be able to quickly test if you are going into a promising direction, in 15 minutes using 50% or less of the dataset, if not you have to rethink everything.”

Develop a model large enough that it overfits (example project)

Once our baseline model has enough capacity to beat the baseline score, we can increase the baseline model capacity until it overfits the dataset, then we move to applying regularization. We can increase module capacity by:

  • Adding more layers
  • Using a better architecture 
  • Better training procedures


According to literature, the architecture refinements below improve model capacity, but barely change the computational complexity. They’re still pretty interesting if you want to dig into the linked examples:

Most of the time, model capacity and accuracy are positively correlated to each other – as the capacity increases, the accuracy increases too, and vice-versa.

Training procedures

Here are some training procedures you can use to tweak your model, with example projects to see how they work:

Hyperparameter tuning

Unlike parameters, hyperparameters are specified by you when you configure the model (i.e. learning rate, number of epochs, number of hidden units, batch size, etc). 

Instead of trying different model configurations manually, you can automate this process by using hyperparameter tuning libraries like Scikit learn Grid Search, Keras Tuner, and others that will try all hyperparameter combinations within the range you specify, and it will return the best performing model.

The more hyperparameters you need to tune, the slower the process, so it’s good to select a minimum subset of model hyperparameters to tune.

Not all model hyperparameters are equally important. Some hyperparameters have an outsized effect on the behaviour, and in turn the performance, of a machine learning algorithm. You should carefully pick the ones that impact your model’s performance the most, and tune them for maximum performance.


This method forces the model to learn a meaningful and generalizable representation of the data by penalizing memorization/overfitting and underfitting, making the model more robust at dealing with data it has never seen before.

One simple method to solve the problems stated above is to get more training data because a model trained on more data will naturally generalize better.

Here are some techniques you can try to mitigate overfitting and underfitting, with example project links for you to dig into:

Loss function

Also known as cost function or objective function, the loss function is used to find the difference between the models output from the target output, and to help the model minimize the distance between them.

Here are some of the most popular loss functions, with project examples where you’ll find tricks to improve your model capacity:

Evaluation + error analysis

Here, we do an ablation study, and analyse our experiment results. We identify our model’s weaknesses and strengths, and identify areas to improve in the future. You can use the below techniques at this stage, and see how they’re implemented in the linked examples:

There are many experiment tracking and management tools that take the minimal setup to save all the data for you automatically, which makes the ablation study easier – does a great job here.

Closing thoughts

There are many ways to tweak your models, and new ideas come out all the time. Deep Learning is a fast moving field and there are no silver bullet methods. We have to experiment a lot, and enough trial and error causes breakthroughs. This article already contains a lot of links, but for the most knowledge-hungry readers, I also added a long reference section below for you to read more and run some notebooks.

Further research





Kaggle Competitions:

Kaggle notebooks:

Was the article useful?

Thank you for your feedback!