Imagine if you could get all the tips and tricks you need to hammer a Kaggle competition. I have gone over 39 Kaggle competitions including

- Data Science Bowl 2017 – $1,000,000
- Intel & MobileODT Cervical Cancer Screening – $100,000
- 2018 Data Science Bowl – $100,000
- Airbus Ship Detection Challenge – $60,000
- Planet: Understanding the Amazon from Space – $60,000
- APTOS 2019 Blindness Detection – $50,000
- Human Protein Atlas Image Classification – $37,000
- SIIM-ACR Pneumothorax Segmentation – $30,000
- Inclusive Images Challenge – $25,000

– and extracted that knowledge for you. Dig in.

## Contents

**External data**

- Use of the LUng Node Analysis Grand Challenge data because it contains detailed annotations from radiologists
- Use of the LIDC-IDRI data because it had radiologist descriptions of each tumor that they found
- Use Flickr CC, Wikipedia Commons datasets
- Use Human Protein Atlas Dataset
- Use IDRiD dataset

**Data exploration and gaining insights**

- Clustering of 3d segmentation with the 0.5 threshold
- Identify if there is a substantial difference in train/test label distributions

**Preprocessing**

- Perform blob Detection using the Difference of Gaussian (DoG) method. Used the implementation available in skimage package.
- Use of patch-based inputs for training in order to reduce the time of training
- Use cudf for loading data instead of Pandas because it has a faster reader
- Ensure that all the images have the same orientation
- Apply contrast limited adaptive histogram equalization
- Use OpenCV for all general image preprocessing
- Employ automatic active learning and adding manual annotations
- Resize all images to the same resolution in order to apply the same model to scans of different thicknesses
- Convert scan images into normalized 3D numpy arrays
- Apply single Image Haze Removal using Dark Channel Prior
- Convert all data to Hounsfield units
- Find duplicate images using pair-wise correlation on RGBY
- Make labels more balanced by developing a sampler
- Apply pseudo labeling to test data in order to improve score
- Scale down images/masks to 320×480
- Histogram equalization (CLAHE) with kernel size 32×32
- Convert DCM to PNG
- Calculate the md5 hash for each image when there are duplicate images

**Data augmentations**

- Use albumentations package for augmentations
- Apply random rotation by 90 degrees
- Use horizontal, vertical or both flips
- Attempt heavy geometric transformations: Elastic Transform, Perspective

Transform, Piecewise Affine transforms, pincushion distortion - Apply random HSV
- Use of loss-less augmentation for generalization to prevent loss of useful image information
- Apply channel shuffling
- Do data augmentation based on class frequency
- Apply gaussian noise
- Use lossless permutations of 3D images for data augmentation
- Rotate by a random angle from 0 to 45 degrees
- Scale by a random factor from 0.8 to 1.2
- Brightness changing
- Randomly change hue, saturation and value
- Apply D4 augmentations
- Contrast limited adaptive histogram equalization
- Use the AutoAugment augmentation strategy

**Modeling**

**Architectures**

- Use of a U-net based architecture. Adopted the concepts and applied them to 3D input tensors
- Employing automatic active learning and adding manual annotations
- The inception-ResNet v2 architecture for training features with different receptive fields
- Siamese networks with adversarial training
*ResNet50*,*Xception*,*Inception ResNet**v2*x 5 with Dense (FC) layer as the final layer- Use of a global max-pooling layer which returns a fixed-length output no matter the input size
- Use of stacked dilated convolutions
- VoxelNet
- Replace plus sign in LinkNet skip connections with concat and conv1x1
- Generalized mean pooling
- Keras NASNetLarge to train the model from scratch using 224x224x3
- Use of the 3D convnet to slide over the images
- Imagenet-pre-trained ResNet152 as the feature extractor
- Replace the final fully-connected layers of ResNet by 3 fully connected layers with dropout
- Use ConvTranspose in the decoder
- Applying the VGG baseline architecture
- Implementing the C3D network with adjusted receptive fields and a 64 unit bottleneck layer on the end of the network
- Use of UNet type architectures with pre-trained weights to improve convergence and performance of binary segmentation on 8-bit RGB input images
- LinkNet since it’s fast and memory efficient
- MASKRCNN
- BN-Inception
- Fast Point R-CNN
- Seresnext
- UNet and Deeplabv3
- Faster RCNN
- SENet154
- ResNet152
- NASNet-A-Large
- EfficientNetB4
- ResNet101
*GAPNet*- PNASNet-5-Large
- Densenet121
- AC-GAN
- XceptionNet (96), XceptionNet (299), Inception v3 (139), InceptionResNet v2 (299), DenseNet121 (224)
- AlbuNet (resnet34) from ternausnets
- SpaceNet
- Resnet50 from selim_sef SpaceNet 4
- SCSEUnet (seresnext50) from selim_sef SpaceNet 4
- A custom Unet and Linknet architecture
- FPNetResNet50 (5 folds)
- FPNetResNet101 (5 folds)
- FPNetResNet101 (7 folds with different seeds)
- PANetDilatedResNet34 (4 folds)
- PANetResNet50 (4 folds)
- EMANetResNet101 (2 folds)
- RetinaNet
- Deformable R-FCN
- Deformable Relation Networks

**Hardware setups**

- Use of the AWS GPU instance p2.xlarge with a NVIDIA K80 GPU
- Pascal Titan-X GPU
- Use of 8 TITAN X GPUs
- 6 GPUs: 2
*1080Ti + 4*1080 - Server with 8×NVIDIA Tesla P40, 256 GB RAM and 28 CPU cores
- Intel Core i7 5930k, 2×1080, 64 GB of RAM, 2x512GB SSD, 3TB HDD
- GCP 1x P100, 8x CPU, 15 GB RAM, SSD or 2x P100, 16x CPU, 30 GB RAM
- NVIDIA Tesla P100 GPU with 16GB of RAM
- Intel Core i7 5930k, 2×1080, 64 GB of RAM, 2x512GB SSD, 3TB HDD
- 980Ti GPU, 2600k CPU, and 14GB RAM

**Loss functions**

- Dice Coefficient because it works well with imbalanced data
- Weighted boundary loss whose aim is to reduce the distance between the predicted segmentation and the ground truth
- MultiLabelSoftMarginLoss that creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input and target
- Balanced cross entropy (BCE) with logit loss that involves weighing the positive and negative examples by a certain coefficient
- Lovasz that performs direct optimization of the mean intersection-over-union loss in neural networks based on the convex Lovasz extension of sub-modular losses
- FocalLoss + Lovasz obtained by summing the Focal and Lovasz losses
- Arc margin loss that incorporates margin in order to maximise face class separability
- Npairs loss that computes the npairs loss between y_true and y_pred.
- A combination of BCE and Dice loss functions
- LSEP – a pairwise ranking that is is smooth everywhere and thus is easier to optimize
- Center loss that simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers
- Ring Loss that augments standard loss functions such as Softmax
- Hard triplet loss that trains a network to embed features of the same class at the same time maximizing the embedding distance of different classes
*1 + BCE – Dice*that involves subtracting the BCE and DICE losses then adding 1- Binary cross-entropy – log(dice) that is the binary cross-entropy minus the log of the dice loss
- Combinations of BCE, dice and focal
- Lovasz Loss that loss performs direct optimization of the mean intersection-over-union loss
- BCE + DICE -Dice loss is obtained by calculating smooth dice coefficient function
- Focal loss with Gamma 2 that is an improvement to the standard cross-entropy criterion
- BCE + DICE + Focal – this is basically a summation of the three loss functions
- Active Contour Loss that incorporates the area and size information and integrates the information in a dense deep learning model
- 1024 * BCE(results, masks) + BCE(cls, cls_target)
- Focal + kappa – Kappa is a loss function for multi-class classification of ordinal data in deep learning. In this case we sum it and the focal loss
- ArcFaceLoss — Additive Angular Margin Loss for Deep Face Recognition
- Soft Dice trained on positives only – Soft Dice uses predicted probabilities
- 2.7 * BCE(pred_mask, gt_mask) + 0.9 * DICE(pred_mask, gt_mask) + 0.1 * BCE(pred_empty, gt_empty) which is a custom loss used by the Kaggler
*nn.SmoothL1Loss()*- Use of the Mean Squared Error objective function in scenarios where it seems to work better than binary-cross entropy objective function.

**Training tips**

- Try different learning rates
- Try different batch sizes
- Use SDG with momentum with manual rate scheduling
- Too much augmentation will reduce the accuracy
- Train on image crops and predict on full images
- Use of Keras’s ReduceLROnPlateau() to the learning rate
- Train without augmentation until plateau then apply soft and hard augmentation to some epochs
- Freeze all layers except the last one and use 1000 images from Stage1 for tuning
- Make labels more balanced by developing a sampler
- Use of class aware sampling
- Use dropout and augmentation while tuning the last layer
- Pseudo Labeling to improve score
- Use Adam reducing LR on plateau with patience 2–4
- Use Cyclic LR with SGD
- Reduce the learning rate by a factor of two if validation loss does not improve for two consecutive epochs
- Repeat the worst batch out of 10 batches
- Train with default UNET
- Overlap tiles so that each edge pixel is covered twice
- Hyperparameter tuning: learning rate on training, non-maximum suppression and score threshold on inference
- Remove low bounding box with low confidence score
- Train different convolutional neural networks then build an ensemble
- Stop training when the F1 score is decreasing
- Differential learning rate with gradual reducing
- Train ANNs in a stacking way using 5 folds and 30 repeats
- Track of your experiments using Neptune.

**Evaluation and cross-validation**

- Split on non-uniform stratified by classes
- Avoid overfitting by applying cross-validation while tuning the last layer
- 10-fold CV ensemble for classification
- Combination of 5 10-fold CV ensembles for detection
- Sklearn’s stratified K fold function
- 5 KFold Cross-Validation
- Adversarial Validation & Weighting

**Ensembling methods**

- Use simple majority voting for ensemble
- XGBoost on the max malignancy at 3 zoom levels, the z-location and the amount of strange tissue
- LightGBM for models with too many classes. This was done for raw data features only.
- CatBoost for a second-layer model
- Training with 7 features for the gradient boosting classifier
- Use ‘curriculum learning’ to speed up model training. In this technique, models are first trained on simple samples then progressively moving to hard ones.
- Ensemble with ResNet50, InceptionV3, and InceptionResNetV2
- Ensemble method for object detection
- An ensemble of Mask RCNN, YOLOv3, and Faster RCNN architectures n with a classification network — DenseNet-121 architecture

**Post processing**

- Apply test time augmentation — presenting an image to a model several times with different random transformations and average the predictions you get
- Equalize test prediction probabilities instead of only using predicted classes
- Apply geometric mean to the predictions
- Overlap tiles during inferencing so that each edge pixel is covered at least thrice because UNET tends to have bad predictions around edge areas.
- Non-maximum suppression and bounding box shrinkage
- Watershed post processing to detach objects in instance segmentation problems.

**Final thoughts**

Hopefully, this article gave you some background into image segmentation tips and tricks and given you some tools and frameworks that you can use to start competing.

We’ve covered tips on:

- architectures
- training tricks,
- losses,
- pre-processing,
- post processing
- ensembling
- tools and frameworks.

If you want to go deeper down the rabbit hole, simply follow the links and see how the best image segmentation models are built.

Happy segmenting!

^{ADDITIONAL RESOURCE}

Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names

**READ NEXT**

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

**Jakub Czakon | Posted November 26, 2020**

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in.

Continue reading ->