Imagine if you could get all the tips and tricks you need to hammer a Kaggle competition. I have gone over 39 Kaggle competitions including

- Data Science Bowl 2017 – $1,000,000
- Intel & MobileODT Cervical Cancer Screening – $100,000
- 2018 Data Science Bowl – $100,000
- Airbus Ship Detection Challenge – $60,000
- Planet: Understanding the Amazon from Space – $60,000
- APTOS 2019 Blindness Detection – $50,000
- Human Protein Atlas Image Classification – $37,000
- SIIM-ACR Pneumothorax Segmentation – $30,000
- Inclusive Images Challenge – $25,000

– and extracted that knowledge for you. Dig in.

## Contents

**External Data**

- Use of the LUng Node Analysis Grand Challenge data because it contains detailed annotations from radiologists
- Use of the LIDC-IDRI data because it had radiologist descriptions of each tumor that they found
- Use Flickr CC, Wikipedia Commons datasets
- Use Human Protein Atlas Dataset
- Use IDRiD dataset

**Data Exploration and Gaining insights**

- Clustering of 3d segmentation with the 0.5 threshold
- Identify if there is a substantial difference in train/test label distributions

**Preprocessing**

- Perform blob Detection using the Difference of Gaussian (DoG) method. Used the implementation available in skimage package.
- Use of patch-based inputs for training in order to reduce the time of training
- Use cudf for loading data instead of Pandas because it has a faster reader
- Ensure that all the images have the same orientation
- Apply contrast limited adaptive histogram equalization
- Use OpenCV for all general image preprocessing
- Employ automatic active learning and adding manual annotations
- Resize all images to the same resolution in order to apply the same model to scans of different thicknesses
- Convert scan images into normalized 3D numpy arrays
- Apply single Image Haze Removal using Dark Channel Prior
- Convert all data to Hounsfield units
- Find duplicate images using pair-wise correlation on RGBY
- Make labels more balanced by developing a sampler
- Apply pseudo labeling to test data in order to improve score
- Scale down images/masks to 320×480
- Histogram equalization (CLAHE) with kernel size 32×32
- Convert DCM to PNG
- Calculate the md5 hash for each image when there are duplicate images

**Data Augmentations**

- Use albumentations package for augmentations
- Apply random rotation by 90 degrees
- Use horizontal, vertical or both flips
- Attempt heavy geometric transformations: Elastic Transform, PerspectiveTransform, Piecewise Affine transforms, pincushion distortion
- Apply random HSV
- Use of loss-less augmentation for generalization to prevent loss of useful image information
- Apply channel shuffling
- Do data augmentation based on class frequency
- Apply gaussian noise
- Use lossless permutations of 3D images for data augmentation
- Rotate by a random angle from 0 to 45 degrees
- Scale by a random factor from 0.8 to 1.2
- Brightness changing
- Randomly change hue, saturation and value
- Apply D4 augmentations
- Contrast limited adaptive histogram equalization
- Use the AutoAugment augmentation strategy

**Modeling**

**Architectures**

- Use of a U-net based architecture. Adopted the concepts and applied them to 3D input tensors
- Employing automatic active learning and adding manual annotations
- The inception-ResNet v2 architecture for training features with different receptive fields
- Siamese networks with adversarial training
*ResNet50*,*Xception*,*Inception ResNet**v2*x 5 with Dense (FC) layer as the final layer- Use of a global max-pooling layer which returns a fixed-length output no matter the input size
- Use of stacked dilated convolutions
- VoxelNet
- Replace plus sign in LinkNet skip connections with concat and conv1x1
- Generalized mean pooling
- Keras NASNetLarge to train the model from scratch using 224x224x3
- Use of the 3D convnet to slide over the images
- Imagenet-pre-trained ResNet152 as the feature extractor
- Replace the final fully-connected layers of ResNet by 3 fully connected layers with dropout
- Use ConvTranspose in the decoder
- Applying the VGG baseline architecture
- Implementing the C3D network with adjusted receptive fields and a 64 unit bottleneck layer on the end of the network
- Use of UNet type architectures with pre-trained weights to improve convergence and performance of binary segmentation on 8-bit RGB input images
- LinkNet since it’s fast and memory efficient
- MASKRCNN
- BN-Inception
- Fast Point R-CNN
- Seresnext
- UNet and Deeplabv3
- Faster RCNN
- SENet154
- ResNet152
- NASNet-A-Large
- EfficientNetB4
- ResNet101
*GAPNet*- PNASNet-5-Large
- Densenet121
- AC-GAN
- XceptionNet (96), XceptionNet (299), Inception v3 (139), InceptionResNet v2 (299), DenseNet121 (224)
- AlbuNet (resnet34) from ternausnets
- SpaceNet
- Resnet50 from selim_sef SpaceNet 4
- SCSEUnet (seresnext50) from selim_sef SpaceNet 4
- A custom Unet and Linknet architecture
- FPNetResNet50 (5 folds)
- FPNetResNet101 (5 folds)
- FPNetResNet101 (7 folds with different seeds)
- PANetDilatedResNet34 (4 folds)
- PANetResNet50 (4 folds)
- EMANetResNet101 (2 folds)
- RetinaNet
- Deformable R-FCN
- Deformable Relation Networks

**Hardware Setups**

- Use of the AWS GPU instance p2.xlarge with a NVIDIA K80 GPU
- Pascal Titan-X GPU
- Use of 8 TITAN X GPUs
- 6 GPUs: 2
*1080Ti + 4*1080 - Server with 8×NVIDIA Tesla P40, 256 GB RAM and 28 CPU cores
- Intel Core i7 5930k, 2×1080, 64 GB of RAM, 2x512GB SSD, 3TB HDD
- GCP 1x P100, 8x CPU, 15 GB RAM, SSD or 2x P100, 16x CPU, 30 GB RAM
- NVIDIA Tesla P100 GPU with 16GB of RAM
- Intel Core i7 5930k, 2×1080, 64 GB of RAM, 2x512GB SSD, 3TB HDD
- 980Ti GPU, 2600k CPU, and 14GB RAM

**Loss Functions**

- Dice Coefficient because it works well with imbalanced data
- Weighted boundary loss whose aim is to reduce the distance between the predicted segmentation and the ground truth
- MultiLabelSoftMarginLoss that creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input and target
- Balanced cross entropy (BCE) with logit loss that involves weighing the positive and negative examples by a certain coefficient
- Lovasz that performs direct optimization of the mean intersection-over-union loss in neural networks based on the convex Lovasz extension of sub-modular losses
- FocalLoss + Lovasz obtained by summing the Focal and Lovasz losses
- Arc margin loss that incorporates margin in order to maximise face class separability
- Npairs loss that computes the npairs loss between y_true and y_pred.
- A combination of BCE and Dice loss functions
- LSEP – a pairwise ranking that is is smooth everywhere and thus is easier to optimize
- Center loss that simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers
- Ring Loss that augments standard loss functions such as Softmax
- Hard triplet loss that trains a network to embed features of the same class at the same time maximizing the embedding distance of different classes
*1 + BCE – Dice*that involves subtracting the BCE and DICE losses then adding 1- Binary cross-entropy – log(dice) that is the binary cross-entropy minus the log of the dice loss
- Combinations of BCE, dice and focal
- Lovasz Loss that loss performs direct optimization of the mean intersection-over-union loss
- BCE + DICE -Dice loss is obtained by calculating smooth dice coefficient function
- Focal loss with Gamma 2 that is an improvement to the standard cross-entropy criterion
- BCE + DICE + Focal – this is basically a summation of the three loss functions
- Active Contour Loss that incorporates the area and size information and integrates the information in a dense deep learning model
- 1024 * BCE(results, masks) + BCE(cls, cls_target)
- Focal + kappa – Kappa is a loss function for multi-class classification of ordinal data in deep learning. In this case we sum it and the focal loss
- ArcFaceLoss — Additive Angular Margin Loss for Deep Face Recognition
- soft Dice trained on positives only – Soft Dice uses predicted probabilities
- 2.7 * BCE(pred_mask, gt_mask) + 0.9 * DICE(pred_mask, gt_mask) + 0.1 * BCE(pred_empty, gt_empty) which is a custom loss used by the Kaggler
*nn.SmoothL1Loss()*- Use of the Mean Squared Error objective function in scenarios where it seems to work better than binary-cross entropy objective function.

**Training tips**

- Try different learning rates
- Try different batch sizes
- Use SDG with momentum with manual rate scheduling
- Too much augmentation will reduce the accuracy
- Train on image crops and predict on full images
- Use of Keras’s ReduceLROnPlateau() to the learning rate
- Train without augmentation until plateau then apply soft and hard augmentation to some epochs
- Freeze all layers except the last one and use 1000 images from Stage1 for tuning
- Make labels more balanced by developing a sampler
- Use of class aware sampling
- Use dropout and augmentation while tuning the last layer
- Pseudo Labeling to improve score
- Use Adam reducing LR on plateau with patience 2–4
- Use Cyclic LR with SGD
- Reduce the learning rate by a factor of two if validation loss does not improve for two consecutive epochs
- Repeat the worst batch out of 10 batches
- Train with default UNET
- Overlap tiles so that each edge pixel is covered twice
- Hyperparameter tuning: learning rate on training, non-maximum suppression and score threshold on inference
- Remove low bounding box with low confidence score
- Train different convolutional neural networks then build an ensemble
- Stop training when the F1 score is decreasing
- Differential learning rate with gradual reducing
- Train ANNs in a stacking way using 5 folds and 30 repeats
- Track of your experiments using Neptune.

**Evaluation and cross-validation**

- Split on non-uniform stratified by classes
- Avoid overfitting by applying cross-validation while tuning the last layer
- 10-fold CV ensemble for classification
- Combination of 5 10-fold CV ensembles for detection
- Sklearn’s stratified K fold function
- 5 KFold Cross-Validation
- Adversarial Validation & Weighting

**Ensembling methods**

- Use simple majority voting for ensemble
- XGBoost on the max malignancy at 3 zoom levels, the z-location and the amount of strange tissue
- LightGBM for models with too many classes. This was done for raw data features only.
- CatBoost for a second-layer model
- Training with 7 features for the gradient boosting classifier
- Use ‘curriculum learning’ to speed up model training. In this technique, models are first trained on simple samples then progressively moving to hard ones.
- Ensemble with ResNet50, InceptionV3, and InceptionResNetV2
- Ensemble method for object detection
- An ensemble of Mask RCNN, YOLOv3, and Faster RCNN architectures n with a classification network — DenseNet-121 architecture

**Post Processing**

- Apply test time augmentation — presenting an image to a model several times with different random transformations and average the predictions you get
- Equalize test prediction probabilities instead of only using predicted classes
- Apply geometric mean to the predictions
- Overlap tiles during inferencing so that each edge pixel is covered at least thrice because UNET tends to have bad predictions around edge areas.
- Non-maximum suppression and bounding box shrinkage
- Watershed post processing to detach objects in instance segmentation problems.

**Final Thoughts**

Hopefully, this article gave you some background into image segmentation tips and tricks and given you some tools and frameworks that you can use to start competing.

We’ve covered tips on:

- architectures
- training tricks,
- losses,
- pre-processing,
- post processing
- ensembling
- tools and frameworks.

If you want to go deeper down the rabbit hole, simply follow the links and see how the best image segmentation models are built.

Happy segmenting!

### Derrick Mwiti

Data Scientist | Author | Mentor