MLOps Blog

Depth Estimation Models with Fully Convolutional Residual Networks (FCRN)

8 min
Aymane Hachcham
19th April, 2023

Estimating and evaluating the depth of a 2D scene is a difficult computer vision task. First, you need costly equipment to make a depth mapping. Bulky systems like 3D stereoscopic sensors for vision, motion, and light projection, are the most reliable technologies used nowadays, and they depend on external factors that need to be additionally assessed to produce accurate renderings.  

If you don’t want to carry and manipulate a large set of equipment for this one task, there is another solution. A lot of work has gone into building compact systems that unify and process all the involved functionalities delivered separately by each piece of the equipment. A great example are light field cameras that consider plenoptic imaging distortion

Depth estimation has a wide range of uses, which is why a lot of research goes on in this field. Some of the most known uses of depth estimation are: 

  • 3D rendering and 3D motion capture,
  • Autonomous driving cars,
  • Robotics (including robot-assisted surgery),
  • 2D and 3D film conversions,
  • 3D rendering and shadow-mapping in computer graphics and the gaming industry.
Depth estimation rendering for a video | Source: Deep Learning approach to Depth prediction, Google AI

Different approaches for the same objective 

Recently, several approaches were engineered for depth estimation. Deep Learning has again proven its ability to solve the problem and respond to the various challenges. The method focuses on a single point of view (2D image) and optimizes a regression on the reference depth map.

Multiple neural architectures have been tested, and some major implementations have paved the way for future research, which makes them state-of-the-art techniques in the field. Let’s take a look at some of them.

Deeper Depth Prediction with Fully Convolutional Residual Networks

This approach addresses the problem by leveraging fully convolutional architectures returning the depth map of a 2D scene from an RGB image.

The proposed architecture includes fully convolutional layers, transpose-convolutions, and efficient residual up-sampling blocks that help keep track of high-dimensional regression problems. 

The research paper exposing the model explains how this original architecture deals with the ambiguous mapping between monocular vision and depth maps using residual learning. The Huber loss function is used for optimization and the model can run on images and videos at a steady frame rate of 30 FPS. 

Depth-Estimation FCRN
Depth Prediction on NYU Depth Qualitative results | Source: official research paper

The network is composed of a first ResNet50 block with initialized pre-trained weights, and progresses with a sequence of convolutional and unpooling layers that make the network learn its upscaling. Dropout layers are then placed at the end alongside final convolution layers that yield the predicted result.

Depth Estimation network architecture
Network architecture, proposed architecture builds upon ResNet 50 | Source: official research paper

More details for the overall architecture can be found in the official research paper: Deeper Depth Prediction with Fully Convolutional Residual Networks.

Unsupervised Learning of Depth and Ego-Motion from Video

This specific approach is entirely based on an unsupervised learning strategy, combining a single-view depth CNN (like the one previously shown) with a camera pose estimation CNN trained on unlabeled video sequences.

The approach is original and the first of its kind. The authors explain that the whole supervision pipeline for the training is based on view synthesis. Roughly explained, the network is fed with a target view and outputs a per-pixel depth map. The target view is accordingly synthesized given the per-pixel depth map in addition to a pose and visibility nearby view from the original image. 

Hence, the network rightly manages the balance between the image synthesis using CNNs and pose estimation modules. 

Unsupervised-Depth-Ego-Vision
Test of the single-view depth and multi-view pose estimation | Source: official research paper

Note: The detailed explanation for the theory behind this specific architecture can be checked out in the original research paper published in 2017: Unsupervised Learning of Depth and Ego-Motion from Video.

Unsupervised Monocular Depth Estimation with Left-Right Consistency

This specific architecture is end-to-end and performs unsupervised monocular depth estimation without ground-truth data. As the title suggests, the training loss for this network enforces left-right depth consistency, which means that the network estimates depth by inferring discrepancies and disparities that wrap the left image to match the right one. 

The whole mechanism that powers the network relies on generating disparities and consistently correcting them over the training course. Typically, the left input image is used to infer left-to-right and right-to-left disparities. The network then generates the predicted resulting image with backward mapping using a bilinear sampler. 

The authors have used a convolutional architecture quite similar to DispNet. It builds upon an encoder and decoder pattern. The decoder uses skip connections from the encoder’s activation blocks to resolve higher resolution details.

Depth estimation DispNet
DispNet architecture introduced in 2015 | Source: official paper

Finally, the network predicts two disparity maps: left-to-right and right-to-left.

FCRN: Fully Convolutional Residual Networks

FCRN is one of the most used models for on-device depth prediction. This architecture became famous when Apple implemented and integrated the model in their depth sensors for the frontal camera of their iPhone lineup. Since the model is based on a CNN (ResNet-50), the compression and quantization processes for on-mobile deployment were relatively easy and straightforward.

We’ll discuss some of the important components and building blocks of the FCRN architecture, and have a glimpse at the actual implementation in Pytorch. 

Implementation and building blocks    

It’s interesting to note that the overall FCRN architecture is highly inspired by the U-Net scheme. Both use three downsampling and three upsampling convolutional blocks with fixed filter 3×3. Originally, U-Net was built with two convolutional layers in each block and the number of filters for all convolutional layers was kept constant. 

Conversely, in the FCRN implementation, the authors increased the number of subsequent layers to compensate for the loss of higher-resolution information caused by pooling.

Convolutions FCRN
Faster up-Convolutions. Illustration of what we’ve explained above. The common up-convolutional steps: unpooling doubles a fea- ture map’s size, filling the holes with zeros, and a 5 × 5 convolution filters this map | Source: official research paper

To start, let’s take a look at the basic piece of the FRCN network, which is the convolutional block, consisting of a convolutional layer, batch normalization, and activation function.

def conv_block(channels: Tuple[int, int],
               size: Tuple[int, int],
               stride: Tuple[int, int]=(1, 1),
               N: int=1):
    """
    Create a block with N convolutional layers with ReLU activation function.
    The first layer is IN x OUT, and all others - OUT x OUT.
    Args:
        channels: (IN, OUT) - no. of input and output channels
        size: kernel size (fixed for all convolution in a block)
        stride: stride (fixed for all convolution in a block)
        N: no. of convolutional layers
    Returns:
        A sequential container of N convolutional layers.
    """
    # a single convolution + batch normalization + ReLU block
    block = lambda in_channels: nn.Sequential(
        nn.Conv2d(in_channels=in_channels,
                  out_channels=channels[1],
                  kernel_size=size,
                  stride=stride,
                  bias=False,
                  padding=(size[0] // 2, size[1] // 2)),
        nn.BatchNorm2d(num_features=channels[1]),
        nn.ReLU()
    )
    # create and return a sequential container of convolutional layers
    # input size = channels[0] for first block and channels[1] for all others
    return nn.Sequential(*[block(channels[bool(i)]) for i in range(N)])

The conv_block function has N convolutional layers with OUT number of filters that follow a ReLU activation and batch normalization. 

The overall FCRN architecture can be obtained by stacking together specifically defined blocks, like Upsampling blocks, DeConv blocks, UpConv decoders and FasterUpConv decoders which are a particular technology for pixelshuffle. 

I can’t specify in detail how each block is actually implemented, since the theory and practicality of it is so dense that it would require a whole article just to present and explain their functionalities. However, the general framework can be depicted as follows:

self.model = nn.Sequential(
    # downsampling
    conv_block(channels=(input_filters, 32), size=(3, 3), N=N),
    nn.MaxPool2d(2),

    conv_block(channels=(32, 64), size=(3, 3), N=N),
    nn.MaxPool2d(2),

    conv_block(channels=(64, 128), size=(3, 3), N=N),
    nn.MaxPool2d(2),

    # "convolutional fully connected"
    DeConv(channels=(128, 512), size=(3, 3), N=N),

    # upsampling
    nn.Upsample(scale_factor=2),
    UpConv(channels=(512, 128), size=(3, 3), N=N),

    nn.Upsample(scale_factor=2),
    FasterUpConv(channels=(128, 64), size=(3, 3), N=N),

    nn.Upsample(scale_factor=2),
    conv_block(channels=(64, 1), size=(3, 3), N=N),
)

The DeConv and UpConv consist of 4 convolutional block modules with decreasing number of channels and increasing feature map size. Their implementation in Pytorch can resemble the following:

def convt(in_channels):
            stride = 2
            padding = (kernel_size - 1) // 2
            output_padding = kernel_size % 2
            assert -2 - 2 * padding + kernel_size + output_padding == 0, "deconv parameters incorrect"

            module_name = "deconv{}".format(kernel_size)
            return nn.Sequential(collections.OrderedDict([
                (module_name, nn.ConvTranspose2d(in_channels, in_channels // 2, kernel_size,
                                                 stride, padding, output_padding, bias=False)),
                ('batchnorm', nn.BatchNorm2d(in_channels // 2)),
                ('relu', nn.ReLU(inplace=True)),
            ]))

        self.layer1 = convt(in_channels)
        self.layer2 = convt(in_channels // 2)
        self.layer3 = convt(in_channels // (2 ** 2))
        self.layer4 = convt(in_channels // (2 ** 3))
class UpConv(Decoder):
    def upconv_module(self, in_channels):
        # UpConv module: unpool -> 5*5 conv -> batchnorm -> ReLU
        upconv = nn.Sequential(collections.OrderedDict([
            ('unpool', Unpool(in_channels)),
            ('conv', nn.Conv2d(in_channels, in_channels // 2, kernel_size=5, stride=1, padding=2, bias=False)),
            ('batchnorm', nn.BatchNorm2d(in_channels // 2)),
            ('relu', nn.ReLU()),
        ]))
        return upconv

    def __init__(self, in_channels):
        super(UpConv, self).__init__()
        self.layer1 = self.upconv_module(in_channels)
        self.layer2 = self.upconv_module(in_channels // 2)
        self.layer3 = self.upconv_module(in_channels // 4)
        self.layer4 = self.upconv_module(in_channels // 8)

If you’re interested and curious about the whole infrastructure, there’s a quite good Github repo that implements the FCRN architecture with ResNet-50 and it thoroughly follows the indications of the research paper: FCRN Pytorch implementation.

Training on NYU Depth V2 Dataset

This dataset is composed of elements from Segmentation and Support Inference from RGBD (RGB and Depth) images. It contains video sequences from a wide variety of 3D scenes recorded by both the RGB and Depth cameras from the Microsoft Kinect.

Link: NYU Depth Dataset V2

Basically, it features:

  • 1449 densely labeled pairs of aligned RGB and depth images
  • 464 new scenes taken from 3 cities 
  • 407024 new unlabeled frames

The dataset contains multiple types of data:

  • Labeled with a subset of the video data accompanied by dense multi-class labels. 
  • Raw RGB, depth and accelerometer data as provided by the Microsoft kinect camera.

To start debugging the model, we’ll rely on an open-source Github implementation using Pytorch. The implementation is by Shane Wang (big shout out to him!), and he provided an alternative version completely made in Pytorch, carefully following the official paper indication regarding the model architecture and the training process. 

In Shane’s own words: This is a PyTorch implementation of Deeper Depth Prediction with Fully Convolutional Residual Networks. It can use Fully Convolutional Residual Networks to realize monocular depth prediction. Currently, we can train FCRN using NYUDepthv2 and Kitti Odometry Dataset.  

Installation guide

  • Clone the repo: git clone git@github.com:dontLoveBugs/FCRN_pyotrch.git 
  • Install the required dependencies:pip install matplotlib pillow tensorboardX torch torchvision 

Configure the dataset path in the dataloaders folder

Download the NYU Depth V2 dataset, the labeled version of 2.8GB approximately: download here.

Let’s define the nyu_dataloader class which loads the dataset from the root directory and performs data transformations and data augmentation.

height, width = 480, 640
class NYUDataset(Dataloader):
    def __init__(self, root, type, sparsifier=None, modality='rgb'):
        super(NYUDataset, self).__init__(root, type, sparsifier, modality)
        self.output_size = (228, 304)

    def train_transform(self, rgb, depth):
        s = np.random.uniform(1.0, 1.5)  # random scaling
        depth_np = depth / s
        angle = np.random.uniform(-5.0, 5.0)  # random rotation degrees
        do_flip = np.random.uniform(0.0, 1.0) < 0.5  # random horizontal flip

        # perform 1st step of data augmentation
        transform = transforms.Compose([
            transforms.Resize(250.0 / height),  # this is for computational efficiency, since rotation can be slow
            transforms.Rotate(angle),
            transforms.Resize(s),
            transforms.CenterCrop(self.output_size),
            transforms.HorizontalFlip(do_flip)
        ])
        rgb_np = transform(rgb)
        rgb_np = self.color_jitter(rgb_np)  # random color jittering
        rgb_np = np.asfarray(rgb_np, dtype='float') / 255
        depth_np = transform(depth_np)

        return rgb_np, depth_np

We also log all this information in our Neptune experiment, in order to keep track of the dataset directory and the transformations applied.

Start your experiment:

run = neptune.init_run(
    project="aymane.hachcham/FCRN",  
    api_token="YourNeptuneApiToken" # your credentials
)
run['config/dataset/path'] = 'Documents/FCRN/dataset'
run['config/dataset/size'] = 407024
run['config/dataset/transforms'] = {
    'train': transforms.Compose([
            transforms.Rotate(angle),
            transforms.Resize(int(250 / height)),
            transforms.CenterCrop((228, 304)),
            transforms.RandomHorizontalFlip(p=0.5)
        ])
}
Depth estimation - Neptune
Neptune dataset configurations

Before starting the training, let’s log the hyperparameters of the model we want to use.

hparams = {
    'batch_size': 128,
    'decoder':'upproj',
    'epochs':10,
    'lr':0.01,
    'lr_patience':2,
    'manual_seed':1,
    'momentum':0.9,
    'print_freq':10,
    'resume':None,
    'weight_decay':0.0005,
    'workers':0
}
run["params"] = hparams

After logging all the required hyperparameters, we’ll launch the training session for epochs. We’ll be logging the loss and all the metrics to Neptune to keep track of the progress.

def train(train_loader, model, criterion, optimizer, epoch, logger):
    average_meter = AverageMeter()
    model.train()  # switch to train mode
    end = time.time()

    batch_num = len(train_loader)

    for i, (input, target) in enumerate(train_loader):

        # itr_count += 1
        input, target = input.cuda(), target.cuda()
        torch.cuda.synchronize()
        data_time = time.time() - end

        # compute predictions
        end = time.time()

        pred = model(input)
        loss = criterion(pred, target)
        optimizer.zero_grad()
        loss.backward()  # compute gradient and do SGD step
        optimizer.step()
        torch.cuda.synchronize()
        gpu_time = time.time() - end

        # measure accuracy and record loss
        result = Result()
        result.evaluate(pred.data, target.data)
        acc = result.evaluate(pred.data/target.data)
        average_meter.update(result, gpu_time, data_time, input.size(0))
        end = time.time()
        # Logging the loss and Accuracy in Neptune
        run["training/batch/accuracy"].append(acc)
        run["training/batch/loss"].append(loss)

        if (i + 1) % args.print_freq == 0:
            current_step = epoch * batch_num + i
            logger.add_scalar('Train/RMSE', result.rmse, current_step)
            logger.add_scalar('Train/rml', result.absrel, current_step)
            logger.add_scalar('Train/Log10', result.lg10, current_step)
            logger.add_scalar('Train/Delta1', result.delta1, current_step)
            logger.add_scalar('Train/Delta2', result.delta2, current_step)
            logger.add_scalar('Train/Delta3', result.delta3, current_step)

The accuracy and loss setting in Neptune

We clearly observe for the training sessions that the two metrics run well, except for this linear degradation in the Accuracy curve caused by data interpolation, where the upper convolutional layers lose depth information and get overwhelmed by the proportion of index matrices that continuously render in real-time.

Even if we perform new training sessions varying the number of epochs or slightly tweaking the training data, the variations are not that big.

By increasing the number of epochs and reducing the number of data transformations, we get a slight improvement in the accuracy score and also eliminate the discrepancies caused by data interpolation.

A cool tip Shan advises to use when training with a low number of epochs (< 100) is to decrease the learning rate and increase the rate_optimizer. That way the gradients are calculated more slowly and the backward is more synchronized leaving room for the model to adapt accordingly.

So the new hyperparameters for the new run appear like this:

hparams = {
    'batch_size': 128,
    'decoder':'upproj',
    'epochs':10,
    'lr':0.002,
    'lr_patience': 2.5,
    'manual_seed':1,
    'momentum':0.9,
    'print_freq':10,
    'resume':None,
    'weight_decay':0.0005,
    'workers':0
}

Here’s a comparison between the two runs:

Depth estimation - Neptune comparison
Blue chart: previous accuracy, Red chart: the improved one | See in the app

We clearly notice how hyperparameter tweaking and opting for long training sessions with 30 epochs has largely contributed to improving the accuracy standard.

I’ll leave you the link for the project in Neptune. Don’t hesitate to check it out: FCRN Experiment

Final output

Once the training is performed, we can take a look at the resulting depth maps generated by the model on the basis of unseen indoor images from the validation set.

To reinforce and assert the certitude regarding the model results, the Github repo presents a pretrained version of the same model we used here with a bunch of satisfying metrics asserting the quality of model inference.

Results

After successfully completing the training, the authors present some error related metrics in order to assess the performance of his implementation in comparison with previous ones. 

Clearly, his implementation with Pytorch beats quite well the precedent attempts by slightly improving the rel (Relative absolute Error) defined as absolute error normalized on ground truth depth map values, and the log10 (abs(log10(gt) – log10(pred))), where gt is ground truth depth map and pred is predicted depth map. 

Depth estimation - Error metrics
Error metrics on NYU Depth V2 | Source: FCRN Pytorch by Shane Wang 

Qualitative results

Depth estimation results
RGB images from validation set, their ground truth and final results generated by the model 

As it’s explained in the official paper, the fully convolutional residual networks improve accuracy thanks to their specific architecture using enhanced up-projection blocks and up-sampling techniques that involve deconvolution with successive 2×2 kernels. Qualitatively, this method preserves more structure in the output when comparing with other different fully convolutional variants on the same dataset.

Conclusion and perspectives

We took a tour of the various techniques and architectures used in the field of depth estimation. We practiced and trained an existing implementation of FCRN to demonstrate and observe the power of this approach in terms of qualitative results. 

One interesting project you might want to try is to implement the on-device version of the FCRN model and create a small IOS application that performs depth estimation in real-time. MLCore and Apple Vision already propose different variants of pre-trained FCRN models which can quickly be utilized for tasks involving the front-face camera and the depth sensor. Maybe it’s something that I can consider for a next article on the subject. Stay tuned!

As always, I’ll leave you with some helpful resources that you can look up for in-depth knowledge about the topic: