Blog » Computer Vision » Binarized Neural Network (BNN) and Its Implementation in Machine Learning

Binarized Neural Network (BNN) and Its Implementation in Machine Learning

Binarized Neural Network (BNN) comes from a paper by Courbariaux, Hubara, Soudry, El-Yaniv and Bengio from 2016. It introduced a new method to train neural networks, where weights and activations are binarized at train time, and then used to compute the gradients. 

This way, memory size is reduced, and bitwise operations improve the power efficiency. GPUs consume huge amounts of power, making it difficult for neural networks to be trained on low-power devices. BNNs can reduce power consumption by more than 32 times.

The paper showed that a binary matrix multiplication can be used to reduce the train time, which made it possible to train BNN on MNIST 7 times faster, achieving near state-of-the-art results.

In this article, we’ll see how Binarized Neural Networks work. We’ll dig into the algorithm, and look at the libraries that implement BNNs.

How Binarized Neural Networks work

Before we dig any deeper, let’s see how BNNs work.

In the paper, they use two functions to binarize the values of x(weight/activation) – deterministic and stochastic.

Eq. 1

Eq. 1 is the deterministic function, where signum function is used on real valued variables.

The stochastic function uses the hard sigmoid:

BNN equation
Eq. 2

xb in Eq. 1 and Eq. 2 is the binarized value of the real valued variable (weight/activation) x. Eq. 1 is quite straightforward.

In Eq. 2 

BNN equation
Eq. 3

The deterministic function is used in most cases, except for a few experiments where stochastic is used with activations.

There are two more important aspect for BNNs to work, apart from binarizing the weights and activations:

  • For optimizers to work, you need real-valued weights, so they’re accumulated in real-valued variables. Even though we use binarized weights/activations, we use real-valued weights for optimization.
  • Another problem happens when we use deterministic or stochastic functions for binarization. When we backpropagate, the derivative of these functions is zero, which makes the whole gradient zero. So we can use Saturated STE (Straight Through Estimator), which was previously introduced by Hinton and studied by Bengio. In saturated STE, the derivative of signum is substituted by 1{x<=1}, which simply means replacing the derivative zero by identity(1) when x<=1. So, it cancels out the gradient when x is too large, since the derivative is zero.
Binary weight filter
Fig. 1 – A binary weight filter from the 1st convolutional layer of BNN | Source

Shift-based Batch Normalization and shift-based AdaMax Optimization

There’s an alternative to regular BatchNormalization and Adamax optimization. Both BatchNorm and Adam optimizer contain lots of multiplication. To speed up the process, they’re replaced by shift-based methods. These methods use bitwise operations to save time. The BNN paper claims that no accuracy loss is observed when Batch Normalization and Adam optimizer is replaced with shift-based Batch Normalization, and shift-based Adam optimizer.

Speeding up the training

A method introduced in the BNN paper can speed up the GPU implementation of BNNs. It can increase the time efficiency even more than by using cuBLAS. 

cuBLAS is a CUDA toolkit library that provides GPU-accelerated basic linear algebra subroutines (BLAS).

A method called SWAR, used to perform parallel operations within a register, is used to speed up the calculations. It concatenates 32-binary variables to 32-bit registers. 

SWAR can evaluate these 32 connections in just 6 clock cycles on a Nvidia GPU, hence the theoretical speed improvement of 32/6 = 5.3 times. The values +1 and -1 are important to perform this, so we need to binarize variables to these two values.

Let’s see some performance stats:

BNN performance stats
Fig. 2 – Comparison between Baseline kernel, cuBLAS and the XNOR kernel for time and accuracy. | Source

As we can see in Fig. 2, the accuracy for all three methods; unoptimized baseline kernel, using cuBLAS library and paper’s XNOR kernel, are the same in the third section of the graph. In the first section, matrix multiplication time is compared, with a 8192 x 8192 x 8192 matrix. In the second section, full test data in MNIST is inferred on a multi-layered perceptron. We can clearly see that the XNOR kernel performs better. XNOR is 23 times faster than the baseline kernel, and 3.4 times faster than the cuBLAS kernel in the case of matrix multiplication.

We can see there’s a smaller difference between cuBLAS and XNOR kernels while running MNIST test data. That’s because on the first layer, the values are not binary, so the baseline kernel is used for computation, thus resulting in a little delay. But it’s not that big of a problem, since the input image usually has only 3 channels, which means less computation.

Code

Let’s look at some Github repos that implement BNNs.

The first two implementations of BNNs are included in the original papers, though one is in lua(torch) and the other is in Python, but implemented in theano.

Theano:

https://github.com/MatthieuCourbariaux/BinaryNet

Torch:

https://github.com/itayhubara/BinaryNet

PyTorch:

There’s a pytorch implementation from one of the authors of the BNN paper, which includes architectures like alexnet binary, resnet binary and vgg binary, with different numbers of layers (resent18, resnet34, resnet50, etc.)

https://github.com/itayhubara/BinaryNet.pytorch

There’s no documentation, but the code is intuitive. In the subdirectory ‘models’ there are three binarized networks implemented: vgg, resnet and alexnet.

Use the file ‘data.py’ to send a custom dataset to the BNN network. There are also a lot of transformation options in ‘preprocess.py’.

Keras/TensorFlow:

One of the best Packages I have seen so far is Larq, an open source package where building and training a Binarized Neural Network is really easy.

In the previously discussed packages, there were pre-implemented networks that could be used. But with Larq you can create new networks in a really easy way. It’s just like Keras API, for instance, if you want to add a binarized conv layer, instead of ‘tf.keras.layers.Conv2D’, you can use ‘larq.layers.Conv2D’.

Best this about the package is the documentation is really good, and the community is actively developing it, so the support is also good.

Even though it has a great documentation, let’s see an example from the documentation so that you’ll get the gist of how easy to use the library is.

import tensorflow as tf
import larq as lq

kwargs = dict(input_quantizer="ste_sign",
              kernel_quantizer="ste_sign",
              kernel_constraint="weight_clip",
              use_bias=False)
 
model = tf.keras.models.Sequential([
    lq.layers.QuantConv2D(128, 3,
                          kernel_quantizer="ste_sign",
                          kernel_constraint="weight_clip",
                          use_bias=False,
                          input_shape=(32, 32, 3)),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
 
    lq.layers.QuantConv2D(128, 3, padding="same", **kwargs),
    tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
 
    lq.layers.QuantConv2D(256, 3, padding="same", **kwargs),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
 
    lq.layers.QuantConv2D(256, 3, padding="same", **kwargs),
    tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
 
    lq.layers.QuantConv2D(512, 3, padding="same", **kwargs),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
 
    lq.layers.QuantConv2D(512, 3, padding="same", **kwargs),
    tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
    tf.keras.layers.Flatten(),
 
    lq.layers.QuantDense(1024, **kwargs),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
 
    lq.layers.QuantDense(1024, **kwargs),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
 
    lq.layers.QuantDense(10, **kwargs),
    tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
    tf.keras.layers.Activation("softmax")
])

Note that we do not use the Signum and STE for the input layer, as explained before. Let’s look at the final architecture.

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quant_conv2d (QuantConv2D)   (None, 30, 30, 128)       3456      
_________________________________________________________________
batch_normalization (BatchNo (None, 30, 30, 128)       384       
_________________________________________________________________
quant_conv2d_1 (QuantConv2D) (None, 30, 30, 128)       147456    
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 15, 15, 128)       0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 15, 15, 128)       384       
_________________________________________________________________
quant_conv2d_2 (QuantConv2D) (None, 15, 15, 256)       294912    
_________________________________________________________________
batch_normalization_2 (Batch (None, 15, 15, 256)       768       
_________________________________________________________________
quant_conv2d_3 (QuantConv2D) (None, 15, 15, 256)       589824    
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 256)         0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 7, 7, 256)         768       
_________________________________________________________________
quant_conv2d_4 (QuantConv2D) (None, 7, 7, 512)         1179648   
_________________________________________________________________
batch_normalization_4 (Batch (None, 7, 7, 512)         1536      
_________________________________________________________________
quant_conv2d_5 (QuantConv2D) (None, 7, 7, 512)         2359296   
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 3, 3, 512)         0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 3, 3, 512)         1536      
_________________________________________________________________
flatten (Flatten)            (None, 4608)              0         
_________________________________________________________________
quant_dense (QuantDense)     (None, 1024)              4718592   
_________________________________________________________________
batch_normalization_6 (Batch (None, 1024)              3072      
_________________________________________________________________
quant_dense_1 (QuantDense)   (None, 1024)              1048576   
_________________________________________________________________
batch_normalization_7 (Batch (None, 1024)              3072      
_________________________________________________________________
quant_dense_2 (QuantDense)   (None, 10)                10240     
_________________________________________________________________
batch_normalization_8 (Batch (None, 10)                30        
_________________________________________________________________
activation (Activation)      (None, 10)                0         
=================================================================
Total params: 10,363,550
Trainable params: 10,355,850
Non-trainable params: 7,700

And now you can train this like you train a normal neural network implemented in keras.


SEE ALSO
➡️ Neptune’s integration with PyTorch
➡️ Neptune’s integration with TensorFlow/Keras


Applications

BNNs are power efficient and so can be used with low power devices. This is one of the greatest advantages of BNNs. You can use LCE(Larq Compute Engine) with Tensorflow Lite Java to train and infer neural networks on Android, consuming less power.

BNN Tensorflow
Fig. 3 – Example of Image Classification using LCE Lite | Source

You can head over to the following link to read more about using BNNs on an Android device.

Conclusion

Deep networks require power-hungry GPUs, it’s difficult to train them on low-power devices. So, the concept of Binary Neural Networks seems promising. 

They consume less power without any accuracy loss, and can be used in mobile devices to train DNNs. Seems pretty useful!

Thanks for reading.

References

Here are some references if you want to dig deep into BNNs:


READ NEXT

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

Jakub Czakon | Posted November 26, 2020

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”

– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

  • use different models and model hyperparameters
  • use different training or evaluation data, 
  • run different code (including this small change that you wanted to test quickly)
  • run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics. 

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.  

This is where ML experiment tracking comes in. 

Continue reading ->

Computer vision resources

Computer Vision in Machine Learning Industry – Top 12 Best Resources and How to Use Them to Follow Current Trends

Read more

Recurrent Neural Network Guide – a Deep Dive in RNN

Read more
Graph-Neural-Network-GNN

Graph Neural Network and Some of GNN Applications – Everything You Need to Know

Read more

A Comprehensive Guide to the Backpropagation Algorithm in Neural Networks

Read more