We Raised \$8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works” # Binarized Neural Network (BNN) and Its Implementation in Machine Learning

Binarized Neural Network (BNN) comes from a paper by Courbariaux, Hubara, Soudry, El-Yaniv and Bengio from 2016. It introduced a new method to train neural networks, where weights and activations are binarized at train time, and then used to compute the gradients.

This way, memory size is reduced, and bitwise operations improve the power efficiency. GPUs consume huge amounts of power, making it difficult for neural networks to be trained on low-power devices. BNNs can reduce power consumption by more than 32 times.

The paper showed that a binary matrix multiplication can be used to reduce the train time, which made it possible to train BNN on MNIST 7 times faster, achieving near state-of-the-art results.

In this article, we’ll see how Binarized Neural Networks work. We’ll dig into the algorithm, and look at the libraries that implement BNNs.

## How Binarized Neural Networks work

Before we dig any deeper, let’s see how BNNs work.

In the paper, they use two functions to binarize the values of x(weight/activation) – deterministic and stochastic.

Eq. 1 is the deterministic function, where signum function is used on real valued variables.

The stochastic function uses the hard sigmoid:

xb in Eq. 1 and Eq. 2 is the binarized value of the real valued variable (weight/activation) x. Eq. 1 is quite straightforward.

In Eq. 2

The deterministic function is used in most cases, except for a few experiments where stochastic is used with activations.

There are two more important aspect for BNNs to work, apart from binarizing the weights and activations:

• For optimizers to work, you need real-valued weights, so they’re accumulated in real-valued variables. Even though we use binarized weights/activations, we use real-valued weights for optimization.
• Another problem happens when we use deterministic or stochastic functions for binarization. When we backpropagate, the derivative of these functions is zero, which makes the whole gradient zero. So we can use Saturated STE (Straight Through Estimator), which was previously introduced by Hinton and studied by Bengio. In saturated STE, the derivative of signum is substituted by 1{x<=1}, which simply means replacing the derivative zero by identity(1) when x<=1. So, it cancels out the gradient when x is too large, since the derivative is zero. Fig. 1 – A binary weight filter from the 1st convolutional layer of BNN | Source

## Shift-based Batch Normalization and shift-based AdaMax Optimization

There’s an alternative to regular BatchNormalization and Adamax optimization. Both BatchNorm and Adam optimizer contain lots of multiplication. To speed up the process, they’re replaced by shift-based methods. These methods use bitwise operations to save time. The BNN paper claims that no accuracy loss is observed when Batch Normalization and Adam optimizer is replaced with shift-based Batch Normalization, and shift-based Adam optimizer.

## Speeding up the training

A method introduced in the BNN paper can speed up the GPU implementation of BNNs. It can increase the time efficiency even more than by using cuBLAS.

cuBLAS is a CUDA toolkit library that provides GPU-accelerated basic linear algebra subroutines (BLAS).

A method called SWAR, used to perform parallel operations within a register, is used to speed up the calculations. It concatenates 32-binary variables to 32-bit registers.

SWAR can evaluate these 32 connections in just 6 clock cycles on a Nvidia GPU, hence the theoretical speed improvement of 32/6 = 5.3 times. The values +1 and -1 are important to perform this, so we need to binarize variables to these two values.

Let’s see some performance stats: Fig. 2 – Comparison between Baseline kernel, cuBLAS and the XNOR kernel for time and accuracy. | Source

As we can see in Fig. 2, the accuracy for all three methods; unoptimized baseline kernel, using cuBLAS library and paper’s XNOR kernel, are the same in the third section of the graph. In the first section, matrix multiplication time is compared, with a 8192 x 8192 x 8192 matrix. In the second section, full test data in MNIST is inferred on a multi-layered perceptron. We can clearly see that the XNOR kernel performs better. XNOR is 23 times faster than the baseline kernel, and 3.4 times faster than the cuBLAS kernel in the case of matrix multiplication.

We can see there’s a smaller difference between cuBLAS and XNOR kernels while running MNIST test data. That’s because on the first layer, the values are not binary, so the baseline kernel is used for computation, thus resulting in a little delay. But it’s not that big of a problem, since the input image usually has only 3 channels, which means less computation.

## Code

Let’s look at some Github repos that implement BNNs.

The first two implementations of BNNs are included in the original papers, though one is in lua(torch) and the other is in Python, but implemented in theano.

### Theano:

https://github.com/MatthieuCourbariaux/BinaryNet

### Torch:

https://github.com/itayhubara/BinaryNet

### PyTorch:

There’s a pytorch implementation from one of the authors of the BNN paper, which includes architectures like alexnet binary, resnet binary and vgg binary, with different numbers of layers (resent18, resnet34, resnet50, etc.)

https://github.com/itayhubara/BinaryNet.pytorch

There’s no documentation, but the code is intuitive. In the subdirectory ‘models’ there are three binarized networks implemented: vgg, resnet and alexnet.

Use the file ‘data.py’ to send a custom dataset to the BNN network. There are also a lot of transformation options in ‘preprocess.py’.

### Keras/TensorFlow:

One of the best Packages I have seen so far is Larq, an open source package where building and training a Binarized Neural Network is really easy.

In the previously discussed packages, there were pre-implemented networks that could be used. But with Larq you can create new networks in a really easy way. It’s just like Keras API, for instance, if you want to add a binarized conv layer, instead of ‘tf.keras.layers.Conv2D’, you can use ‘larq.layers.Conv2D’.

Best this about the package is the documentation is really good, and the community is actively developing it, so the support is also good.

Even though it has a great documentation, let’s see an example from the documentation so that you’ll get the gist of how easy to use the library is.

```import tensorflow as tf
import larq as lq

kwargs = dict(input_quantizer="ste_sign",
kernel_quantizer="ste_sign",
kernel_constraint="weight_clip",
use_bias=False)

model = tf.keras.models.Sequential([
lq.layers.QuantConv2D(128, 3,
kernel_quantizer="ste_sign",
kernel_constraint="weight_clip",
use_bias=False,
input_shape=(32, 32, 3)),
tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),

tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)),
tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),

tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),

tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)),
tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),

tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),

tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)),
tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
tf.keras.layers.Flatten(),

lq.layers.QuantDense(1024, **kwargs),
tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),

lq.layers.QuantDense(1024, **kwargs),
tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),

lq.layers.QuantDense(10, **kwargs),
tf.keras.layers.BatchNormalization(momentum=0.999, scale=False),
tf.keras.layers.Activation("softmax")
])
```

Note that we do not use the Signum and STE for the input layer, as explained before. Let’s look at the final architecture.

```model.summary()
```
```Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
quant_conv2d (QuantConv2D)   (None, 30, 30, 128)       3456
_________________________________________________________________
batch_normalization (BatchNo (None, 30, 30, 128)       384
_________________________________________________________________
quant_conv2d_1 (QuantConv2D) (None, 30, 30, 128)       147456
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 15, 15, 128)       0
_________________________________________________________________
batch_normalization_1 (Batch (None, 15, 15, 128)       384
_________________________________________________________________
quant_conv2d_2 (QuantConv2D) (None, 15, 15, 256)       294912
_________________________________________________________________
batch_normalization_2 (Batch (None, 15, 15, 256)       768
_________________________________________________________________
quant_conv2d_3 (QuantConv2D) (None, 15, 15, 256)       589824
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 256)         0
_________________________________________________________________
batch_normalization_3 (Batch (None, 7, 7, 256)         768
_________________________________________________________________
quant_conv2d_4 (QuantConv2D) (None, 7, 7, 512)         1179648
_________________________________________________________________
batch_normalization_4 (Batch (None, 7, 7, 512)         1536
_________________________________________________________________
quant_conv2d_5 (QuantConv2D) (None, 7, 7, 512)         2359296
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 3, 3, 512)         0
_________________________________________________________________
batch_normalization_5 (Batch (None, 3, 3, 512)         1536
_________________________________________________________________
flatten (Flatten)            (None, 4608)              0
_________________________________________________________________
quant_dense (QuantDense)     (None, 1024)              4718592
_________________________________________________________________
batch_normalization_6 (Batch (None, 1024)              3072
_________________________________________________________________
quant_dense_1 (QuantDense)   (None, 1024)              1048576
_________________________________________________________________
batch_normalization_7 (Batch (None, 1024)              3072
_________________________________________________________________
quant_dense_2 (QuantDense)   (None, 10)                10240
_________________________________________________________________
batch_normalization_8 (Batch (None, 10)                30
_________________________________________________________________
activation (Activation)      (None, 10)                0
=================================================================
Total params: 10,363,550
Trainable params: 10,355,850
Non-trainable params: 7,700
```

And now you can train this like you train a normal neural network implemented in keras.

➡️ Neptune’s integration with PyTorch
➡️ Neptune’s integration with TensorFlow/Keras

## Applications

BNNs are power efficient and so can be used with low power devices. This is one of the greatest advantages of BNNs. You can use LCE(Larq Compute Engine) with Tensorflow Lite Java to train and infer neural networks on Android, consuming less power.

## Conclusion

Deep networks require power-hungry GPUs, it’s difficult to train them on low-power devices. So, the concept of Binary Neural Networks seems promising.

They consume less power without any accuracy loss, and can be used in mobile devices to train DNNs. Seems pretty useful!

### References

Here are some references if you want to dig deep into BNNs:

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

10 mins read | Author Jakub Czakon | Updated July 14th, 2021

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”

– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

• use different models and model hyperparameters
• use different training or evaluation data,
• run different code (including this small change that you wanted to test quickly)
• run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in. ### Computer Vision in Machine Learning Industry – Top 12 Best Resources and How to Use Them to Follow Current Trends ### Recurrent Neural Network Guide: a Deep Dive in RNN  