# Guide to Building Your Own Neural Network [With Breast Cancer Classification Example]

This is a hands-on guide to build your own neural network for breast cancer classification. I will start off with the basics and then go through the implementation.

The task of accurately identifying and categorizing breast cancer subtypes is a crucial clinical task, which can take hours for trained pathologists to complete. So, we will try to automate breast cancer classification by analyzing breast histology images, using image classification, PyTorch and deep learning.

My focus will be to provide a guide for new data scientists, or those who want to revise basics, and go on to building your own neural network. We’re going to cover:

- How neural networks work?
- How convolutional neural networks work?
- Implementation of breast cancer classification with CNN from scratch

## What are neural networks and how they work?

In order to understand neural networks, we must start with perceptrons.

Perceptrons, or artificial neurons, are mathematical models that mimic biological neurons. Like neurons, a perceptron takes in several binary inputs to give a single binary output. Simple!

The importance of each input can be expressed by adding weights to the inputs. The neuron’s output, 0 or 1, is determined by whether the weighted sum is greater or less than the *threshold value.* Mathematically, it’s:

Where **w** is the weight to each input and **x **is the input.

By varying the weights and the threshold, we can get different models. Now, to simplify how we express the perceptron, let’s move the threshold to the other side of the inequality and replace it with what’s known as perceptron’s bias b = -threshold. Using the bias instead of the threshold, the perceptron rule can be written as:

Where **W** and **x** are vectors whose components are weights and input respectively.

Now, when these neurons are arranged on multiple levels, it’s called a neural network. A neuron is not a complete model of decision-making, but it illustrates how a neuron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of neurons could make quite subtle decisions.

**Activation function **is a function that’s used to get the output of a neuron. There are two types of activation functions: linear and nonlinear (the function above is a linear activation function). Nonlinear functions are most commonly used because it makes the model generalise better with a wide variety of data – we will use one in this article. Some of the most commonly used activation functions are:

- Sigmoid function
- Tanh activation function
- Rectified linear Unit or ReLU
- Leaky ReLU

This is what a simple neural networks looks like:

The first layer is called the input layer, and the rightmost layer is the output layer. Layers between these two are called hidden layers. In this network, the 1st layer of perceptrons is making decisions by weighing the inputs. The output is fed to the second layer, and so on till the last layer.

Since each perceptron is making a decision by weighing up the inputs from the previous layer, the complexity of decision-making increases down the layers. This way, a multi-layer network of perceptrons engage in complex decision-making tasks. Neural networks where the output of one layer is used as input for the next layer are called **feedforward networks**.

Now, we know what neural networks are, so let’s discuss how they learn to give the correct output. When we say the network learns, it means that through calculations or following some process, the network has found the right set of weights and biases so that its loss is minimum.

The loss is the discrepancy between the target and the predicted output that we get from the value of **w **and **b **. Our goal is to minimize this error to obtain the most accurate value of **w **and **b. **Let’s use the mean square error function to calculate our error function.

There are three steps to calculate the mean squared error:

- Find the difference between the actual y and predicted y value(y = wx + b), for a given x.
- Square this difference.
- Find the mean of the squares for every value in X.

Here yᵢ is the actual value and ȳᵢ is the predicted value. Let’s substitute the value of ȳᵢ:

So we square the error and find the mean. Hence the name mean squared error.

Why introduce the error function? After all, aren’t we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the error function?

The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, **making small changes to the weights and biases won’t cause any change at all in the number of training images classified correctly**. Please look into this blog for further read.

That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function, like the error function defined above, it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That’s why we focus first on minimizing the error function, and only after that will we examine the classification accuracy.

## Gradient descent

Now that we have defined the loss function, let’s get into the interesting part — minimizing it and finding **w** and **b. **Now, the gradient descent algorithm is an iterative optimization algorithm to find the minimum of the function. Here our function is the error function we defined earlier. I am going to explain gradient descent using scalar values and jump into matrix operations later as we discuss image classification as the image is basically a matrix.

Let’s try applying gradient descent to **w** and **b** and approach it step by step:

1. Initially let w = 4 and b = 0. Let **L** be our learning rate. This controls how much the value of **w** changes with each step.** L** could be a small value like 0.0001 for good accuracy. Bear in mind, the weight **w** should always be initialised randomly and not at 1 or 0 [more details]

2. Calculate the partial derivative of the loss function with respect to **w**, and plug in the current values of x, y, w and b in it to obtain the derivative value **D** .

Now Dw is the value calculated with respect to **w**. Let’s calculate D with respect to **b**, i.e., Db.

3. Now we update the current value of **w** and **b** using the following equation:

**w = w – L * D _{w}**

**b = b – L * D _{b}**

4. We repeat this process until our loss function is a very small value or ideally 0 (which means 0 error or 100% accuracy). The value of **w** and **b **that we are left with now will be the optimum values. Now with the optimum value of **w** and **b** our model is ready to make predictions! Please note that finding the “right set” of optimum values are crucial. Please look in this article to know about overfitting and underfitting of data, which interfere in finding the “right set” of optimum values.

To make gradient descent work correctly, we need to choose a small enough learning rate **L **so that the above equation is a good approximation, but not too small or the gradient descent will work too slowly.

Gradient descent often works extremely well, and in neural networks we’ll find that it’s a powerful way of minimizing the cost function, and helping the net learn.

Now, there’s a challenge in applying gradient descent rules. A quick look at the error function:

tells us that it’s an average over the errors for individual training samples. In practice, to compute the gradient **D **we need to compute the gradients **D**x separately for each training input x, and then average them. Unfortunately, when the number of training inputs is very large, this can take a long time, so learning occurs slowly.

To handle this issue, *stochastic gradient descent* can be used. Here, instead of calculating the exact gradient **D, **an estimated gradient is calculated for a small sample of randomly chosen training inputs or a mini-batch. By averaging over this mini batch it turns out that we can quickly get a good estimate of the true gradient, and this helps speed up gradient descent and learning.

**How does this connect to learning in a neural network?** Let **w** and **b** be the weights and biases in our network. Stochastic gradient descent works by picking out a randomly chosen mini batch of training inputs, and training with those. Then it picks out another batch randomly and trains with those. This goes on until the training inputs are exhausted, which is said to complete an *epoch *of training. At that point, a new training epoch starts.

There’s a fast algorithm for computing the gradient of the error function known as backpropagation.

**Backpropagation **is about how changing the weights and biases in a network changes the error function. The goal of backpropagation is to compute the partial derivatives Dw and Dbof the error function **E**, with respect to any weight **w** or bias **b** in the network.

To compute those, let me introduce an intermediate δ^{l}_{j} which will be the error in the j^{th} neuron in the l^{th} layer. Backpropagation will give us a procedure to compute δ^{l}_{j}, then will relate to Dw and Db.

Let’s understand how this error affects our neural network. The error sits at the j^{th} neuron in the l^{th} layer. As the input to the neuron comes in, the error messes with the neuron’s operation. It adds a little change ∆e^{l}_{j} to the neuron’s weighted input, so instead of outputting y(e^{l}_{j}), the neuron outputs y(e^{l}_{j}+∆e^{l}_{j}). This change propagates through later layers in the network, finally causing the overall cost to change by an amount D_{elj}∆e^{l}_{j}.

Backpropagation is based around four fundamental equations:

1. **Error in the output layer**

Where E is is the error function, σ is the activation function. ∂E / ∂a^{l}_{j} measures how fast the** **error function is changing as a function of the j^{th} output activation. The second term σ’e^{l}_{j}, measures how fast the activation function is changing at e^{l}_{j}. To simplify, let’s consider E as a vector, rewriting the above expression (eq 1):

2. **Error in terms of the error in the next layer** **(eq 2)**

Where (w^{l+1})^{T} is the transpose of the weight matrix w^{l+1} for the (l+1)^{th} layer. This appears complicated, but let me break it down. Suppose we know the error δ^{l+1} at the (l+1)^{th} layer. When we apply the transpose weight matrix, (w^{l+1})^{T}, we can think of this as moving the error backward through the network, giving us some sort of measure of the error at the output of the l^{th} layer. We then take the dot product, O symbolises dot product. This moves the error backward through the activation function in layer l, giving us the error δ^{l} in the weighted input to layer l. By combining **(eq 1) **and **(eq 2)** we can compute the error δ^{l} for any layer in the network. We start by using** **δ^{L-1},then** (eq 2) **again to compute** **δ^{L-2}, and so on, all the way back through the network.

3. **Rate of change of the error function with respect to any bias in the network (eq 3)**

That is, the error** **δ^{lj} is exactly equal to the rate of change ∂E / ∂b^{l}_{j}. The **(eq 1) **and **(eq 2)** already give us δ^{lj}. We can simplify **(eq 3) **as:

Where it is understood δ is being evaluated at the same neuron as the bias **b.**

4. **Rate of change of the error with respect to any weight in the network (eq 4)**

This shows how to compute the partial derivatives ∂E / ∂w^{l}_{jk} in terms of the quantities δ^{l} and al-1, which we already know how to compute. Here a^{l-1} is the activation of the neuron input to the weight **w**, and δ^{l} is the error of the neuron output from the weight **w.** By looking at **(eq 4), **we can say that when** **a^{l-1} ≈ 0, the gradient term will also tend to be small, which means the weight learns slowly, or the gradient descent is not changing much. In other words, we can say the consequence of **(eq 4) **is that weights output from low-activation neurons learn slowly.** **

Summing up, now you’ve seen that a weight will learn slowly if either the input neuron is low-activation, or the output neuron has saturated, i.e. either high- or low-activation.

The four fundamental equations turn out to hold for any activation function, not just the standard sigmoid function or the perceptron we discussed in the beginning. Let’s write this out in the form of a pseudo algorithm:

**Input**x**:**Set the corresponding activation a^{1 }for the input layer.**Feedforward :**For each l = 2,3,…, L compute e^{l}=w^{l}a^{l-1}+b^{l}and a^{l}=σ(e^{l}).**Output error δ**Compute the vector δ^{L}:^{L}=∆_{a}EOσ'(e^{L}).**Backpropagate the error:**For each l=L-1,L-2,…,2compute δ^{L}=((w^{l+1})^{T}δ^{l+1})Oσ'(e^{L}).**Output:**The gradient of the error function is given by ∂E / ∂w^{l}_{jk}=a^{l-1}_{k}δ^{lj}and ∂E / ∂b^{l}_{j}= δ^{lj}.

Examining the algorithm, you can see why it’s called *back*propagation. We compute the error vectors **δ ^{L}** backwards, starting from the final layer. It may seem strange that we’re going through the network backwards. But if you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backwards through the layers to obtain usable expressions. If you aren’t familiar with the chain rule, please check out this video by Josh Starmer.

If you still aren’t clear about the essence of backpropagation, I suggest you check out this video and this video to catch the backpropagation calculus.

For the rest of the blog, I will use PyTorch’s loss.backward(), as it’s optimised. In order to use this, you need to clear existing gradients using the zero_grad() function, or else the gradients will accumulate.

### Read also

How to keep track of model training metadata with Neptune-PyTorch integration.

We’ve been focusing on feed-forward neural networks. Now, for the task of breast cancer classification, let’s look at a neural network famous for image classification.

## What are Convolutional Neural Networks? How do they work?

Let’s start with why we need convolutional neural nets (ConvNets, CNNs) over feed-forward neural nets.

Consider a small image, size 100100. For a feed-forward neural net, there are 10000 weights for each neuron in the second layer. This makes the network prone to overfitting the data. Also, flattening the image and reducing it to 10000 weights loses the essence of an image.

CNNs are regularised versions of the feed forward neural networks (fully connected neural network). Usual ways of regularization include varying the weights as the loss function gets minimized, while randomly trimming connectivity.

CNNs take advantage of hierarchical patterns in image data; in each layer they capture small localised features (w.r.t. previous layer), but as the depth increases, the complexity of these features increases w.r.t. the input image. Hence, this stacking of localised filters (neurons connected locally over a small area) enables CNNs to capture complex and space invariant features like dogs, cats, cars, etc., with less number of trainable parameters compared to fully connected networks. We can say that they’re more efficient in capturing relevant features from images than fully connected networks. To know more about the importance of CNN in image classification check out this video by Computerphile.

Convolutional neural networks are a specialised type of neural network, which uses convolution (filters/kernels convolve with the input image to generate the activation) instead of regular matrix multiplication in at least one of the layers. The architecture of CNNs is similar to that of a fully connected neural network. There’s an input layer, the hidden layer and the final output layer.

Here, the hidden layer performs convolution. This is followed by other layers which perform other functions like pooling layer, fully connected layers and normalisation layer. Let’s look at these sections in detail.

### Convolutional layers

As I mentioned earlier, convolution takes place in the hidden layers. To be precise, the kernel or as we shall here refer as filter, moves to different positions in the image, changing the stride of the convolution across the image. For each position of the filter, the dot-product is calculated between the filter and the image pixels under the filter, which results in a single pixel in the output image.

So, moving the filter across the entire input image results in a new image being generated. These images are called feature maps. The feature maps generated in the first convolutional layer are down-sampled. These feature maps are then passed through a second convolutional layer. Here, for each of these newly generated images, filter-weights are needed. The resulting images are further down-sampled. If you are interested to know in depth how convolution works on images, you can refer to this blog on performing convolution operations.

### Pooling layer

Now, instead of down-sampling by changing the stride of the convolution, there are other robust ways of down-sampling the image, like using a pooling layer. Pooling layers reduce data dimensions by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, typically 2 x 2. This halves the resolution further. There are two types of pooling:

- Max Pooling – maximum value of each cluster of neurons at the prior layer is picked up for each feature map
- Average Pooling – average value of each cluster of neurons is picked up for each feature map

Max pooling is usually preferred, as it performs denoising along with dimensionality reduction.

Pooling helps in extracting the dominant features which are positional invariant. Also, the dimensional reduction decreases the computational power required to process the data.

### Fully connected layers

The final layer is a fully connected layer which classifies our image. The output from the convolution network is then flattened into a column vector, and fed to a fully connected neural network; backpropagation is applied to every iteration of training.

Over a series of epochs, the model is able to distinguish between dominating and low-level features in images, and classify them using the **softmax classification** technique. I won’t go into detail about softmax, but in a few words, the softmax classifier gives probabilities for each class. To know more about softmax classification, please go through this blog by Adrian Rosebrock where he beautifully explains softmax classification.

## CNN code

Let’s build our own CNN now that we have gone through the basics, and see how it performs on MNIST dataset, with Pytorch in Colab, using the GPU.

First, import the libraries.

```
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
```

Download the training and testing dataset.

```
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))]
)
trainset = torchvision.datasets.MNIST(
root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64,
shuffle=True, num_workers=2)
```

```
testset = torchvision.datasets.MNIST(
root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=20,
shuffle=False, num_workers=2)
```

Let’s visualise the training images which we’re going to use as input.

```
import matplotlib.pyplot as plt
import numpy as np
# functions to change tensor to numpy image
def imshow(img):
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()
# show images
imshow(torchvision.utils.make_grid(images[:6], nrow=3))
```

`device = torch.device('cuda' if torch.cuda.is_available() else "cpu")`

**Now, time to build our cnn.**

```
class NumClassifyNet(nn.Module):
def __init__(self):
super(NumClassifyNet, self).__init__()
# 1 input image channel, 16 output channels, 5X5 square convolutional kernels
self.conv1 = nn.Conv2d(1, 16, kernel_size=5)
self.conv2 = nn.Conv2d(16, 32, kernel_size=5)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(512, 120)
self.fc2 = nn.Linear(120, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, self.flat_features(x))
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
def flat_features(self, x):
size = x.size()[1:]
num_features = 1
for s in size:
num_features *= s
return num_features
net = NumClassifyNet()
net = net.to(device)
```

```
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr = 0.001)
```

Time to put the model to train!

```
test_data_iter = iter(testloader)
test_images, test_labels = test_data_iter.next()
for epoch in range(10):
running_loss = 0
for i, data in enumerate(trainloader, 0):
input_imgs, labels = data
optimizer.zero_grad()
input_imgs = input_imgs.to(device)
labels = labels.to(device)
outputs = net(input_imgs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# printing stats to check that out model is being trained correctly
# and test on one image as we train
running_loss += loss.item()
if i % 1000 == 0:
print('epoch', epoch+1, 'loss', running_loss/1000)
imshow(torchvision.utils.make_grid(test_images[0].detach()))
test_out = net(test_images.to(device))
_, predicted_out = torch.max(test_out, 1)
print('Predicted : ', ' '.join('%5s' % predicted_out[0]))
print('Training finished')
```

The output on our last bach was:

The loss is less and the prediction is accurate, so we can stop the training and use this model for making predictions now.

The accuracy achieved on the whole test set was:

## Breast Cancer Classification using CNN

### Information about dataset: breast histopathology images

Breast histopathology images can be downloaded from Kaggle’s website. The image data consists of 1,77,010 patches of 50 50 pixels, extracted from 162 whole mount slide images of breast cancer specimens scanned at 40. The data contains images of both negative and positive samples.

Let’s download the data from kaggle to our drive so that we can use it. I found the documentation ambiguous, so I’m going to explain how to do it in my own words. Hope it helps. This is a one-time setup:

1. **Setting up Kaggle API access: **Collect your Kaggle API access token. Navigate to your Kaggle profile “Account” page. Find “create your API token”. Download your token as a JSON file containing your username and key.

2. **Save API token in Drive: **Create a folder for Kaggle in your Google Drive. Save a copy of the API token as a private file in this folder, so you can access it easily.

3. **Mount Google Drive to Colab**: This will make sure you don’t have to download data every time you restart your runtime.

```
from google.colab import drive
drive.mount('/content/gdrive')
```

4. **Configure a ‘Kaggle Environment’ using OS**: This will store the API key and value as an OS environ object/variable. When you run Kaggle terminal commands (in the next step), your machine will be linked to your account through your API token. Linking to the private directory in your drive ensures that your token information will remain hidden.

```
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/kaggle"
```

5. **Download the data**

```
os.chdir('../content/gdrive/MyDrive/kaggle')
!kaggle datasets download -d paultimothymooney/breast-histopathology-images
```

Now that we have our dataset, let’s start building our network!

```
import torch
import torchvision
from torchvision import transforms
from torchvision.datasets import ImageFolder
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import random_split
import torch.optim as optim
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
```

Convert the images to tensor.

```
data_dir = "/content/gdrive/MyDrive"
folder_name = "kaggle"
image_folders = os.path.join(data_dir, folder_name)
transform = transforms.Compose([transforms.Resize((50, 50)), transforms.ToTensor()])
images = []
for file in os.listdir(image_folders):
try:
images.append(ImageFolder(os.path.join(image_folders, file), transform=transform))
except:
print(file)
datasets = torch.utils.data.ConcatDataset(images)
```

Check out the dataset to find the number of samples in each class.

```
i=0
for dataset in datasets.datasets:
if i==0:
result = Counter(dataset.targets)
i += 1
else:
result += Counter(dataset.targets)
result = dict(result)
print("""Total Number of Images for each Class:
Class 0 (No Breast Cancer): {}
Class 1 (Breast Cancer present): {}""".format(result[0], result[1]))
```

Output:

Now, split the dataset into 75% of it being the training set and 25% being testing set.

```
random_seed = 42
torch.manual_seed(random_seed)
test_size = int(0.25*(result[0]+result[1]))
print(test_size)
train_size = len(datasets) - test_size
train_dataset, test_dataset = random_split(datasets, [train_size, test_size])
```

```
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=128,
shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(test_dataset, batch_size=64,
shuffle=False, num_workers=2)
```

Now, have a look at our dataset.

```
# functions to show an image
def imshow(img):
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()
# show images
imshow(torchvision.utils.make_grid(images[:6], nrow=3))
# show labels
labels[:6]
```

Output:

Use the GPU.

`device = torch.device('cuda' if torch.cuda.is_available() else "cpu")`

Build the breast cancer classification neural net.

```
class BreastCancerClassifyNet(nn.Module):
def __init__(self):
super(BreastCancerClassifyNet, self).__init__()
self.conv1 = nn.Conv2d(3, 64, kernel_size=3)
self.conv2 = nn.Conv2d(64, 128, kernel_size=3)
self.conv3 = nn.Conv2d(128, 256, kernel_size=3)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(4096, 1024)
self.fc2 = nn.Linear(1024, 512)
self.fc3 = nn.Linear(512, 1)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = x.view(-1, self.flat_features(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
x = F.log_softmax(x)
return x
def flat_features(self, x):
size = x.size()[1:]
num_features = 1
for s in size:
num_features *= s
return num_features
net = BreastCancerClassifyNet()
net = net.to(device)
```

Using Binary Cross Entropy loss, as we’re doing binary classification.

```
criterion = nn.BCELoss()
optimizer = optim.SGD(net.parameters(), lr = 0.001)
```

Time to train!

```
test_data_iter = iter(testloader)
test_images, test_labels = test_data_iter.next()
for epoch in range(20):
running_loss = 0
for i, data in enumerate(trainloader, 0):
input_imgs, labels = data
input_imgs = input_imgs.to(device)
labels = labels.to(device)
optimizer.zero_grad()
outputs = net(input_imgs)
labels = labels.unsqueeze(1).float()
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
#printing stats and checking prediction as we train
running_loss += loss.item()
if i % 10000 == 0:
print('epoch', epoch+1, 'loss', running_loss/10000)
imshow(torchvision.utils.make_grid(test_images[0].detach()))
test_out = net(test_images.to(device))
_, predicted_out = torch.max(test_out, 1)
print('Predicted : ', ' '.join('%5s' % predicted_out[0]))
print('Training finished')
```

Finally, test our trained model on all the dataset and calculate the accuracy.

```
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
test_images, test_labels = data
test_out = net(test_images.to(device))
_, predicted = torch.max(test_out.data, 1)
total += test_labels.size(0)
for _id, out_pred in enumerate(predicted):
if int(out_pred) == int(test_labels[_id]):
correct += 1
print('Accuracy of the network on the 44252 test images: %d %%' % (
100 * correct / total))
```

Output:

Now, this accuracy seems lower than what we had achieved earlier, but note that we used a much more complex dataset, and we built the model from scratch. Nonetheless, we still achieved a good accuracy for 20 epochs.

In order to achieve higher accuracy, you can use pretrained networks, trained on millions of datasets, as a base and build your classification model on top of it, i.e. by applying transfer learning.

## Conclusion

We went from defining neural networks to building our own neural network for breast cancer classification. Let’s recap on what we learnt:

- We first looked into the very definition of neural nets. What neurons signify and how they form a network?
- Then moved on to how they worked. After a brief understanding of activation functions, we went into error functions and how gradient descent helps in reducing the error.
- We further looked into backpropagation where I gave a brief explanation on its mathematics.
- We then moved to CNNs and each of its layers and then built our own CNN from scratch for classifying MNIST dataset.
- With our collective knowledge of the neural nets, we built our own neural net for
**breast cancer classification**.

### Resources

- Online book on Neural networks and deep learning
- Blog on CNN
- Tensorflow Tutorial on training CNN
- Breast Cancer classification with Keras
- Analysis on different breast cancer classification models
- Paper on how deep Neural Networks improve radiologists performance in Breast Cancer Screening with Code

I showed you how to build your own breast cancer classification network, but I hope this blog will be helpful in building your own classification neural net for any dataset.

I hope you enjoyed the journey! Thanks for reading.