Machine Learning Methods Explained

Posted October 1, 2020

In machine learning, every problem is a little different and requires a different approach. The problem can be a stock price prediction, classifying an image, detecting objects in an image, grouping the same type of data, and about a million others.

In this article, we will cover the methods that are most commonly used in machine learning projects (with examples).

See also: Best Tools to Manage Machine Learning Projects

Types of Learning

We can categorize the machine learning methods into 3 types of learning.

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

1. Supervised Learning

In supervised learning, we get the data with the target value. To put this into layman terms, say we have the data about a geographical location which consists of lot-size, house age, bedroom number and we want to purchase a house in that particular area.

Now we can write a mathematical function which will take X = {lot_size, house_age, bedroom_number} and will return the house price ‘y’.

Where X is a feature vector and y is a target vector.

This method is most often the best option but the downside is you need to have the access to y. In some projects, this is easy (and cheap) in others getting lots of labeled data can be very difficult (and expensive). 

Supervised Learning Methods

1. Regression

In regression, we try to predict the continuous output from a feature vector while optimizing the errors between the model prediction and ground truth with the help of the loss function. Let’s understand this by simple Linear Regression, say we have a line in our x-y plane as y = m*x + b.

Say for every feature in the feature vector a target value was assigned. See the following diagram.


Now our task is to come up with a 2-D line that can take the features and return the target value. How can we do that?

So to get the Predictive line or Predictive model we need to tweak the values of m (Slope) and b (Intercept). We can train our model by finding the m(Slope) and b(Intercept) that produce the predictions that are closest to the ground truth (have the lowest error).

predictive model

In Regression, an error is a difference between the actual and predicted value. Let’s say we build a regression model that can predict house prices. For the input feature X1, we have the target value 200K dollars, and when we put X1 in our predictive model it returns the 230K dollars. So the error is 30K dollars which is a huge amount which we need to reduce by building a better model. Errors are calculated by the so-called objective (or cost) functions.


In regression, we usually take the cost function as the squared difference between the actual and predicted values and take the mean of them. This cost function is called the Mean Squared Error.

There are various methods of updating model parameters (slope and intercept). One of the most common techniques is called Gradient Descent. Gradient Descent is an iterative process of updating the parameters of our predictive model to best fit the data to make better predictions.

gradient descent

2. Classification

In Classification, we assign a class or a label to a given object (represented by some feature vector). For example, we can train a classification model to predict whether an email is a spam or not.

binary classification

Binary classification

In classification problems, we are not limited to just two-class classification (know as binary classification). We can also have multiple classes to be classified with our model.

For example, our task could be to classify the digits from 0 to 9. It means we have 9 classes to classify (choose from). Problems like this one are called Multi-class classification problems.

multi class classification

Multi-class classification

Let’s understand this with one more example.

decision boundary

Let’s say we have two types of data points. One is Red and the other is Blue. Our objective is to separate the Red points from Blue points by a hyperplane or decision boundary. The farther away are the points the easier it is to create a boundary that divides those points into classes.

3. Ensemble Learning

Ensemble learning is the process of combining multiple models to get improved results.

ensemble learning

There are various flavors of model ensembling (blending, bagging, stacking). For example, in Bootstrapped aggregation or Bagging the process goes as follows.

We split our dataset (Dn) randomly into chunks or samples of {D1, D2, D3… Dn} and we train our models {M1, M2, M3… Mn} on every sampled data. Then we aggregate all trained models.

ensemble leearning

Now if we have a new query to process with our ensemble model then the query point will go through each model which were previously trained on the sampled data. To get the results we combine (aggregate) predictions. By doing that we often get better results.

2. Unsupervised Learning

In Unsupervised Learning, we are given only feature values (no target values) which means the dataset we have is unlabelled.

The goal of unsupervised learning is to learn from the feature representation even though we don’t know what we are learning yet.

For example, if we have the data of customers of an e-commerce website with the feature values like expense score, credit card score, number of items purchased, etc. We don’t know yet but the grouping customers with similar features can show us groups of those customers that share commonalities. Then we can cluster this data to the target audiences with better promotional plans.

Unsupervised Learning Techniques

1. Clustering

Clustering is a technique that helps you group similar observations.

Let’s understand this with an example.

clustering unsttructured data

By performing the clustering analysis we will get the following results.

clustering clustered data

Clustering works upon the two distances inter-cluster distance and intra-cluster distance. We can measure the clustering performance, for example, with the Dunn Index which is given as inter-cluster distance divided by intra-cluster distance.

Inter-cluster distance: the distance between clusters.

Intra-cluster distance: the distance between the data points in a cluster.

cluster distance

Ideally, we want Inter-cluster distance to be high and intra-cluster distance to be very low to obtain good clusters from the data.

2. Dimensionality Reduction

Dimensionality Reduction is a technique that projects the higher dimensional data on to the lower-dimensional space.

Why it could be useful?

Because we can only really visualize and understand data in lower dimensions, usually 2 or 3 and such projection can give us insights which we wouldn’t immediately see in the 1000 dimensional vectors of floats but we can in a 2D approximation.  

dimensionality reduction

Source: Wikipedia

The most popular technique for dimensionality reduction is Principal Component Analysis (PCA).

In PCA, we try to preserve those features which hold the maximum variance for the feature vector.

For example, say we have three features: height, weight, and hair length and we want to remove a feature from this feature vector to conclude if a person is obese or not.

It is quite likely that hair length doesn’t add much information to our problem. We can neglect the hair length and can carry on with the rest of the features. This is also known as Preserving the maximum variance with respect to the principal axis.

3. Reinforcement Learning

Reinforcement learning is an agent based learning where an agent learns to behave in an environment by performing the actions to get the maximum rewards.

In reinforcement learning, an agent learns without any labeled data since it is bound to learn by the experiences only.

For example, consider the following diagram where Mario’s mission is to rescue the princess.

reinforcement learning method

Now we can consider Mario as our agent and the princess as a reward with many hurdles in between. Mario is supposed to find the best possible path to reach the princess.

For each correct block Mario steps on he will be rewarded +1 point and will advance one block ahead and for each incorrect block, Mario will be punished by -1 point and will degrade to the previous block.

So +1 and -1 can be considered as positive reinforcement and negative reinforcement respectively.

Through many iterations, Mario will learn on which block he should step on and on which he should not in order to rescue the princess.

Final thoughts

I have tried to touch on the most common techniques which are used in data science projects. 

But just reading what those are, is not going to be enough to successfully finish those projects. Sorry. 

Now that you know what machine learning methods are out there you need to practice and dive deeper into any particular method that your project requires. 

Good luck!

Python & Machine Learning Instructor | Founder of