In this article, we’ll look at a major problem with using Random Forest for Regression which is **extrapolation**.

We’ll cover the following items:

- Random Forest Regression vs Linear Regression
- Random Forest Regression Extrapolation Problem
- Potential solutions
- Should you use Random Forest for Regression?

Let’s dive in.

## Random Forest Regression vs Linear Regression

Random Forest Regression is quite a robust algorithm, however, the question is **should you use it for regression**?

Why not use linear regression instead? The function in a Linear Regression can easily be written as y=mx + c while a function in a complex Random Forest Regression seems like a black box that can’t easily be represented as a function.

Generally, Random Forests produce better results, work well on large datasets, and are able to work with missing data by creating estimates for them. However, they pose a major challenge that is that they can’t extrapolate outside unseen data. We’ll dive deeper into these challenges in a minute

**Decision Tree Regression**

Decision Trees are great for obtaining non-linear relationships between input features and the target variable.

The **inner working of a Decision Tree can be thought of as a bunch of if-else conditions. **

It starts at the very top with one node. This node then splits into a left and right node — decision nodes. These nodes then split into their respective right and left nodes.

At the end of the leaf node, the average of the observation that occurs within that area is computed. The most bottom nodes are referred to as leaves or terminal nodes.

The value in the leaves is usually the mean of the observations occurring within that specific region. For instance in the right most leaf node below, 552.889 is the average of the 5 samples.

How far this splitting goes is what is known as the depth of the tree. This is one of the hyperparameters that can be tuned. The maximum depth of the tree is specified so as to prevent the tree from becoming too deep — a scenario that leads to overfitting.

**Random Forest Regression**

Random forest is an ensemble of decision trees. This is to say that many trees, constructed in a certain “random” way form a Random Forest.

- Each tree is created from a different sample of rows and at each node, a different sample of features is selected for splitting.
- Each of the trees makes its own individual prediction.
- These predictions are then averaged to produce a single result.

The averaging makes a Random Forest better than a single Decision Tree hence improves its accuracy and reduces overfitting.

A prediction from the Random Forest Regressor is an average of the predictions produced by the trees in the forest.

**Example of trained Linear Regression and Random Forest**

In order to dive in further, let’s look at an example of a Linear Regression and a Random Forest Regression. For this, we’ll apply the Linear Regression and a Random Forest Regression to the same dataset and compare the result.

Let’s take this example dataset where you should predict the price of diamonds based on other features like carat, depth, table, x, y and z. If we look at the distribution of price below:

We can see that the** price ranges from 326 to 18823.**

Let’s train the Linear Regression model and run predictions on the validation set.

The distribution of predicted prices is the following:

**Predicted prices are clearly outside the range of values of “price” seen in the training dataset. **

A Linear Regression model, just like the name suggests, created a linear model on the data. A simple way to think about it is in the form of y = mx+C. Therefore, since it fits a linear model, it is able to obtain values outside the training set during prediction. **It is able to extrapolate based on the data.**

Let’s now look at the results obtained from a Random Forest Regressor using the same dataset.

These values are clearly **within the range of 326 and 18823** — just like in our training set. There are no values outside that range. **Random Forest cannot extrapolate.**

## Extrapolation problem

As you have seen above, when using a Random Forest Regressor, the predicted values are never outside the training set values for the target variable.

If you look at prediction values they will look like this:

Wondering why?

Let’s explore that phenomenon here. The data used above has the following columns carat, depth, table, x, y, z for predicting the price.

The diagram below shows one decision tree from the Random Forest Regressor.

Let’s zoom in to a smaller section of this tree. For example, there are 4 samples with depth <= 62.75, x <= 5.545, carat <= 0.905, and z <= 3.915. The price being predicted for these is 2775.75. This figure represents the mean of all these four samples. Therefore, **any value in the test set that falls in this leaf will be predicted as 2775.75. **

This is to say that when the Random Forest Regressor is tasked with the problem of predicting for values not previously seen, it will always predict an average of the values seen previously. Obviously the average of a sample can not fall outside the highest and lowest values in the sample.

The **Random Forest Regressor is unable to discover trends that would enable it in extrapolating values that fall outside the training set**. When faced with such a scenario, the regressor assumes that the prediction will fall close to the maximum value in the training set. Figure 1 above illustrates that.

**Potential solutions**

Ok, so how can you deal with this extrapolation problem?

There are a couple of options:

- Use a linear model such as SVM regression, Linear Regression, etc
- Build a deep learning model because neural nets are able to extrapolate (they are basically stacked linear regression models on steroids)
- Combine predictors using stacking. For example, you can create a stacking regressor using a Linear model and a Random Forest Regressor.
**Use modified versions of random forest**

One of such extensions is Regression-Enhanced Random Forests (RERFs). The authors of this paper propose a technique borrowed from the strengths of penalized parametric regression to give better results in extrapolation problems.

Specifically there are two steps to the process:

- run Lasso before Random Forest,
- train a Random Forest on the residuals from Lasso.

Since Random Forest is a fully nonparametric predictive algorithm, it may not efficiently incorporate known relationships between the response and the predictors. The response values are the observed values Y1, . . . , Yn from the training data. RERFs are able to incorporate known relationships between the responses and the predictors which is another benefit of using Regression-Enhanced Random Forests for regression problems.

^{FURTHER READING}

➡️ The 3 Ways To Compute Feature Importance in the Random Forest

➡️ Is Random Forest better than Logistic Regression? (a comparison)

## Final thoughts

At this point, I am sure you might be wondering whether or not you should use a Random Forest for regression problems.

Let’s look at that.

** When to use it**

- When the data has a non-linear trend and extrapolation outside the training data is not important.

**When not to use it**

- When your data is in time series form. Time series problems require identification of a growing or decreasing trend that a Random Forest Regressor will not be able to formulate.

Hopefully, this article gave you some background into the inner workings of Random Forest Regression.

**READ NEXT**

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

**Jakub Czakon | Posted November 26, 2020**

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in.

Continue reading ->