We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more

Blog » Time Series » ARIMA vs Prophet vs LSTM for Time Series Prediction

ARIMA vs Prophet vs LSTM for Time Series Prediction

Assuming we subscribe to a linear understanding of time and causality, as Dr. Sheldon Cooper says, then representing historical events as a series of values and features observed over time provides the foundations for learning from the past. However, time series are somewhat different from other datasets, including sequential data like text or DNA sequences.

The time component provides additional information that can be useful when predicting the future. Thus, there are many different techniques designed specifically for dealing with time series. Such techniques range from simple visualization tools that show trends evolving or repeating over time to advanced machine learning models that utilize the specific structure of time series.

In this post, we will discuss three popular approaches to learning from time-series data:

  • 1The classic ARIMA framework for time series prediction
  • 2Facebook’s in-house model Prophet, which is specifically designed for learning from business time series
  • 3The LSTM model, a powerful recurrent neural network approach that has been used to achieve the best-known results for many problems on sequential data

We will then show how to compare the results across the three models using Neptune and its powerful features.

Let’s start with a brief overview of the three methods.

Overview of the three methods: ARIMA, Prophet and LSTM


ARIMA is a class of time series prediction models, and the name is an abbreviation for AutoRegressive Integrated Moving Average. The backbone of ARIMA is a mathematical model that represents the time series values using its past values. This model is based on two main features: 

  1. Past Values: Clearly, past behaviour is a good predictor of the future. The only question is how many past values we should use. The model uses the last p time series values as features. Here p is a hyperparameter that needs to be determined when we design the model.
  2. Past Errors: The model can use the information on how well it has performed in the past. Thus, we add as features the most recent q errors the model made. Again, q is a hyperparameter.   

An important aspect here is that the time series needs to be standardized such that the model becomes independent from seasonal or temporary trends. The formal term for this is that we want the model to be trained on a stationary time series. In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time. It does not mean that the series does not change over time, just that the way it changes does not itself change over time.

There are several approaches to making a time series stationary, the most popular being differencing. By replacing the n values in the series with the n-1 differences, we force the model to learn more advanced patterns. When the model predicts a new value, we simply add the last observed value to it in order to obtain a final prediction. Stationarity can be somewhat confusing if you encounter the concept for the first time, you can refer to this tutorial for more details.


Formally, ARIMA is defined by three parameters p, d, and q that describe the three main components of the model. 

  • Integrated (the I in ARIMA): The number of differences needed to achieve stationarity is given by the parameter d. Let the original features be Yt where t is the index in the sequence. We create a stationary time series using the following transformations for different values of d.
For d=0
ARIMA parameters

In this case the series is already stationary and we have nothing to do.

For d=1
ARIMA parameters

This is the most typical transformation.

For d=2
ARIMA parameters

Observe that differencing can be seen as a discrete version of differentiation. For d=1 the new features represent how the values change. While for d=2 the new features represent the rate of the change, just like the second derivative in calculus.  The above can be generalized to d>2 as well but this is rarely used in practice.

  • AutoRegressive (AR): The parameter p tells us how many past values to consider for the expression of the current value. Essentially, we learn a model that predicts the value at time t as:
AutoRegressive (AR)
  •  Moving Average (MA): How many of the forecast errors in the past should be considered. A new value is computed as:
AutoRegressive (AR)

The past prediction errors:

AutoRegressive (AR)

The combination of the three components gives the ARIMA(p, d, q) model. More precisely, we first integrate the time series, and then we add the AR and MA models and learn the corresponding coefficients.


Prophet FB was developed by Facebook as an algorithm for the in-house prediction of time series values for different business applications. Therefore, it is specifically designed for the prediction of business time series.

It is an additive model consisting of four components:


Let us discuss the meaning of each component:

  1. g(t): It represents the trend and the objective is to capture the general trend of the series. For example, the number of advertisements views on Facebook is likely to increase over time as more people join the network. But what would be the exact function of increase?
  2. s(t): It is the Seasonality component. The number of advertisement views might also depend on the season. For example, in the Northern hemisphere during the summer months, people are likely to spend more time outdoors and less time in from of their computers. Such seasonal fluctuations can be very different for different business time series. The second component is thus a function that models seasonal trends. 
  3. h(t): The Holidays component. We use the information for holidays which have a clear impact on most business time series. Note that holidays vary between years, countries, etc. and therefore the information needs to be explicitly provided to the model.
  4. The error term εt stands for random fluctuations that cannot be explained by the model. As usual, it is assumed that εt follows a normal distribution N (0, σ2) with zero mean and unknown variance σ that has to be derived from the data.

LSTM recurrent neural networks

LSTM stands for Long short-term memory. LSTM cells are used in recurrent neural networks that learn to predict the future from sequences of variable lengths. Note that recurrent neural networks work with any kind of sequential data and, unlike ARIMA and Prophet, are not restricted to time series. 

The main idea behind LSTM cells is to learn the important parts of the sequence seen so far and forget the less important ones. This is achieved by the so-called gates, i.e., functions that have different learning objectives such as: 

  1. a compact representation of the time series seen so far
  2. how to combine new input with the past representation of the series
  3. what to forget about the series
  4. what to output as a prediction for the next time step. 

See Figure 1 and the Wikipedia article for more details.

Designing an optimal LSTM based model can be a difficult task that requires careful hyperparameter tuning. Here is the list of the most important parameters an LSTM based model needs to consider:

  • How many LSTM cells are to use in order to represent the sequence? Note that each LSTM cell will focus on specific aspects of the time series processed so far. A few LSTM cells are unlikely to capture the structure of the sequence while too many LSTM cells might lead to overfitting.
  • It is typical that first, we convert the input sequence into another sequence, i.e. the values ht. This yields a new representation as the ht states capture the structure of the series processed so far. But at some point, we won’t need all htvalues but rather only the last ht. This will allow us to feed the different ht’s into a fully connected layer as each ht corresponds to the final output of an individual LSTM cell. Designing the exact architecture might require careful finetuning and many trials.
The structure of an LSTM cell
Figure 1: the structure of an LSTM cell | Source

Finally, we would like to reiterate that recurrent neural networks are a general class of methods for learning from sequential data and they can work with arbitrary sequences such as natural text or audio.  

Experimental evaluation: ARIMA vs Prophet vs LSTM


We are going to use stock exchange data for Bajaj Finserv Ltd, an Indian financial services company in order to compare the three models. The dataset spans the period from 2008 until the end of 2021. It contains the daily stock price (mean, low, and high values) as well as the total volume and the turnover of traded stocks. A subsample of the dataset is shown in Figure 2. 

The data used for evaluation
Figure 2: the data used for evaluation | Source: Author

We are interested in predicting the Volume Weighted Average Price (VWAP) variable at the end of each day.  A graph of the time series VWAP values is presented in Figure 3.

The daily values of the VWAP variable
Figure 3: the daily values of the VWAP variable | Source: Author

For the evaluation, we divided the time series into a train and test time series where the training series consists of the data until the end of 2018 (see Figure 4). 

Total number of observations: 3201 

Training observations: 2624

Test observations: 577

The train and test subsets of the VWAP time series
Figure 4: the train and test subsets of the VWAP time series | Source: Author


In order to work properly, machine learning models require good data and for this, we will do a little Feature engineering. The objective behind feature engineering is to design more powerful models that exploit different patterns in the data. As the three models learn patterns observed in the past, we create additional features that thoroughly describe the recent trends of the stock movements. 

In particular, we track the moving average for the different trade features over a period of 3, 7, and 30 days. In addition, we consider features such as the month, the week number, and the weekday. Thus, the input to our models is multidimensional.  A small example of the used feature engineering looks as follows:

lag_features = ["High", "Low", "Volume", "Turnover", "Trades"]
df_rolled_7d = df[lag_features].rolling(window=7, min_periods=0)
df_mean_7d = df_rolled_7d.mean().shift(1).reset_index().astype(np.float32)

The above code excerpt shows how to add the running mean over the last week of several features describing the sales of the stock. Overall, we create a set of exogenous features:


Now, let’s get started with our main models:


We implemented the ARIMA version from the publicly available package pmdarima. The function auto_arima accepts as an additional parameter a list of exogenous features where we provide the features created in the feature engineering step. The main advantage of auto_arima is that it first performs several tests in order to decide if the time series is stationary or not. Also, it employs a smart grid search strategy that determines the optimal parameters for p, d, and q discussed in the previous section. 

from pmdarima import auto_arima
model = auto_arima(

The grid search over different values of the parameters p, d, and q is shown below. In the end, the model with the smallest AIC value is returned. (The AIC value is a measure of model complexity that simultaneously optimizes the accuracy and the complexity of a prediction model.) 


Predictions on the test set are then obtained by

forecast = model.predict(n_periods=len(df_valid),  exogenous=df_valid[exogenous_features])


We use the publicly available Python implementation of Prophet. The input data must contain two specific fields: 

  1. Date:  should be a valid calendar date from which the holidays can be computed
  2. Y: the target variable we want to predict.

We instantiate the model as:

from prophet import Prophet
model = Prophet()

The features created during feature engineering have to be explicitly added to the model as follows:

for feature in exogenous_features:

Finally, we fit the model:

model.fit(df_train[["Date", "VWAP"] + exogenous_features].rename(columns={"Date": "ds", "VWAP": "y"}))

And the forecast for the test set is obtained as:

forecast = model.predict(df_test[["Date", "VWAP"] + exogenous_features].rename(columns={"Date": "ds"}))


We used the Keras implementation of LSTMs:

import tensorflow as tf
from keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.metrics import RootMeanSquaredError, MeanAbsoluteError
from tensorflow.keras.models import Sequential

The model is defined by the following function.

def get_model(params, input_shape):
	model = Sequential()
	model.add(LSTM(units=params["lstm_units"], return_sequences=True, input_shape=(input_shape, 1)))

	model.add(LSTM(units=params["lstm_units"], return_sequences=True))

	model.add(LSTM(units=params["lstm_units"], return_sequences=True))

	model.add(LSTM(units=params["lstm_units"], return_sequences=False))


              	metrics=[RootMeanSquaredError(), MeanAbsoluteError()])

	return model

Then we instantiate a model with a given set of parameters. We use the past 90 observations in the time series as a sequence for the input to the model. The other hyperparameters describe the architecture and the specific choices for training the model. 

params = {
	"loss": "mean_squared_error",
	"optimizer": "adam",
	"dropout": 0.2,
	"lstm_units": 90,
	"epochs": 30,
	"batch_size": 128,
	"es_patience" : 10

model = get_model(params=params, input_shape=x_train.shape[1])

The above results in the following Keras model (see Figure 5):

A summary of the Keras LSTM model
Figure 5: a summary of the Keras LSTM model | Source: Author

We then create a callback to implement early stopping i.e. to stop training the model if it yields no improvement on the validation dataset for a given number of epochs (in our case 10):

es_callback = tf.keras.callbacks.EarlyStopping(monitor='val_root_mean_squared_error',

The parameter es_patience refers to the number of epochs for early stopping.

Finally, we fit the model using the predefined parameters:

	validation_data=(x_test, y_test),
	callbacks=[neptune_callback, es_callback]

Experiment tracking and model comparison 

Since in this blog post, we want to answer the simple question of which model yields the most accurate predictions for the test dataset, we will need to see how these three models fare against each other.

There are many different approaches for model comparisons such as creating tables and charts that record the evaluation of different metrics, creating graphs that plot the predicted values vs the true values on a test set, etc. However, for this exercise, we will be using Neptune.

What is Neptune?

Neptune is a metadata store for MLOps, built for teams that run a lot of experiments.‌

It gives you a single place to log, store, display, organize, compare, and query all your model-building metadata.

‌Neptune is used for:‌

  • Experiment tracking: Log, display, organize, and compare ML experiments in a single place.
  • Model registry: Version, store, manage, and query trained models, and model building metadata.
  • Monitoring ML runs live: Record and monitor model training, evaluation, or production runs live

As described in this tutorial, we first create a Neptune project and record the API of our account:

run = neptune.init(project='<YOUR_WORKSPACE/YOUR_PROJECT>',

The variable run can be seen as a folder in which we can create subfolders containing different information. For example, we can create a subfolder called model and record in it the name of the model:

run["model/name"] = "Arima"

We will compare the accuracy of these models with respect to two different metrics: 

  1. The root mean square error (RMSE)
The root mean square error (RMSE)
  1. The mean absolute error (MAE)
The mean absolute error (MAE)

Note that these values can be logged into Neptune by setting the corresponding values, for example, setting:

run["test/mae"] = mae
 run["test/rmse"] = mse

The mean square error and the mean average error for the three models can be seen next to each other in the runs table:

The mean square error and the mean average error for the three models can be seen next to each other. (The tags for each project are at the top.)
Figure 6. the MSE and the MAE for the three models in the Neptune UI
(the tags for each project are at the top) | Source

The comparison of the three algorithms can be then seen side by side in Neptune as shown in Figure 7.

Side by side comparison ARIMA Prophet LSTM
Figure 7: The mean square error and the mean average error for the three models can be seen next to each other
(the tags for each project are at the top) | Source

We see that ARIMA yields the best performance, i.e. it achieves the smallest mean square error and mean absolute error on the test set. In contrast, the LSTM neural network performs the worst of the three models. 

The exact predictions plotted against the true values can be seen in the following images. We observe that all three models capture the overall trend of the time series but the LSTM appears to be running behind the curve, i.e. it needs more to adjust itself to the change in trend. And Prophet appears to lose against ARIMA in the last few months of the considered test period where it underestimates the true values.

ARIMA predictions
Figure 8: ARIMA predictions | Source: Author
Prophet predictions
Figure 9: prophet predictions | Source: Author
LSTM prediction
Figure 10: LSTM prediction | Source: Author

A deeper look into the performance of the models 

ARIMA grid-search

When doing grid-search over different values for p, d, and q in ARIMA, we can plot the individual values for the mean squared error. The colored dots in Figure 11 show the mean square error values for different ARIMA parameters over a validation set.

Grid-search over the ARIMA parameters
Figure 11: grid-search over the ARIMA parameters | Source

Trends in Prophet

In Figure 12 we show the change of the different components of the Prophet. We observe that the trend follows a linear increase while the seasonal components exhibit fluctuations.

The change of values of the different components in the Prophet over time
Figure 12: the change of values of the different components in the Prophet over time | Source: Author

Why did LSTM fare the worst?

We collect in Neptune the mean absolute error while training the LSTM model over several epochs. This is achieved using a Neptune callback which captures training metadata and logs it automatically to Neptune. The results are shown in Figure 13. 

Observe that while the error on the training dataset decreases over subsequent epochs, this is not the case for the error on the validation set which reaches its minimum in the second epoch and then fluctuates. This shows that the LSTM model is too advanced for a rather small dataset and is prone to overfitting. Despite adding regularization terms such as dropout, we can’t still avoid overfitting.

The evolution of train and test error over different epochs of training the LSTM model
The evolution of train and test error over different epochs of training the LSTM model
Figure 13: the evolution of train and test error over different epochs of training the LSTM model | Source


In this blog post, we presented and compared three different algorithms for time series prediction. As expected, there is no clear winner and each algorithm has its own advantages and limitations. Below we summarize our observations for each algorithm:

  1. ARIMA is a powerful model and as we saw it achieved the best result for the stock data. A challenge is that it might need careful hyperparameter tuning and a good understanding of the data. 
  2. Prophet is specifically designed for business time series prediction. It achieves very good results for the stock data but, speaking from anecdotes, it can fail spectacularly on time series datasets from other domains. In particular, this holds for time series where the notion of calendar date is not applicable and we cannot learn any seasonal patterns. Prophet’s advantage is that it requires less hyperparameter tuning as it is specifically designed to detect patterns in business time series.
  3. LSTM-based recurrent neural networks are probably the most powerful approach to learning from sequential data and time series are only a special case. The potential of LSTM based models is fully revealed when learning from massive datasets where we can detect complex patterns. Unlike ARIMA or Prophet, they do not rely on specific assumptions about the data such as time series stationarity or the existence of a Date field.  A disadvantage is that LSTM based RNNs are difficult to interpret and it is challenging to gain intuition into their behaviour. Also, careful hyperparameter tuning is required in order to achieve good results.

Future directions

So I hope you enjoyed reading this article and now you must have a better understanding of the time-series algorithms that we discussed here. If you want to dig deeper, here are some links to some useful resources. Happy experimenting!

  1. PMD ARIMA. The documentation for the respective Python package.
  2. Prophet. Documentation and tutorial for Facebook Prophet.
  3. Keras LSTM. Documentation and examples for LSTM RNNs in Keras.
  4. Neptune. The Neptune website with tutorials and documentation.
  5. A blog post on ML experiment tracking with Neptune. 
  6. A deeper overview of ARIMA models.
  7. A tutorial on time series prediction with LSTM RNNs.
  8. The original Prophet research paper.


ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

10 mins read | Author Jakub Czakon | Updated July 14th, 2021

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”

– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

  • use different models and model hyperparameters
  • use different training or evaluation data, 
  • run different code (including this small change that you wanted to test quickly)
  • run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics. 

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.  

This is where ML experiment tracking comes in. 

Continue reading ->
Time series tools packages libraries

Time Series Projects: Tools, Packages, and Libraries That Can Help

Read more
Time series vs machine learning

Time Series Prediction: How Is It Different From Other Machine Learning? [ML Engineer Explains]

Read more
Select time series models

How to Select a Model For Your Time Series Prediction Task [Guide]

Read more
Predicting stock prices

Predicting Stock Prices Using Machine Learning

Read more