Usually, in the traditional machine learning approach, we randomly split the data into training data, test data, and cross-validation data.
Here, each point xi in the dataset has:
- 60% probability of going into Dtrain
- 20% probability of going into Dtest
- 20% probability of going into Validation
Instead of random-based splitting, we can use another approach called time-based splitting. When we have a timestamp given in our dataset, we can split the data according to time.
Imagine you’re an ML engineer at Amazon, trying to productionize a model to classify reviews. You randomly split the data into training data and test data, and after obtaining the required accuracy, you deploy the model. With additional reviews being added to new products, over time the model’s accuracy could decrease. Time-based splitting is a way to overcome this issue.
In time-based splitting, we generally divide the data based on the timestamp and train the model. With this, we have a better chance of getting higher accuracy than with random-based splitting.
Why do we need a different approach?
The standard ML approach doesn’t work for time series models:
- Features and target variables are the same,
- Data correlated over time,
- Often non-stationary (hard to model),
- Need a lot of data to capture the patterns and trends and model those changes appropriately.
In this article we’ll cover:
What is a time series?
Time-series are a sequence of data points organized in time order.
Types of forecasting
Time series are everywhere
Finance: we’re trying to predict perhaps stock prices over time, asset prices, different macroeconomic factors that will have a large effect on our business objectives.
E-commerce: we’re trying to predict future page views compared to what happened in the past, and whether it’s trending up, down, or if there’s seasonality. Same with new users, how many new users are you getting/losing over time?
Business: we’re trying to predict the number of transactions, future revenue, and future inventory levels that you will need.
Time series decomposition involves thinking of a series as a combination of level, trend, seasonality, and noise components.Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.
One of the fundamental topics in time series is time series decomposition:
- Components of time series data
- Seasonal patterns and trends
- Decomposition of time series data
What are the components of time series?
Trend: change direction over a period of time
Seasonality: seasonality is about periodic behavior, spikes or drops caused by different factors, for example:
- Naturally occurring events, like weather fluctuations
- Business or administrative procedures, like start or end of a fiscal year
- Social and cultural behavior, like holidays or religious observances
- Calendar events, like the number of Mondays per month or holidays shifting year to year
Residual: irregular fluctuations that we cannot predict using trend or seasonality.
The graphs of trends, seasonality, and residual factors are constructed below using Pandas and NumPy arrays in Python.
The additive model assumes the observed time series is the sum of components:
Observation = trend + seasonality
Additive models are used when the magnitude of seasonal and residual values are independent of the trend.
The above graph is generated using python which we will learn in a while
In the above example, we can see that seasonality in the residuals doesn’t increase or decrease as the trend increases, but rather it stays constant all the way. Looking at this plot, and subtracting out the straight line that is the trend, we can imagine that we just have the straight added on the seasonal component that says the same no matter what that trend is.
The multiplicative model assumes the observed time series is a product of its components:
Observation = trend * seasonality * residual
We can transform the multiplicative model to an additive model by applying a log transformation:
log(time * seasonality * residual) = log(Time) + log(seasonality) + log(residual)
These are used if the magnitudes of seasonal and residual values fluctuate with the trend.
The above graph is generated using python which we will learn in a while
In the above image, we see the trend increases, so we’re trending up. The seasonal component is also trending up with the trend. This means that it’s likely a multiplicative model, so we should divide out that trend, and then we would end up with more reasonable looking (more consistent) seasonality.
Pseudo-additive models combine the elements of both additive and multiplicative models. They can be useful when:
- Time series values are close to or equal to zero
- We expect features related to the multiplicative model
- Division by zero often becomes a problem when this is the case
Time series decomposition using Python-Pandas
We will individually construct fictional trends, seasonality, and residual components. This is an example to show how a simple time-series dataset can be constructed using the Pandas module.
time = np.arange(1, 51)
Now we need to create a trend. Let’s pretend we have a sensor measuring electricity demand. We’ll ignore units to keep things simple.
trend = time * 2.75
Now lets plot to show trend as a function of time
Now let’s generate a seasonal component.
seasonal = 10 + np.sin(time) * 10
Let’s plot seasonality against time.
Now, let’s construct the residual component.
np.random.seed(10) # reproducible results residual = np.random.normal(loc=0.0, scale=1, size=len(time))
A quick plot of residuals:
Aggregate trend, seasonality, and residual components
Additive time series
Remember the equation for additive time series is simply: Ot = Tt + St + Rt
Ot = output
Tt = trend
St = seasonality
Rt = residual
t = variable representing a particular point in time
additive = trend + seasonal + residual
The same follows for multiplicative time series, except we don’t add, but multiply the values of trend, seasonality, and residual.
Stationary and autocorrelation
What is stationarity?
For time series data to be stationary, the data must exhibit four properties over time:
1. Constant Mean:
A stationary time series will have a constant mean throughout the entire series.
As an example, if we were to draw the mean of the series, this holds as the mean throughout all of the time.
A good example where the mean wouldn’t be constant is if we had some type of trend. With an upward or downward trend, for example, the mean at the end of our series would be noticeably higher or lower than the mean at the beginning of the series.
2. Constant Variance:
A stationary time series will have a constant variance throughout the entire series.
3. Constant Autocorrelation Structure:
Autocorrelation simply means that the current time series measurement is correlated with a past measurement. For example, today’s stock price is often highly correlated with yesterday’s price.
The time interval between correlated values is called LAG. Suppose we wanted to know if today’s stock price correlated better with yesterday’s price, or the price from two days ago. We could test this by computing the correlation between the original time series and the same series delayed by one time interval. So, the second value of the original time series would be compared with the first of the delayed. The third original value would be compared with the second of the delayed, and so on. Performing this process for a lag of 1 and a lag of 2, respectively, would yield two correlation outputs. This output would tell which lag is more correlated. That is autocorrelation in a nutshell.
Time series smoothing
What is Smoothing?
Smoothing is a process that often improves our ability to forecast series by reducing the impact of noise.
Why is smoothing important?
Smoothing is an important tool that lets us improve forward-looking forecasts.
Consider the data in the below graph. How could we forecast what will happen in one, two, or three steps into the future?
One solution is to calculate the mean of the series and predict the value in the future.
But, using the mean to predict future values doesn’t seem like a good way, and we might not get accurate predictions. Instead, we employ a technique called exponential smoothing.
Single Exponential Smoothing
Single Exponential Smoothing, also called Simple Exponential Smoothing, is a time series forecasting method for univariate data without a trend or seasonality.
It requires a single parameter, called alpha (a), also called the smoothing factor or smoothing coefficient.
This parameter controls the rate at which the influence of observations at prior time steps decays exponentially. Alpha is often set to a value between 0 and 1. Large values mean that the model pays attention mainly to the most recent past observations, whereas smaller values mean more of the history is taken into account when making a prediction.
Double Exponential Smoothing
Double Exponential Smoothing is an extension to Exponential Smoothing that explicitly adds support for trends in the univariate time series.
In addition to the alpha parameter for controlling the smoothing factor for the level, a smoothing factor is added to control the decay of the influence of the change in a trend, called beta (b).
The method supports trends that change in different ways: an additive and a multiplicative, depending on whether the trend is linear or exponential respectively.
Double Exponential Smoothing with an additive trend is classically referred to as Holt’s linear trend model, named after the developer of the method, Charles Holt.
Triple Exponential Smoothing
Triple Exponential Smoothing is an extension of Exponential Smoothing that explicitly adds support for seasonality to the univariate time series.
This method is sometimes called Holt-Winters Exponential Smoothing, named for two contributors to the method: Charles Holt and Peter Winters.
In addition to the alpha and beta smoothing factors, a new parameter is added called gamma (g), which controls the influence on the seasonal component.
As with the trend, the seasonality may be modeled as either an additive or multiplicative process, for a linear or exponential change in the seasonality.
Autoregressive models and Moving Average (ARMA) models
ARMA models combine two models:
The first is an autoregressive (AR) model. Autoregressive models anticipate series dependence on its past values.
The second is the moving average (MA) model. Moving average model anticipates series dependence on past forecast errors.
The combination (ARMA) is also known as the Box-Jenkins approach.
ARMA model: Auto regressive (AR) part
ARMA models are often expressed using P and Q for the AR and MA components. For a time series variable X that we want to predict the time t, the last few observations are:
Xt – 3, Xt – 2, Xt- 1
AR(p) models are assumed to depend on the last p values of the time series. Let’s say p = 2, the forecast has the form:
Ma(q) models are assumed to depend on the last q values of the time series. Let say q = 2, the forecast has the form:
We’ll discuss what exactly these equations mean and how the errors are calculated in a while.
Now, to get our AR(p) and MA(q) models together, we combine the AR(p) and MA(P) to yield the ARMA(p,q) model. For p = 2 and q = 2 the ARMA (2,2) forecast will be:
Again we’ll see all these while doing the hands-on.
There are some things to keep in mind while implementing ARMA models:
- First, the time series is going to be assumed to be stationary, and that regression approach will fail if we’re working with a non-stationary example.
- A good rule of thumb is to have at least 100 observations when fitting an ARMA model, so that we can adequately demonstrate those past autocorrelations.
Now we’ll take a practical approach to understand auto-regressive models, and get a practical understanding of moving averages.
One of the key concepts in the quantitative toolbox is that of mean reversion. This process refers to a time series that displays a tendency to revert to its historical mean value. Mathematically, such a (continuous) time series is referred to as an Ornstein-Uhlenbeck process.
This is in contrast to a random walk (aka Brownian motion), which has no “memory” of where it has been at each particular instance of time.
The mean-reverting property of a time series can be exploited to produce better predictions.
A continuous mean-reverting time series can be represented by an Ornstein-Uhlenbeck stochastic differential equation:
𝑑𝑥𝑡 = θ(μ−𝑥𝑡) 𝑑𝑡 + σ𝑑𝑊𝑡
- θ is the rate of reversion to the mean,
- μ is the mean value of the process,
- σ is the variance of the process,
- 𝑊𝑡 is a Wiener Process or Brownian Motion.
In a discrete setting, the equation states that the change of the price series in the next time period is proportional to the difference between the mean price and the current price, with the addition of Gaussian noise.
For more details, have a look here.
Section 1: ARMA
Enter Autoregressive Integrated Moving Average (ARIMA) modeling. When we have autocorrelation between outcomes and their ancestors, we will see a theme or relationship in the outcome plot. This relationship can be modeled in its way, allowing us to predict the future with a confidence level proportionate to the strength of the relationship and the proximity to known values (prediction weakens the further out we go).
For second-order stationary data (both mean and variance: 𝜇𝑡 = 𝜇 and 𝜎2𝑡=𝜎2 for all 𝑡), autocovariance is expressed as a function only of the time lag 𝑘:
Therefore, the autocorrelation function is defined as:
𝜌𝑘 = 𝛾𝑘/𝜎2
We use the plot of these values at different lags to determine optimal ARIMA parameters. Notice how phi changes the process.
Section 2: Autoregressive (AR) Models
Autocorrelation: a variable’s correlation with itself at different lags.
AR models regress on actual past values.
This is the first order or AR(1) formula you should know:
𝑦𝑡 = 𝛽0 + 𝛽1 𝑦𝑡−1 + 𝜖𝑡
The β’s are just like those in linear regression and ϵ is an irreducible error.
A second-order or AR(2) would look like this:
𝑦𝑡 = 𝛽0 + 𝛽1 𝑦𝑡−1 + 𝛽2 𝑦𝑡−2 +𝜖𝑡
We’ll generate our data to gain insight into how AR models work.
# reproducibility np.random.seed(123) # create autocorrelated data time = np.arange(100) #Assuming 0 mean ar1_sample = np.zeros(100) # Set our first number to a random value with expected mean of 0 and standard deviation of 2.5 ar1_sample += np.random.normal(loc=0, scale=2.5, size=1) # Set every value thereafter as 0.7 * the last term plus a random error for t in time[1:]: ar1_sample[t] = (0.7 * ar1_sample[t-1]) + np.random.normal(loc=0, scale=2.5, size=1) plt.fill_between(time,ar1_sample)
Here we create a prediction for generated data to show we came up with a model that is approximately ar(1) with phi ≈ 0.7.
# using ARMA model from statsmodel package model = sm.tsa.ARMA(ar1_sample, (1, 0)).fit(trend='nc', disp=0) model.params
# create autocorrelated data np.random.seed(112) # Mean is again 0 ar2_sample = np.zeros(100) # Set first two values to random values with expected mean of 0 and standard deviation of 2.5 ar2_sample[0:2] += np.random.normal(loc=0, scale=2.5, size=2) # Set future values as 0.3 times the prior value and 0.3 times value two prior for t in time[2:]: ar2_sample[t] = (0.3 * ar2_sample[t-1]) + (0.3 * ar2_sample[t-2]) + np.random.normal(loc=0, scale=2.5, size=1) plt.fill_between(time,ar2_sample)
Section 3: Moving Average(MA) models
MA Model Specifics
A MA model is defined by this equation:
𝑦𝑡 = 𝑐 + 𝑒𝑡 + θ1 𝑒𝑡 − 1 + θ2 𝑒𝑡 − 2 +⋯+ θ𝑞 𝑒𝑡−𝑞
- 𝑒𝑡 is the white noise value,
- 𝑐 is a constant value,
- 𝜃’s are coefficients, not unlike those found in linear regression.
MA Models != Moving Average Smoothing
An important distinction is that a moving average model is not the same thing as moving average smoothing. What we did in previous lessons was smoothing. It has important properties that we’ve discussed. However, moving average models are a completely different beast.
Moving average smoothing is useful for estimating the trend and seasonality of past data. MA models, on the other hand, are a useful forecasting model that regresses past forecast errors to forecast future values.
It’s easy to lump the two techniques together, but they serve very different functions. Thus, a moving-average model is conceptually a linear regression of the current value of the series against current and previous (unobserved) white noise error terms or random shocks.
The random shocks at each point are assumed to be mutually independent and to come from the same distribution, typically a normal distribution, with a location at zero and constant scale.
We’ll generate our data so we know the generative process for an MA series.
# reproducibility np.random.seed(12) # create autocorrelated data time = np.arange(100) #mean 0 ma1_sample = np.zeros(100) #create vector of random normally distributed errors error = np.random.normal(loc=0, scale=2.5, size=100) # set first value to one of the random errors ma1_sample += error #set future values to 0.4 times error of prior value plus the current error term for t in time[1:]: ma1_sample[t] = (0.4 * error[t-1]) + error[t] plt.fill_between(time,ma1_sample)
# find model params for generated sample model = sm.tsa.ARMA(ma1_sample, (0, 1)).fit(trend='nc', disp=0) model.params
Section 3: The Autocorrelation Function (ACF)
There’s a crucial question we need to answer: how do you choose the orders (p and q) for a time series?
To answer that question, we need to understand the Autocorrelation Function (ACF). Let’s start by showing an example ACF plot for our different simulated series.
fig = sm.tsa.graphics.plot_acf(ar1_sample, lags=range(1,30), alpha=0.05,title = 'ar1 ACF') fig = sm.tsa.graphics.plot_acf(ma1_sample, lags=range(1,15), alpha=0.05,title = 'ma1 ACF')
An explanation is in order. First, the blue region represents a confidence interval. Alpha, in this case, was set to 0.05 (95% confidence interval). This can be set to whatever float value you require. See the plot_acf function for details.
The stems represent lagged correlation values. In other words, a lag of 1 will show a correlation with the prior endogenous value. A lag of 2 shows a correlation to the value 2 prior and so on. Remember that we’re regressing on past forecast values, that’s the correlation we’re inspecting here.
Correlations outside of the confidence interval are statistically significant, whereas the others are not.
Note that if lag 1 shows strong autocorrelation, lag 2 will show strong autocorrelation as well, since lag 1 is correlated with lag 2, lag 2 with lag 3, and so on. That’s why you see the ar1 model with slowly decaying correlation.
If we think about the functions, we note that autocorrelation will propagate for AR(1) models:
- 𝑦𝑡 = 𝛽0 + 𝛽1 𝑦𝑡−1 + 𝜖𝑡
- 𝑦𝑡−1 = 𝛽0 + 𝛽1 𝑦𝑡−2 +𝜖𝑡−1
- 𝑦𝑡 =𝛽0 + 𝛽0 + 𝛽1 𝑦𝑡−2 + 𝜖𝑡−1 + 𝜖𝑡
The past errors will propagate into the future, leading to the slowly decaying plot we just mentioned.
For MA(1) models:
𝑦𝑡 = 𝑦𝑡 = 𝛽0 + θ1 𝑒𝑡−1 + 𝜖𝑡
Only the prior error affects future errors.
So an easy way to identify an AR(1) model or MA(1) model is to see if the correlation from one affects the next.
fig = sm.tsa.graphics.plot_acf(ar2_sample, lags=range(1,15), alpha=0.05,title = 'ar2 ACF') fig = sm.tsa.graphics.plot_acf(ma2_sample, lags=range(1,15), alpha=0.05,title = 'ma2 ACF')
In this post, we explored what exactly is time series forecasting, and what are the important components of time series forecasting, ie.: the constituent components that a time series can be decomposed into when performing an analysis.
We also went through different types of forecasting, and dove into moving averages, stationary models, and how to plot time series using Python.
In the next article, we’ll focus on how to model time series data using ARIMA, SARIMA, and FB PROPHET. Thanks for reading!
ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It
10 mins read | Author Jakub Czakon | Updated July 14th, 2021
Let me share a story that I’ve heard too many times.
”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…
…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…
…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”
– unfortunate ML researcher.
And the truth is, when you develop ML models you will run a lot of experiments.
Those experiments may:
- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)
And as a result, they can produce completely different evaluation metrics.
Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.
This is where ML experiment tracking comes in.Continue reading ->