3rd February 2021 • Tan Xue Ying
So often do we hear about time series analysis and forecasting in regards to deriving impactful data-driven business decisions and insights, but what is time series actually all about? Why is it so interesting, and how can we interpret and deal with time series data? If you have absolutely no clue about it, or just want to have a quick refresher, then this article is for you!
As the name suggests, a time series is a series of data points that have a time order. Just imagine yourself recording the outdoor temperature every hour, then you get a time series. Plotting that series of temperatures as a function of time, and you get a time series plot. Below are some examples of univariate time series plot:
Trend and seasonality are terms that always come together with any time series. If the time series is in general increasing or decreasing, or perhaps increasing in one section but decreasing in another (not to be confused with seasonality, where there has to be a cycle), we say that is the trend. For example, in the first figure above, we see that there is a general increasing trend whereas in the second figure, there is no obvious trend.
Seasonality is the repeating cycle in a time series. For example in the second figure, we see there is a strong sign of seasonality, where the minimum daily temperatures increases, peaks, decreases, and then increases again.
Sometimes it is not straightforward for human eyes to decompose trend and seasonality from the raw time series data. For example in the case of the minimum daily temperature time series, it can actually be decomposed into 3 components; trend, seasonal and residual. In Python, this can be done by using the seasonal_decompose() function from the statsmodels package:
https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html
Another term that often comes together with a time series is ‘stationarity’. A stationary time series is a time series whose statistical properties are not time-dependent. That is, the mean and covariances stay constant, and there is no trend over time. For example in the case of S&P 500 adjusted closing price shown above, the original time series is obviously non-stationary. However, if we take its first difference (note: given any time series Y, the first difference at time T is ), then we get an approximately stationary time series.
Sometimes, we would prefer a quantitative measure of stationarity. This can be achieved by performing the Augmented Dickey-Fuller (ADF) Test, which is also commonly known as the ‘unit root test’. The null hypothesis of the test is that the time series is non-stationary. In Python, we could implement this by using the adfuller() function from the statsmodels package:
https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html
As we can see, being stationary is not implying that the time series does not change at all, just that it changes in a similar manner. Because of the fixed statistical properties, it is much easier to analyze a stationary time series, and in fact, many time series tools and forecasts assume stationarity. For this reason, it is a common practice to do stationary conversion on the time series first before modelling and forecasting. Even for some models that can deal with non-stationary time series, for example the ARIMA and seasonal ARIMA models (discussed below), the algorithms work by first detrending the time series to make it stationary.
For a given time series, if we want to quantify the similarity between observations at any two time points, we use the autocorrelation function. Analogous to correlation, autocorrelation measures how ‘related’ or similar observations are in a time series as a function of time lag (number of time points before the present), and is very commonly used for model diagnostics. To illustrate what time lag is, and how autocorrelation can be linked to the concept of correlation, let us see the following example.
The table below shows the first 5 rows of the temperature data, with the addition of 2 lag features (Temp_Lag1 and Temp_Lag2). We can easily see that at lag 1, the observations are shifted downwards by 1 position; while at lag 2, they are shifted downwards by 2 positions instead. This means that on 1981-01-03, the temperature’s present value is 18.8, lag 1 value is 17.9 and lag 2 value is 20.7. Mathematically at time T, for any time series Y we can denote the present value as Y_T , lag 1 value as Y_(T-1) and so on.
With that said, below is the scatter plot of Temp against Temp_Lag1. We can see that Temp and Temp_Lag1 are strongly, positively correlated. Since this is in essence comparing observations of the same variable/ time series, we use the term ‘autocorrelation’ to describe the property. Hence in this scenario, we say that the time series has a strong lag 1 autocorrelation. The same logic applies to all lags.
As a consequence, similar to correlation, autocorrelation can be positive or negative. Visualization of autocorrelation in Python can be done via using the plot_acf() function from the statsmodels package:
https://www.statsmodels.org/stable/generated/statsmodels.graphics.tsaplots.plot_acf.html
Below is the autocorrelation plot (up to lag = 500) for the temperature data, showing the relatedness between present and any of the lags, up to 500, in the time series.
An Autoregressive–moving-average (ARMA) model is one of the simplest classical univariate time series models for modelling and forecasting. It consists of 2 parts, the AR (autoregression) part and the MA (moving average) part, which are also time series models independently on their own.
Given a time series Y, an Autoregressive (AR(p)) model, parameterized by an integer p, models Y_t as a linear function of its p past values (i.e. Y_(t-1 ),Y_(t-2),...,Y_(t-p)). On the other hand, a Moving-average (MA(q)) model, parameterized by an integer q, models Y_t as a linear function of its q past white noise error terms (i.e. ε_(t-1),ε_(t-2),...,ε_(t-q)). If you are wondering, white noise is defined as a series of identically and independently distributed random variables that have mean zero, and have neither trend nor seasonality.
Below is a table summarizing the aforementioned time series models and their respective equations. The coefficients (i.e. ϕ_i and θ_j) are learnt by the model from the time series data, μ is the mean of the MA(q) model and ε_t is white noise.
Note that an ARMA model assumes that the time series is stationary. Autoregressive integrated moving average (ARIMA) is its extension for non-stationary time series (in terms of trend) modelling, where the d term in ARIMA(p, d, q) accounts for the degree of differencing. This means how many times the time series has to be subtracted by its past values, in order to achieve stationarity. In one of the previous sections, it was inspected that d=1 is sufficient to make the time series of S&P 500 adjusted closing price stationary.
Last but not least, ARIMA can be further extended to incorporate a seasonal component, resulting in a Seasonal Autoregressive Integrated Moving Average (SARIMA) model. If you are interested in classical time series modelling in Python, the SARIMAX package could be useful.
This article covered the very basics of time series, but the fun does not stop here! If you are new to time series, there are many interesting and evolving modelling techniques and packages available, most of which are used for forecasting, be it future sales or stock prices. If you wish to work on a time series project but wonder where to start, the M4 Forecasting Competition Dataset could be helpful.
Also note that due to some special characteristics of stock price data, the classical models introduced in this article (even SARIMA) might not be appropriate for direct modelling. In that context, you might want to find out more about financial time series as well as GARCH models. Happy exploring!