multi variable auto regression

multi variable auto regression

Vector autoregression (VAR) may be a statistical model for multivariate statistic analysis, especially during a statistic where the variables have a relationship that affects one another to time. VAR models are different from univariate autoregressive models because they permit analysis and make predictions on multivariate statistic data. VAR models are often utilized in economics and meteorology .

Basic requirements to use the VAR model are :

  • Time series with a minimum of two variables.
  • Relationship between variables.

It is considered an autoregressive model because the predictions made by the model are hooked in to the past values, which suggests that every observation is modelled because the function of its lagged value.

The basic difference between the ARIMA family and VAR models is that each one the ARIMA models are used for univariate statistic , where the VAR models work with multivariate statistic . additionally , ARIMA models are unidirectional models, which suggests that the dependent variables are influenced by their past or lag values itself, where VAR may be a bi-directional model, which suggests a variable is suffering from its past value or by another variable’s value or influenced by both of the items .

For more understanding about the time-series, please ask these article:-

  • General Overview of your time Series Data Analysis.
  • Comprehensive Guide To Deseasonalizing statistic .
  • How To Apply Smoothing Methods In statistic Analysis.
  • Guide To AC and PAC Plots In statistic

To learn more about the time-series modeling, please ask these articles:-

  • Comprehensive Guide To statistic Analysis Using ARIMA.
  • Complete Guide To SARIMAX in Python for statistic Modeling.
  • Tutorial on Univariate Single-Step Style LSTM in statistic Forecasting.

this text goes to be another guide for time-series modeling, but this point it'll be with multivariate statistic data. Also, we'll undergo some tests which are required to know the multivariate statistic . Before starting the modeling part, let’s undergo the mathematics behind the model.

What is Vector Autoregression(VAR)?

A typical autoregression model(AR(p)) for univariate time series can be represented by

multi variable auto regression

Where

  • yt−i indicates the variable value at periods earlier.
  • Ai is a time-invariant (k × k)-matrix.
  • et is an error term.
  • c is an intercept of the model.

Here the order p means, up to p-lags of y is used.

As we know, the VAR model deals with multivariate time series, which means there will be two or more variables affecting each other. Therefore, the VAR model equation increases with the number of variables in the time series.

Let’s suppose there are two time-series variables, y1 and y2, so to calculate y1(t), the VAR model will use the lags of both time-series variables.

For example, the equation for the VAR(1) model with two time-series variables (y1 and y2) will look like this:

multi variable auto regression

Where, Y{1,t-1} is the first lag value for yy1 and Y{2,t-1} is the first lag value for y2.

And the VAR(2) with y1 and y2 time series variables, the equation of the model will look like :

multi variable auto regression

Here we can clearly understand how the model’s equation will increase with variables and the lag values. For example, a VAR(3) model equation with 3 time-series variables will look like.

multi variable auto regression

So this is how the p-value will increase the length of the model equation, and the number of variables will increase the height of the equation.

Implementing Vector Autoregression(VAR) in Python

Let’s build a basic VAR model using python.

To build the model, we can use python’s statsmodel package, which provides most of the module to work on time series analysis and p[rovides some data with the package to practice on the time series analysis.

Importing libraries

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.base import  datetools

Importing the data

mdata = sm.datasets.macrodata.load_pandas().data
mdata.head()

Output:

multi variable auto regression

Here we can see how our data looks. Here I am considering three variable real gdp real cons and realinv for further modeling processing. And also need to make a datetime value using the year and quarter columns before going for further processes.

mdata = mdata[['year','quarter']].astype(int)
mdata = mdata[['year','quarter']].astype(str)
quarterly = mdata["year"] + "Q" + mdata["quarter"]
quarterly = datetools.dates_from_str(quarterly)
quarterly

Output:

multi variable auto regression

Now we can use the datetime values as the index of our data.

mata = mdata[['realgdp','realcons','realinv']]
mdata.index = pandas.DatetimeIndex(quarterly)
mdata.head()

Output:

multi variable auto regression

Visualizing the data.

mdata.plot()

Output:

multi variable auto regression

Here we can see that both realgdp and realcons have a high correlation, and there is a slight correlation between realinv and other variables. Because of the trend we are seeing, we can understand that all of them are steadily growing until the last point, but they all have some sort of decreament.

So we can check the pattern of time series in their log values.

data = np.log(mdata).diff().dropna()
data.plot()
Output: multi variable auto regression

Here we can see how the normalized values are being plotted in the time series. The relationship between realgdp and realcons is as strong as they follow the same line pattern, which clearly says the change in realgdp will affect realcons or vice-versa. For more analysis, we can perform some tests on the time series to statistically define the relationship between the variables.

Granger’s causality test

By using granger’s causality test, we can find the relationship between the variables before building the model because it is known that if there is no relationship between the variables, we can drop the variables and separately do the modeling. If there is a relationship between them, we need to consider the variable in the modeling part.

In mathematics, the test provides the p-value between the variables, and if the p-value is higher than 0.05 then we will be required to accept the null hypothesis, and if the p-value is lesser than 0.05 we are required to reject the null hypothesis.

Statsmodel also provides a module to perform the test, so using the statsmodel next, I am performing the granger’s causality test.

from statsmodels.tsa.stattools import grangercausalitytests

data = mdata[["realgdp", "realcons"]].pct_change().dropna()
#Performing test on for realgdp and realcons.
gc_res = grangercausalitytests(data, 12)

Output:

multi variable auto regression

Here we can see that p-values for every lag are zero. So now, let’s move forward for the causality test between realgdp and real inv.

data = mdata[["realgdp", "realinv"]].pct_change().dropna()
Output: multi variable auto regression

Here we can see p values for every lag is higher than 0.05, which means we need to accept the null hypothesis. And also, we can say that a similar thing will happen if we perform the test between realcons and realinv.

data = mdata[["realcons", "realinv"]].pct_change().dropna()

Output:

multi variable auto regression

Here we are getting p-values higher than before. This means that the realcons are affecting the realinv. In graphs, we were estimating that there is no significant relationship between the realinv and other values, and hence we can understand the significance of the granger’s causality test.

Cointegration test

Cointegration helps to find out the statistical connection between two or more time series. When two or more time series are cointegrated, they have a long run, statistically significant relationship.

We can perform this test using the statsmodel package.

data = mdata[["realgdp","realcons", "realinv"]].pct_change().dropna()
from statsmodels.tsa.vector_ar.vecm import coint_johansen

def cointegration_test(data, alpha=0.05): 

   """Perform Johanson's Cointegration Test and Report Summary"""
  out = coint_johansen(data,-1,5)
  d = {'0.90':0, '0.95':1, '0.99':2}
   traces = out.lr1
   cvts = out.cvt[:, d[str(1-alpha)]]
def adjust(val, length= 6): return str(val).ljust(length)

   # Summary
    print('Name   ::  Test Stat > C(95%)    =>   Signif  \n', '--'*20)
    for col, trace, cvt in zip(data.columns, traces, cvts):
        print(adjust(col), ':: ', adjust(round(trace,2), 9), ">", adjust(cvt, 8), ' =>  ' , trace > cvt)
cointegration_test(data)

Output:

multi variable auto regression

Here we can see the significance of the variables on the whole system.

So here we have seen how we can use these two tests; next, we can further proceed with the modeling part.

Modeling

We can directly put the preprocessed data in the VAR module for modeling purposes.

var = VAR(data)

After this step, one thing comes up in the procedure: how to select the order. One thing that is usually performed is to check for the best-fit lag value. We need to compare the different AIC(Akaike Information Criterion), BIC(Bayesian Information criterion), FPE(Focused Prediction Error) and HQIC(Hannan–Quinn information criterion). These all are the parameters that help to select the best-fit lag value.

Statsmodel provides the select_order module to analyze these values.

x= var.select_order()
x.summary()

Output:

multi variable auto regression

Here we can see the minimum values in combination with the AIC, BIC, FPE and HQIC are given with the * sign. Here we can see we have that sign in the third row and the first row. Here I am choosing the third row, which means that the value of lag valueVAR(p) is three because it is suggested to go with the combinations where AIC with other parameters are generating minimums.

results = var.fit(3)
#We can check the summary of the model by.
results.summary()

Output:

multi variable auto regression
multi variable auto regression

Here we can see all coefficient standard error value t-test and the model’s probabilities at every lag till 3 lag. Then, at last, there is a confusion matrix that shows the correlation between the variables. In the results, we found that the correlation between realgdp and realinv is high, the relatable effect in the probability also we can cross-check for the same thing. We can also plot the model, which will be a better way to understand the model’s performance.

Visualizing the input:

results.plot();

Output:

multi variable auto regression

We can also plot our forecasted values by the model.

results.plot_forecast(20);

Output:

multi variable auto regression

Here we can see the results by the model in the plot for every variable. The lines for forecasted values for the next 20 steps are going in such a steady manner, so here we can see the lags we have decided are quite satisfactory and providing good results.

Here we have seen in the article how we can perform the time series modeling where the data is multivariate. We have seen in such a condition how we can understand the relationship between two time-series variables presented in one data. There can be many examples of this kind of situation. For example, the variation in the atmospheric temperature can be caused by humidity and season factors also or stock market data. I encourage you to use this model in the real world scenario datasets so that we will know things in more depth about the multivariate time series analysis.

You may also like...