Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/builds/AstraOS/buildsystem/tbs_build/statsmodels/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.977
Model:                            OLS   Adj. R-squared:                  0.975
Method:                 Least Squares   F-statistic:                     646.2
Date:                Tue, 03 Dec 2024   Prob (F-statistic):           1.36e-37
Time:                        12:52:36   Log-Likelihood:                -7.7049
No. Observations:                  50   AIC:                             23.41
Df Residuals:                      46   BIC:                             31.06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9646      0.100     49.493      0.000       4.763       5.167
x1             0.5050      0.015     32.644      0.000       0.474       0.536
x2             0.4084      0.061      6.716      0.000       0.286       0.531
x3            -0.0209      0.001    -15.366      0.000      -0.024      -0.018
==============================================================================
Omnibus:                        1.586   Durbin-Watson:                   2.601
Prob(Omnibus):                  0.452   Jarque-Bera (JB):                1.568
Skew:                           0.381   Prob(JB):                        0.457
Kurtosis:                       2.586   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.44286238  4.89281744  5.30918319  5.66970056  5.96014365  6.17665703
  6.32638917  6.42631794  6.50046124  6.57593088  6.67847824  6.82826332
  7.03654259  7.30381952  7.61976188  7.96489932  8.31382226  8.63935664
  8.9170285   9.12908504  9.26741132  9.3348635   9.34479963  9.31888495
  9.2835323   9.26556265  9.2877984   9.36531381  9.50295829  9.69455964
  9.92393819 10.16756502 10.3984277  10.59046839 10.72286503 10.78345237
 10.77072066 10.69406375 10.57223564 10.43026929 10.29536379 10.19241559
 10.13992808 10.1469716  10.21169312 10.32162075 10.45571282 10.58781497
 10.6909572  10.74178503]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.71026806 10.56863462 10.33290179 10.03957047  9.7366887   9.4720878
  9.2816716   9.18062596  9.1597007   9.1874744 ]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.964641
x1                  0.505014
np.sin(x1)          0.408430
I((x1 - 5) ** 2)   -0.020871
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.710268
1    10.568635
2    10.332902
3    10.039570
4     9.736689
5     9.472088
6     9.281672
7     9.180626
8     9.159701
9     9.187474
dtype: float64