在Python中实现线性回归 (四)---statsmodels库的线性回归应用

我们也可以使用statsmodels库实现线性回归，这个库的优点是可以输出回归模型更详细的结果。使用statsmodels的方法也和scikit-learn类似。

Step 1：导入相关库和类

import pandas as pd
import tushare as ts
import statsmodels.api as sm

除了pandas，还需要导入statsmodels.api模块。

Step 2：输出结果

x = df_existing_data[['x1_metal', 'x2_finance']]
y = df_existing_data['y_hs300']

如果用statsmodels计算截距，需要在x中添加全为1的列。statsmodels.api中有添加截距项的方法.add_constant()，该方法在数组中的第一列添加了值为1的列：

x = sm.add_constant(x)
print(x.head())
            const  x1_metal  x2_finance
trade_date                             
20180102      1.0    0.8574      1.6890
20180103      1.0    1.0870     -0.0458
20180104      1.0    0.9429     -0.3276
20180105      1.0   -0.7645      0.1391
20180108      1.0    2.2010      0.0529

修改后的x现在有三列，第一列是新的截距项，另外两列是原始数据。

Step 3：建立模型并拟合数据

回归模型是在statsmodels.regression.linear_model.OLS类建立的一个实例，OLS为最小二乘法：

model = sm.OLS(y, x)

【注】.OLS()中第一个参数是因变量，后面一个是自变量。其中还有几个可选参数，可以参考statsmodels的官方文档。

模型建立后进行拟合：

results = model.fit()

通过调用.fit()，可以获得模型结果results，results是statsmodels.regression.linear_model.RegressionResultsWrapper类的实例。模型结果中包含了大量关于这次回归模型的信息。

Step 4：获取结果

results中包含了回归模型的详细信息。提取结果的方法可以调用.summary():

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                y_zz300   R-squared:                       0.892
Model:                            OLS   Adj. R-squared:                  0.891
Method:                 Least Squares   F-statistic:                     995.0
Date:                Sat, 20 Jul 2019   Prob (F-statistic):          6.76e-117
Time:                        17:35:37   Log-Likelihood:                -146.32
No. Observations:                 243   AIC:                             298.6
Df Residuals:                     240   BIC:                             309.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0005      0.029      0.019      0.985      -0.056       0.057
x1_metal       0.3053      0.023     13.385      0.000       0.260       0.350
x2_finance     0.6422      0.026     24.357      0.000       0.590       0.694
==============================================================================
Omnibus:                        0.419   Durbin-Watson:                   1.890
Prob(Omnibus):                  0.811   Jarque-Bera (JB):                0.224
Skew:                          -0.055   Prob(JB):                        0.894
Kurtosis:                       3.101   Cond. No.                         2.19
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

这张表提供的非常全面的回归结果。里面可以找到包括R2，b0，b1和b2在内的信息。同时也可以从上表中提取想要的值：

print('coefficient of determination:', results.rsquared)
print('adjusted coefficient of determination:', results.rsquared_adj)
print('regression coefficients:', results.params)
coefficient of determination: 0.8923741832489497
adjusted coefficient of determination: 0.8914773014426909
regression coefficients: 
const         0.000533
x1_metal      0.305295
x2_finance    0.642216
dtype: float64

以下是获取线性回归的属性的方法：

.rsquared：代表R2
.rsquared_adj：代表修正R2（经过自变量个数的调整后的R2）
.params：返回回归模型参数

可以对比scikit-learn中同类的回归模型，得到的模型结果都是一样的。

Step 5：预测模型

可以使用.fittedvalues或者.predict()方法获取模型产生的因变量f(x1,x2):

print('predicted response:', results.fittedvalues.head(), sep='\n')
predicted response:
trade_date
20180102    1.346996
20180103    0.302975
20180104    0.078006
20180105   -0.143532
20180108    0.706460
dtype: float64
print('predicted response:', results.predict(x.head()), sep='\n')
predicted response:
trade_date
20180102    1.346996
20180103    0.302975
20180104    0.078006
20180105   -0.143532
20180108    0.706460
dtype: float64

或者使用新数据进行预测：

print('predicted response:', results.fittedvalues.head(), sep='\n')
# 选中2019年前7个交易日的数据作为新数据进行预测
df_new_data = df[df.index>'20190101']

# 取自变量x1，x2
new_x = df_new_data[['x1_metal', 'x2_finance']]

# 添加截距项
new_x = sm.add_constant(new_x)

# 预测
new_y = results.predict(new_x)
print(new_y)
trade_date
20190102   -1.471684
20190103    1.178916
20190104    2.550342
20190107    0.177020
20190108   -0.812838
20190109    0.571591
20190110   -0.197293dtype: float64