在Python中实现线性回归 (二)---基于sklearn的多变量线性回归

多元线性回归和基础线性回归实现步骤基本一致。

Step 1：导入相关库和类

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import tushare as ts

Step 2：读取数据

本例使用了tushare.pro中的沪深300作为因变量，中金的细分金融指数和细分有色指数作为两个自变量。

# 使用tushare获取数据
ts.set_token('your token')
pro = ts.pro_api()

# 获取沪深300，中金细分有色指数，中金细分金融指数2018年全年日线数据
index_hs300 = pro.index_daily(ts_code='399300.SZ', start_date='20180101', end_date='20190110')
index_metal = ts.pro_bar(ts_code='000811.CSI', freq='D', asset='I', start_date='20180101', end_date='20190110')
index_finance = ts.pro_bar(ts_code='000818.CSI', freq='D', asset='I', start_date='20180101', end_date='20190110')

# 保留用于回归的数据，数据对齐
# join的inner参数用于剔除指数日线数据中交易日不匹配的数据
# 同时保留三个指数的日收益率并分别命名两组列名y_zz300, x1_metal，x2_finance，确定因变量和自变量
index_hs300_pct_chg = index_hs300.set_index('trade_date')['pct_chg']
index_metal_pct_chg = index_metal.set_index('trade_date')['pct_chg']
index_finance_pct_chg = index_finance.set_index('trade_date')['pct_chg']
df = pd.concat([index_hs300_pct_chg, index_metal_pct_chg, index_finance_pct_chg], keys=['y_hs300', 'x1_metal', 'x2_finance'], join='inner', axis=1, sort=True)

# 选中2018年的x1，x2，y数据作为现有数据进行多变量线性回归
df_existing_data = df[df.index # 提取多变量x1，x2，其数据类型为DataFrame。另外numpy需要的数据结构可以通过np.array(x)查看，也是二维数组。
x = df_existing_data[['x1_metal', 'x2_finance']]
y = df_existing_data['y_hs300']

因变量和自变量如下：

print(x.head())
print(y.head())
            x1_metal  x2_finance
trade_date                      
20180102      0.8574      1.6890
20180103      1.0870     -0.0458
20180104      0.9429     -0.3276
20180105     -0.7645      0.1391
20180108      2.2010      0.0529
trade_date
20180102    1.4028
20180103    0.5870
20180104    0.4237
20180105    0.2407
20180108    0.5173
Name: y_hs300, dtype: float64

DataFrame类型可以非常直观的列出多个自变量，语法也相当简洁，最重要的是该数据类型可以直接应用于LinearRegression类。当然，这里也可以使用numpy.array的数据类型。

Step 3：建立模型并拟合数据

和前一个例子一样，建立模型并数据拟合：

model = LinearRegression().fit(x, y)

Step 4：输出结果

一样的方式，通过model的.score()，.intercept_，.coef_属性获取回归模型的决定系数，截距b0，以及权重b1，b2：

model = LinearRegression().fit(x, y)
r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('slope:', model.coef_)
coefficient of determination: 0.8923741832489497intercept: 0.0005333314305123599slope: [0.30529472 0.64221632]

输出结果可以看出，截距b0大致为0，即当x1=x2=0时，f(x1, x2)=0。更多地，当x1增加1时，相应的f(x1, x2)增加大约0.305，而当x2增加1时，相应的f(x1, x2)增加大约0.642。反之亦然。

Step 5：模型预测

预测方式与前面也是一致的：

# 选中2019年前7个交易日的数据作为新数据进行预测
df_new_data = df[df.index > '20190101']

# 取自变量x1，x2
new_x = df_new_data[['x1_metal', 'x2_finance']]

# 预测
y_pred = model.predict(new_x)
print('predicted response:', y_pred, sep='\n')
predicted response:
[-1.4716842   1.17891556  2.55034238  0.17701977 -0.81283822  0.57159138
 -0.19729323]

也可以用下面这种方式：

y_pred = model.intercept_ + (model.coef_ * new_x).sum(axis=1)
print('predicted response:', y_pred, sep='\n')
predicted response:
trade_date
20190102   -1.471684
20190103    1.178916
20190104    2.550342
20190107    0.177020
20190108   -0.812838
20190109    0.571591
20190110   -0.197293
dtype: float64

这种方式就是将每列自变量x1，x2乘上其对应的权重b1，b2，再加上一个固定的截距b0，得到预测的因变量。