在Python中实现线性回归 (三)---基于sklearn的多项式回归

执行多项式回归步骤与前面也相似，只是需要增加一组经过转换的自变量作为非线性项，如x2。

Step 1：导入相关库和类

除了需要导入pandas和tushare，sklearn.linear_model.LinearRegression外，还需要导入sklearn.preprocessing中的PolynomialFeatures类用来处理非线性项。

import pandas as pd
import tushare as ts
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

Step 2：读取数据

本例就采用tushare.pro上面的沪深300和中金细分金融指数数据。

# 获取沪深300，中金细分金融指数2018年全年日线数据及2019年前7个交易日数据
index_hs300 = pro.index_daily(ts_code='399300.SZ', start_date='20180101', end_date='20190110')
index_finance = ts.pro_bar(ts_code='000818.CSI', freq='D', asset='I', start_date='20180101', end_date='20190110')
index_hs300_pct_chg = index_hs300.set_index('trade_date')['pct_chg']
index_finance_pct_chg = index_finance.set_index('trade_date')['pct_chg']
df = pd.concat([index_hs300_pct_chg, index_finance_pct_chg], keys=['y_hs300', 'x_finance'], join='inner', axis=1, sort=True)
df_existing_data = df[df.index x = df_existing_data[['x_finance']]
y = df_existing_data['y_hs300']

Step 3：自变量转换

这个步骤是多项式回归需要执行的步骤。由于多项式回归中有x2项，因此需要将x数组转换成x2并作为新的列。

有很多种转换方式，比如使用numpy中的insert()方法，或者pandas的DataFrame添加x2列。本例中使用的是PolynomialFeatures类：

transformer = PolynomialFeatures(degree=2, include_bias=False)

PolynomialFeatures类中有以下几个可选参数：

degree：整型，默认为2。用于决定线性回归模型的阶数。
interaction_only：布尔值，默认为False。如果指定为True，那么就不会有xi2项，只有如xixj的交叉项。
include_bias：布尔值，默认为True。此参数决定是否将截距项添加为回归模型中的一项，即增加一列值为1的列。False表示不添加。

先将自变量x数据传入转换器transformer中：

transformer.fit(x)

传入transformer之后，就可以使用.transform()方法获得转换后的数据x和x2：

x_ = transformer.transform(x)

当然也可以用.fit_transform()来替代以上三个语句：

x_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x)

转换后的数组效果如下：

print(x_[:5])
[[ 1.6890000e+00  2.8527210e+00]
 [-4.5800000e-02  2.0976400e-03]
 [-3.2760000e-01  1.0732176e-01]
 [ 1.3910000e-01  1.9348810e-02]
 [ 5.2900000e-02  2.7984100e-03]]

可以看到新的数据中有两列：一列是原始的自变量x的数据，另一列是x2项。

Step 4：建立模型并拟合数据

数据经过转换后，模型的建立和拟合的方式与前面一样：

model = LinearRegression().fit(x_, y)

【注】模型中第一个参数变为经过处理后的数据x_，而不是原始的自变量x数组。

Step 5：输出结果

r_sq = model.score(x_, y)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('coefficients:', model.coef_)

.score()返回R2=0.813，截距项b0大约为-0.014，x项的权重b1约为0.862，x2项的权重b2约为-0.014。

另外一种回归方法，也可以通过向转换器中传入不同的参数获得相同的回归结果：

x_ = PolynomialFeatures(degree=2, include_bias=True).fit_transform(x)

这里将include_bias参数设置为True，其他的依然是默认参数。这种方法和step3中的相比，x_中会多出值为1的列，这列对应的是回归模型中的截距项：

print(x_[:5])
[[ 1.0000000e+00  1.6890000e+00  2.8527210e+00]
 [ 1.0000000e+00 -4.5800000e-02  2.0976400e-03]
 [ 1.0000000e+00 -3.2760000e-01  1.0732176e-01]
 [ 1.0000000e+00  1.3910000e-01  1.9348810e-02]
 [ 1.0000000e+00  5.2900000e-02  2.7984100e-03]]

这组数据中，第一列全为1，第二列为x，第三列为x2。由于截距项已经包含在新的数据x_中了，因此在创建LinearRegression类时应该忽略截距项，即传入参数fit_intercept=False：

model = LinearRegression(fit_intercept=False).fit(x_, y)

至此，模型和数据x_是匹配的。模型结果如下：

r_sq = model.score(x_, y)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('coefficients:', model.coef_)
coefficient of determination: 0.8133705974408868
intercept: 0.0
coefficients: [-0.01395755  0.86215933 -0.01444605]

可以看到.intercept_为0，但实际上截距b0已经包含在.coef_中了。其他结果都是一样的。

Step 6：模型预测

同样的，模型预测只需要调用.predict()方法，但是需要注意的是，传入的新数据也需要做转换：

# 选中2019年前7个交易日的数据作为新数据进行预测df_new_data = df[df.index>'20190101']
# 取自变量x
new_x = df_new_data[['x_finance']]

# 自变量转换
new_x_ = PolynomialFeatures(degree=2, include_bias=True).fit_transform(new_x)

# 预测
y_pred = model.predict(new_x_)
print('predicted response:', y_pred, sep='\n')
predicted response:
[-1.32738808  0.98948547  2.38535216 -0.30448932 -0.40485458  0.78721444
 -0.23448668]

已经了解到了，多项式回归模型的预测方式与前面两例基本一样，只需要将用于预测的数据做下转换就好了。

如果有多个自变量，执行的过程也是一样的。这里使用了和前面提到的多变量回归中相同的三组指数日收益率数据：

x = df_existing_data[['x1_metal', 'x2_finance']]
y = df_existing_data['y_hs300']
# step3 数据转换
x_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x)

# step4 建立模型并拟合数据
model = LinearRegression().fit(x_, y)
r_sq = model.score(x_, y)
intercept, coefficients = model.intercept_, model.coef_

# step5：模型预测
# 选中2019年前7个交易日的数据作为新数据进行预测
df_new_data = df[df.index > '20190101']
new_x = df_new_data[['x1_metal', 'x2_finance']]

# 自变量转换
new_x_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(new_x)

# 预测
y_pred = model.predict(new_x_)

模型结果和预测如下：

print('coefficient of determination:', r_sq)
print('intercept:', intercept)
print('predicted response:', y_pred, sep='\n')
coefficient of determination: 0.8960573740272783
intercept: 0.020518382683722233
predicted response:
[-0.18180669  0.55726287 -0.79432939  0.17404118  2.57114362  1.23813068
 -1.45582482]

此结果的回归函数为f(x1, x2)=b0+b1x1+b2x2+b3x12+b4x1x2+b5x22。同时可以看到多项式回归的决定系数会高于多变量线性回归。决定系数R2高可能说明这是个不错的回归模型

但是，在实践中，如此复杂的模型产生的高R2，也可能是过拟合导致的。如果想要检验回归模型的表现，需要使用新的数据测试模型，即用样本外的数据进行模型训练。