有这么一句话在业界广泛流传:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。由此可见,特征工程在机器学习中占有相当重要的地位。在实际应用当中,可以说特征工程是机器学习成功的关键。
特征工程是数据分析中最耗时间和精力的一部分工作,它不像算法和模型那样是确定的步骤,更多是工程上的经验和权衡。因此没有统一的方法。这里只是对一些常用的方法做一个总结。
特征工程包含了 Data PreProcessing(数据预处理)、Feature Extraction(特征提取)、Feature Selection(特征选择)和 Feature construction(特征构造)等子问题。
Jupyter Notebook 代码连接:feature_engineering_demo_p5_time_series
导入必要的包
import numpy as npimport pandas as pdimport reimport sysimport matplotlib.pyplot as pltimport seaborn as snsimport warningsimport gcwarnings.filterwarnings('ignore' ) sns.set_style('whitegrid' ) SEED = 42
时间序列特征
本案例并不涉及时序特征(time series),但在其他许多场景,时序特征处理可能会非常复杂。时序特征的处理方法有很多,这里参考sklearn中关于时间特征的案例简单介绍几种有代表性的方法。这些方法可以帮助我们更好地理解和利用时间序列数据中的模式和趋势,从而提高机器学习模型的性能和预测准确度。
参考资料:
Time-related feature engineering
Lagged features for time series forecasting
数据探索
Data exploration
from sklearn.datasets import fetch_openmlbike_sharing = fetch_openml("Bike_Sharing_Demand" , version=2 , as_frame=True , parser="pandas" ) df = bike_sharing.frame
season
year
month
hour
holiday
weekday
workingday
weather
temp
feel_temp
humidity
windspeed
count
0
spring
0
1
0
False
6
False
clear
9.84
14.395
0.81
0.0
16
1
spring
0
1
1
False
6
False
clear
9.02
13.635
0.80
0.0
40
2
spring
0
1
2
False
6
False
clear
9.02
13.635
0.80
0.0
32
3
spring
0
1
3
False
6
False
clear
9.84
14.395
0.75
0.0
13
4
spring
0
1
4
False
6
False
clear
9.84
14.395
0.75
0.0
1
X = df.drop('count' , axis=1 ) y = df['count' ]
目标变量为共享单车的每小时租赁次数
fig, ax = plt.subplots(figsize=(12 , 4 )) average_week_demand = df.groupby(["weekday" , "hour" ])["count" ].mean() average_week_demand.plot(ax=ax) _ = ax.set ( title="Average hourly bike demand during the week" , xticks=[i * 24 for i in range (7 )], xticklabels=["Sun" , "Mon" , "Tue" , "Wed" , "Thu" , "Fri" , "Sat" ], xlabel="Time of the week" , ylabel="Number of bike rentals" , )
由于数据集是按时间排序的,我们使用基于时间的交叉验证
from sklearn.model_selection import TimeSeriesSplit, cross_val_scorets_cv = TimeSeriesSplit( n_splits=3 , gap=48 , max_train_size=10000 , test_size=3000 , ) all_splits = list (ts_cv.split(X, y)) train_idx, test_idx = all_splits[0 ] X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
[(array([ 0, 1, 2, ..., 8328, 8329, 8330]),
array([ 8379, 8380, 8381, ..., 11376, 11377, 11378])),
(array([ 1331, 1332, 1333, ..., 11328, 11329, 11330]),
array([11379, 11380, 11381, ..., 14376, 14377, 14378])),
(array([ 4331, 4332, 4333, ..., 14328, 14329, 14330]),
array([14379, 14380, 14381, ..., 17376, 17377, 17378]))]
构造评估函数
import numpy as npfrom sklearn.model_selection import cross_validatedef evaluate (model, X, y, cv, model_prop=None , model_step=None ): cv_results = cross_validate( model, X, y, cv=cv, scoring=["neg_mean_absolute_error" , "neg_root_mean_squared_error" ], return_estimator=model_prop is not None , ) if model_prop is not None : if model_step is not None : values = [ getattr (m[model_step], model_prop) for m in cv_results["estimator" ] ] else : values = [getattr (m, model_prop) for m in cv_results["estimator" ]] print (f"Mean model.{model_prop} = {np.mean(values)} " ) mae = -cv_results["test_neg_mean_absolute_error" ] rmse = -cv_results["test_neg_root_mean_squared_error" ] print ( f"Mean Absolute Error: {mae.mean():.3 f} +/- {mae.std():.3 f} \n" f"Root Mean Squared Error: {rmse.mean():.3 f} +/- {rmse.std():.3 f} " )
from sklearn.ensemble import HistGradientBoostingRegressorgbrt = HistGradientBoostingRegressor(categorical_features="from_dtype" , random_state=42 ) categorical_columns = X.columns[X.dtypes == "category" ] print ("Categorical features:" , categorical_columns.tolist())evaluate(gbrt, X, y, cv=ts_cv, model_prop="n_iter_" )
Categorical features: ['season', 'holiday', 'workingday', 'weather']
Mean model.n_iter_ = 100.0
Mean Absolute Error: 61.872 +/- 17.473
Root Mean Squared Error: 91.119 +/- 23.725
时间组件
一般情况下,时间变量都存储在字符中,我们需要先将字符特征转换为时间戳。由于共享单车数据集中没有时间变量,我们自行构造时间数据演示。
ts_df = pd.DataFrame([ '2019-01-01 01:22:26' , '2019-02-02 04:34:52' , '2019-03-03 06:16:40' , '2019-04-04 08:11:38' , '2019-05-05 10:52:39' , '2019-06-06 12:06:25' , '2019-07-07 14:05:25' , '2019-08-08 16:51:33' , '2019-09-09 18:28:28' , '2019-10-10 20:55:12' , '2019-11-11 22:55:12' , '2019-12-12 00:55:12' , ], columns=['time' ]) ts_df['time' ] = pd.to_datetime(ts_df['time' ]) print (ts_df.head())
time
0 2019-01-01 01:22:26
1 2019-02-02 04:34:52
2 2019-03-03 06:16:40
3 2019-04-04 08:11:38
4 2019-05-05 10:52:39
然后,根据时间所在的自然周期,将一个时间特征转化为若干个离散特征,这种方法在分析具有明显时间趋势的问题比较好用。
ts_df['year' ] = ts_df['time' ].dt.year ts_df['month' ] = ts_df['time' ].dt.month ts_df['day' ] = ts_df['time' ].dt.day ts_df['hour' ] = ts_df['time' ].dt.hour ts_df['minute' ] = ts_df['time' ].dt.minute ts_df['second' ] = ts_df['time' ].dt.second
进一步提取
ts_df['season' ] = ts_df['time' ].dt.quarter ts_df['season' ] = ts_df['season' ].map ({0 : "spring" , 1 : "summer" , 2 : "spring" , 3 : "winter" }) ts_df['dayofweek' ] = ts_df['time' ].dt.dayofweek
还可以提取小时所在每一天的周期:凌晨,上午,下午,晚上
ts_df['hour_section' ] = (ts_df['hour' ] // 6 ).astype(str )
布尔特征提取
ts_df['is_leap_year' ] = ts_df['time' ].dt.is_leap_year ts_df['is_month_start' ] = ts_df['time' ].dt.is_month_start ts_df['is_month_end' ] = ts_df['time' ].dt.is_month_end ts_df['is_quarter_start' ] = ts_df['time' ].dt.is_quarter_start ts_df['is_quarter_end' ] = ts_df['time' ].dt.is_quarter_end ts_df['is_year_start' ] = ts_df['time' ].dt.is_year_start ts_df['is_year_end' ] = ts_df['time' ].dt.is_year_end ts_df['is_weekend' ] = ts_df['dayofweek' ].isin([5 , 6 ])
time
year
month
day
hour
minute
second
season
dayofweek
hour_section
is_leap_year
is_month_start
is_month_end
is_quarter_start
is_quarter_end
is_year_start
is_year_end
is_weekend
0
2019-01-01 01:22:26
2019
1
1
1
22
26
summer
1
0
False
True
False
True
False
True
False
False
1
2019-02-02 04:34:52
2019
2
2
4
34
52
summer
5
0
False
False
False
False
False
False
False
True
2
2019-03-03 06:16:40
2019
3
3
6
16
40
summer
6
1
False
False
False
False
False
False
False
True
3
2019-04-04 08:11:38
2019
4
4
8
11
38
spring
3
1
False
False
False
False
False
False
False
False
4
2019-05-05 10:52:39
2019
5
5
10
52
39
spring
6
1
False
False
False
False
False
False
False
True
差分编码:对于时间型数据来说,即可以把它转换成离散值,也可以计算出时间变量到某一时间之间的数值差距,从而转化为连续值。
ts_df['diff' ] = ts_df['time' ] - pd.Timestamp('2019-01-01' ) ts_df['diff' ].dt.days ts_df['diff' ].dt.seconds ts_df['diff_hours' ] = ts_df['diff' ].values.astype("timedelta64[h]" ).astype('int' )
ts_df['diff' ].values.astype("timedelta64[h]" )
array([ 1, 772, 1470, 2240, 2986, 3756, 4502, 5272, 6042, 6788, 7558,
8280], dtype='timedelta64[h]')
注意,这里的days属性是完全忽略时分秒属性,seconds则是忽略了天数的计算结果。后面的diff_hours则是将时间间隔转换成相差的小时数。
one-hot编码
Time-steps as categories
由于时间特征是离散的,我们可以忽略排序,使用one-hot编码,从而给线性模型带来更大的灵活性。
from sklearn.pipeline import make_pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.linear_model import RidgeCVfrom sklearn.preprocessing import MinMaxScaler, OneHotEncoderone_hot_encoder = OneHotEncoder(handle_unknown="ignore" , sparse_output=False ) alphas = np.logspace(-6 , 6 , 25 ) one_hot_linear_pipeline = make_pipeline( ColumnTransformer( transformers=[ ("categorical" , one_hot_encoder, categorical_columns), ("one_hot_time" , one_hot_encoder, ["hour" , "weekday" , "month" ]), ], remainder=MinMaxScaler(), ), RidgeCV(alphas=alphas), ) evaluate(one_hot_linear_pipeline, X, y, cv=ts_cv)
Mean Absolute Error: 100.844 +/- 6.861
Root Mean Squared Error: 132.145 +/- 2.966
绘制拟合曲线:
one_hot_linear_pipeline.fit(X_train, y_train) one_hot_linear_predictions = one_hot_linear_pipeline.predict(X_test) plt.figure(figsize=(12 , 4 )) plt.title("Predictions by one-hot linear models" ) plt.plot(y_test.values[-96 :], "x-" , alpha=0.2 , label="Actual demand" , color="black" ) plt.plot(one_hot_linear_predictions[-96 :], "x-" , label="One-hot time features" ) plt.legend() plt.show()
周期性编码
在我们之前看到的拟合线中,呈现出了阶梯状的特点。这是因为每个虚拟变量都被单独处理,缺乏连续性。然而,时间等变量具有明显的周期性连续性。这时,我们可以尝试对时间特征进行周期编码(cyclical encoding),重新编码后,周期范围内的第一个值和最后一个值之间还可以保持平滑。
三角函数变换
cyclical encoding with trigonometric transformation
我们可以使用正弦/余弦变换进行循环编码。我们可以利用这种变换将周期性的时间特征编码成两个新的特征,从而更好地表达时间的连续性。
from sklearn.preprocessing import FunctionTransformerdef sin_transformer (period ): return FunctionTransformer(lambda x: np.sin(x / period * 2 * np.pi)) def cos_transformer (period ): return FunctionTransformer(lambda x: np.cos(x / period * 2 * np.pi))
现在,我们可以使用以下策略构建特征提取管道:
cyclic_cossin_transformer = ColumnTransformer( transformers=[ ("categorical" , one_hot_encoder, categorical_columns), ("month_sin" , sin_transformer(12 ), ["month" ]), ("month_cos" , cos_transformer(12 ), ["month" ]), ("weekday_sin" , sin_transformer(7 ), ["weekday" ]), ("weekday_cos" , cos_transformer(7 ), ["weekday" ]), ("hour_sin" , sin_transformer(24 ), ["hour" ]), ("hour_cos" , cos_transformer(24 ), ["hour" ]), ], remainder=MinMaxScaler(), ) cyclic_cossin_linear_pipeline = make_pipeline( cyclic_cossin_transformer, RidgeCV(alphas=alphas), ) evaluate(cyclic_cossin_linear_pipeline, X, y, cv=ts_cv)
Mean Absolute Error: 115.371 +/- 14.248
Root Mean Squared Error: 159.276 +/- 5.759
绘制拟合曲线
cyclic_cossin_linear_pipeline.fit(X_train, y_train) cyclic_cossin_linear_predictions = cyclic_cossin_linear_pipeline.predict(X_test) plt.figure(figsize=(12 , 4 )) plt.title("Predictions by one-hot linear models" ) plt.plot(y_test.values[-96 :], "x-" , alpha=0.2 , label="Actual demand" , color="black" ) plt.plot(cyclic_cossin_linear_predictions[-96 :], "x-" , label="One-hot + Trigonometric" ) plt.legend() plt.show()
上图说明,正弦/余弦特征允许模型拾取主要模式,但不足以完全捕获该系列的动态。
周期性样条特征
Periodic spline features
同样我们可以尝试对每个周期性特征使用足够数量的样条变换
from sklearn.preprocessing import SplineTransformerdef periodic_spline_transformer (period, n_splines=None , degree=3 ): if n_splines is None : n_splines = period n_knots = n_splines + 1 return SplineTransformer( degree=degree, n_knots=n_knots, knots=np.linspace(0 , period, n_knots).reshape(n_knots, 1 ), extrapolation="periodic" , include_bias=True , )
参数 extrapolation="periodic"
,使周期范围内的第一个值和最后一个值之间保持平滑。
对于那些有序的离散特征,样条编码在保留更多信息的同时,比one-hot编码更有效。
cyclic_spline_transformer = ColumnTransformer( transformers=[ ("categorical" , one_hot_encoder, categorical_columns), ("cyclic_month" , periodic_spline_transformer(12 , n_splines=6 ), ["month" ]), ("cyclic_weekday" , periodic_spline_transformer(7 , n_splines=3 ), ["weekday" ]), ("cyclic_hour" , periodic_spline_transformer(24 , n_splines=12 ), ["hour" ]), ], remainder=MinMaxScaler(), ) cyclic_spline_linear_pipeline = make_pipeline( cyclic_spline_transformer, RidgeCV(alphas=alphas), ) evaluate(cyclic_spline_linear_pipeline, X, y, cv=ts_cv)
Mean Absolute Error: 98.339 +/- 6.321
Root Mean Squared Error: 132.398 +/- 2.490
绘制拟合曲线
cyclic_spline_linear_pipeline.fit(X_train, y_train) cyclic_spline_linear_predictions = cyclic_spline_linear_pipeline.predict(X_test) plt.figure(figsize=(12 , 4 )) plt.title("Predictions by one-hot linear models" ) plt.plot(y_test.values[-96 :], "x-" , alpha=0.2 , label="Actual demand" , color="black" ) plt.plot(cyclic_spline_linear_predictions[-96 :], "x-" , label="One-hot + Spline" ) plt.legend() plt.show()
特征交互
feature interactions
线性模型不能自动捕获特征间的交互效应,因此我们可以显式的引入交互项,然后与之前已经预处理的特征合并。
from sklearn.pipeline import FeatureUnionfrom sklearn.preprocessing import PolynomialFeatureshour_workday_interaction = make_pipeline( ColumnTransformer( [ ("cyclic_hour" , periodic_spline_transformer(24 , n_splines=8 ), ["hour" ]), ("workingday" , FunctionTransformer(lambda x: x == "True" ), ["workingday" ]), ] ), PolynomialFeatures(degree=2 , interaction_only=True , include_bias=False ), ) cyclic_spline_interactions_pipeline = make_pipeline( FeatureUnion( [ ("marginal" , cyclic_spline_transformer), ("interactions" , hour_workday_interaction), ] ), RidgeCV(alphas=alphas), ) evaluate(cyclic_spline_interactions_pipeline, X, y, cv=ts_cv)
Mean Absolute Error: 84.792 +/- 6.839
Root Mean Squared Error: 110.399 +/- 7.442
cyclic_spline_interactions_pipeline.fit(X_train, y_train) cyclic_spline_interactions_predictions = cyclic_spline_interactions_pipeline.predict(X_test)
plt.figure(figsize=(12 , 4 )) plt.title("Predictions by linear models" ) plt.plot(y_test.values[-96 :], "x-" , alpha=0.2 , label="Actual demand" , color="black" ) plt.plot(one_hot_linear_predictions[-96 :], "x-" , label="One-hot time features" ) plt.plot(cyclic_cossin_linear_predictions[-96 :], "x-" , label="One-hot + Trigonometric" ) plt.plot(cyclic_spline_linear_predictions[-96 :], "x-" , label="One-hot + Spline" ) plt.plot(cyclic_spline_interactions_predictions[-96 :], "x-" , label="Splines + polynomial" ) plt.legend() plt.show()
窗口特征
将时间序列在时间轴上划分窗口是一个常用且有效的方法,包括滑动窗口 ( 根据指定的单位长度来框住时间序列,每次滑动一个单位),与滚动窗口 (根据指定的单位长度来框住时间序列,每次滑动窗口长度的多个单位)。窗口分析对平滑噪声或粗糙的数据非常有用,比如移动平均法等,这种方式结合基础的统计方法,即按照时间的顺序对每一个时间段的数据进行统计,从而可以得到每个时间段内目标所体现的特征,进而从连续的时间片段中,通过对同一特征在不同时间维度下的分析,得到数据整体的变化趋势。
count = df["count" ] lagged_df = pd.concat( [ count, count.shift(1 ).rename("lagged_count_1h" ), count.shift(2 ).rename("lagged_count_2h" ), count.shift(3 ).rename("lagged_count_3h" ), count.shift(24 ).rename("lagged_count_1d" ), count.shift(24 + 1 ).rename("lagged_count_1d_1h" ), count.shift(7 * 24 ).rename("lagged_count_7d" ), count.shift(7 * 24 + 1 ).rename("lagged_count_7d_1h" ), count.shift(1 ).rolling(24 ).mean().rename("lagged_mean_24h" ), count.shift(1 ).rolling(24 ).max ().rename("lagged_max_24h" ), count.shift(1 ).rolling(24 ).min ().rename("lagged_min_24h" ), count.shift(1 ).rolling(7 * 24 ).mean().rename("lagged_mean_7d" ), count.shift(1 ).rolling(7 * 24 ).max ().rename("lagged_max_7d" ), count.shift(1 ).rolling(7 * 24 ).min ().rename("lagged_min_7d" ), ], axis="columns" , ) lagged_df.tail(10 )
count
lagged_count_1h
lagged_count_2h
lagged_count_3h
lagged_count_1d
lagged_count_1d_1h
lagged_count_7d
lagged_count_7d_1h
lagged_mean_24h
lagged_max_24h
lagged_min_24h
lagged_mean_7d
lagged_max_7d
lagged_min_7d
17369
247
203.0
224.0
157.0
160.0
169.0
70.0
135.0
93.500000
224.0
1.0
67.732143
271.0
1.0
17370
315
247.0
203.0
224.0
138.0
160.0
46.0
70.0
97.125000
247.0
1.0
68.785714
271.0
1.0
17371
214
315.0
247.0
203.0
133.0
138.0
33.0
46.0
104.500000
315.0
1.0
70.386905
315.0
1.0
17372
164
214.0
315.0
247.0
123.0
133.0
33.0
33.0
107.875000
315.0
1.0
71.464286
315.0
1.0
17373
122
164.0
214.0
315.0
125.0
123.0
26.0
33.0
109.583333
315.0
1.0
72.244048
315.0
1.0
17374
119
122.0
164.0
214.0
102.0
125.0
26.0
26.0
109.458333
315.0
1.0
72.815476
315.0
1.0
17375
89
119.0
122.0
164.0
72.0
102.0
18.0
26.0
110.166667
315.0
1.0
73.369048
315.0
1.0
17376
90
89.0
119.0
122.0
47.0
72.0
23.0
18.0
110.875000
315.0
1.0
73.791667
315.0
1.0
17377
61
90.0
89.0
119.0
36.0
47.0
22.0
23.0
112.666667
315.0
1.0
74.190476
315.0
1.0
17378
49
61.0
90.0
89.0
49.0
36.0
12.0
22.0
113.708333
315.0
1.0
74.422619
315.0
1.0
目标变量的自相关性
0.8438107986135388