|
减少过拟合方法2:通过特征选择进行降维,包含特征选择和特征提取;
此方法对未经正则化处理的模型特别有效;
特征选择:是基于已有特征,构建子集;包含:
序列后向选择算法(Sequential Backward Selection, SBS)
基于特征权重的递归后向消除算法;
基于特征重要性的特征选择树方法;
单变量统计测试;
...
特征提取:是基于已有特征,构造新的特征;
利用随机森林判断特征的重要性;
利用决策树得到的平均不纯度衰减来度量特征的重要性,且不用考虑数据是否线性可分;
scikit-learn中使用tranform()可以基于用户设定的阈值选择特征值;
随机森林程序运行结果如下:
1) Alcohol 0.182483
2) Malic acid 0.158610
3) Ash 0.150948
4) Alcalinity of ash 0.131987
5) Magnesium 0.106589
6) Total phenols 0.078243
7) Flavanoids 0.060718
8) Nonflavanoid phenols 0.032033
9) Proanthocyanins 0.025400
10) Color intensity 0.022351
11) Hue 0.022078
12) OD280/OD315 of diluted wines 0.014645
13) Proline 0.013916
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier # 读入数据,增加特征名称 df_wine = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None) df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline'] print('Class labels', np.unique(df_wine['Class label'])) #print(df_wine.head()) #将数据分成训练集和测试集 from sklearn.cross_validation import train_test_split X, y = df_wine.iloc[:,1:].values, df_wine.iloc[:,0].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0) #标准化 from sklearn.preprocessing import StandardScaler stdsc = StandardScaler() X_trian_std = stdsc.fit_transform(X_train) X_test_std = stdsc.transform(X_test) # 随机森林 对各个特征的平均不纯度衰减进行排序 feat_labels = df_wine.columns[1:] forest = RandomForestClassifier(n_estimators = 10000, random_state = 0, n_jobs=-1) forest.fit(X_train, y_train) importances = forest.feature_importances_ indices = np.argsort(importances)[::-1] for f in range(X_train.shape[1]): print("%2d) %-*s %f" % (f + 1, 30, feat_labels[f], importances[indices[f]])) # 对结果图形化显示 import matplotlib.pyplot as plt plt.title('Feature Importance') plt.bar(range(X_train.shape[1]), importances[indices], color='lightblue', align='center') plt.xticks(range(X_train.shape[1]), feat_labels, rotation=90) plt.xlim([-1, X_train.shape[1]]) plt.tight_layout() plt.show()
#参考《Python 机器学习》,作者:Sebastian Raschaka, 机械工业出版社;
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-4-24 03:32
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社