|
机器学习中存在两类参数:
通过训练数据学习得到的参数;---可认为是辨识得到的参数,例如模型系数;
在学习算法中单独需要优化的参数--超参、调优参数;---算法自身的系数,例如决策树的深度参数;
Grid search:根据超参列表,穷举搜索,找出最优值;缺点计算量很大;改进办法:randomized search;
例:支持向量机流水线的训练与调优
import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.learning_curve import learning_curve from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.learning_curve import validation_curve from sklearn.grid_search import GridSearchCV from sklearn.svm import SVC ############################################################################### df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None) #将数据分成训练集和测试集 from sklearn.preprocessing import LabelEncoder X = df.loc[:, 2:].values y = df.loc[:, 1].values le = LabelEncoder() y = le.fit_transform(y) #print(le.transform(['M', 'B'])) #将数据分成训练集和测试集 from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1) ############################################################################### pipe_svc = Pipeline([('scl', StandardScaler()),('clf',SVC(random_state=1))]) param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0] param_grid = [{'clf__C':param_range, 'clf__kernel':['linear']}, {'clf__C':param_range, 'clf__gamma':param_range,'clf__kernel':['rbf']},] gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=-1) gs = gs.fit(X_train, y_train) print(gs.best_score_) print(gs.best_params_)
嵌套交叉验证:
如果在不同机器学习算法之间进行选择,推荐使用---嵌套交叉验证,而非单独使用k折交叉验证;
在嵌套交叉验证的外围循环中,将数据分为训练块和测试块;
在模型选择的内部循环中,基于训练块,利用k折交叉验证;
完成模型选择后,使用测试块验证模型性能;
import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.learning_curve import learning_curve from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.learning_curve import validation_curve from sklearn.grid_search import GridSearchCV from sklearn.svm import SVC ############################################################################### df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None) #将数据分成训练集和测试集 from sklearn.preprocessing import LabelEncoder X = df.loc[:, 2:].values y = df.loc[:, 1].values le = LabelEncoder() y = le.fit_transform(y) #print(le.transform(['M', 'B'])) #将数据分成训练集和测试集 from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1) ############################################################################### pipe_svc = Pipeline([('scl', StandardScaler()),('clf',SVC(random_state=1))]) param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0] param_grid = [{'clf__C':param_range, 'clf__kernel':['linear']}, {'clf__C':param_range, 'clf__gamma':param_range,'clf__kernel':['rbf']},] gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=-1) scores = cross_val_score(gs, X,y, scoring='accuracy', cv=5) print('CV accuracy : %.3f +/- %.3f' % (np.mean(scores), np.std(scores))
#参考《Python 机器学习》,作者:Sebastian Raschaka, 机械工业出版社;
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-14 13:17
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社