|
在预处理中,涉及的常用函数如下:
df = pd.read_csv(...):将数据读入到Pandas的Dataframe中;
df.isnull():判别数据中是否存在NAN数据;
df.isnull.sum(): 统计每列中NAN数量;
df.dropna():按要求对存在NAN的行或列进行删除;
sklearn.preprocessing.imputer():对Dataframe中的NAN数据进行插补;
类别数据处理:有序特征映射,类标是有大小、可排序;利用字典
import pandas as pd df = pd.DataFrame([ ['green', 'M', 10.1, 'class1'], ['red', 'L', 13.5, 'class2'], ['blue', 'XL', 15.3, 'class1'],]) df.columns = ['color', 'size', 'price', 'classlabel'] print(df) size_mapping = { 'XL':3, 'L':2, 'M':1} df['size'] = df['size'].map(size_mapping) #size表示列名 print(df)
类标编码: 类标无大小之分,不可排序;利用枚举
import pandas as pd import numpy as np df = pd.DataFrame([ ['green', 'M', 10.1, 'class1'], ['red', 'L', 13.5, 'class2'], ['blue', 'XL', 15.3, 'class1'],]) df.columns = ['color', 'size', 'price', 'classlabel'] print(df) class_mapping = {label:idx for idx, label in enumerate(np.unique(df['classlabel']))} df['classlabel'] = df['classlabel'].map(class_mapping) print(df)
使用LabelEncoder类可以更加方便的完成类标编码;
import pandas as pd import numpy as np df = pd.DataFrame([ ['green', 'M', 10.1, 'class1'], ['red', 'L', 13.5, 'class2'], ['blue', 'XL', 15.3, 'class1'],]) df.columns = ['color', 'size', 'price', 'classlabel'] print(df) from sklearn.preprocessing import LabelEncoder class_le = LabelEncoder() y = class_le.fit_transform(df['classlabel'].values) print(y)
独热编码One-hot encoding
class sklearn.preprocessing.
OneHotEncoder
(n_values=’auto’, categorical_features=’all’, dtype=<class ‘numpy.float64’>, sparse=True, handle_unknown=’error’)
参数:
n_values:字符串'auto'、或者整数、或者整数数组;指定每一个属性取值上界;'auto'表示自动从训练数据中推断属性值取值的上界;一个整数表示所有属性取值的上界;数组表示每个元素依次指定了一个属性的上界;
categorical_features: 'all'表示所有的属性都将编码为独热编码;下标数组表示指定下标的属性将编码为独热码;mask表示对应True的属性将编码为独热码;
dtype:指定独热编码的数值类型,默认为np.float64;
sparse:布尔值,指定结果是否稀疏;
handle_unkown:字符型,在数据类型转换时,如果某集合的属性未包含在categorical_features中时,可指定如下:'error'抛出异常;'ignore'忽略;
属性:
active_features_:数组,给出激活特征; 仅当n_valute='auto'时有效;表示如果原始数据的某个属性的某个取值在转换后数据的第i个属性中激活,则i是数组的元素;
feature_indices_:数组,表示原始数据的第i个属性对应转换后数据的【feature_indices_[i], feature_indices_[i+1] 】之间的属性;
n_values_:数组,存放每个属性取值的种类;
方法:
fit (X[, y]) | Fit OneHotEncoder to X. |
fit_transform (X[, y]) | Fit OneHotEncoder to X, then transform X. |
get_params ([deep]) | Get parameters for this estimator. |
set_params (**params) | Set the parameters of this estimator. |
transform (X) | Transform X using one-hot encoding. |
import pandas as pd df = pd.DataFrame([ ['green', 'M', 10.1, 'class1'], ['red', 'L', 13.5, 'class2'], ['blue', 'X', 15.3, 'class1'],]) df.columns = ['color', 'size', 'price', 'classlabel'] print(df) from sklearn.preprocessing import LabelEncoder X = df[['color','size','price']].values; class_le = LabelEncoder() X[:,0] = class_le.fit_transform(X[:,0]) X[:,1] = class_le.fit_transform(X[:,1]) print(X) from sklearn.preprocessing import OneHotEncoder ohe = OneHotEncoder(categorical_features=[0],sparse=False) X = ohe.fit_transform(X) print(X)
利用Pandas中的get_dummies()也可实现One-hot Encoding
import pandas as pd df = pd.DataFrame([ ['green', 'M', 10.1, 'class1'], ['red', 'L', 13.5, 'class2'], ['blue', 'X', 15.3, 'class1'],]) df.columns = ['color', 'size', 'price', 'classlabel'] print(df) print(pd.get_dummies(df[['color','size','price']]))
数据集和测试集的划分:常用比例为6:4、7:3、8:2;对于庞大数据集,常用9:1或者99:1;
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
特征缩放:归一化(normalization)
sklearn.preprocessing.
MinMaxScaler
(feature_range=(0, 1), copy=True)
feature_range:元组(min,max), 指定变换之后属性的取值范围;
copy:布尔型,Ture为在原始数据中修改;
属性:
min_:数组,给出每个属性的原始最小值的调整值;
scale_:数组,给出每个属性的缩放倍数;
data_min:数组,给出每个属性的原始最小值;
data_max:数组,给出每个属性的原始最大值;
data_range_:数组,给出每个属性的原始范围(最大值减去最小值);
方法:
fit (X[, y]) | Compute the minimum and maximum to be used for later scaling. |
fit_transform (X[, y]) | Fit to data, then transform it. |
get_params ([deep]) | Get parameters for this estimator. |
inverse_transform (X) | Undo the scaling of X according to feature_range. |
partial_fit (X[, y]) | Online computation of min and max on X for later scaling. |
set_params (**params) | Set the parameters of this estimator. |
transform (X) | Scaling features of X according to feature_range. |
特征缩放:标准化(standardization)
class sklearn.preprocessing.
StandardScaler
(copy=True, with_mean=True, with_std=True)
参数:
copy:True在原始数据上修改;
with_mean:布尔型,True表示缩放前将数据中心化,即属性值减去该属性的均值;如果数据为稀疏矩阵,则不能指定with_mean为True;
with_std:True则缩放数据到单位方差;
属性:
scale_:数组,给出每个属性的缩放倍数的倒数;
mean_:数组,给出原始数据中每个属性的均值;
var_:数组,给出原始数据中每个属性的方差;
n_samples_seen_:数组,给出当前已经处理的样本数量;(用于分批训练)
方法:
fit (X[, y]) | Compute the mean and std to be used for later scaling. |
fit_transform (X[, y]) | Fit to data, then transform it. |
get_params ([deep]) | Get parameters for this estimator. |
inverse_transform (X[, copy]) | Scale back the data to the original representation |
partial_fit (X[, y]) | Online computation of mean and std on X for later scaling. |
set_params (**params) | Set the parameters of this estimator. |
transform (X[, y, copy]) | Perform standardization by centering and scaling |
减少过拟合方法1:正则化(L1正则化和L2正则化)
利用penalty参数进行选择;
--------------------------------------------------------------------------------------------------------
补充:
Parameters: | data : numpy ndarray (structured or homogeneous), dict, or DataFrame
index : Index or array-like
columns : Index or array-like
dtype : dtype, default None
copy : boolean, default False
|
---|
Attributes
T | Transpose index and columns. |
at | Access a single value for a row/column label pair. |
axes | Return a list representing the axes of the DataFrame. |
blocks | (DEPRECATED) Internal property, property synonym for as_blocks() |
columns | The column labels of the DataFrame. |
dtypes | Return the dtypes in the DataFrame. |
empty | Indicator whether DataFrame is empty. |
ftypes | Return the ftypes (indication of sparse/dense and dtype) in DataFrame. |
iat | Access a single value for a row/column pair by integer position. |
iloc | Purely integer-location based indexing for selection by position. |
index | The index (row labels) of the DataFrame. |
ix | A primarily label-location based indexer, with integer position fallback. |
loc | Access a group of rows and columns by label(s) or a boolean array. |
ndim | Return an int representing the number of axes / array dimensions. |
shape | Return a tuple representing the dimensionality of the DataFrame. |
size | Return an int representing the number of elements in this object. |
style | Property returning a Styler object containing methods for building a styled HTML representation fo the DataFrame. |
values | Return a Numpy representation of the DataFrame. |
2、pandas.read_csv:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
pandas.
read_csv
(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
作用:将CSV格式的数据读取到DataFrame
返回值:DataFrame类型
部分参数:
filepath_or_buffer: 数据文件的路径,可以是URL;也可以直接写入"文件名.csv";
header:将行号用作列名,且是数据的开头。注意当skip_blank_lines=True时,这个参数忽略注释行和空行。所以header=0表示第一行是数据而不是文件的第一行
csv_data = '''A, B, C, D 1.0, 2.0, 3.0, 4.0 5.0, 6.0,, 8.0 0.0, 11.0, 12.0,''' ----------------------------- df=pd.read_csv(StringIO(csv_data)) print(df) 输出如下: A B C D 0 1.0 2.0 3.0 4.0 1 5.0 6.0 NaN 8.0 2 0.0 11.0 12.0 NaN ----------------------------- df=pd.read_csv(StringIO(csv_data), header=1) print(df) 1.0 2.0 3.0 4.0 0 5.0 6.0 NaN 8.0 1 0.0 11.0 12.0 NaN df=pd.read_csv(StringIO(csv_data), header=2) 5.0 6.0 Unnamed: 2 8.0 0 0.0 11.0 12.0 NaN ----------------------------- df=pd.read_csv(StringIO(csv_data), header=2,skip_blank_lines=True) print(df) 5.0 6.0 Unnamed: 2 8.0 0 0.0 11.0 12.0 NaN
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-4-25 02:36
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社