lijun255sysu的个人博客分享 http://blog.sciencenet.cn/u/lijun255sysu

博文

Python_机器学习_总结9:数据预处理(1)

已有 3717 次阅读 2018-8-25 23:33 |系统分类:科研笔记

 在预处理中,涉及的常用函数如下:

  • df = pd.read_csv(...):将数据读入到Pandas的Dataframe中;

  • df.isnull():判别数据中是否存在NAN数据;

  • df.isnull.sum(): 统计每列中NAN数量;

  • df.dropna():按要求对存在NAN的行或列进行删除;

  • sklearn.preprocessing.imputer():对Dataframe中的NAN数据进行插补;

  • 类别数据处理:有序特征映射,类标是有大小、可排序;利用字典

import pandas as pd
df = pd.DataFrame([
        ['green', 'M', 10.1, 'class1'],
        ['red', 'L', 13.5, 'class2'],
        ['blue', 'XL', 15.3, 'class1'],])
df.columns = ['color', 'size', 'price', 'classlabel']
print(df)
size_mapping = {
            'XL':3,
            'L':2,
            'M':1}
df['size'] = df['size'].map(size_mapping) #size表示列名
print(df)
  • 类标编码: 类标无大小之分,不可排序;利用枚举

import pandas as pd
import numpy as np
df = pd.DataFrame([
        ['green', 'M', 10.1, 'class1'],
        ['red', 'L', 13.5, 'class2'],
        ['blue', 'XL', 15.3, 'class1'],])
df.columns = ['color', 'size', 'price', 'classlabel']
print(df)
class_mapping = {label:idx for idx, label in enumerate(np.unique(df['classlabel']))}
df['classlabel'] = df['classlabel'].map(class_mapping)
print(df)
  • 使用LabelEncoder类可以更加方便的完成类标编码;

import pandas as pd
import numpy as np
df = pd.DataFrame([
        ['green', 'M', 10.1, 'class1'],
        ['red', 'L', 13.5, 'class2'],
        ['blue', 'XL', 15.3, 'class1'],])
df.columns = ['color', 'size', 'price', 'classlabel']
print(df)
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
print(y)
  • 独热编码One-hot encoding

        class sklearn.preprocessing.OneHotEncoder(n_values=’auto’categorical_features=’all’dtype=<class ‘numpy.float64’>sparse=Truehandle_unknown=’error’)

        参数:

        n_values:字符串'auto'、或者整数、或者整数数组;指定每一个属性取值上界;'auto'表示自动从训练数据中推断属性值取值的上界;一个整数表示所有属性取值的上界;数组表示每个元素依次指定了一个属性的上界;

        categorical_features: 'all'表示所有的属性都将编码为独热编码;下标数组表示指定下标的属性将编码为独热码;mask表示对应True的属性将编码为独热码;

        dtype:指定独热编码的数值类型,默认为np.float64;

        sparse:布尔值,指定结果是否稀疏;

        handle_unkown:字符型,在数据类型转换时,如果某集合的属性未包含在categorical_features中时,可指定如下:'error'抛出异常;'ignore'忽略;

        属性:

        active_features_:数组,给出激活特征; 仅当n_valute='auto'时有效;表示如果原始数据的某个属性的某个取值在转换后数据的第i个属性中激活,则i是数组的元素;

        feature_indices_:数组,表示原始数据的第i个属性对应转换后数据的【feature_indices_[i], feature_indices_[i+1] 】之间的属性;

        n_values_:数组,存放每个属性取值的种类;

        方法:

                

fit(X[, y])Fit OneHotEncoder to X.
fit_transform(X[, y])Fit OneHotEncoder to X, then transform X.
get_params([deep])Get parameters for this estimator.
set_params(**params)Set the parameters of this estimator.
transform(X)Transform X using one-hot encoding.

        

        

        import pandas as pd
df = pd.DataFrame([
        ['green', 'M', 10.1, 'class1'],
        ['red', 'L', 13.5, 'class2'],
        ['blue', 'X', 15.3, 'class1'],])
df.columns = ['color', 'size', 'price', 'classlabel']
print(df)
from sklearn.preprocessing import LabelEncoder
X = df[['color','size','price']].values;
class_le = LabelEncoder()
X[:,0] = class_le.fit_transform(X[:,0])
X[:,1] = class_le.fit_transform(X[:,1])
print(X)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0],sparse=False)
X = ohe.fit_transform(X)
print(X)
  • 利用Pandas中的get_dummies()也可实现One-hot Encoding

import pandas as pd
df = pd.DataFrame([
        ['green', 'M', 10.1, 'class1'],
        ['red', 'L', 13.5, 'class2'],
        ['blue', 'X', 15.3, 'class1'],])
df.columns = ['color', 'size', 'price', 'classlabel']
print(df)
print(pd.get_dummies(df[['color','size','price']]))
  • 数据集和测试集的划分:常用比例为6:4、7:3、8:2;对于庞大数据集,常用9:1或者99:1;

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
  • 特征缩放:归一化(normalization)

        sklearn.preprocessing.MinMaxScaler(feature_range=(01)copy=True)

        feature_range:元组(min,max), 指定变换之后属性的取值范围;

        copy:布尔型,Ture为在原始数据中修改;

        属性:

        min_:数组,给出每个属性的原始最小值的调整值;

        scale_:数组,给出每个属性的缩放倍数;

        data_min:数组,给出每个属性的原始最小值;

        data_max:数组,给出每个属性的原始最大值;

        data_range_:数组,给出每个属性的原始范围(最大值减去最小值);

        方法

        

fit(X[, y])Compute the minimum and maximum to be used for later scaling.
fit_transform(X[, y])Fit to data, then transform it.
get_params([deep])Get parameters for this estimator.
inverse_transform(X)Undo the scaling of X according to feature_range.
partial_fit(X[, y])Online computation of min and max on X for later scaling.
set_params(**params)Set the parameters of this estimator.
transform(X)Scaling features of X according to feature_range.
  • 特征缩放:标准化(standardization)

        class sklearn.preprocessing.StandardScaler(copy=Truewith_mean=Truewith_std=True)

        参数:

        copy:True在原始数据上修改;

        with_mean:布尔型,True表示缩放前将数据中心化,即属性值减去该属性的均值;如果数据为稀疏矩阵,则不能指定with_mean为True;

        with_std:True则缩放数据到单位方差;

        属性:

        scale_:数组,给出每个属性的缩放倍数的倒数;

        mean_:数组,给出原始数据中每个属性的均值;

        var_:数组,给出原始数据中每个属性的方差;

        n_samples_seen_:数组,给出当前已经处理的样本数量;(用于分批训练)

        方法:

        

fit(X[, y])Compute the mean and std to be used for later scaling.
fit_transform(X[, y])Fit to data, then transform it.
get_params([deep])Get parameters for this estimator.
inverse_transform(X[, copy])Scale back the data to the original representation
partial_fit(X[, y])Online computation of mean and std on X for later scaling.
set_params(**params)Set the parameters of this estimator.
transform(X[, y, copy])Perform standardization by centering and scaling


  • 减少过拟合方法1:正则化(L1正则化和L2正则化)

        利用penalty参数进行选择;


--------------------------------------------------------------------------------------------------------

补充:

1、pandas.DataFrame https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Parameters:

data : numpy ndarray (structured or homogeneous), dict, or DataFrame

Dict can contain Series, arrays, constants, or list-like objects

Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later.

index : Index or array-like

Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided

columns : Index or array-like

Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided

dtype : dtype, default None

Data type to force. Only a single dtype is allowed. If None, infer

copy : boolean, default False

Copy data from inputs. Only affects DataFrame / 2d ndarray input

Attributes

TTranspose index and columns.
atAccess a single value for a row/column label pair.
axesReturn a list representing the axes of the DataFrame.
blocks(DEPRECATED) Internal property, property synonym for as_blocks()
columnsThe column labels of the DataFrame.
dtypesReturn the dtypes in the DataFrame.
emptyIndicator whether DataFrame is empty.
ftypesReturn the ftypes (indication of sparse/dense and dtype) in DataFrame.
iatAccess a single value for a row/column pair by integer position.
ilocPurely integer-location based indexing for selection by position.
indexThe index (row labels) of the DataFrame.
ixA primarily label-location based indexer, with integer position fallback.
locAccess a group of rows and columns by label(s) or a boolean array.
ndimReturn an int representing the number of axes / array dimensions.
shapeReturn a tuple representing the dimensionality of the DataFrame.
sizeReturn an int representing the number of elements in this object.
styleProperty returning a Styler object containing methods for building a styled HTML representation fo the DataFrame.
valuesReturn a Numpy representation of the DataFrame.




2、pandas.read_csvhttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

pandas.read_csv(filepath_or_buffersep=''delimiter=Noneheader='infer'names=Noneindex_col=Noneusecols=Nonesqueeze=Falseprefix=Nonemangle_dupe_cols=Truedtype=Noneengine=Noneconverters=Nonetrue_values=Nonefalse_values=Noneskipinitialspace=Falseskiprows=Nonenrows=Nonena_values=Nonekeep_default_na=Truena_filter=Trueverbose=Falseskip_blank_lines=Trueparse_dates=Falseinfer_datetime_format=Falsekeep_date_col=Falsedate_parser=Nonedayfirst=Falseiterator=Falsechunksize=Nonecompression='infer'thousands=Nonedecimal=b'.'lineterminator=Nonequotechar='"'quoting=0escapechar=Nonecomment=Noneencoding=Nonedialect=Nonetupleize_cols=Noneerror_bad_lines=Truewarn_bad_lines=Trueskipfooter=0doublequote=Truedelim_whitespace=Falselow_memory=Truememory_map=Falsefloat_precision=None)


作用:将CSV格式的数据读取到DataFrame

返回值:DataFrame类型

部分参数:

filepath_or_buffer: 数据文件的路径,可以是URL;也可以直接写入"文件名.csv";

header:将行号用作列名,且是数据的开头。注意当skip_blank_lines=True时,这个参数忽略注释行和空行。所以header=0表示第一行是数据而不是文件的第一行


csv_data = '''A, B, C, D
        1.0, 2.0, 3.0, 4.0
        5.0, 6.0,, 8.0
        0.0, 11.0, 12.0,'''
-----------------------------
df=pd.read_csv(StringIO(csv_data))
print(df)

输出如下:
     A     B     C    D
0  1.0   2.0   3.0  4.0
1  5.0   6.0   NaN  8.0
2  0.0  11.0  12.0  NaN
-----------------------------
df=pd.read_csv(StringIO(csv_data), header=1)
print(df)

    1.0   2.0   3.0   4.0
0   5.0   6.0   NaN   8.0
1   0.0  11.0  12.0   NaN

df=pd.read_csv(StringIO(csv_data), header=2)

    5.0   6.0  Unnamed: 2   8.0
0   0.0  11.0        12.0   NaN
-----------------------------
df=pd.read_csv(StringIO(csv_data), header=2,skip_blank_lines=True)
print(df)

    5.0   6.0  Unnamed: 2   8.0
0   0.0  11.0        12.0   NaN






https://blog.sciencenet.cn/blog-3377553-1131141.html

上一篇:Python_机器学习_总结8:决策树
下一篇:Python_机器学习_总结9:数据预处理(2)
收藏 IP: 61.145.151.*| 热度|

1 曹俊兴

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-25 02:36

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部