jilingxf的个人博客分享 http://blog.sciencenet.cn/u/jilingxf

博文

EXPLORATORY DATA ANALYSIS USING R 学习笔记(1)

已有 2295 次阅读 2019-7-26 09:47 |个人分类:r|系统分类:科研笔记| EXPLORATORY, DATA, ANALYSIS, USING, 数据分析

首先声明,本人英文能力非常低,是用有道词典看的这些东西,只能理解大概,放在这里一方面在做笔记,另一方面也请热心的各位大侠能指点一下,不胜感谢。

另外,个人觉得这本书很值得一看

Chapter 1

Data,Exploratory Analysis,and R

1.1为什么要分析数据Why do we analyze data?

The basic subject of this book is data analysis, so it is useful to   begin by addressing the question of why we might want to do this. There are   at least three motivations for analyzing data:


本书的基本主题是数据分析,因此,从解决我们为什么要这样做这个问题开始是很有用的。分析数据至少有三个动机:


1. to   understand what has happened or what is happening;

2. to   predict what is likely to happen, either in the future or in other   circumstances we haven't seen yet;

3. to   guide us in making decisions.


1. 明白到底发生了什么或发生了什么;2。预测可能会发生什么,在未来或在其他情况下我们还没有见过;3。指导我们做决定。


The primary focus of this book is on exploratory data analysis, discussed further in the next   section and throughout the rest of this book, and this approach is most   useful in addressing problems of the first type: understanding our data. That   said, the predictions required in the second type of problem listed above are   typically based on mathematical models like those discussed in Chapters 5 and 10, which are optimized to give reliable predictions for data we have   available, in the hope and expectation that they will also give reliable   predictions for cases we haven't yet considered. In building these models, it   is important to use representative, reliable data, and the exploratory analysis techniques described in this book can be   extremely useful in making certain this is the case. Similarly, in the third   class of problems listed above making decisions it is important that we base   them on an accurate understanding of the situation and/or accurate   predictions of what is likely to happen next. Again, the techniques of   exploratory data analysis described here can be extremely useful in verifying   and/or improving the accuracy of our data and our predictions.


本书的主要焦点是探索性数据分析,将在下一节和本书的其余部分中进一步讨论,这种方法在处理第一类问题时最有用:理解我们的数据。第二种类型的预测需要上面列出的问题通常是基于数学模型510章中讨论的一样,这是优化给予可靠预测数据可用,希望和期望,他们也会给可靠的预测情况下我们还没有考虑。在构建这些模型时,使用有代表性的、可靠的数据是很重要的,而本书中描述的探索性分析技术对于确定这一点非常有用。同样,在上面列出的第三类问题中,做出决定时,重要的是我们要基于对形势的准确理解和/或对接下来可能发生的事情的准确预测。同样,这里描述的探索性数据分析技术在验证和/或提高数据和预测的准确性方面非常有用。


1.2  9万英尺的视野The view from 90,000 feet 

This book   is intended as an introduction to the three title subjects data, its exploratory   analysis, and the R programming language and the following sections   give high-level overviews of each, emphasizing key details and   interrelationships.


本书旨在作为三个标题主题数据的介绍,其探索性分析,和R编程语言,以及下面的部分给出每个主题的高级概述,强调关键细节和相互关系。


1.2.1 数据 Data

Loosely speaking,   the term  "data" refers to a   collection of details, recorded to characterize a source like one of the   following:


粗略地说,数据一词指的是一组详细信息,这些信息被记录下来以描述一个源的特征,如下所示:


l  an entity, e.g.: family history from a patient in   a medical study; manufacturing lot information for a material sample in a   physical testing application; or competing company characteristics in a   marketing analysis;


一个实体,例如:医学研究中病人的家族史;在物理测试应用程序中为材料样品制造批号信息;或市场分析中竞争公司的特征;


l  an event, e.g.: demographic characteristics of   those who voted for different political candidates in a particular election;


某一事件,例如:在某一特定选举中投票给不同政治候选人的人口统计特征;


l  a process, e.g.: operating data from an   industrial manufacturing process.


一种过程,例如:来自工业生产过程的操作数据。


This   book will generally use the term "data" to refer to a rectangular   array of observed values, where each row refers to a different observation of   entity, event, or process characteristics (e.g., distinct patients in a   medical study), and each column represents a different characteristic (e.g.,   diastolic blood pressure) recorded-or at least potentially recorded-for each   row. In R's terminology, this   description defines a data frame, one of R's   key data types.


本书一般使用数据(data)”一词来表示观测值的矩形数组,其中每一行引用对实体、事件或流程特征的不同观察(例如,医学研究中的不同病人),每一列代表不同的特征(如舒张压),或者至少可能记录每一行。在R的术语中, 这个描述定义了一个数据框,这是R的关键数据类型之一。


The   mtcars data frame is one of many built-in data examples in R. This data frame   has 32 rows, each one corresponding to a different car. Each of these cars is   characterized by 11 variables, which constitute the columns of the data   frame. These variables include the car's mileage (in miles per gallon, mpg),   the number of gears in its transmission, the transmission type (manual or   automatic), the number of cylinders, the horsepower, and various other   characteristics. The original source of this data was a comparison of 32 cars   from model years 1973 and 1974 published in Motor Trend Magazine. The first   six records of this data frame may be examined using the head command in R:


mtcarsR中的许多内置数据示例之一。这个数据帧有32行,每一行对应一辆不同的车,每辆车都由11个变量组成,构成数据框架的列。这些变量包括汽车的里程数(以每加仑英里数、每加仑英里数为单位)、变速器中的齿轮数、变速器类型(手动或自动)、气缸数、马力以及其他各种特性。这一数据的原始来源是对1973年和1974年发表在《汽车趋势》杂志上的32辆汽车的比较。这个数据框的前六条记录可以使用R中的head命令检查:


head(mtcars)




An   important feature of data frames in R is that both rows and columns have   names associated with them. In favorable cases, these names are informative,   as they are here: the row names identify the particular cars being   characterized, and the column names identify the characteristics recorded for   each car.


R中数据帧的一个重要特性是行和列都有与之关联的名称。如果做得比较好,这些名称具有信息性,如这个例子:行名称标识所描述的特定汽车,列名称标识为每辆汽车记录的特征。


A more complete description of this dataset is available through R's built-in help facility. Typing "help(mtcars)"   at the R command prompt will bring up a help   page that gives the original source of the data, cites a paper from the   statistical literature that analyzes this dataset [39], and briey describes the variables included. This information   constitutes metadata for the mtcars data frame: metadata is "data about   data," and it can vary widely in terms of its completeness, consistency,   and general accuracy. Since metadata often provides much of our preliminary   insight into the contents of a dataset, it is extremely important, and any   limitations of this metadata-incompleteness, inconsistency, and/or inaccuracy-can   cause serious problems in our subsequent analysis. For these reasons,   discussions of metadata will recur frequently throughout this book. The key   point here is that, potentially valuable as metadata is, we cannot a_ord to   ccept it uncritically: we should always cross-check the metadata with the   actual data values, with our intuition and prior understanding of the subject   matter, and with other sources of information that may be available.

[39]H.V. Henderson and P.F.   Vellemen. Building multiple regression models interactively. Biometrics, 37(2):391-411, 1981.

这个数据集的更完整的描述可以通过R的内置帮助工具获得。在R命令提示符中键入“help(mtcars)”将会出现一个帮助页面,该页面给出原始数据来源,引用了分析该数据集[39]的统计文献中的一篇论文,briey描述了包含的变量。这些信息构成mtcars数据框架的元数据:元数据是关于数据的数据,它在完整性、一致性和一般准确性方面可以有很大的差异。由于元数据通常为我们提供了对数据集内容的大部分初步了解,因此元数据非常重要,而且这种元数据的任何限制(不完整、不一致和/或不准确)都可能在我们的后续分析中导致严重的问题。


As a   specific illustration of this last point, a popular benchmark dataset for   evaluating binary classification algorithms (i.e., computational procedures   that attempt to predict a binary outcome from other variables) is the Pima   Indians diabetes dataset, available from the UCI Machine Learning Repository,   an important Internet data source discussed further in Chapter 4. In this particular case, the dataset   characterizes female adult members of the Pima Indians tribe, giving a number   of different medical status and history characteristics (e.g., diastolic   blood pressure, age, and number of times pregnant), along with a binary   diagnosis indicator with the value 1 if the patient had been diagnosed with   diabetes and 0 if they had not. Several versions of this dataset are   available: the one considered here was the UCI website on May 10, 2014, and   it has 768 rows and 9 columns. In contrast, the data frame Pima.tr included   in R's MASS package is a subset of this original, with 200 rows and 8   columns. The metadata available for this dataset from the UCI Machine   Learning Repository now indicates that this dataset exhibits missing values,   but there is also a note that prior to February 28, 2011 the metadata   indicated that there were no missing values. In fact, the missing values in   this dataset are not coded explicitly as missing with a special code (e.g.,   R's "NA" code), but are instead coded as zero. As a result, a   number of studies characterizing binary classifiers have been published using   this dataset as a benchmark where the authors were not aware that data values   were missing, in some cases, quite a large fraction of the total   observations. As a specific example, the serum insulin measurement included   in the dataset is 48.7% missing.


作为最后一点的一个具体说明,一个用于评估二进制分类算法的流行基准数据集(即皮马印第安人糖尿病数据集(Pima Indians diabetes dataset),可从UCI机器学习存储库(UCI Machine Learning Repository)获得。UCI机器学习存储库是一个重要的互联网数据源,将在第四章进一步讨论。在这种情况下,数据集特征成年女性皮马印第安人部落的成员,给许多不同的医疗状况和历史特征(例如,舒张压、年龄、怀孕次数),连同一个二进制值诊断指标1如果病人被诊断患有糖尿病和0如果他们没有。这个数据集有几个版本:这里考虑的是2014510UCI网站,它有7689列。相反,包含在RMASS包中的数据帧Pima.tr是这个原始包的子集,有200行和8列。UCI机器学习存储库中此数据集可用的元数据现在表明该数据集显示缺失值,但也有一个注意,在2011228日之前的元数据表明没有缺失值。事实上,这个数据集中缺失的值并不是用特殊代码(例如,R“NA”代码)显式地编码为缺失,而是编码为零。因此,许多描述二进制分类器的研究都是使用这个数据集作为基准来发表的,在这些研究中,作者没有意识到数据值丢失了,在某些情况下,丢失了相当大一部分的观察值。作为一个具体的例子,数据集中包含的血清胰岛素测量缺失48.7%





https://blog.sciencenet.cn/blog-853805-1191136.html

上一篇:django创建项目
下一篇:sqlserver的几个要点(1)——删除表和批次日存栏查询
收藏 IP: 42.180.51.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-12-14 12:46

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部