博文

Summary of cluster anlysis in R

已有 10405 次阅读 2015-7-10 18:49 |系统分类:科研笔记| 聚类分析

其实，写完二代高通量数据分析的软件以及数据库的博文之后，我本来想先谈OTU based analysis---alpha diversity和beta diversity,但是由于自己最近处理数据正好用上了聚类分析(clust analysis)，就先在这里和大家聊一下。其实，网上关于聚类分析的博客也有很多，我主要是集大家之长，想尽量介绍的全面一些，也可供我他日之用。

从统计学的观点看，聚类分析是通过数据建模简化数据的方法，即将研究对象按相似性来分组(clusters)的统计分析技术。其实，在聚类分析中，数据转换(Data transformation)这个经常会被大家忽略。因为我们通常拿到的OTU table或者Microarray数据，需要经过数据标准化或中心化之后才能继续往下分析，以我个人的经验，数据转换与不转换对最后的结果有很大的影响。所以我首先介绍一下R中Vegan package的decostand这个命令，专门是用来Standardization Methods for Community Ecology. 主要有以下几种转换方式：

1. total: divide by margin total (default MARGIN=1)

2. max: divide by margin maximum (default MARGIN=2)

3. freq: divide by margin maximum and multiply by the number of non-zero items, so that the average of non- zero entries is one (Okasnen 1983; default MARGIN=2)

4. normalize: make margin sum of squares equal to one (default MARGIN=1)

5. range: standardize values into range 0 ..1 (default MARGIN=2). If all values are constant, they will be transfomed to 0.

6. standardize: scale x to zero mean and unit variance (default MARGIN=2).

7. pa: scale x to presence/absence scale (0/1)

8. chi.square: divide by row sums and square root of column sums, and adjust for square root of matrix total (Legendre & Gallagher 2001).

9. hellinger: square root of method="total" (Legendre & Gallagher 2001).

10. log: logarithmic transformation as suggested by Anderson et al. (2006): log_b(x) +1 for x>0, where b is the base of the logarithm; zero are left as zero. Higher bases give less weight to quantities and more to presences, and logbase=Inf gives the presence/absence scaling.

大家可以根据自己的数据类型来进行数据转换，一般hellinger和log 转换在实际过程中用得较多。前面我用了较多的篇幅谈到了数据转换的方法，我认为这个很重要，但是往往有时候会被忽视。比如t-test, ANOVA, Pearson correlation, 这些公式都是基于在数据正态分布的条件下推导出来的，所以使用时，你的数据必须符合这些规则。接下来言归正传，总结一下在R语言中的聚类分析的方法。

1.Partitioning Clustering---划分聚类法

stats包的kmeans()函数提高了几种基于欧式距离的划分聚类方法。
cluster包的pam()依据重心做划分，可用于任何距离测度。clara()是pam()的包装函数，适合于大数据集。Silhouette plot和spanning ellipses可用于可视化。
flexclust包的可做k-重心聚类(k-centroid cluster)，允许任意的距离测度和重心计算方法，还包括硬竞争性学习(hard competitive learning) 和神经气聚类 (neural gas and QT clustering) 等其它聚类方法。邻接图(Neighborhood Graph) 和图形块 (image plot) 可用来可视化划分。
trimcluster包提供截尾k均值聚类(trimmed k-means clustering)。

Kmeans clustering is the most popular partitioning method. It requires the analyst to specify the number of clusters to extract. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. The analyst looks for a bend in the plot similar to a scree test in factor analysis. See Everitt & Hothorn (pg. 251).

# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

# K-Means Cluster Analysis
fit <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means# #
aggragate(mydata, by=list (fit$cluster),FUN=mean)
#append cluster assignment
mydata <- data.frame (mydata, fit$cluster)

A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). The function pamk( ) in the fpc package is a wrapper for pam that also prints the suggested number of clusters based on optimum average silhouette width.

2.Hierarchical Clustering---等级聚类

stats包的hclust(), cluster包的agnes()是凝聚式等级聚类(agglomerative hierarchical clusteringhierarchical clustering) 的主要函数。diana()函数则可以做分裂式等级聚类(divisive hierarchical clustering)。
stats包的dendrogram()函数及其相关的方法可用提高聚类树的可视化效果。
pvclust包可评估等级聚类的不确定性，提供近似的无偏的P值和bootstrap的P值。
hybridHclust执行混合等级聚类。

There are a wide range of hierarchical clustering approaches. I have had good luck with Ward's method described below.

# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")

click to view

The pvclust( ) function in the pvclust package provides p-values for hierarchical clustering based on multiscale bootstrap resampling. Clusters that are highly supported by the data will have large p values. Interpretation details are provided Suzuki. Be aware that pvclust clusters columns, not rows. Transpose your data before using.

# Ward Hierarchical Clustering with Bootstrapped p values
library(pvclust)
fit <- pvclust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)

click to view

3.Model Based Clustering---基于模型的聚类

mclust包 (后来的版本：mclust02) 用期望最大算法拟合混合高斯模型，允许很好的控制协方差矩阵的容积(volume)、形状 (shape) 和凝聚式等级聚类。基于等级聚类提供了多种分析，如：期望值最大算法，聚类的贝叶斯信息标准 (Bayesian Information Criterion)，密度估计和判别分析。
prabclus包聚类presence-absence矩阵，从距离计算MDS，并在MDS点用最大似然高斯混合聚类。
有限混合多元正态的贝叶斯估计可由bayesm包实现。从这样的模型中抽样，并用吉布斯抽样(Gibbs sampling) 估计模型。还提供关于马尔科夫蒙特卡洛 (MCMC) 链的分析，如：决定边缘密度，聚类观测数据，画单变量和二变量边缘密度图。
单变量正态混合模型可由nor1mix包分析。bayesmix包提供了JAGS的贝叶斯估计。vabayelMix包用变分的方法实现了有对角协方差矩阵的多元正态分布的的贝叶斯估计。wle包可做健壮的加权似然估计。
MFDA包做基于模型的功能数据聚类。

Model based approaches assume a variety of data models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters. Specifically, the Mclust( )function in the mclust package selects the optimal model according to BIC for EM initialized by hierarchical clustering for parameterized Gaussian mixture models. (phew!). One chooses the model and number of clusters with the largest BIC. See help(mclustModelNames) to details on the model chosen as best.

# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
plot(fit) # plot results
summary(fit) # display the best model

click to view

4. Other Cluster Algorithms---其它聚类方法
amap包提供k均值和凝聚式等级聚类的可选方法。
cba包为业务分析提供聚类技术，如：Proximus和Rock。
clue包执行等级和划分聚类的集成方法。
模糊聚类和bagged聚类可在e1071包里实现。
hopach包是等级方法和划分方法 (PArtitioning method，PAM) 的混合算法，并递归划分数据集构建一个树。
自组织图（Self-organizing map）可由som包执行。

5. Cluster-wise Regression---聚类似回归

flexmix包执行期望最大算法，估计混合回归模型，包括混合 (广义) 线形模型。fpc包为基于模型的聚类和线形回归提供定点方法(fixed-point methods)，多种投影法用于展示聚类结果。
分类和连续数据 (包括时间序列) 的多组混合马尔科夫隐模型 (Multigroup mixtures of latent Markov models) 能由depmix包拟合。在参数的线形或非线性约束条件下，通用目的最优化方法可用来最优化参数。
mixreg包拟合混合回归模型，并提供bootstrap检验成分的个数。
混合方式的潜类回归模型 (mixed-mode latent class regression )，尤其是纵向数据 (longitudinal data) 由mmlcr包执行。数据可服从多元正态、多项式、负二项式、泊松分布。而且还可指定建模先验的相伴变量(concomitant variables)。
moc包用牛顿算法算法 (Newton-type algorithm) 拟合多变量混合数据的混合模型，还可指定相关变量(covariates)、相伴变量 (concomitant) 和参数约束。
mixtools包提供期望最大算法拟合混合多项式、混合正态、有重复测量的正态、泊松回归和高斯回归（有随机效应），还提供Metropolis- Hastings algorithm拟合混合高斯回归。

6. Plotting Cluster Solutions

It is always a good idea to look at the cluster results.

# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)

# Cluster Plot against 1st 2 principal components

# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functions
library(fpc)
plotcluster(mydata, fit$cluster)

click to view

Validating cluster solutions

The function cluster.stats() in the fpc package provides a mechanism for comparing the similarity of two cluster solutions using a variety of validation criteria (Hubert's gamma coefficient, the Dunn index and the corrected rand index)

# comparing 2 cluster solutions
library(fpc)
cluster.stats(d, fit1$cluster, fit2$cluster)

where d is a distance matrix among objects, and fit1$cluster and fit$cluster are integer vectors containing classification results from two different clusterings of the same data.

引用地址：1. http://www.statmethods.net/advstats/cluster.html

2. http://blog.sina.com.cn/s/blog_4b2fe4d20100ixut.html

有用的聚类博文：http://www.cnblogs.com/foreverycc/archive/2013/04/19/3029873.html

R中cluster analysis的介绍：http://cran.r-project.org/web/views/Cluster.html

附件：

pvclust(clust with p-values)_1.2-0.pdf

Standardization methods for community ecology.pdf

R语言的数据基本转换.pdf

转载本文请联系原作者获取授权，同时请注明本文来自赵军科学网博客。
链接地址：https://blog.sciencenet.cn/blog-2662605-904424.html

上一篇：align以及classify的数据库
下一篇：Biogegraphy