jiyanbio1983的个人博客分享 http://blog.sciencenet.cn/u/jiyanbio1983

博文

用于聚类验证的R包:clValid

已有 5825 次阅读 2017-7-10 21:53 |个人分类:生物信息|系统分类:教学心得

聚类是一种无监督技术,用于在多维特征空间中对彼此接近的对象进行分组,通常是为了揭示数据所具有的一些固有结构。聚类是高通量基因组数据分析中常用的一种方法,其目的是将具有相似表达模式的基因或蛋白质组合在一起,并可能共享共同的生物通路。


目前存在大量的聚类算法,其中许多算法在分析基因组数据时表现出了一定的希望。


为了验证聚类分析的结果,并确定哪一种聚类算法在某一特定实验中表现最佳,各种措施都被提出。这种验证可以完全基于数据的内部属性或外部引用,以及单独的表达式数据或与相关的生物信息结合在一起。



clValid 提供函数,用来验证聚类分析的结果。它提供三种聚类验证方法:“internal”,“stability”和“biological”。

1) “internal”只将数据集和聚类分区作为输入,并使用数据中的内在信息来评估聚类的质量。

2) “stability”是内部措施的一种特殊形式。他们将聚类结果与每次删除一列后得到的聚类进行比较,从而评估聚类结果的一致性。

3)biological评估聚类算法产生生物意义上的聚类的能力。


对于internal validation,选择了反映聚类的紧凑性(compactness),连接性(connectedness)和子聚类的分离度(separation of the cluster partitions)。

1)连通性(connectedness),涉及到在相同的聚类中最近的点到底有多么接近。

2)紧凑性(compactness)评估集群的均匀性,通常采用在簇内的方差,而分离量化分离集群之间的程度(通常通过测量聚类中心之间的距离)。


由于紧凑性和分离度表现出相反的趋势,通用的方法将它们合并成一个的分数。


同时,它提供九种聚类算法,包括hierarchical, K-means, self-organizing maps (SOM),model based clustering。


  • UPGMA(Unweighted Pair Group Method with Arithmetic Mean). It is an agglomerative, hierarchical clustering algorithm that yields a dendogram which can be cut at a chosen height to produce the desired number of clusters.


  • K-means. It is an iterative method which minimizes the within-class sum of squares for a given number of clusters.Often another clustering algorithm (e.g., UPGMA) is run initially to determine starting points for the cluster centers.


  • Diana. It is a divisive hierarchical algorithm that initially starts with all observations in a single cluster, and successively divides the clusters until each cluster contains a single observation.


  • PAM. Partitioning around medoids (PAM) is similar to K-means, but is considered more robust because it admits the use of other dissimilarities besides Euclidean distance. Like K-means, the number of clusters is xed in advance,and an initial set of cluster centers is required to start the algorithm.


  • Clara. It is a sampling-based algorithm which implements PAM on a number of sub-datasets. This allows for faster running times when a number of observations is relatively large.


  • Fanny. This algorithm performs fuzzy clustering, where each observation can have partial membership in each cluster.


  • SOM. Self-organizing maps is an unsupervised learning technique. SOM is based on neural networks, and is highly regarded for its ability to map and visualize high-dimensional data in two dimensions.


  • Model based clustering. Under this approach, a statistical model consisting of a finite mixture of

Gaussian distributions is fit to the data.


  • SOTA. Self-organizing tree algorithm (SOTA) is an unsupervised network with a divisive hierarchical binary tree structure.


关注“如何玩转生物大数据”微信公众号,及时获取更多内容




https://blog.sciencenet.cn/blog-3291578-1065612.html

上一篇:测序数据质量控制:多样本的fastqc结果,一目了然!
下一篇:“如何玩转生物大数据”系列:TCGA的样本注释信息和数据类型统计
收藏 IP: 202.127.20.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-12-20 00:45

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部