xbinbzy的个人博客分享 http://blog.sciencenet.cn/u/xbinbzy

博文

16s sequencing中的Subsampled open-reference clustering策略

已有 3711 次阅读 2017-9-19 17:30 |个人分类:数据分析|系统分类:科研笔记

参考文章:Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences

发表时间:2014

发表杂志:PeerJ


       16s rDNA sequencing得到的reads中,需要对组成好的tag进行聚类生成OTU,生成OTU的过程中存在1)de novo OTU picking,2)closed-reference OTU picking,3)open-reference OTU picking


de novo OTU picking的基本原理是:

In de novo OTU picking, input sequences are aligned against one another, and sequences that align with greater than a user-specified percent identity are defined as belonging to the same OTU.

鉴于de novo的策略需要大量的比对过程,为此计算时间会消耗多一些

closed-reference OTU picking的基本原理是:

In closed-reference OTU picking, input sequences are aligned to pre-defined cluster centroids in a reference database. If the input sequence does not match any reference sequence at a user-defined percent identity threshold, that sequence is excluded.

由于closed-reference是比对数据库,为此计算会快,不过会丢失掉不少信息。

open-reference OTU picking的基本原理是:

结合了两者的特点进行处理,First, input sequences are clustered against a reference database in parallel in a closed-reference OTU picking process. However, rather than discarding sequences that fail to match the reference, these “failures” are clustered de novo in a serial process.

subsampled open-reference OTU picking是在open-reference OTU picking的大框架下,做了一些改动和调整,基本思路如下

(原图可查看https://peerj.com/articles/545/#p-8)

处理思路的详细介绍:First, sequences are clustered in parallel using a closed-reference OTU picking workflow, where sequences are queried against the reference database at percent identity s (default 97%). If a read matches a reference sequence at greater than or equal to s% identity, it is assigned to the OTU defined by that reference sequence. These are referred to as the reference OTUs. Next, a random subsample of n% (n should be small, the default value in QIIME 1.8.0-dev and earlier is 0.1%) of the sequences that failed to match the reference sequence collection are clustered de novo, and the cluster centroids for all resulting OTUs are used to define a new reference sequence collection. Those OTUs are referred to as the new reference OTUs. The sequences that were not included in the random subsample that was clustered de novo then go through an additional round of parallel closed-reference OTU picking, this time where they are clustered against the new reference OTUs based on matching a sequence in the new reference sequence collection at greater than or equal to s% identity. This creation of a “new reference database” allows us to harness the parallelization of our closed-reference OTU picking pipeline, greatly decreasing the time it takes for sequences that fail to hit the initial reference database to be clustered into OTUs. In the final clustering step, sequences that fail to hit a reference sequence during this final closed-reference OTU picking step are clustered de novo. These are referred to as the clean-up OTUs. Finally, the reference OTUs, new reference OTUs, and clean-up OTUs are combined into a single OTU table, and this table, as well as a filtered table excluding OTUs with counts less than or equal to a user-defined threshold c, are provided to the user. By default, c = 2, so each OTU is observed at least twice (i.e., singleton OTUs are excluded). Because many more of the sequences can be clustered using closed-reference OTU picking in this workflow, it can run in far less time than classic open-reference OTU picking.



https://blog.sciencenet.cn/blog-306699-1076746.html

上一篇:16s的建库策略
下一篇:基于DNA数据预测人的血型
收藏 IP: 210.21.228.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-9-21 11:47

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部