||
ENCODE的数据量非常庞大,有将近14000个实验(experiments)。
为了帮助用户方便检索dataset及其metdata,ENCODE提供了网络接口,具体方法见 https://www.encodeproject.org/help/rest-api/。它详细解释了检索的URL格式,如何解析返回结果。在结尾“Additional search examples”部分,给出了一些非常好的例子。
例如:All biosamples (full metadata with object references):
https://www.encodeproject.org/search/?type=biosample&frame=object&limit=all&format=json
这个例子可以得到JSON格式的所有biosamples的metadata。在微信公众号“如何玩转生物大数据”中,有篇文章“TCGA系列:TCGA的样本注释信息和数据类型统计”简单的提到了如何解析JSON。
这里,提供了已经解析好的结果。
http://pan.baidu.com/s/1nv7emeP。压缩包中的encode.bioSamples.metadata.csv。
为了方便用户深入理解metadata,ENCODE在NAR database发表了一篇文章,题目是“Principles of metadata organization at the ENCODE data coordination center”。
ENCODE的metadata有下面6个主要单元:
experiments
biosamples
libraries
antibodies
data files
pipelines
理解这六个单元的定义是理解ENCODE metadata的基础。
1, experiment
The experiment refers to one or more replicates that are grouped together along with the raw data files and processed data files. Each replicate that is part of an experiment will be performed using the same experimental method or assay (e.g. ChIP-seq). A single replicate, which can be designated as a biological or technical replicate, is linked to a specific library and an antibody used in immunoprecipitation-based assay (e.g. ChIP-seq). Since the library is derived from the biosample, the details of the biosample are affiliated with the replicate through the library used.
2, biosample
The biological material used as input material for an experimental assay. Metadata for the biosample includes the source of the material (such as a company name or a lab), how it was handled in the lab (such as number of passages or starting amounts) and any modifications to the biological material (such as the integration of a fusion gene or the application of a treatment).
3, library
the nucleic acid material that is extracted from the biosample and contains details of the experimental methods used to prepare that nucleic acid for sequencing. Details of the specific population or sub- population of nucleic acid (e.g. DNA, rRNA, nuclear RNA, etc.) and how this material is prepared for sequencing libraries is captured as metadata.
4,antibody
The metadata recorded for antibodies include the source of the antibody, as well as the product number and the specific lot of the antibody if acquired commercially. Capturing the antibody lot id is critical because there is potential for lot-to-lot variation in the specificity and sensitivity of an antibody. Antibody metadata include characterizations of the antibody performed by the labs, which examines this specificity and sensitivity of an antibody, as defined by the ENCODE consortium
练习
如何在ENCODE中检索出人类RNA-Seq数据
首先,访问https://www.encodeproject.org/search/?type=Experiment,在左边的侧边栏的两个区域做如下选择:
Assay,选择多个选项
然后,选择Download。按照指示,可以很容易metadata和各种datasets。
但是,ENCODE提供的metadata中有一个小缺陷,没有biosample的accession。通过开篇提到的URL metadata检索方法,可以建立file accession, experiment accession和biosample accesssion三者之间的关联。
这里,提供已经整理好的metadata的下载链接。
http://pan.baidu.com/s/1nv7emeP。压缩包中的encode.rnaseq.metadata.csv。
关注“如何玩转生物大数据”微信公众号,及时获取更多内容
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-20 02:46
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社