ChengyangWang的个人博客分享 http://blog.sciencenet.cn/u/ChengyangWang

博文

ENCODE IDR pipeline Q&A 视频记录

已有 2404 次阅读 2017-12-19 10:25 |个人分类:ENCODE|系统分类:科普集锦| ENCODE


本文转载自嘉因微信公众号,已获得授权。查看最新文章,敬请关注嘉因,微信ID:rainbow-genome

常宇周 嘉因

下面是我们的会议视频记录员为上期会议视频做的记录,括号()里的是小编的碎碎念。

ChIPseq数据分析基础 + ENCODE pipeline Q & A视频 | 招募会议视频记录员



有空就听,没空就速读文字记录,下面的视频跟上期一样,放在这里便于边听边看。


(前面好啰嗦,我喜欢拖到中间开始看Q &  A)

So what I have to do here is so much easier than what Ben just did, and I am almost embarrassed to go back to slides because the live demo is incredibly nerve wracking. But so we have been working on the RNA-seq pipeline. So in the beginning we told you that we have implemented RNA-seq processing pipeline. We also have implemented ChIP-seq processing pipeline for both transcription factors and for histone modifications, as well as the whole-genome bisulfite pipeline and the DNase pipeline is coming.

So I am not going to do a live demo on the ChIP-seq pipeline, but I did want to tell you what the architecture of the pipeline is and make you aware that is exist. So if you were more interested or also interested in processing ChIP-seq data, it should be able to do that.

The deployment of the ChIP-seq pipeline is identical to the RNA-seq pipeline. It’s on DNAnexus, the same sort of process that you would go through to add data to the beginning of the RNA-seq pipeline. Exactly the same process will apply for ChIP-seq as well.

So the pipeline, like many of the pipelines that we deploy, has a mapping step and then has a peak calling step and then it has a statistical framework that is applied to the replicated peaks at the end to try to assess concordance of biological replicates.

All ENCODE experiments replicated and so this last piece, called IDR, is something that we run on all of the ENCODE experiments, because all replicated. Pure experiments are not replicated, and you cannot run this. But you can call peaks.

Okay, so briefly, the ChIP-seq pipeline uses for transcription factors and for histone modifications. The mapping step is done with BWA. Duplicates are marked and removed. The peak calling step for transcription factors uses SPP. The peak-calling step for histone modifications uses MACS2. MACS2 is also used to generate the signal tracks for both histone modifications and transcription factors and I am going to tell you about the difference between the peak calls and the signal tracks with just a moment.

(MACS2就是这么经典)


And then, as I mentioned, there is a piece of software called IDR that is a statistical framework that allows assessment of concordance of two replicates.


So something we haven't talked about yet, but, is very important.  An important advantage to deploying the pipelines in the way that we have is that we can generate all sorts of quality assurance metrics.

So all ENCODE experiments have targeted read depth. They have targeted library complexities. They have goals for data quality that has to be reached for those experiments to be accessioned and distributed as ENCODE products.


And so the calculation of those quality insurance metrics are important to us, because that’s the way that we figure out whether the experiments were any good or not.

But they will also be important for you when you run your experiments through these pipelines because you can compare your data to ENCODE data and see how it stacks up.

So I wanted to actually point you to resources to learn about QC metrics rather than step through the math which I probably couldn't do justice to anyway. We do calculate sort of four general categories of quality assurance metrics for the ChIP-seq experiments and some of these also applyto DNase experiment.

So of course we calculate the read depth and there’s an excellent paper that I refer you to here that talks about target read depth for ChIP-seq experiments and what read depth you want to try to achieve for different histone modifications or different transcription factors depending on how often there they bind to the genome.

So we also calculate some estimates of the complexity of the library that you sequence and those are called NRF and the PCR bottleneck coefficient, those are documented in this paper here.

There’s also a strand of cross correlation method that is documented here that. We calculate as part of the pipeline, and that’s the measure of the quality of the ChIP. And then I mentioned that IDR is the way that we quantify the concordance between replicates.

So rather than step through in great detail what all of these metrics are or what they mean. I just wanted to put this in the slide deck so that you could go back and look them up and read about them if you want to. So you’ve already seen this, most of you I hope. But this is just what the histone ChIP-seq pipeline looks like running on DNAnexus. Itlooks just like the RNA-seq pipeline. There is a workflow that is composed of steps. Those step some run concurrently. Some run by themselves. Other run the end and this is the display that you see when you run one of these pipelines till completion on the platform.

So I also wanted to just show you. After we run these pipelines, the DCC (Dynamic currency conversion), of course we accession all the output up at the ENCODE portal. And I thought I would justquickly show you what that looks like.

So, an experiment that has been run on DNAnexus and then accessioned back into the ENCODE portal. Here I am at www.encodeproject.org. This is the experiment page for an experiment that has been run through the pipeline. This is what Ur showed you yesterday. You can access the metadata that described the experiment. But now, if we scroll down, we see the files that are generated by the pipeline.

Okay?

And this is a graphical representation of what just happened on DNAnexus. So you see files are yellow bubbles and software steps are blue rectangles. And you can follow the trajectory, if you want that the raw data takes through the pipeline by following the arrows through this graph. So I won’t step through it, but what I just want you to see is that on the portal without going to DNAnexus at all, you can see exactly what the relationships are between the input files, intermediate files that aregenerated, and the final output.

So that’s what we are trying to depicton this graph here is the relationships between the files and the software steps that generated.

Again! This is all on the ENCODE portal, so this is accessioned metadata about a processing pipeline that was run.

You can click on each one of these. I’m just going to click on this one in the middle at random and scroll down. And youwill get additional metadata about that file as well as a link to download it.

So what we are trying to accomplish on this page here is just to show you how the files were generated and give you direct access to them one by one.




Q & A


The first question:

So the question was that one of these says signal over control and what does that mean. And I will go back to the slides.

And so this is important if you care about ChIP-seq.

So what is the ChIP-seq? The ChIP-seq pipeline actually creates a number of outputs. It generates peak calls which are these blocks that you see on tracks. They have a definite start and a definite stopand they are generated based on the raw signal.

We also generate these continuous tracks that you see on the browser of where the ChIP-seq signal was high and where was low.

And all of those signal tracks are normalized to the control experiment for the ChIP. That could be input DNA, itcould be a mock IP. But the signal tracks that you see output from the uniform pipelines. The signal tracks that you see are normalized to the controls. So if you see a positive going trend in that track, you know that came from the experiment and not from the control.


Speaker: Did I answer your question?

Audience: No!

Speaker: Okay, no. I have answer some else question.

So the question was actually about the sort of exactly how you input the control files into the pipeline. You didn't see something like that RNA. The ChIP-seq pipeline takes these types of controls and you add those fastq from a control experiment in exactly the way that you add inputs fastq from the experiment itself. So the input to the pipeline is fastq from your experiment and also fastq from the control.

Yeah. So typically we match controls to experiments. So if you have two replicates you will have also two control replicates. However, the pipeline will run if you submit the same reads as both controls. We do a certain amount of read normalization between the two controls. One control is very shallow and the other is very deep. For example, we will pool those and use that pooled control for both experiments.

The second question:

Audience: Do you use any way of aligning the ChIPs, aligning the peaks so that we can say that this actually the same peak from the different examples?

Speaker: That is a good question. So from two replicates or from two different experiments?

Audience: Different experiments.

Speaker: From different experiments. No, we do not. So this actually brings up an important sort of design criteria for all of our pipelines. Our pipelines really are designed to take one experimentusually a replicated experiment, and produce a uniform output from that one experiment that then can be compared. That is comparable across many experiments. But that comparison across experiments. That’s for you to do. Our pipelines really are designed to take the primary experiment data and process it into some sort of output that can be consumed by any analysis algorithm that you might want to apply to compare experiments. So most of our pipelines are within a replicated experiment.

The third question:

Audience: There are two kinds of pipeline; you cannot compare two different time points of two different samples. It is major hard to do.

Speaker: Is it hard to do? But that’s why we didn’t do it. But really because our role as a data coordinating center isreally to give uniform output from each experiment that then can be used for subsequent analysis. So we would consider that as a subsequent analysis of you want. Be happy to take any other questions.

The forth question:

Audience: Maybe to follow up on that. All of these things being generated at different center with possibly different instruments, different flow cells, lanes and all that. To sort of follow up on the question that we just asked. How do you normalize across all those things? And it sounds like maybe you don't. that is a downstream thing. But can you give us any idea?  How we would do that because those effects can be kind of significant?

(Batch effect的问题来啦)

Speaker: That’s a good question. So this is one of the reasons why we take primary reads and not, for example, mapped reads. So we could build our pipelines to take BAM file for example. But you might not map your reads in exactly the same way as we would have mapped the reads. So that actually difference can propagate through to the end and when you do your PCA, PCA1 with the lab, right, which is not what you want. But I think that's what you are concerned about.

So what we have our experience. What we have found is that within the consortium. There are working groups that just set sort of standards for how experiments are performed and those documented on theportal. Ur pointed that out yesterday. And what we have found is that if those guidelines are followed, and, for example, for ChIP is that the antibodies havebeen characterized to the same levels and that the ChIP is performed in the same way, even data from multiple locations run at different times. The fastq, if you put them all into the same pipeline, the fastq are comparable. However, what isn’t comparable necessarily is read depth or the libraries themselves. And that's what I was talking about in the QC metrics that we calculate. Those are definitely not uniform. They are not uniform within a lab, neither are they uniform across labs. So that’s one of the reasons why we generate all those QC metrics. They all should fall with intarget ranges in order to then be able to compare the data at the end.

So I didn't really give you sort of a checklist that you might go down to ensure that an experiment that you want to compareto ENCODE is comparable. But you definitely want to calculate the same QC metrics that the pipeline through the pipeline. And compare those to other experiments that have been done with an ENCODE. And they are very different, then it is unlikely that the results will be comparable. Thank you for that question.

The fifth question:

Audience: So maybe I missed it, but could you explain the step from the BAM to the pseudoreplicates.

(两次重复的处理)


Speaker: So I am going to give you the answer for histone ChIP first, because it is simpler. And the answer for TF, for transcription factor ChIP will be slightly different.

So for histone ChIP what happens here is we call peaks for each actual replicate.

Let’s say an experiment with two replicates. We call peaks on replicate on, we call peaks on replicate two. We take the reads from both of those replicates and we pool all of them. And we call peaks on the pool. So I call peak three times. I’ve called peaks on each of my true replicates. I’ve called peaks on all the reads pooled together. Different from concatenating a peak list, right? So it’s actually an independent peak calling on pooled reads. Then we back up. We take that set of pooled reads and we split it in half and we call those pseudoreplicates. They are chosen at random without replacement. So we split the pooled reads in half. And then we call peaks on each of those pseudoreplicates. I have called peaks on five read sets. Each true, true replicate one, true replicate two, the pool, the pseudoreplicate one of the pool, and the pseudoreplicate two of the pool. Five set of reads we’ve called peaks on.

In the end, when we report the replicated peaks which you’ll see when you bring it up in the experiment page. You will see that they will be rep1 peaks, rep2 peaks and then they will be able to be replicated peaks. The replicated peaks are those which appear in either both true replicates. That’s good right? You have replicated your peak. It is in both places or if it doesn’t, is has as I said a last chance to get into this set. If it appears in both of the pseudoreplicates of the pool, then that also qualifies as a replicated peak.


So that’s what is happening here when we pool replicates and we subsample into pseudoreplicates. That is what’s going on here and we call peaks on all of those. So all of this is in order to generate the subsampled pools from which we decide whether peaks are in fact replicated. That is for histone.

For TF ChIP, where we actually run a full IDR protocol, there are additional pseudoreplicates that are generated. So pseudoreplicates of the true replicates are also generated and fed into the IDR framework. And those are not accessioned on the portal. So you will never see this files. They actually just exist within the pipeline. But that contributes to the IDR threshold peak that you get in a TF ChIP experiment. So it is a subsampling and a pseudoreplication within true replicates that are then run through this framework in order to have an unbiased quantitative way of determining whether the peak came from both replicates. I hope that was helpful.

The sixth question:

The question was for the replicated peaks. Are the coordinates based on the pool or based on the truth? That is a great question. And yes. They are from the pool.

Okay, I am going to…..this will not take long. I have shown you the graph of file relationships and all I wanted. The only other thing that I wanted to show you is that each of these files are also available through the graph. Here in the way that I showed you. Clicking on an individual file, but also in a list of files down here at the bottom of the experiment page on the portal. So what I wanted to make clear just through these slides is that we spent a lot of time talking about this platform where we actually run the experiments on the cloud. But all of the results of those runs are distributed through the portal. So they could have been generated any where I suppose. But we in fact do use these pipelines that we are sharing with you. But the results are accessioned and distributed through the portal.

So I think I will stop there and now that Ben has had a chance to catch his breath. We will see we can visualize the results of your pipeline.







https://blog.sciencenet.cn/blog-3372875-1090452.html

上一篇:ENCODE介绍视频 | 由ENCODE成员翁志萍教授亲自讲解
下一篇:表观遗传系列视频17 | Penn State 岳峰
收藏 IP: 124.77.56.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-18 20:24

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部