||
volume 583, pages711–719(2020)
ENCODE是什么?
ENCODE计划全称是The Encylopedia of DNA Elements,意在揭开人类基因组功能元件的全面图谱。
该图谱中包括基因、与基因调控相关的生化区域以及转录本亚型等内容。
目前在ENCODE百科全书计划中已经整合的包括人类基因组中的90万个和小鼠中的30万个调控元件的注释信息。这些注释为科研界将提供宝贵的参考资源。
ENCODE 是科学,也是艺术
21 世纪,全球逐渐开始兴起万人级别基因组计划,以基因组学为基础的精准医学飞速发展。
2003 年 - 2020 年,ENCODE 用了 17 年走完了三个阶段。这三个阶段每一步都具有里程碑式意义,为后续人类遗传相关疾病的研究和药物研发提供了丰富的理论和数据基础。
作为 ENCODE 计划第四阶段**的一部分,依托科学技术迭代,将会有更多不同细胞类型、器官和组织的描述和注释,为生命科学和医学研究开辟崭新道路。
ENCODE第三阶段
2020 年 7 月 29 日,ENCODE 计划发布第三阶段成果,公布了超过 120 万个人类与小鼠体内调控基因的候选功能性元件,大幅扩展了 RNA 转录、RNA 结合蛋白、染色质结构和修饰、DNA 甲基化、转录因子的数据库和相关工具。
ENCODE 计划第三阶段扩展了对细胞和组织库中 RNA 转录,染色质结构和修饰,DNA 甲基化,染色质环化以及转录因子和 RNA 结合蛋白的分析。ENCODE 第三阶段产生了 5992 个新的实验数据图谱。
如今,ENCODE计划的第三阶段获得的数据以及实验方法已经全面向大家开放。
ENCODE相关文章
•1. ENCODE计划第三阶段的纲领与总结
对ENCODE第三阶段产生了5992个新的实验数据图谱进行了总结,其中包括对小鼠胎儿发育的系统测定。
•2.染色质相关蛋白占位图谱(Occupancy maps)
Occupancy maps of 208 chromatin-associated proteins in one human cell type,揭开了人类细胞中染色质相关蛋白在染色质上的占据图谱信息。转录因子是DNA结合蛋白,在基因调控过程中发挥着关键作用。
•3.人类主要细胞种类的转录程序图谱
A limited set of transcriptional programs define major cell types,为人体内主要细胞类型的转录程序进行了解析。
•4. ENCODE“定制版”癌症基因组资源
An integrative ENCODE resource for cancer genomics,介绍了癌症基因组的ENCODE整合数据库。
•5. RNA结合蛋白的结合与功能图谱
A large-scale binding and functional map of human RNA-binding proteins,揭开了人类基因组中RNA结合蛋白的大规模结合和功能图谱。
美国加州大学圣地亚哥分校Gene W. Yeo研究组、UConn Health的Brenton R. Graveley研究组、麻省理工学院Christopher B. Burge研究组、加拿大IRCM研究所Eric Lécuyer 研究组与加州大学圣地亚哥分校Xiang-Dong Fu(付向东)研究合作发表,揭开了人类基因组中RNA结合蛋白的大规模结合和功能图谱。
Gene W. Yeo:
主要研究RNA在调节人体正常及肿瘤干细胞生物学和神经分化方面的影响,包括计算生物基因组、建立RNA、干细胞生物学及疾病模型。
Brenton R Graveley
uses genomic approaches to study alternative splicing and other interesting post-transcriptional mechanisms of gene regulation.
Christopher Burge
用组合方法理解pre-mRNA剪接的调控模式和转录后基因调控。
Eric Lécuyer
RNA Biology, Developmental Genetics, Transcriptomics, Genomics
Xiang-Dong Fu
RNA processing and regulation; microRNA biogenesis and functions; Transcription and mRNA processing coupling; Functional genomics; Genetics and epigenetics regulation of gene expression.
volume 19, pages327–341(2018)
Figure 1: Functional crosstalk between proteins and RNA. a | An RNA-binding protein (RBP) can interact with RNA through defined RNA-binding domains to regulate RNA metabolism and function. b | Inversely, the RNA can bind to the RBP to affect its fate and function.
RBP可以和RNA形成核糖核蛋白复合物,调节基因表达的许多方面
Figure 2: Comparison of published RNA interactomes. We stringently curated and updated the annotations of RNA-binding proteins (RBPs) identified from various sources, listed in . a | Supersets of RBPs identified by the combination of different RBP detection studies in different cell lines and organisms. (b-h) Venn diagrams and showing the intersections between different RBP sets. b | Human RBPs. RNA interactome capture (RIC; ) with either conventional ultraviolet light crosslinking (cCL) or photoactivatable ribonucleoside-enhanced crosslinking (PAR-CL) was applied to cervical cancer (HeLa), embryonic kidney (HEK293),, hepatocyte (HuH7) and myeloid leukaemia (K562) cells lines. HeLa cells were separately also subjected to RBDmap and RNPxl (Ref. ). The human data sets highly overlap, likely because of the prevalence of long-established cell lines as source material. c | Murine RBPs. RIC was applied to primary mouse embryonic fibroblasts (MEFs), mouse embryonic stem cells (mESCs) and macrophages (RAW264.7). HL-1 cardiomyocytes were subjected to both RIC and RBDmap. d | Budding yeast RBPs. Two studies used either in vitro protein arrays or oligo(dT) capture screens to identify RBPs,, and three in vivo RNA interactomes were generated by using cCL, or PAR-CL. RNPxl was used with two crosslinking approaches. The diversity of the applied technical approaches likely explains the disparity in coverage and overlap. e | Fruitfly RBPs. RIC was applied through the use of cCL, or both cCL and PAR-CL on embryos undergoing the maternal-to-zygotic transition. Together with differences in mass spectrometry (MS) approaches, this likely underlies the moderate overlap. f | Plant RBPs. RIC was performed on different plant sources, including cell-suspension cultures and leaves, etiolated seedlings and leaf mesophyll protoplasts. Given the heterogeneous sources, the three data sets agree reasonably well with each other. The lower RBP identification rates suggest ultraviolet crosslinking limitations, likely because of the presence of a cell wall and/or ultraviolet-absorbing pigments. g | Pairwise comparisons of InParanoid analysis clusters between humans, mice and yeast. h | UpSet plot showing the overlap between the human superset of RBPs and human orthologues. nRIC, nuclear RIC.
Figure 4: Modes of RNA binding.**a** | An RNA-binding protein (RBP) harbouring a classic RNA-binding domain such as the RNA recognition motif (RRM) can interact with high specificity with an RNA sequence in the context of a stem–loop. b | The eukaryotic translation initiation factor 4F (eIF4F) complex is composed of the cap-binding proteins eIF4E (4E) and eIF4G (4G) and the helicase eIF4A (4A). This complex associates with capped RNA in a sequence-independent manner to enable translation initiation. c | The exon junction complex (EJC) is deposited nonselectively on nascent transcripts by its interaction with the splicing factor CWC22 (complexed with CEF1 22) about 20 nucleotides upstream of the exon–exon junction, immediately following intron removal. d | The intrinsically disordered Arg–Gly–Gly (RGG) repeat motif of fragile X mental retardation protein (FMRP) co-folds with its target RNA, forming a tight electrostatic and shape-complementation-driven interaction. e | The internal ribosome entry site (IRES) of hepatitis C virus (HCV) interacts directly with the ribosome through a complex interaction mode that involves shape complementarity between the IRES and the 40S ribosome subunit. f | The long non-coding RNA nuclear enriched abundant transcript 1 (NEAT1) sequesters the RBPs non-POU domain-containing octamer-binding protein (NONO), paraspeckle component 1 (PSPC1) and splicing factor, proline- and glutamine-rich (SFPQ) to form paraspeckles. g | Interferon-induced, double-stranded RNA-activated protein kinase (PKR) binds to double-stranded RNA (dsRNA) derived from viral replication. Binding RNA promotes PKR dimerization, autophosphorylation and activation. Active PKR phosphorylates eIF2α to block protein synthesis in infected cells. h | Iron-regulatory protein 1 (IRP1) associates with an iron–sulfur cluster to catalyse the interconversion between citrate and isocitrate. In conditions of low iron levels, the iron–sulfur cluster is no longer synthesized and IRP1 binds mRNAs that encode cellular factors involved in iron homeostasis, thereby regulating their fate. eIF4A3, eukaryotic initiation factor 4A-III; MAGOH, protein mago nashi homolog; Y14, RBP Y14.
Improved identification of RNA binding protein (RBP) targets by enhanced CrossLinking and ImmunoPrecipitation followed by high-throughput sequencing (eCLIP-seq)
(a) RBP-RNA interactions are stabilized with UV crosslinking, followed by limited RNase I digestion, immunoprecipitation of RBP-RNA complexes with a specific antibody of interest, and stringent washes. After dephosphorylation of RNA fragments, an “inline barcoded” RNA adapter is ligated to the 3′ end. After protein gel electrophoresis and nitrocellulose membrane transfer, a region 75 kDa (~220 nt of RNA) above the protein size is excised and proteinase K treated to isolate RNA. RNA is further prepared into paired-end high-throughput sequencing libraries, where read 1 begins with the inline barcode and read 2 begins with a random-mer sequence (added during the 3′ DNA adapter ligation) followed by sequence corresponding to the 5′ end of the original RNA fragment (which often marks reverse transcriptase termination at the crosslink site (red X)). (b) Bars indicate the number of reads remaining after processing steps. PCR duplicate reads that map to the same genomic position and have the same random-mer as previously considered reads are discarded, leaving only “Usable reads”. (c) Varying numbers of uniquely mapped reads were randomly sampled from RBFOX2 iCLIP and eCLIP experiments and PCR duplicate removal was performed. Points indicate the mean of 100 downsampling experiments (for all, s.e.m. is less than 0.1% of mean value). (d) RBFOX2 read density in reads per million usable (RPM). Shown are iCLIP, two biological replicates for eCLIP with paired size-matched input (SMInput) and IgG-only controls. CLIPper-identified clusters indicated as boxes below, with dark colored boxes indicating binding sites enriched above SMInput.
Fig. 1: Overview of experiments and data types.
采用5种针对356个RBP的活性的不同方面的分析方法进行综合分析,得到1223个可重复数据集。
eCLIP得到150个RBPs的223个数据集;
KD RNA-Seq得到263个RBP的472个profile,RNA表达和剪接模式,推测RNA功能。
RNA Bind-N-Seq,破译78个RBPs的体外结合特异性。
免疫荧光法绘制了274个RBPs的亚细胞定位图。
CHIP-Seq分析了39个RBPs的DNA关联模式。
a, The five assays performed to characterize RBPs. b, Three hundred and fifty-six RBPs profiled by at least one ENCODE experiment (orange or red) with localization by immunofluorescence (green), essential genes from CRISPR screening (maroon), manually annotated RBP functions (blue or purple), and annotated protein domains (pink; RRM, KH, zinc finger, RNA helicase, RNase, double-stranded RNA binding (dsRBD), and pumilio/FBF domain (PUM-HD)). Histograms for each category are shown at bottom. c, Combinatorial expression and splicing regulation of PTBP3. Tracks indicate eCLIP and RNA-seq read density (reads per million). Tracks are shown for replicate 1; eCLIP and KD–RNA-seq were performed in biological duplicate with similar results. Bottom, alternatively spliced exon 2, with lines indicating junction-spanning reads and indicated per cent spliced in (ψ). Boxes indicate reproducible (by IDR) PTBP1 peaks, with red boxes indicating RBNS motifs for the PTB family member PTBP3 located within (or up to 50 bases upstream of) peaks.
包含PTBP3外显子2的mRNA改变了起始密码子的使用,增加了PTBP3蛋白的细胞质定位。
PTBP1敲除后增加PTBP3 外显子2。与之前的研究一致,PTBP1直接调控PTBP3的剪接。
点评与思考:PTBP1的eclip结果显示PTBP3的2号外显子可以被PTBP1蛋白结合,从而阻止剪切,产生1-3连接的isoform;而shRNA对PTBP1进行敲除后,释放了2号外显子,使其能够形成1-2-3号外显子连接的新isoform;考虑到PTBP1的敲除能够引起转分化,是否正是这个新产生的isoform在其中起到了关键作用呢?
讨论(XYM):有这个可能,但是考虑到PTBP1敲除转分化是在另外的细胞系中,如神经细胞,而这个是在肝癌细胞系中做的,所以情况可能会有所不同,需要查能够引起转分化的细胞中是否也有调控PTBP3效应,如果有则可能是这个新产生的isoform起作用,如果无则可能是其他效应。
讨论(LB):还要看p3在胶质细胞中是否表达,需要看回2013年付老师的cell paper
REST阻止神经元基因的表达,并抑制miR-124的产生,而PTB是124的靶标,同时PTB阻止124结合SCP1或CoREST,这两者促进REST的生成。一旦增加124将减少PTB,同时减少REST,进而释放各类神经元相关基因
a. 将eCLIP peaks叠加到GENCODE转录本注释上时,大多数RBPs的peaks都与特定区域重叠。这与之前识别的RBPs的功能作用一致。
b.开发了一个family-aware mapping strategy,使我们能够准确地量化在多拷贝元件中的相对富集,包括基因家族与多个假基因(如核糖体RNA或Y RNA),逆转座子,对许多eCLIP数据观察代表总数的少数的唯一映射读取集,开发和其他重复元件(扩展数据图2 b, c)。
c.结合此方法,我们观察到由rRNA或snRNA信号控制的RBPs簇符合已知功能,以及由反义Alu和L1 /LINE信号(图2 b, c,扩展数据图2 d f),与最近的分析一致,表明与逆转座子元件的结合(特别是反义方向)包含了总RBP结合的一个未被重视的部分17。
d.总体靶RNA表达在5倍以内的,HepG2中的RBFOX2 eCLIP peak通常也会在K562细胞中富集。
e. 具有两种细胞eCLIP数据的73个RBP中,不变或显著差异表达的大多数峰在第二种细胞中富集4倍及以上,并且可重复和显著重叠。
Fig. 3: Sequence-specific binding in vivo is determined predominantly by intrinsic RNA affinity of RBPs.
a. 体外和体内显著富集的5mers在大多数情况下是一致的,在23个RBP中有15个具有明显的重叠。一个RBP的TOP RBNS 5mer也总是在eCLIP peaks富集,在编码区,内含子和UTR区域观察到的结果相似。大多数RBP一半的eCLIP peak包含前五个RBNS 5mer之一。
b. 这些通常为G-、GC-或GU-rich的eCLIP-only motifs,可能代表与RNA结合其他蛋白与目标RBP相互作用,或者可能代表在共纯化或交联位置上的偏倚或交联位点附近序列中的偏倚. C-rich 6mers富集最多。G-rich 6mer没有在RBNS富集。
c. eCLIP peak有RBNS基序的与强烈抑制外显子跳跃有关,平均增加约25%。
Fig. 4: Association between RBP binding and RNA expression upon knockdown.
a, Heatmap indicates significance of overlap between genes with regions that were significantly enriched (P ≤ 10−5 and ≥4-fold enriched in eCLIP versus input) and genes that were significantly (top) increased or (bottom) decreased (P < 0.05 and false discovery rate (FDR) <0.05) in RBP knockdown RNA-seq experiments. Significance determined by two-sided Fisher’s exact test or Yates’ χ2 approximation where appropriate; P* < 0.05, *P* < 10−5 after Bonferroni correction. Shown are all overlaps meeting a P < 0.05 threshold; see Extended Data Fig. for all comparisons. b, c, Lines indicate cumulative distribution plots of gene expression fold-change (knockdown versus control) for indicated categories of eCLIP enrichment of DDX6 in HepG2 cells (b), and IGF2BP3 in HepG2 cells (c). P < 10−5, P* < 0.05; two-sided Kolmogorov–Smirnov test.
发现4个 RBPs eCLIP富集与RNA-Seq对其敲低后基因表达增加相关,7个 RBPs富集与降低表达相关。
点评:IGF2BP的敲低会导致基因表达下降,这与杨老师和慧琳NCB文章中发现IGF2BP稳定mRNA的功能一致
Fig. 5: Integration of eCLIP and RNA-seq identifies splicing regulatory patterns.
a. RBFOX2在下游近端内含子上富集,RBFOX2敲除导致外显子排除,而PTBP1在上游近端内含子上富集,敲除后,导致外显子滞留
b. 在基因敲除中,结合SR蛋白通常是与减少盒式外显子包含相关,而结合hnRNP蛋白与基因敲除后的盒式外显子包含增加有关,这与经典模型中SR和hnRNP蛋白对剪接有拮抗作用的结果一致。
c. 上游5 ‘剪接位点的Alternative exons比外显子直接侧内含子区域的富集程度更高。
d. 显示了潜在的协同调控,QKI在RBFOX2敲除处显示了eCLIP 富集。
e. 做了RBFOX2和QKI在HepG2的eCLIP对比。
a, Normalized splicing maps of RBFOX2 and PTBP1 for skipped exons that were excluded (blue) or included (red) upon knockdown, relative to a set of ‘native’ skipped exons (nSEs) for which the inclusion rate was between 0.05 and 0.95 in controls. Lines indicate average eCLIP read density in IP versus input for indicated exon categories. Shaded area indicates 0.5th and 99.5th percentiles observed from 1,000 random samplings of native events. b, Heatmap indicates the difference between nSE-normalized eCLIP read density at skipped exons that were included (left) or excluded (right) upon RBP knockdown for all profiled HNRNP and SR proteins (see Extended Data Fig. for all RBPs). c, Lines indicate the average number of RBPs with eCLIP peaks at skipped (green) versus constitutive (grey) exons and flanking introns. Spliceosome machinery RBPs were excluded from this analysis. d, Heatmap indicates normalized eCLIP signal at RBFOX2 knockdown-excluded exons in HepG2 cells relative to nSEs for RBFOX2 (top) and all other RBPs within the same binding class and cell type (bottom). See Extended Data Fig. for all labels. e, Lines indicate normalized signal tracks for eCLIP replicates of RBFOX2 and QKI in downstream proximal introns. Black line, mean of 37 non-RBFOX2 data sets in the same binding class; grey, 10th to 90th percentiles.
Fig. 6: Chromatin association of RBPs and overlap with RNA binding.
a, Overlap between RBP ChIP–seq and DNase I hypersensitive sites and various histone marks in HepG2 and K562 cells. Labels indicate marks associated with regulatory regions (RE), promoters (TSS), enhancers (E), transcribed regions (T) and repressive regions (R). b, Heatmap indicates the Jaccard indexes between ChIP–seq peaks of different RBPs at promoter regions (bottom left) or non-promoter regions (top right) for all HepG2 ChIP–seq data sets. See Extended Data Fig. for all labels and Extended Data Fig. for K562 cells. c, Percentage of RBP eCLIP peaks overlapped by ChIP–seq peaks (red) and percentage of RBP ChIP–seq peaks overlapped by eCLIP peaks (green) for the same RBP. RBPs are sorted by decreasing level of overlapped ChIP–seq peaks. d, Clustering of overlapping chromatin- and RNA-binding activities of different RBPs at non-promoter regions in HepG2. Colour indicates the degree of ChIP enrichment at eCLIP peaks relative to surrounding regions. Significant enrichments (P ≤ 0.001 by two-sided Wilcoxon rank-sum test with no multiple comparison correction) are indicated by filled circles. e, Cross-RBP comparison of chromatin and RNA-binding activities in HepG2 cells. Left, ChIP–seq density of indicated RBPs around HNRNPK, PCBP2 or PCBP1 eCLIP peaks. Right, eCLIP average read density of indicated RBPs around HNRNPK, PCBP2 or PCBP1 eCLIP peaks.
Fig. 7: Subcellular localization of RBPs and links to transcriptome binding and regulation.
RBPs在核蛋白上的定位与eCLIP在45S前体RNA和小核蛋白RNA上的富集对应,RBPs在核仁上的定位与eCLIP在45S前体rna和小核仁rna上的富集、线粒体在线粒体rna上的富集以及核散斑在近端内含子区域的富集相对应,证实了RBP的定位与RNA靶点有联系(图7b)。
d. 这些定位于线粒体的RBPs共享高重叠与RBPs显著eCLIP富集在线粒体rna重(H)链(QKI TBRG4),轻(L)链(GRSF1 SUPV3L1),或两个链(FASTKD2 DHX30),而通过免疫荧光定位的线粒体通常与线粒体rna上显著增加的eCLIP富集有关(图7 b d)。
接下来,我们重点研究了DHX30,它对线粒体中的核糖体组装和氧化磷酸化至关重要。DHX30不仅与许多线粒体转录本相关联,与之前来自RIP-seq的数据一致,而且在所有注释基因下游未注释的H 链区域富集,这些区域具有形成茎环结构的巨大潜力(图7d)。
a, Examples of RBPs (green) co-localized with nine investigated markers (red). RBPs were imaged at five or more sites per co-labelling marker with twelve co-labelled markers in total, and representative images are shown. b, For localization patterns with known localized RNA classes, heatmap indicates significance (from one-sided Wilcoxon rank-sum test) comparing eCLIP relative information for the indicated RNA class (y-axis) for RBPs with versus without the indicated localization (x-axis). c, Bars indicate eCLIP relative information content (IP versus input) for mitochondria H-strand (grey) or L-strand (red). RBPs with mitochondrial localization in HepG2 cells are indicated in red. Inset shows immunofluorescence imaging for DHX30 (representative of ten sites imaged). d, Genome browser tracks indicate eCLIP relative information content along the mitochondrial genome (top) or a roughly 300-nt region for indicated RBPs (bottom). Inset shows RNA secondary structure prediction (RNAfold) for the indicated region. Tracks are shown for replicate 1; eCLIP and KD–RNA-seq were performed in biological duplicate with similar results.
•在该工作中,作者们对识别人类基因组中RNA元件的RNA结合蛋白的建立了新的数据图谱,作为ENCODE计划的第三阶段的一部分内容。RNA元件作为RNA结合蛋白的结合位点控制转录后比如对RNA剪接、mRNA的编辑、定位、稳定性以及翻译等过程。
•作者们对K562和HepG2细胞中大量识别RNA元件的RNA结合蛋白进行了解析,综合使用5种方法整合分析并确定了RNA结合蛋白在体内RNA和染色质上的结合位点、在体外的结合偏好、结合位点的功能和亚细胞定位。
•这些数据扩展了人类基因组中编码的功能元件的目录,增加了对于RNA结合蛋白在人类基因组表达调控中的全局性认识。
•在文章的数据上构建RNA调控的其他方面的分析,如microRNA处理,RNA编辑,修饰,如假尿嘧啶修饰和m6A甲基化,以及翻译效率等。
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2025-1-10 19:51
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社