zhlingl的个人博客分享 http://blog.sciencenet.cn/u/zhlingl

博文

20200804=富集分析

已有 3557 次阅读 2020-8-5 09:49 |个人分类:文献阅读|系统分类:科研笔记| 功能富集

Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap

Nature Protocols: Published: 21 January 2019

Abstract

Pathway enrichment analysis helps researchers gain mechanistic insight into  gene lists generated from genome-scale (omics) experiments. This method  identifies biological pathways that are enriched in a gene list more  than would be expected by chance. We explain the procedures of pathway  enrichment analysis and present a practical step-by-step guide to help  interpret gene lists resulting from RNA-seq and genome-sequencing  experiments. The protocol comprises three major steps: definition of a  gene list from omics data, determination of statistically enriched  pathways, and visualization and interpretation of the results.协议包括三个主要步骤:从组学数据中定义基因列表,确定统计丰富的途径以及结果的可视化和解释。 We  describe how to use this protocol with published examples of  differentially expressed genes and mutated cancer genes; however, the  principles can be applied to diverse types of omics data. The protocol  describes innovative visualization techniques, provides comprehensive  background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment  Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can  be performed in ~4.5 h and is designed for use by biologists with no  prior bioinformatics training.

Introduction

Comprehensive quantification of DNA, RNA and proteins in biological samples is now  routine. The resulting data are growing exponentially, and their  analysis helps researchers discover novel biological functions,  genotype–phenotype relationships and disease mechanisms1,2. However, analysis and interpretation of these data represent a major  challenge for many researchers. Analyses often result in long lists of  genes that require an impractically large amount of manual literature  searching to interpret. A standard approach to addressing this problem  is pathway enrichment analysis, which summarizes the large gene list as a smaller list of more easily interpretable pathways. Pathways are  statistically tested for over-representation in the experimental gene  list relative to what is expected by chance, using several common  statistical tests that consider the number of genes detected in the  experiment, their relative ranking and the number of genes annotated to a pathway of interest. For instance, experimental data containing 40%  cell cycle genes are surprisingly enriched, given that only 8% of human  protein-coding genes are involved in this process. 例如,实验数据中包含40%细胞周期相关基因,而背景中只有8%的人类蛋白质编码基因参与该过程,说明这一过程在该实验数据中显著富集。

In a recent example, we used pathway enrichment analysis to help  identify histone and DNA methylation by the polycomb repressive complex  (PRC2) 多梳抑制复合物(PRC2) as the first rational therapeutic target for ependymoma 室管膜瘤, one of  the most prevalent childhood brain cancers3. This pathway is targetable by available drugs such as 5-azacytidine,  which was used on a compassionate basis in a terminally ill patient and  stopped rapid metastatic tumor growth3. In another example, we analyzed rare copy-number variants (CNVs) in  autism and identified several significant pathways affected by gene  deletions, whereas few significant hits were identified with  case–control association tests of single genes or loci4,5. These examples illustrate the useful insights into biological  mechanisms that can be achieved using pathway enrichment analysis.

Development of the protocol

This protocol covers  pathway enrichment analysis of large gene lists typically derived from  genome-scale (omics) technology. The protocol is intended for  experimental biologists who are interested in interpreting their omics  data. It requires only an ability to learn and use ‘point-and-click’  computer software, although advanced users can benefit from the  automatic analysis scripts we provide as Supplementary Protocols 14. We analyze previously published human gene expression and somatic mutation data as examples6,7,8; however, our conceptual framework is applicable to analysis of lists of genes or biomolecules from any organism derived from large-scale data,  including proteomics, genomics, epigenomics and gene-regulation studies. We extensively use pathway enrichment analysis for many projects and  have evaluated numerous available tools9,10,11,12. The software packages we cover here have been selected for their ease  of use, free access, advanced features, extensive documentation and  up-to-date databases, and they are ones we use daily in our research and recommend to collaborators and students. In addition, we have provided  feedback to the developers of these tools, allowing them to implement  features we have needed in published analyses. These tools are  g:Profiler13, GSEA14, Cytoscape15 and EnrichmentMap16, all freely available online:

Overview of the procedure

This section outlines the major stages of pathway enrichment analysis. A detailed step-by-step  protocol is provided in the Procedure below. Pathway enrichment analysis involves three major stages (Fig. 1; see Box 1 for basic definitions).

  1. 1Definition of a gene list of interest using omics data. An omics experiment comprehensively measures the activity of genes in  an experimental context. The resulting raw dataset generally requires  computational processing, such as normalization and scoring, to identify genes of interest, considering the experimental design. For example, a  list of genes differentially expressed between two groups of samples can be derived from RNA-seq data17. Gene lists derived from other types of omics experiments, such as gene expression microarrays18, quantitative proteomics19,20, germline and somatic genome sequencing21,22,23, and global DNA methylation assays24,25, can be used in this protocol; however, each type of data may require  specific pre-processing steps (see ‘Comparison to alternative methods’  section).

  2. Pathway enrichment analysis. A  statistical method is used to identify pathways enriched in the gene  list from stage 1, relative to what is expected by chance. All pathways  in a given database are tested for enrichment in the gene list (see Box 2 for a list of pathway databases). Several established pathway  enrichment analysis methods are available, and the choice of which to  use depends on the type of gene list (see ‘Comparison to alternative  methods’ section).

  3. Visualization and interpretation of pathway enrichment analysis results. Many enriched pathways can be identified in stage 2, often including  related versions of the same pathway. Visualization can help identify  the main biological themes and their relationships for in-depth study  and experimental evaluation.

Gene lists derived from diverse omics data undergo pathway enrichment  analysis, using g:Profiler or GSEA, to identify pathways that are  enriched in the experiment. Pathway enrichment analysis results are  visualized and interpreted in Cytoscape using its EnrichmentMap,  AutoAnnotate, WordCloud and clusterMaker2 applications. Protocol  overview is shown on the left, starting from gene list input, and  example outputs at each stage are shown on the right.

Stage 1: definition of a gene list of interest using omics data

Genome-scale experiments generate raw data that must be processed to  obtain gene-level information suitable for pathway enrichment analysis  (Supplementary Protocols 1 and 2). The specific processing steps are particular to the omics experiment  type and may be standard, and thus usually straightforward to implement, or not, in which case advanced computational skills may be needed for  data processing. Standard processing methods are available for  established omics technologies and are most conveniently performed by  the core facility that generates the data.

There are two major  ways to define a gene list from omics data: list or ranked list. Certain omics data naturally produce a gene list, such as all somatically  mutated genes in a tumor identified by exome sequencing, or all proteins that interact with a bait in a proteomics experiment. Such a list is  suitable for direct input into pathway enrichment analysis using  g:Profiler (Step 6A). Other omics data naturally produce ranked lists.  For example, a list of genes can be ranked by differential gene  expression score or sensitivity in a genome-wide CRISPR screen. Some  pathway enrichment analysis approaches analyze a ranked gene list  filtered by a particular threshold (e.g., FDR-adjusted P value  <0.05 and fold-change >2). Alternative approaches, such as GSEA,  are designed to analyze ranked lists of all available genes and do not  require a threshold. A whole-genome ranked list is suitable for input  into pathway enrichment analysis using GSEA (Step 6B). A partial  (non-whole-genome) ranked gene list should be analyzed using g:Profiler.

As an example, we describe the analysis of raw RNA-seq data from ovarian cancer samples to define a ranked gene list7. DNA sequence reads are quality filtered (e.g., by trimming to remove  low-quality bases) and mapped to a genome-wide reference set of  transcripts to enable counting of reads per transcript. Read counts are  aggregated at the gene level (counts per gene). Typically, RNA-seq data  for multiple biological replicates (three or more) for each of multiple  experimental conditions (two or more, e.g., treatment versus control)  are available. Read counts per gene are normalized across all samples to remove unwanted technical variation between samples, for example, due  to differences in sequencing lane or total read number per sequencing  run26,27,28. Next, read counts per gene are tested for differential expression  across sample groups (e.g., treatment versus control) (Supplementary  Protocols 1 and 2 for RNA-seq and microarray data, respectively). Software packages such as edgeR29, DESeq30, limma/voom31,32 and Cufflinks33 implement procedures for RNA-seq data normalization and differential  expression analysis. Differential gene expression analysis results  include: (i) the P value of the significance of differential expression; (ii) the related Q value (a.k.a. adjusted P value) that has been corrected for multiple testing across all genes,  for example, by using the Benjamini–Hochberg false-discovery rate  (BH-FDR) procedure34 (Box 3); (iii) effect size and direction of expression change so that  upregulated genes are positive and at the top of the list and  downregulated genes are negative and at the bottom of the list, often  expressed as log-transformed fold-change. The gene list is then ranked  by one or more of these values (e.g., −log10 P value multiplied by the sign of log-transformed fold-change) and studied using pathway enrichment analysis.

Stage 2A: pathway enrichment analysis of a gene list using g:Profiler (Step 6A)

The default analysis implemented in g:Profiler and similar web-based tools35,36,37,38 searches for pathways whose genes are significantly enriched (i.e.,  over-represented) in the fixed list of genes of interest, as compared to all genes in the genome (Step 6A)(Box 4). The P value of the enrichment of a pathway is computed using a Fisher’s exact test and multiple-test correction is applied (Box 3).

g:Profiler also includes an ordered enrichment test, which is suitable for lists  of up to a few thousand genes that are ordered by a score, whereas the  rest of the genes in the genome lack meaningful signal for ranking. For  example, significantly mutated genes may be ranked by a score from a  cancer driver prediction method6. This analysis repeats a modified Fisher’s exact test on incrementally  larger sub-lists of the input genes and reports the sub-list with the  strongest enrichment P value for each pathway39. g:Profiler searches a collection of gene sets representing Gene  Ontology (GO) terms, pathways, networks, regulatory motifs, and disease  phenotypes. Major categories of gene sets can be selected to customize  the search.

Pathway enrichment methods that use the Fisher’s exact test or related tests require the definition of background genes for  comparison. All annotated protein-coding genes are often used as  default. This leads to inappropriate inflation of P values and  false-positive results if the experiment can directly measure only a  subset of all genes. For example, setting a custom background is  important in analyzing data from targeted sequencing or  phosphoproteomics experiments. The appropriate custom background would  include all genes in the sequencing panel or all known phosphoproteins,  respectively.

Stage 2B: pathway enrichment analysis of a ranked gene list using GSEA (Step 6B)

Pathway enrichment analysis of a ranked gene list is implemented in the GSEA software14 (Step 6B) (Box 4). GSEA is a threshold-free method that analyzes all genes on the basis of their differential expression rank, or other score, without prior gene  filtering. GSEA is particularly suitable and is recommended when ranks  are available for all or most of the genes in the genome (e.g., for  RNA-seq data). However, it is not suitable when only a small portion of  genes have ranks available, for example, in an experiment that  identifies significantly mutated cancer genes (Stage 2A; Step 6A).

GSEA searches for pathways whose genes are enriched at the top or  bottom of the ranked gene list, more so than expected by chance alone.  For instance, if the topmost differentially expressed genes are involved in the cell cycle, this suggests that the cell cycle pathway is  regulated in the experiment. By contrast, the cell cycle pathway is  probably not significantly regulated if the cell cycle genes appear  randomly scattered through the whole ranked list. 例如,如果最显著差异表达的基因参与细胞周期,则表明该细胞周期途径在实验中受到调控。 相比之下,如果细胞周期基因随机出现在整个排名列表中,则细胞周期途径可能不会受到明显调节。 To calculate an  enrichment score (ES) for a pathway, GSEA progressively examines genes  from the top to the bottom of the ranked list, increasing the ES if a  gene is part of the pathway and decreasing the score otherwise. These  running sum values are weighted, so that enrichment in the very top-  (and bottom-) ranking genes is amplified, whereas enrichment in genes  with more moderate ranks are not amplified. The ES score is calculated  as the maximum value of the running sum and normalized relative to  pathway size, resulting in a normalized enrichment score (NES) that  reflects the enrichment of the pathway in the list. Positive and  negative NES values represent enrichment at the top and bottom of the  list, respectively. Finally, a permutation-based P value is computed and corrected for multiple testing to produce a permutation-based false-discovery rate (FDR) Q value that ranges from 0 (highly significant) to 1 (not significant) (Box 3). The same analysis is performed starting from the bottom of the ranked  gene list to identify pathways enriched in the bottom of the list.  Resulting pathways are selected using the FDR Q value threshold (e.g., Q < 0.05) and ranked using NES. In addition, the ‘leading edge’ aspect of the GSEA analysis identifies specific genes that most strongly  contribute to the detected enrichment signal of a pathway.

GSEA has two methods for determining the statistical significance (P value) of the ES: gene set permutation and phenotype permutation. The  gene set permutation test requires a ranked list, and GSEA compares the  observed pathway ES to a distribution of scores obtained by repeating  the analysis with randomly sampled gene sets of matching sizes (e.g.,  1,000 times). The phenotype permutation test requires expression data  for all samples (e.g., biological replicates), along with a definition  of sample groups called ‘phenotypes’ that are compared with each other  (e.g., cases versus controls; tumor versus. normal samples). The  observed pathway ES is compared to a distribution of scores obtained by  randomly shuffling the samples among phenotype categories and repeating  the analysis (e.g., 1,000 times), including computation of the ranked  gene list and resulting pathway ES. Gene set permutation is recommended  for studies with limited variability and biological replicates (i.e.,  two to five per condition). In this case, differential gene expression  values should be computed outside of GSEA, using methods that include  variance stabilization (such as edgeR29, DESeq30 and limma/voom31,32) and imported into the GSEA software before pathway analysis. Phenotype  permutation should be used with a larger number of replicates (e.g., at  least ten per condition). The main advantage of the phenotype  permutation approach is that it maintains the structure of gene sets  with biologically important gene correlations during permutation, in  contrast to the gene set permutation approach. This protocol covers only gene set permutation because it is appropriate for the most common use  case of pathway enrichment analysis. Phenotype permutation is  computationally expensive and, for the current version of GSEA, requires custom programming to compute ESs and differential expression  statistics separately for thousands of phenotype randomizations. For  advanced users, we provide a supplementary protocol for this procedure  (Supplementary Protocol 4).

By default, the GSEA desktop software searches the MSigDB gene set  database, which includes pathways, published gene signatures, microRNA  target genes and other gene set types (Box 2). The user can also provide a custom database as a text-based GMT (Gene  Matrix Transposed) file in which each line defines a pathway, with its  name, identifier and a list of genes it contains. Gene identifiers in  the GMT file must match those in the input gene list.

Stage 3: visualization and interpretation of pathway enrichment analysis results (Steps 7–13)

Pathway information is inherently redundant, as genes often participate in  multiple pathways, and databases may organize pathways hierarchically by including general and specific pathways with many shared genes (e.g.,  ‘cell cycle’ and ‘M-phase of cell cycle’). Consequently, pathway  enrichment analysis often highlights several versions of the same  pathway. Collapsing redundant pathways into a single biological theme  simplifies interpretation. We recommend addressing such redundancy with  visualization methods such as EnrichmentMap16, ClueGO40 and others41,42,43. An ‘enrichment map’ is a network visualization that represents overlaps among enriched pathways (Fig. 1), whereas ‘EnrichmentMap’ refers to the Cytoscape application that  creates the visualization. Pathways are shown as circles (nodes) that  are connected with lines (edges) if the pathways share many genes. Nodes are colored by ES, and edges are sized on the basis of the number of  genes shared by the connected pathways. Network layout and clustering  algorithms automatically group similar pathways into major biological  themes. The EnrichmentMap software takes as input a text file containing pathway enrichment analysis results and another text file containing  the pathway gene sets used in the original enrichment analysis.  Interactive exploration of pathway ES (filtering nodes) and connections  between pathways (filtering edges) is possible (Step 9A(xii and xiii)  and 9B(xiii and xiv)). Multiple enrichment analysis results can be  simultaneously visualized in a single enrichment map, in which case  different colors are used on the nodes for each enrichment. If the gene  expression data are optionally loaded, clicking on a pathway node will  display a gene expression heat map of all genes in the pathway.

An enrichment map helps identify interesting pathways and themes. First,  expected themes should be identified to help validate the pathway  enrichment analysis results (positive controls). For instance,  growth-related pathways and other hallmarks of cancer44 are expected to be identified in analyses of cancer genomics datasets.  Second, pathways not previously associated with the experimental context are evaluated more carefully as potential discoveries. Pathways and  themes with the strongest ESs should be studied first, followed by  progressively weaker signals (Step 12). Third, interesting pathways are  examined in more detail, examining genes within the pathways (e.g.,  expression heat maps and the GSEA leading edge genes). Further, gene  expression values can be overlaid on a pathway diagram, if available,  from databases such as Pathway Commons45, Reactome46, KEGG47 or WikiPathways48, using tools such as PathVisio49. If a diagram is not available, tools such as STRING50 or GeneMANIA51 can be used with Cytoscape15 to define an interaction network among pathway genes for expression  overlay. This helps in visual identification of the pathway components  (e.g., single genes or entire signaling cascades) that are the most  altered (e.g., differentially expressed) in the experiment. In addition, master regulators for enriched pathways can be searched for by  integrating gene sets of miRNA52 or transcription factor53 targets using the EnrichmentMap post-analysis tool. Finally, pathway  enrichment analysis results can be published to support a scientific  conclusion (e.g., functional differences of two cancer subtypes), or  used for hypothesis generation or planning of experiments to support the identification of novel pathways. More pathway enrichment analysis  examples and deeper explanation of core concepts is provided at http://www.pathwaycommons.org/guide/.

Advantages and limitations

Pathway enrichment  analysis of omics data has several advantages as compared to analysis of single genes, transcripts or proteins. First, it improves statistical  power in two ways: (i) it aggregates counts of mutations across all the  genes and genomic regions involved in the given cell mechanism,  providing a higher number of counts, which makes statistical analyses  more reliable; and (ii) it reduces the dimensionality from tens of  thousands of genes or millions of genomic regions (e.g., SNPs) to a much smaller number of ‘systems’ or ‘pathways’, thereby reducing the cost of multiple hypothesis testing. Second, results are often easier to  interpret because the analysis is phrased at the level of familiar  concepts such as ‘cell cycle’. Third, the approach can help identify  potential causal mechanisms and drug targets. Fourth, results obtained  from related, but different, data may be more comparable because results are projected onto a smaller, shared feature space (i.e., a limited  number of pathways). Fifth, the approach facilitates integration of  diverse data types, such as genomics, transcriptomics and proteomics,  which can all be mapped to the same pathways. Thus, projecting disease  data onto known mechanisms increases statistical and interpretative  power.

The following limitations are important to consider when  interpreting pathway enrichment analysis results, in general, including  those covered by this protocol. Additional limitations apply, depending  on the omics data type (see ‘Application to diverse omics data’  section). Advantages and disadvantages of specific and alternative  pathway enrichment analysis methods are presented in the ‘Comparison to  alternative methods’ section.

  • Enrichment analysis is more effective for  pathways in which multiple genes have strong biological signals (e.g.,  differential expression). For instance, in a transcriptomics experiment, we assume that evolution has optimized a cell to express a pathway only when needed, and that pathway activation or deactivation can be  identified as coordinated activity of many genes in a pathway. Pathways  in which activity is controlled by only a few genes or not controlled by gene expression (e.g., by post-translational regulation) will never be  observed as enriched. 如果同路的变化不是通过基因表达反映出来的,则无法评估 Some pathway analysis methods address this by  using activating and inhibiting gene interactions to construct  quantitative models of pathway activity that include genes that are not  differentially expressed yet are still important regulators. However,  these methods require pathway models with detailed biochemical and  regulatory gene interactions that are obtained through focused  experiments and thus are in limited supply (Box 2).

  • Pathway boundaries tend to be arbitrary, and  different databases will disagree about which genes are involved in a  given pathway. By using multiple databases, multiple pathway definitions can be analyzed, and some may be better than others at explaining the  experimental data.

  • Some pathway enrichment methods, such as those  based on the Fisher’s exact test, are statistically more likely to  identify larger pathways as significant. Users can address this  limitation by selecting an upper limit for the size of the gene sets  considered in the analysis.

  • Multi-functional genes that are highly ranked  in the gene list may lead to enrichment of many different pathways, some of which are not relevant to the experiment54. Repeating the analysis after excluding such genes may reveal pathways  whose enrichment is overly dependent on their presence or confirm the  robustness of pathway enrichment.

  • Pathway databases, and therefore enrichment  results, are biased toward well-known pathways. In fact, pathway  enrichment analysis ignores genes with no pathway annotations, sometimes called the ‘dark matter of the genome’, and these genes should be  studied separately. For example, non-coding RNA genes currently lack  systematic annotations and are not directly usable in pathway enrichment analysis. ncRNA没有得到很好的注释,因此不能用于功能富集分析

  • Most enrichment analysis methods make  unrealistic assumptions of statistical independence among genes as well  as pathways. Some genes may always be co-expressed (e.g., genes within a protein complex), and some pathways have genes in common. Thus,  standard FDRs, which assume statistical independence between tests, are  often either more or less conservative than ideal. Nonetheless, they  should still be used to adjust for multiple testing and rank enriched  pathways for exploratory analysis and hypothesis generation. Custom  permutation tests may lead to better estimates of false discovery (see  ‘Comparison to alternative methods’ section).

Experimental design

Pathway enrichment analysis  benefits greatly from careful experimental design. Otherwise, the  analysis may reveal apparently meaningful results caused by experimental biases or other confounders. This section covers a range of  experimental aspects that must be considered before performing this  protocol.

Experimental conditions

The  experimental conditions must be well defined such that the major  variations observed are responses that the experimenter would like to  monitor and that are related to the biological question of interest  (e.g., tumor versus normal, treated versus untreated, comparison of four disease subtypes, time series).

Number of replicates

Biological replicates are independently processed samples obtained from distinct  organisms or cell lines that are required for measuring variability  across samples and to compute statistical significance (P  values). Lack of replication (i.e., one sample per group) will not  permit robust estimation of the significance of the signal. Insufficient replication may result in lack of signal in the data (e.g., no  significantly differentially expressed genes). The larger the variation  in the set of samples, the more biological replicates are needed to  accurately measure the signal. For systems with lower variability (i.e., model organisms with the same genetic background in controlled  laboratory conditions, or stable cell lines derived from the same  clone), at least three to four biological replicates are recommended per condition for differential analysis with variance stabilization  normalization. Variance stabilization uses a global statistical model to ‘stabilize’ gene-wise variance estimates to reduce inaccuracies  resulting from few replicates. For experiments with higher variability  (e.g., tumor samples), more replicates are required; ideally, a pilot  experiment followed by formal statistical power calculations55 (sometimes called sensitivity testing) should be used to determine the  minimal number of replicates required to identify the signal of  differentially expressed genes or enriched pathways. Technical  replicates comprising repeated experiments of the same samples are  usually not needed for well-established experimental techniques, such as RNA-seq, that have low technical variability, but can be helpful for  novel techniques.

Confounding factors

Differences in factors not related to the experimental question should be avoided  or at least balanced across conditions so that statistical techniques  such as generalized linear models can correct for each factor. Common  factors include sequencing batch, nucleic acid extraction protocol,  subject age and many others. Otherwise, it may be impossible to  accurately separate the experimental signals coming from the  experimental response from the confounding factors. Knowing important  factors in advance supports correct experimental design. Statistical  exploratory analyses such as clustering or principal component analysis  (PCA) can help identify unknown factors. For example, cases and controls are expected to cluster separately and not by processing batch.

Outliers

Outlier samples may considerably differ from others because of major  experimental or technical problems, such as contamination or sample  mix-up. Alternatively, they may present extreme biological features,  such as tumor samples with exceptionally aggressive phenotypes. 它们可能表现出极端的生物学特征,例如具有异常侵袭性表型的肿瘤样品。 Unbiased identification of outlier samples is possible using statistical  techniques such as PCA or clustering. Pathway enrichment analysis should be performed with and without outliers to ensure robust results.  Systematic removal of outliers may be justified to reduce variability in the experiment.

Experimental sensitivity

Some  experimental methods can be tuned to be more or less sensitive. For  instance, the number of reads in RNA-seq experiments influences  downstream analysis. For quantifying gene expression in a biological  system with modest variability and testing differential expression with  variance stabilization, at least three to five replicates and 10 million mapped reads are required56. Substantially greater sequencing depth, such as 50–100 million mapped  reads, is required to investigate splice isoforms, to detect poorly  expressed genes or for samples with complex cellular mixtures such as  surgical resection specimens.

Choice of pathway gene set database

We recommend searching enrichment of pathway gene sets only at first, as  these capture familiar normal cellular processes that are easy to  interpret. GO57 biological process terms and manually curated molecular pathways from Reactome58, Panther38, HumanCyc59 and NetPath60 are good resources for human pathways (Box 2). GO biological process annotations include a mix of manually curated and electronically inferred sources.

Filtering GO pathways by evidence code

A large fraction of gene annotations in GO originate from automatic data  analysis and are not verified by human curators. These have the evidence code ‘inferred from electronic annotation’ (IEA). Earlier literature  has cautioned against analyzing and interpreting IEA-labeled annotations61, whereas more recent studies suggest that these are often as reliable as annotations assigned by human curators62. For high-confidence analyses of data from human and common model  organisms that have many manually curated annotations, we generally  recommend comparing versions of the analysis with and without filtering  of IEA annotations to verify robustness. However, IEA annotations make  up the majority of information in less-well-studied species and should  be used by default in these cases. Removing IEA-coded annotations may  bias the analysis toward well-studied biological processes.

Use of non-pathway gene sets

Different types of gene sets help answer a variety of questions. For instance,  non-pathway gene sets corresponding to microRNA and transcription factor targets can be used to discover important regulators52,53. However, simultaneously analyzing all available types of gene sets  reduces data interpretability. It may also lead to false negatives, as  the increased number of conducted tests increases the effect of multiple testing correction and reduces the multiple-test adjusted significance  of individual pathways. We therefore recommend performing the analysis  of non-pathway and pathway gene sets separately.

Gene set size considerations

It is often beneficial to exclude numerous small pathways because they are redundant with larger pathways and complicate interpretation, and their abundance makes multiple-testing correction more stringent. Large  pathways should be also excluded, as these are overly general (e.g.,  ‘metabolism’), they do not contribute to interpretability of results,  and their statistical significance can be inflated when using certain  statistical enrichment methods (e.g., Fisher’s exact test). For  analyzing human gene expression data, we often recommend excluding  pathway gene sets with <10–15 genes and >200–500 genes, although  upper bounds of 200–2,000 genes can be found in the literature. However, for non-human organisms and other types of gene sets, which may have  different gene set size distributions, larger sets may need to be  included. Filtering of pathways depends on the experimental context, as  different areas of biology have variable coverage in pathway databases.  One can determine the lower and upper bounds of pathway size by  examining the sizes of several pathways of interest that are expected to be relevant to the experiment.

Importance of using updated pathway gene sets

Pathway enrichment analysis depends on gene sets and databases used in the  analysis, and many recent studies using pathway enrichment analysis are  strongly impacted by outdated resources11. For improved reproducibility and transparency of research,  investigators should report in publications the analysis date and  versions of pathway enrichment analysis software and gene set databases  used, as well as all analysis parameters. In addition to enrichment  maps, authors should consider adding their studied gene lists and  complete tables of enriched pathways as supplementary information.

Choice of gene identifier

Genes are associated with many diverse database identifiers (IDs). We  recommend using unambiguous, unique and stable IDs, as some IDs become  obsolete over time. For human genes, we recommend using the Entrez Gene  database IDs (e.g., 4193 corresponds to MDM2) or gene symbols (MDM2 is the official symbol recommended by the HUGO Gene Nomenclature  Committee). As gene symbols change over time, we recommend maintaining  both gene symbols and Entrez Gene IDs. The g:Profiler and related  g:Convert tools support automatic conversion of multiple ID types to  standard IDs.

Unexpected pathway results and experimental design

Unexpected biological themes revealed in a pathway analysis may indicate problems  with experimental design, data generation or analysis. For example,  enrichment of the apoptosis pathway may indicate a problem with the  experimental protocol that led to increased cell death during sample  preparation. In these cases, the experimental design and data generation should be carefully reviewed before further data interpretation.

Application to diverse omics data

This protocol uses RNA-seq data7 and somatic mutation data6 as examples because these data types are frequently encountered.  However, the general concepts of pathway enrichment analysis that we  present are applicable to many types of experiments that can generate  lists of genes, such as single-cell transcriptomics, CNVs5, proteomics63, phosphoproteomics64, DNA methylation65 and metabolomics66. Most data types require protocol modifications, which we only briefly  discuss here. With certain data types, specialized computational methods are required to produce a gene list that is appropriate for pathway  enrichment analysis, whereas with other data types, a specialized  pathway enrichment analysis technique is required. Issues specific to  data types and experimental methods must be considered, including:

  • Different gene identifiers are recommended for certain data types. We recommend UniProt accession numbers for proteins (e.g., Q00987 for MDM2) and Human Metabolome Database IDs for metabolites (e.g., ATP is denoted as HMDB00538).

  • Certain types of omics experiments by design  capture only a subset of genes or proteins. To address this limited  coverage, pathway enrichment analysis must define a custom background  gene set of the genes that can be measured in the experiment. For  example, phosphoproteomics experiments measure only proteins with one or more phosphorylation sites and thus must use the set of genes encoding  phosphoproteins as the custom background gene set. 例如,磷酸蛋白质组学实验仅测量具有一个或多个磷酸化位点的蛋白质,因此必须使用编码磷蛋白的基因集作为定制背景基因集。 Otherwise, pathway  enrichment analysis would reveal inflated P values for general processes such as kinase signaling and protein phosphorylation.

  • Pathway enrichment analysis of short non-coding genomic regions such as transcription factor binding sites from  ChIP-seq experiments need additional consideration. Genomic regions must be mapped to protein-coding genes and corrected for biases such as  increased signal in longer genes. Tools such as GREAT67 automatically perform both tasks.

  • Large genomic intervals that span multiple  genes (e.g., from genome-wide associations, CNV and differentially  methylated regions) require specialized enrichment tests such as the  PLINK CNV gene set burden test68 or INRICH69. Standard enrichment tests often reveal genes clustered in the genome  whose signals are strongly statistically inflated because each gene is  incorrectly counted as an independent signal. 标准富集测试通常会揭示基因组中聚集的基因,其信号在统计上被强烈夸大,因为每个基因被错误地视为独立信号。 Gene types that are  correlated with genomic position include olfactory receptors, histones,  major histocompatibility complex (MHC) members and homeobox  transcription factors. A simple solution to address genomic clustering  of genes in a pathway involves selecting only one representative gene  from each functionally homogeneous genomic cluster before enrichment  analysis.

  • For rare genetic variants, case–control pathway ‘burden’ tests are the most appropriate pathway enrichment analysis  method (see ‘Comparison to alternative’ methods section).

Comparison to alternative methods

Pathway enrichment analysis methods

This protocol recommends the use of g:Profiler and GSEA software for pathway enrichment analysis. g:Profiler13,39 analyzes gene lists using Fisher’s exact test and ordered gene lists  using a modified Fisher’s test. It provides a graphical web interface  and access via R and Python programming languages. The software is  frequently updated, and the gene set database can be downloaded as a GMT file (http://biit.cs.ut.ee/gprofiler). GSEA14 analyzes ranked gene lists using a permutation-based test. GSEA分析排名的基因列表基于置换的检验。 The software typically runs as a desktop application (http://software.broadinstitute.org/gsea). Hundreds of pathway enrichment analysis tools exist (reviewed in ref. 70), although many rely on out-of-date pathway databases or lack unique  features as compared to the most commonly used tools; as such, we do not cover them here. The following are alternative free pathway enrichment  analysis software tools. Although we do not cover these tools in our  protocol, we recommend the following, on the basis of their ease of use, unique features or advanced programming features.

  • Enrichr37: This is a web-based enrichment analysis tool for non-ranked gene lists  that is based on Fisher’s exact test. It is easy to use, has rich  interactive reporting features, and includes >100 gene set databases  (called libraries), including >180,000 gene sets in multiple  categories. Functionality is similar to that of the g:Profiler web  server described in this protocol.

  • Camera71: This R Bioconductor package analyzes gene lists and corrects for  inter-gene correlations such as those apparent in gene co-expression  data. The software is available as part of the limma package in  Bioconductor (https://bioconductor.org/packages/release/bioc/html/limma.html; this is an advanced tool that requires programming expertise; Supplementary Protocol 3).

  • GOseq72: This R Bioconductor package analyzes gene lists from RNA-seq  experiments by correcting for user-selected covariates such as gene  length (https://bioconductor.org/packages/release/bioc/html/goseq.html; this is an advanced tool that requires programming expertise).

  • Genomic Regions Enrichment of Annotations Tool (GREAT)67: In contrast to common methods that analyze gene lists, GREAT analyzes  genomic regions such as DNA binding sites and links these to nearby  genes for pathway enrichment analysis (http://bejerano.stanford.edu/great/public/html/). See ‘Application to diverse omics data’ section.

Visualization tools

This  protocol recommends the use of EnrichmentMap for pathway enrichment  analysis visualization to aid interpretation. EnrichmentMap16 is a Cytoscape15 application that visualizes the results from pathway enrichment  analysis and eases interpretation by displaying pathways as a network in which overlapping pathways are clustered together to identify major  biological themes in the results (http://www.baderlab.org/Software/EnrichmentMap). Two alternative useful visualization tools are:

  • ClueGO40: This Cytoscape application is conceptually similar to EnrichmentMap and provides a network-based visualization to reduce redundancy of results  from pathway enrichment analysis. It also includes a pathway enrichment  analysis feature for analysis of GO annotations using Fisher’s exact  tests. However, it currently supports only GO gene sets.

  • PathVisio49: This desktop application provides a complementary visualization  approach to those of EnrichmentMap and ClueGO. PathVisio enables the  user to visually interpret omics data in the context of gene and protein interactions in a pathway of interest. PathVisio colors pathway genes  according to user-provided omics data (https://www.pathvisio.org). This is the main advantage of PathVisio as compared to EnrichmentMap and ClueGO.

Topology-aware pathway analysis methods

Most pathway enrichment analysis methods treat all genes in a pathway  uniformly and ignore gene interactions. By contrast, topology-aware  methods explicitly model the interactions between genes. CePa73, GANPA74 and THINK-Back75 use physical gene interactions or co-expression networks to assign a  weight to each gene in each pathway. Weights can be derived from  measures of the gene importance in the network such as degree, the  number of gene connections and betweenness centrality, and can be  integrated into a traditional pathway enrichment analysis method such as GSEA. Methods such as SPIA76, Pathway-Express77 and EnrichNet78 generate an ES for the entire pathway that considers pathway regulatory interactions such as activation and inhibition. Although useful and  potentially more accurate, regulatory and biochemical gene interactions  are available for fewer genes and pathways as compared to physical  interactions networks and co-expression. We anticipate that these  methods will become more useful as more gene interactions in pathways  are characterized in detailed molecular experiments. However collecting  and curating high-quality and biochemically detailed pathway data from  the literature is currently complex and expensive. Therefore, pathway  enrichment analysis methods described in this protocol will probably  remain the most widely used approaches for the foreseeable future.

Future perspectives

Current pathway enrichment analysis methods provide a useful high-level  overview of the pathways active in a genomics experiment. However, these methods consider a simplified pathway view that involves only gene  sets. Next-generation pathway analysis methods will integrate more  biological pathway details, build pathway models based on multiple types of genomics data measured across many samples, and consider positive  and negative regulatory relationships in the data. For instance,  qualitative mathematical modeling parameterized with single-cell RNA-seq data may one day enable accurate predictions of drug combinations  capable of treating a given disease under study.

Overview of the protocol

This step-by-step protocol explains how to complete pathway enrichment  analysis using g:Profiler (filtered gene list) and GSEA (unfiltered,  whole-genome, ranked gene list), followed by visualization and  interpretation using EnrichmentMap. The example data provided for the  g:Profiler analysis is a list of genes with frequent somatic single  nucleotide variants (SNVs) identified in The Cancer Genome Atlas (TCGA)  exome-sequencing data of 3,200 tumors of 12 types6. The example data provided for the GSEA analysis is a list of  differentially expressed genes in two subtypes of ovarian cancer defined by TCGA8.

Input data

Critical

We provide downloadable example files that are referred to throughout the protocol (Supplementary Tables 1–13). We recommend saving all these files in a personal project data folder before starting. We also recommend creating an additional result data folder to save the files generated while performing the protocol.

  • A gene list or ranked gene list of interest

    • Example data for Step 6A.  g:Profiler requires a list of genes, one per line, in a text file or  spreadsheet, ready to copy and paste into a web page: for this, we use  genes with frequent somatic SNVs identified in TCGA exome sequencing  data of 3,200 tumors of 12 types6. The MuSiC cancer driver mutation detection software was used to find  127 cancer driver genes that displayed higher than expected mutation  frequencies in cancer samples (Supplementary Table 1, which is derived from column B of Supplementary Table 4 in ref. 6). Genes are ranked in decreasing order of significance (FDR Q value) and mutation frequency (not shown).

    • Example data for Step 6B. GSEA  requires an RNK file with gene scores. An RNK file is a two-column text  file with gene IDs in the first column and gene scores in the second  column. All (or most) genes in the genome need to have a score, and the  gene IDs need to match those used in the GMT file. We provide a ranked  list of differentially expressed genes in ovarian cancer from TCGA  (Supplementary Table 2). This cohort was previously stratified into four molecular subtypes on the basis of gene expression data, defined as  differentiated, immunoreactive, mesenchymal and proliferative 该队列先前根据基因表达数据分为四种分子亚型,定义为分化,免疫反应,间充质和增生7,8. We compared the immunoreactive and mesenchymal subtypes to demonstrate the protocol. Step 5 of Supplementary Protocol 1 shows how this file was created.

  • Pathway gene set database

    • In Step 6A, g:Profiler maintains an  up-to-date set of pathway gene sets from multiple sources and no further input from the user is required, but a database of pathway gene sets is required for Step 6B (GSEA). Supplementary Table 3 contains a database  of pathway gene sets used for pathway enrichment analysis in the  standard GMT format, downloaded from http://baderlab.org/GeneSets. This file contains pathways downloaded on 1 July 2017 from eight data sources: GO57, Reactome58, Panther38, NetPath60, NCI79, MSigDB curated gene sets (C2 collection, excluding Reactome and KEGG)80, MSigDB Hallmark (H collection)81 and HumanCyc59. The gene sets available from http://baderlab.org/GeneSets are updated monthly. A GMT file is a text file in which each line  represents a gene set for a single pathway. Each line includes a pathway ID, a name and the list of associated genes in a tab-separated format.

Procedure

Software installation

Timing 5 min

  1. 1

    Download the required input and output files from the Supplementary Materials of the protocol.

    • Create two directories, project data folder and results data folder.

    • Place all downloaded input and example output files into the project data folder.

    • As you progress through the protocol, place any newly generated files into the results data folder.

  2. 2

    Install Java v.8 or higher. Follow the download and installation instructions for Java JRE at http://www.oracle.com/technetwork/java/javase/downloads/index.html.

    troubleshooting

  3. 3

    Download the latest version of GSEA. We recommend the javaGSEA desktop application available at http://www.broadinstitute.org/gsea/downloads.jsp. Free registration is required.

  4. 4

    Download the latest version of Cytoscape from http://www.cytoscape.org. Cytoscape v.3.6.0 or higher is required.

  5. 5

    Install the required Cytoscape applications.

    • Launch Cytoscape.

    • Go to AppsApp Manager (i.e., open the Apps menu and select the item App Manager).

    • In the Install Apps tab search bar, search for EnrichmentMap.

    • Click on EnrichmentMap Pipeline Collection in the center panel. Verify that it is v.1.0.0 or higher.

    • Click on the Install button.

    • Go to the Currently Installed tab and verify that the applications (EnrichmentMap, clusterMaker2, WordCloud and AutoAnnotate) have been installed.

Pathway enrichment analysis

Timing 3–20 min

  1. 6

    Two major types of gene lists are used in  pathway enrichment analysis of omics data. Flat (unranked) gene lists of dozens to thousands of genes can be analyzed using g:Profiler (option  A). A statistical threshold is required to compile a gene list from  omics data. By contrast, ranked, whole-genome gene lists are suitable  for pathway enrichment analysis using GSEA (option B). Gene lists  analyzed with GSEA do not require prior filtering using statistical  thresholds. Partial, filtered ranked gene lists can also be analyzed  with g:Profiler. Select Step 6A or 6B, depending on the type of gene  list you have.

    1. (i)

      Open the g:Profiler website at http://biit.cs.ut.ee/gprofiler/ (Fig. 2).

    2. (ii)

      Paste the gene list (Supplementary Table 1) into the Query field in top-left corner of the screen. The gene list can be space-separated or one per line. The organism for the analysis, Homo sapiens, is selected by default. The input list can contain a mix of gene and  protein IDs, symbols and accession numbers. Duplicated and unrecognized  IDs will be removed automatically, and ambiguous symbols can be refined  in an interactive dialogue after submitting the query.

    3. (iii)

      Check the box next to Ordered query. This option treats the input as an ordered gene list and prioritizes  genes with higher mutation ESs at the beginning of the list.

    4. (iv)

      (Optional) Check the box next to No electronic GO annotations. This option will discard less reliable GO annotations (IEAs) that are not manually reviewed.

    5. (v)

      Set filters on gene  annotation data using the menu on the right. We recommend that initial  pathway enrichment analyses includes only biological processes (BPs) of  GO and molecular pathways of Reactome. Keep the two checkboxes checked  and uncheck all other boxes in the menu.

    6. (vi)

      Click on Show Advanced Options to set additional parameters.

    7. (vii)

      Set the values of Size of functional category in the dropdown menu to 5 (‘min’) and 350 (‘max’). Large pathways are  of limited interpretative value, whereas numerous small pathways  decrease the statistical power because of excessive multiple testing.

    8. (viii)

      Set the Size of query/term intersection in the dropdown menu to 3. The analysis will consider only more  reliable pathways that have three or more genes in the input gene list.

    9. (ix)

      Click g:Profile! to run the analysis. A graphical heat map image will be shown, with detected pathways shown along the y axis (left) and associated genes of the input list shown along the x axis (top). Resulting pathways are organized hierarchically into  related groups. g:Profiler uses graphical output by default and switches to textual output when a large number of pathways is found. g:Profiler  returns only statistically significant pathways with P values adjusted for multiple testing correction (called Q values). By default, results with Q values <0.05 are reported. g:Profiler reports unrecognized and ambiguous gene IDs that can be resolved manually.

    10. (x)

      Use the dropdown menu Output type and select the option Generic Enrichment Map (TAB). This file is required for visualizing pathway results with Cytoscape and EnrichmentMap.

    11. (xi)

      Click g:Profile! again to run the analysis with the updated parameters. The required link Download data in Generic Enrichment Map (GEM) format will appear under the g:Profiler interface. Download the file from the link and save it on your computer in your result data folder created in Step 1. Example results are provided in Supplementary Table 4.

    12. (xii)

      Download the required GMT file by clicking on the link name at the bottom of the Advanced Options form. The GMT file is a compressed ZIP archive that contains all gene  sets used by g:Profiler (e.g., gprofiler_hsapiens.NAME.gmt.zip). The  gene set files are divided by data source. Download and uncompress the  ZIP archive to your project folder. All required gene sets for this  analysis will be in the file hsapiens.pathways.Name.gmt  (Supplementary_Table5_hsapiens.pathways.NAME.gmt). Place the saved file  in your result data folder created in Step 1.

    13. (A)

                                       Pathway enrichment analysis of a gene list using g:Profiler                                

      Timing 3 min

      Fig. 2: Screenshot of g:Profiler user interface.

      figure2

      Protocol Step 6A involves populating the g:Profiler interface. Procedural steps  are highlighted with rectangles and roman numerals (refer to Step  6A(i–xii)). Purple boxes highlight files that must be downloaded for  subsequent protocol steps. The remaining boxes indicate parameters for  the analysis.

Fig. 3: Screenshot of GSEA user interface.

Fig. 4: GSEA output overview.

a, Web page summary of GSEA results showing pathways enriched in  the top or bottom of the ranked list, with the ‘na_pos’ and ‘na_neg’  phenotypes corresponding to enrichment in upregulated and downregulated  genes, respectively. These have been manually labeled here as  mesenchymal and immunoreactive, respectively. Clicking on ‘Snapshot’  under either of the phenotypes will show the top 20 enrichment plots for that phenotype. b, An example enrichment plot for the top pathway in the mesenchymal set. c, An example enrichment plot for the top pathway in the immunoreactive set.

Fig. 5: Class/phenotype-specific GSEA output.

Class/phenotype-specific GSEA output in the web page summary shows how many gene sets were found enriched in upregulated genes, regardless of significance (purple), the total number of gene sets used after size filtering (cyan), the  phenotype name (red) and the number of gene sets that pass different  thresholds (orange).

Visualization of enrichment results with EnrichmentMap

Fig. 6: Screenshot of the EnrichmentMap software user interface.

a,b, Input fields in the EnrichmentMap interface for g:Profiler (a) and GSEA (b) results. Procedural steps are shown for Step 9A and 9B. Other than the  specific input files, the parameters are the same for the two analysis  types. Attributes surrounded by a dashed box should be filled out  automatically if the user selects an appropriate folder with the  required files. Missing file names indicate that EnrichmentMap was  unable to find the specified file. Orange boxes indicate optional files. For the examples presented in the protocol, optional files are used for the GSEA analysis but not for the g:Profiler analysis to demonstrate  the two distinct use cases. EM, EnrichmentMap.



Fig. 7: Resulting enrichment maps (no manual formatting).

a,b, Unformatted enrichment maps generated from Steps 6A (a) and 6B (b). Each node (circle) represents a distinct pathway, and edges (blue  lines) represent the number of genes overlapping between two pathways,  determined using the similarity coefficient. a, Enrichment map of significantly mutated cancer driver genes generated using the g:Profiler analysis in Step 6A. b, Enrichment map of pathways enriched in upregulated genes in mesenchymal (red) and immunoreactive (blue) ovarian cancer samples using the GSEA  analysis in Step 6B.

Fig. 8: Overview of EnrichmentMap panels in Cytoscape.

(i) Cytoscape ‘Control Panel’, which contains ‘Networks’, ‘Styles’ and ‘Select’ tabs as well as the ‘EnrichmentMap’ main panel. (ii) The ‘Table Panel’ contains tables with node, edge and network  attributes, as well as an enrichment map ‘Heat Map’ panel displaying  expression for genes associated with selected nodes and edges. (iii) Cytoscape search bar, which can be used to search for genes in the enrichment map. (iv) ‘Node Table’ containing values for all variables associated with each node in the network. (v) Q-value or P-value slider bar. By default, the slider is set to Q value if the data contains Q values but can be changed to use P values by selecting the ‘P-value’ radio button. All nodes that pass the initial Q-value threshold are displayed in the enrichment map. By moving the slider to the left, the Q-value threshold is adjusted to a lower value, removing any nodes that do not pass the Q-value threshold. The currently set threshold will be displayed in the  accompanying text box. Thresholds can be manually adjusted by modifying  the text box value directly. (vi) ‘Edge Cutoff (Similarity)’  slider bar. The slider bar modifies the similarity threshold. The  similarity threshold can only be increased; i.e., edges are required to  have more genes in common in order to remain visible, which will remove  edges from the network that do not satisfy the threshold. One can also  manually change the threshold by modifying the text box value directly.

Fig. 9: Example heat map in EnrichmentMap.

Heat map created by selecting the immunoreactive pathway interferon alpha  beta signaling pathway from Reactome. The heat map is useful for  visualization of detailed gene expression patterns for a pathway of  interest. Magenta corresponds to high expression, and green corresponds  to low expression. This heat map is for GSEA results, thus the ‘leading  edge’ genes are highlighted in yellow; these genes have the largest  contribution to the enrichment signal. (ivi) Additional controls in the Heat Map panel include sorting options (i), selection of genes to include (ii), expression data visualization options (iii), data compression options (iv), the option to show values (v) and heat map settings (vi).

Fig. 10: Resulting publication-ready enrichment map.

(i) Overall thumbnail view of the publication-ready enrichment map created with parameters FDR Q value < 0.01, and combined coefficient >0.375 with combined constant = 0.5. (ii) Zoomed-in section of publication-ready enrichment map, in which red and blue nodes represent mesenchymal and immunoreactive phenotype pathways, respectively. Nodes were manually laid out to form a clearer picture.  Clusters of nodes were labeled using the AutoAnnotate Cytoscape  application. Individual node labels were removed for clarity using the  publication-ready button in EnrichmentMap and exported as PNG and PDF  files. A legend was manually added at the bottom of the figure.


Fig. 11: Collapsed enrichment map.

The enrichment map was summarized by collapsing node clusters using the AutoAnnotate application. Each cluster of nodes in Fig. 10 is now represented as a single node. The network was scaled for better  node distribution and manually adjusted to reduce node and label  overlap. (i) Overall thumbnail view of the entire collapsed enrichment map. (ii) Zoomed-in section of the publication-ready collapsed enrichment map that corresponds to the zoomed-in network in Fig. 10 (ii).

Fig. 12: Subnetwork example.

Subnetwork of the main enrichment map (Fig. 10) was manually created by selecting pathways with the top NES values and  creating a new network from the selection. Red and blue nodes are  mesenchymal and immunoreactive phenotypes, respectively. Clusters of  nodes were automatically labeled using the AutoAnnotate application.  Annotations in the subnetwork may differ slightly from those in the main network, as word counts were normalized on a network basis.

Fig. 13: Generic enrichment map legend.

Enrichment map attributes can be copied for use in a custom figure legend. Only  components relevant to the analysis should be copied. Post-analysis  ‘Signature set’ nodes are included in the generic legend (not covered in this protocol). Post-analysis nodes highlight pathways in the  enrichment map that contain specific genes of interest such as targets  of drugs or microRNAs.

Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap

Nature Protocols: Published: 21 January 2019

Abstract

Pathway enrichment analysis helps researchers gain mechanistic insight into  gene lists generated from genome-scale (omics) experiments. This method  identifies biological pathways that are enriched in a gene list more  than would be expected by chance. We explain the procedures of pathway  enrichment analysis and present a practical step-by-step guide to help  interpret gene lists resulting from RNA-seq and genome-sequencing  experiments. The protocol comprises three major steps: definition of a  gene list from omics data, determination of statistically enriched  pathways, and visualization and interpretation of the results.协议包括三个主要步骤:从组学数据中定义基因列表,确定统计丰富的途径以及结果的可视化和解释。 We  describe how to use this protocol with published examples of  differentially expressed genes and mutated cancer genes; however, the  principles can be applied to diverse types of omics data. The protocol  describes innovative visualization techniques, provides comprehensive  background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment  Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can  be performed in ~4.5 h and is designed for use by biologists with no  prior bioinformatics training.

Introduction

Comprehensive quantification of DNA, RNA and proteins in biological samples is now  routine. The resulting data are growing exponentially, and their  analysis helps researchers discover novel biological functions,  genotype–phenotype relationships and disease mechanisms1,2. However, analysis and interpretation of these data represent a major  challenge for many researchers. Analyses often result in long lists of  genes that require an impractically large amount of manual literature  searching to interpret. A standard approach to addressing this problem  is pathway enrichment analysis, which summarizes the large gene list as a smaller list of more easily interpretable pathways. Pathways are  statistically tested for over-representation in the experimental gene  list relative to what is expected by chance, using several common  statistical tests that consider the number of genes detected in the  experiment, their relative ranking and the number of genes annotated to a pathway of interest. For instance, experimental data containing 40%  cell cycle genes are surprisingly enriched, given that only 8% of human  protein-coding genes are involved in this process. 例如,实验数据中包含40%细胞周期相关基因,而背景中只有8%的人类蛋白质编码基因参与该过程,说明这一过程在该实验数据中显著富集。

In a recent example, we used pathway enrichment analysis to help  identify histone and DNA methylation by the polycomb repressive complex  (PRC2) 多梳抑制复合物(PRC2) as the first rational therapeutic target for ependymoma 室管膜瘤, one of  the most prevalent childhood brain cancers3. This pathway is targetable by available drugs such as 5-azacytidine,  which was used on a compassionate basis in a terminally ill patient and  stopped rapid metastatic tumor growth3. In another example, we analyzed rare copy-number variants (CNVs) in  autism and identified several significant pathways affected by gene  deletions, whereas few significant hits were identified with  case–control association tests of single genes or loci4,5. These examples illustrate the useful insights into biological  mechanisms that can be achieved using pathway enrichment analysis.

Development of the protocol

This protocol covers  pathway enrichment analysis of large gene lists typically derived from  genome-scale (omics) technology. The protocol is intended for  experimental biologists who are interested in interpreting their omics  data. It requires only an ability to learn and use ‘point-and-click’  computer software, although advanced users can benefit from the  automatic analysis scripts we provide as Supplementary Protocols 14. We analyze previously published human gene expression and somatic mutation data as examples6,7,8; however, our conceptual framework is applicable to analysis of lists of genes or biomolecules from any organism derived from large-scale data,  including proteomics, genomics, epigenomics and gene-regulation studies. We extensively use pathway enrichment analysis for many projects and  have evaluated numerous available tools9,10,11,12. The software packages we cover here have been selected for their ease  of use, free access, advanced features, extensive documentation and  up-to-date databases, and they are ones we use daily in our research and recommend to collaborators and students. In addition, we have provided  feedback to the developers of these tools, allowing them to implement  features we have needed in published analyses. These tools are  g:Profiler13, GSEA14, Cytoscape15 and EnrichmentMap16, all freely available online:

Overview of the procedure

This section outlines the major stages of pathway enrichment analysis. A detailed step-by-step  protocol is provided in the Procedure below. Pathway enrichment analysis involves three major stages (Fig. 1; see Box 1 for basic definitions).

  1. 1Definition of a gene list of interest using omics data. An omics experiment comprehensively measures the activity of genes in  an experimental context. The resulting raw dataset generally requires  computational processing, such as normalization and scoring, to identify genes of interest, considering the experimental design. For example, a  list of genes differentially expressed between two groups of samples can be derived from RNA-seq data17. Gene lists derived from other types of omics experiments, such as gene expression microarrays18, quantitative proteomics19,20, germline and somatic genome sequencing21,22,23, and global DNA methylation assays24,25, can be used in this protocol; however, each type of data may require  specific pre-processing steps (see ‘Comparison to alternative methods’  section).

  2. Pathway enrichment analysis. A  statistical method is used to identify pathways enriched in the gene  list from stage 1, relative to what is expected by chance. All pathways  in a given database are tested for enrichment in the gene list (see Box 2 for a list of pathway databases). Several established pathway  enrichment analysis methods are available, and the choice of which to  use depends on the type of gene list (see ‘Comparison to alternative  methods’ section).

  3. Visualization and interpretation of pathway enrichment analysis results. Many enriched pathways can be identified in stage 2, often including  related versions of the same pathway. Visualization can help identify  the main biological themes and their relationships for in-depth study  and experimental evaluation.

Gene lists derived from diverse omics data undergo pathway enrichment  analysis, using g:Profiler or GSEA, to identify pathways that are  enriched in the experiment. Pathway enrichment analysis results are  visualized and interpreted in Cytoscape using its EnrichmentMap,  AutoAnnotate, WordCloud and clusterMaker2 applications. Protocol  overview is shown on the left, starting from gene list input, and  example outputs at each stage are shown on the right.

Stage 1: definition of a gene list of interest using omics data

Genome-scale experiments generate raw data that must be processed to  obtain gene-level information suitable for pathway enrichment analysis  (Supplementary Protocols 1 and 2). The specific processing steps are particular to the omics experiment  type and may be standard, and thus usually straightforward to implement, or not, in which case advanced computational skills may be needed for  data processing. Standard processing methods are available for  established omics technologies and are most conveniently performed by  the core facility that generates the data.

There are two major  ways to define a gene list from omics data: list or ranked list. Certain omics data naturally produce a gene list, such as all somatically  mutated genes in a tumor identified by exome sequencing, or all proteins that interact with a bait in a proteomics experiment. Such a list is  suitable for direct input into pathway enrichment analysis using  g:Profiler (Step 6A). Other omics data naturally produce ranked lists.  For example, a list of genes can be ranked by differential gene  expression score or sensitivity in a genome-wide CRISPR screen. Some  pathway enrichment analysis approaches analyze a ranked gene list  filtered by a particular threshold (e.g., FDR-adjusted P value  <0.05 and fold-change >2). Alternative approaches, such as GSEA,  are designed to analyze ranked lists of all available genes and do not  require a threshold. A whole-genome ranked list is suitable for input  into pathway enrichment analysis using GSEA (Step 6B). A partial  (non-whole-genome) ranked gene list should be analyzed using g:Profiler.

As an example, we describe the analysis of raw RNA-seq data from ovarian cancer samples to define a ranked gene list7. DNA sequence reads are quality filtered (e.g., by trimming to remove  low-quality bases) and mapped to a genome-wide reference set of  transcripts to enable counting of reads per transcript. Read counts are  aggregated at the gene level (counts per gene). Typically, RNA-seq data  for multiple biological replicates (three or more) for each of multiple  experimental conditions (two or more, e.g., treatment versus control)  are available. Read counts per gene are normalized across all samples to remove unwanted technical variation between samples, for example, due  to differences in sequencing lane or total read number per sequencing  run26,27,28. Next, read counts per gene are tested for differential expression  across sample groups (e.g., treatment versus control) (Supplementary  Protocols 1 and 2 for RNA-seq and microarray data, respectively). Software packages such as edgeR29, DESeq30, limma/voom31,32 and Cufflinks33 implement procedures for RNA-seq data normalization and differential  expression analysis. Differential gene expression analysis results  include: (i) the P value of the significance of differential expression; (ii) the related Q value (a.k.a. adjusted P value) that has been corrected for multiple testing across all genes,  for example, by using the Benjamini–Hochberg false-discovery rate  (BH-FDR) procedure34 (Box 3); (iii) effect size and direction of expression change so that  upregulated genes are positive and at the top of the list and  downregulated genes are negative and at the bottom of the list, often  expressed as log-transformed fold-change. The gene list is then ranked  by one or more of these values (e.g., −log10 P value multiplied by the sign of log-transformed fold-change) and studied using pathway enrichment analysis.

Stage 2A: pathway enrichment analysis of a gene list using g:Profiler (Step 6A)

The default analysis implemented in g:Profiler and similar web-based tools35,36,37,38 searches for pathways whose genes are significantly enriched (i.e.,  over-represented) in the fixed list of genes of interest, as compared to all genes in the genome (Step 6A)(Box 4). The P value of the enrichment of a pathway is computed using a Fisher’s exact test and multiple-test correction is applied (Box 3).

g:Profiler also includes an ordered enrichment test, which is suitable for lists  of up to a few thousand genes that are ordered by a score, whereas the  rest of the genes in the genome lack meaningful signal for ranking. For  example, significantly mutated genes may be ranked by a score from a  cancer driver prediction method6. This analysis repeats a modified Fisher’s exact test on incrementally  larger sub-lists of the input genes and reports the sub-list with the  strongest enrichment P value for each pathway39. g:Profiler searches a collection of gene sets representing Gene  Ontology (GO) terms, pathways, networks, regulatory motifs, and disease  phenotypes. Major categories of gene sets can be selected to customize  the search.

Pathway enrichment methods that use the Fisher’s exact test or related tests require the definition of background genes for  comparison. All annotated protein-coding genes are often used as  default. This leads to inappropriate inflation of P values and  false-positive results if the experiment can directly measure only a  subset of all genes. For example, setting a custom background is  important in analyzing data from targeted sequencing or  phosphoproteomics experiments. The appropriate custom background would  include all genes in the sequencing panel or all known phosphoproteins,  respectively.

Stage 2B: pathway enrichment analysis of a ranked gene list using GSEA (Step 6B)

Pathway enrichment analysis of a ranked gene list is implemented in the GSEA software14 (Step 6B) (Box 4). GSEA is a threshold-free method that analyzes all genes on the basis of their differential expression rank, or other score, without prior gene  filtering. GSEA is particularly suitable and is recommended when ranks  are available for all or most of the genes in the genome (e.g., for  RNA-seq data). However, it is not suitable when only a small portion of  genes have ranks available, for example, in an experiment that  identifies significantly mutated cancer genes (Stage 2A; Step 6A).

GSEA searches for pathways whose genes are enriched at the top or  bottom of the ranked gene list, more so than expected by chance alone.  For instance, if the topmost differentially expressed genes are involved in the cell cycle, this suggests that the cell cycle pathway is  regulated in the experiment. By contrast, the cell cycle pathway is  probably not significantly regulated if the cell cycle genes appear  randomly scattered through the whole ranked list. 例如,如果最显著差异表达的基因参与细胞周期,则表明该细胞周期途径在实验中受到调控。 相比之下,如果细胞周期基因随机出现在整个排名列表中,则细胞周期途径可能不会受到明显调节。 To calculate an  enrichment score (ES) for a pathway, GSEA progressively examines genes  from the top to the bottom of the ranked list, increasing the ES if a  gene is part of the pathway and decreasing the score otherwise. These  running sum values are weighted, so that enrichment in the very top-  (and bottom-) ranking genes is amplified, whereas enrichment in genes  with more moderate ranks are not amplified. The ES score is calculated  as the maximum value of the running sum and normalized relative to  pathway size, resulting in a normalized enrichment score (NES) that  reflects the enrichment of the pathway in the list. Positive and  negative NES values represent enrichment at the top and bottom of the  list, respectively. Finally, a permutation-based P value is computed and corrected for multiple testing to produce a permutation-based false-discovery rate (FDR) Q value that ranges from 0 (highly significant) to 1 (not significant) (Box 3). The same analysis is performed starting from the bottom of the ranked  gene list to identify pathways enriched in the bottom of the list.  Resulting pathways are selected using the FDR Q value threshold (e.g., Q < 0.05) and ranked using NES. In addition, the ‘leading edge’ aspect of the GSEA analysis identifies specific genes that most strongly  contribute to the detected enrichment signal of a pathway.

GSEA has two methods for determining the statistical significance (P value) of the ES: gene set permutation and phenotype permutation. The  gene set permutation test requires a ranked list, and GSEA compares the  observed pathway ES to a distribution of scores obtained by repeating  the analysis with randomly sampled gene sets of matching sizes (e.g.,  1,000 times). The phenotype permutation test requires expression data  for all samples (e.g., biological replicates), along with a definition  of sample groups called ‘phenotypes’ that are compared with each other  (e.g., cases versus controls; tumor versus. normal samples). The  observed pathway ES is compared to a distribution of scores obtained by  randomly shuffling the samples among phenotype categories and repeating  the analysis (e.g., 1,000 times), including computation of the ranked  gene list and resulting pathway ES. Gene set permutation is recommended  for studies with limited variability and biological replicates (i.e.,  two to five per condition). In this case, differential gene expression  values should be computed outside of GSEA, using methods that include  variance stabilization (such as edgeR29, DESeq30 and limma/voom31,32) and imported into the GSEA software before pathway analysis. Phenotype  permutation should be used with a larger number of replicates (e.g., at  least ten per condition). The main advantage of the phenotype  permutation approach is that it maintains the structure of gene sets  with biologically important gene correlations during permutation, in  contrast to the gene set permutation approach. This protocol covers only gene set permutation because it is appropriate for the most common use  case of pathway enrichment analysis. Phenotype permutation is  computationally expensive and, for the current version of GSEA, requires custom programming to compute ESs and differential expression  statistics separately for thousands of phenotype randomizations. For  advanced users, we provide a supplementary protocol for this procedure  (Supplementary Protocol 4).

By default, the GSEA desktop software searches the MSigDB gene set  database, which includes pathways, published gene signatures, microRNA  target genes and other gene set types (Box 2). The user can also provide a custom database as a text-based GMT (Gene  Matrix Transposed) file in which each line defines a pathway, with its  name, identifier and a list of genes it contains. Gene identifiers in  the GMT file must match those in the input gene list.

Stage 3: visualization and interpretation of pathway enrichment analysis results (Steps 7–13)

Pathway information is inherently redundant, as genes often participate in  multiple pathways, and databases may organize pathways hierarchically by including general and specific pathways with many shared genes (e.g.,  ‘cell cycle’ and ‘M-phase of cell cycle’). Consequently, pathway  enrichment analysis often highlights several versions of the same  pathway. Collapsing redundant pathways into a single biological theme  simplifies interpretation. We recommend addressing such redundancy with  visualization methods such as EnrichmentMap16, ClueGO40 and others41,42,43. An ‘enrichment map’ is a network visualization that represents overlaps among enriched pathways (Fig. 1), whereas ‘EnrichmentMap’ refers to the Cytoscape application that  creates the visualization. Pathways are shown as circles (nodes) that  are connected with lines (edges) if the pathways share many genes. Nodes are colored by ES, and edges are sized on the basis of the number of  genes shared by the connected pathways. Network layout and clustering  algorithms automatically group similar pathways into major biological  themes. The EnrichmentMap software takes as input a text file containing pathway enrichment analysis results and another text file containing  the pathway gene sets used in the original enrichment analysis.  Interactive exploration of pathway ES (filtering nodes) and connections  between pathways (filtering edges) is possible (Step 9A(xii and xiii)  and 9B(xiii and xiv)). Multiple enrichment analysis results can be  simultaneously visualized in a single enrichment map, in which case  different colors are used on the nodes for each enrichment. If the gene  expression data are optionally loaded, clicking on a pathway node will  display a gene expression heat map of all genes in the pathway.

An enrichment map helps identify interesting pathways and themes. First,  expected themes should be identified to help validate the pathway  enrichment analysis results (positive controls). For instance,  growth-related pathways and other hallmarks of cancer44 are expected to be identified in analyses of cancer genomics datasets.  Second, pathways not previously associated with the experimental context are evaluated more carefully as potential discoveries. Pathways and  themes with the strongest ESs should be studied first, followed by  progressively weaker signals (Step 12). Third, interesting pathways are  examined in more detail, examining genes within the pathways (e.g.,  expression heat maps and the GSEA leading edge genes). Further, gene  expression values can be overlaid on a pathway diagram, if available,  from databases such as Pathway Commons45, Reactome46, KEGG47 or WikiPathways48, using tools such as PathVisio49. If a diagram is not available, tools such as STRING50 or GeneMANIA51 can be used with Cytoscape15 to define an interaction network among pathway genes for expression  overlay. This helps in visual identification of the pathway components  (e.g., single genes or entire signaling cascades) that are the most  altered (e.g., differentially expressed) in the experiment. In addition, master regulators for enriched pathways can be searched for by  integrating gene sets of miRNA52 or transcription factor53 targets using the EnrichmentMap post-analysis tool. Finally, pathway  enrichment analysis results can be published to support a scientific  conclusion (e.g., functional differences of two cancer subtypes), or  used for hypothesis generation or planning of experiments to support the identification of novel pathways. More pathway enrichment analysis  examples and deeper explanation of core concepts is provided at http://www.pathwaycommons.org/guide/.

Advantages and limitations

Pathway enrichment  analysis of omics data has several advantages as compared to analysis of single genes, transcripts or proteins. First, it improves statistical  power in two ways: (i) it aggregates counts of mutations across all the  genes and genomic regions involved in the given cell mechanism,  providing a higher number of counts, which makes statistical analyses  more reliable; and (ii) it reduces the dimensionality from tens of  thousands of genes or millions of genomic regions (e.g., SNPs) to a much smaller number of ‘systems’ or ‘pathways’, thereby reducing the cost of multiple hypothesis testing. Second, results are often easier to  interpret because the analysis is phrased at the level of familiar  concepts such as ‘cell cycle’. Third, the approach can help identify  potential causal mechanisms and drug targets. Fourth, results obtained  from related, but different, data may be more comparable because results are projected onto a smaller, shared feature space (i.e., a limited  number of pathways). Fifth, the approach facilitates integration of  diverse data types, such as genomics, transcriptomics and proteomics,  which can all be mapped to the same pathways. Thus, projecting disease  data onto known mechanisms increases statistical and interpretative  power.

The following limitations are important to consider when  interpreting pathway enrichment analysis results, in general, including  those covered by this protocol. Additional limitations apply, depending  on the omics data type (see ‘Application to diverse omics data’  section). Advantages and disadvantages of specific and alternative  pathway enrichment analysis methods are presented in the ‘Comparison to  alternative methods’ section.

  • Enrichment analysis is more effective for  pathways in which multiple genes have strong biological signals (e.g.,  differential expression). For instance, in a transcriptomics experiment, we assume that evolution has optimized a cell to express a pathway only when needed, and that pathway activation or deactivation can be  identified as coordinated activity of many genes in a pathway. Pathways  in which activity is controlled by only a few genes or not controlled by gene expression (e.g., by post-translational regulation) will never be  observed as enriched. 如果同路的变化不是通过基因表达反映出来的,则无法评估 Some pathway analysis methods address this by  using activating and inhibiting gene interactions to construct  quantitative models of pathway activity that include genes that are not  differentially expressed yet are still important regulators. However,  these methods require pathway models with detailed biochemical and  regulatory gene interactions that are obtained through focused  experiments and thus are in limited supply (Box 2).

  • Pathway boundaries tend to be arbitrary, and  different databases will disagree about which genes are involved in a  given pathway. By using multiple databases, multiple pathway definitions can be analyzed, and some may be better than others at explaining the  experimental data.

  • Some pathway enrichment methods, such as those  based on the Fisher’s exact test, are statistically more likely to  identify larger pathways as significant. Users can address this  limitation by selecting an upper limit for the size of the gene sets  considered in the analysis.

  • Multi-functional genes that are highly ranked  in the gene list may lead to enrichment of many different pathways, some of which are not relevant to the experiment54. Repeating the analysis after excluding such genes may reveal pathways  whose enrichment is overly dependent on their presence or confirm the  robustness of pathway enrichment.

  • Pathway databases, and therefore enrichment  results, are biased toward well-known pathways. In fact, pathway  enrichment analysis ignores genes with no pathway annotations, sometimes called the ‘dark matter of the genome’, and these genes should be  studied separately. For example, non-coding RNA genes currently lack  systematic annotations and are not directly usable in pathway enrichment analysis. ncRNA没有得到很好的注释,因此不能用于功能富集分析

  • Most enrichment analysis methods make  unrealistic assumptions of statistical independence among genes as well  as pathways. Some genes may always be co-expressed (e.g., genes within a protein complex), and some pathways have genes in common. Thus,  standard FDRs, which assume statistical independence between tests, are  often either more or less conservative than ideal. Nonetheless, they  should still be used to adjust for multiple testing and rank enriched  pathways for exploratory analysis and hypothesis generation. Custom  permutation tests may lead to better estimates of false discovery (see  ‘Comparison to alternative methods’ section).

Experimental design

Pathway enrichment analysis  benefits greatly from careful experimental design. Otherwise, the  analysis may reveal apparently meaningful results caused by experimental biases or other confounders. This section covers a range of  experimental aspects that must be considered before performing this  protocol.

Experimental conditions

The  experimental conditions must be well defined such that the major  variations observed are responses that the experimenter would like to  monitor and that are related to the biological question of interest  (e.g., tumor versus normal, treated versus untreated, comparison of four disease subtypes, time series).

Number of replicates

Biological replicates are independently processed samples obtained from distinct  organisms or cell lines that are required for measuring variability  across samples and to compute statistical significance (P  values). Lack of replication (i.e., one sample per group) will not  permit robust estimation of the significance of the signal. Insufficient replication may result in lack of signal in the data (e.g., no  significantly differentially expressed genes). The larger the variation  in the set of samples, the more biological replicates are needed to  accurately measure the signal. For systems with lower variability (i.e., model organisms with the same genetic background in controlled  laboratory conditions, or stable cell lines derived from the same  clone), at least three to four biological replicates are recommended per condition for differential analysis with variance stabilization  normalization. Variance stabilization uses a global statistical model to ‘stabilize’ gene-wise variance estimates to reduce inaccuracies  resulting from few replicates. For experiments with higher variability  (e.g., tumor samples), more replicates are required; ideally, a pilot  experiment followed by formal statistical power calculations55 (sometimes called sensitivity testing) should be used to determine the  minimal number of replicates required to identify the signal of  differentially expressed genes or enriched pathways. Technical  replicates comprising repeated experiments of the same samples are  usually not needed for well-established experimental techniques, such as RNA-seq, that have low technical variability, but can be helpful for  novel techniques.

Confounding factors

Differences in factors not related to the experimental question should be avoided  or at least balanced across conditions so that statistical techniques  such as generalized linear models can correct for each factor. Common  factors include sequencing batch, nucleic acid extraction protocol,  subject age and many others. Otherwise, it may be impossible to  accurately separate the experimental signals coming from the  experimental response from the confounding factors. Knowing important  factors in advance supports correct experimental design. Statistical  exploratory analyses such as clustering or principal component analysis  (PCA) can help identify unknown factors. For example, cases and controls are expected to cluster separately and not by processing batch.

Outliers

Outlier samples may considerably differ from others because of major  experimental or technical problems, such as contamination or sample  mix-up. Alternatively, they may present extreme biological features,  such as tumor samples with exceptionally aggressive phenotypes. 它们可能表现出极端的生物学特征,例如具有异常侵袭性表型的肿瘤样品。 Unbiased identification of outlier samples is possible using statistical  techniques such as PCA or clustering. Pathway enrichment analysis should be performed with and without outliers to ensure robust results.  Systematic removal of outliers may be justified to reduce variability in the experiment.

Experimental sensitivity

Some  experimental methods can be tuned to be more or less sensitive. For  instance, the number of reads in RNA-seq experiments influences  downstream analysis. For quantifying gene expression in a biological  system with modest variability and testing differential expression with  variance stabilization, at least three to five replicates and 10 million mapped reads are required56. Substantially greater sequencing depth, such as 50–100 million mapped  reads, is required to investigate splice isoforms, to detect poorly  expressed genes or for samples with complex cellular mixtures such as  surgical resection specimens.

Choice of pathway gene set database

We recommend searching enrichment of pathway gene sets only at first, as  these capture familiar normal cellular processes that are easy to  interpret. GO57 biological process terms and manually curated molecular pathways from Reactome58, Panther38, HumanCyc59 and NetPath60 are good resources for human pathways (Box 2). GO biological process annotations include a mix of manually curated and electronically inferred sources.

Filtering GO pathways by evidence code

A large fraction of gene annotations in GO originate from automatic data  analysis and are not verified by human curators. These have the evidence code ‘inferred from electronic annotation’ (IEA). Earlier literature  has cautioned against analyzing and interpreting IEA-labeled annotations61, whereas more recent studies suggest that these are often as reliable as annotations assigned by human curators62. For high-confidence analyses of data from human and common model  organisms that have many manually curated annotations, we generally  recommend comparing versions of the analysis with and without filtering  of IEA annotations to verify robustness. However, IEA annotations make  up the majority of information in less-well-studied species and should  be used by default in these cases. Removing IEA-coded annotations may  bias the analysis toward well-studied biological processes.

Use of non-pathway gene sets

Different types of gene sets help answer a variety of questions. For instance,  non-pathway gene sets corresponding to microRNA and transcription factor targets can be used to discover important regulators52,53. However, simultaneously analyzing all available types of gene sets  reduces data interpretability. It may also lead to false negatives, as  the increased number of conducted tests increases the effect of multiple testing correction and reduces the multiple-test adjusted significance  of individual pathways. We therefore recommend performing the analysis  of non-pathway and pathway gene sets separately.

Gene set size considerations

It is often beneficial to exclude numerous small pathways because they are redundant with larger pathways and complicate interpretation, and their abundance makes multiple-testing correction more stringent. Large  pathways should be also excluded, as these are overly general (e.g.,  ‘metabolism’), they do not contribute to interpretability of results,  and their statistical significance can be inflated when using certain  statistical enrichment methods (e.g., Fisher’s exact test). For  analyzing human gene expression data, we often recommend excluding  pathway gene sets with <10–15 genes and >200–500 genes, although  upper bounds of 200–2,000 genes can be found in the literature. However, for non-human organisms and other types of gene sets, which may have  different gene set size distributions, larger sets may need to be  included. Filtering of pathways depends on the experimental context, as  different areas of biology have variable coverage in pathway databases.  One can determine the lower and upper bounds of pathway size by  examining the sizes of several pathways of interest that are expected to be relevant to the experiment.

Importance of using updated pathway gene sets

Pathway enrichment analysis depends on gene sets and databases used in the  analysis, and many recent studies using pathway enrichment analysis are  strongly impacted by outdated resources11. For improved reproducibility and transparency of research,  investigators should report in publications the analysis date and  versions of pathway enrichment analysis software and gene set databases  used, as well as all analysis parameters. In addition to enrichment  maps, authors should consider adding their studied gene lists and  complete tables of enriched pathways as supplementary information.

Choice of gene identifier

Genes are associated with many diverse database identifiers (IDs). We  recommend using unambiguous, unique and stable IDs, as some IDs become  obsolete over time. For human genes, we recommend using the Entrez Gene  database IDs (e.g., 4193 corresponds to MDM2) or gene symbols (MDM2 is the official symbol recommended by the HUGO Gene Nomenclature  Committee). As gene symbols change over time, we recommend maintaining  both gene symbols and Entrez Gene IDs. The g:Profiler and related  g:Convert tools support automatic conversion of multiple ID types to  standard IDs.

Unexpected pathway results and experimental design

Unexpected biological themes revealed in a pathway analysis may indicate problems  with experimental design, data generation or analysis. For example,  enrichment of the apoptosis pathway may indicate a problem with the  experimental protocol that led to increased cell death during sample  preparation. In these cases, the experimental design and data generation should be carefully reviewed before further data interpretation.

Application to diverse omics data

This protocol uses RNA-seq data7 and somatic mutation data6 as examples because these data types are frequently encountered.  However, the general concepts of pathway enrichment analysis that we  present are applicable to many types of experiments that can generate  lists of genes, such as single-cell transcriptomics, CNVs5, proteomics63, phosphoproteomics64, DNA methylation65 and metabolomics66. Most data types require protocol modifications, which we only briefly  discuss here. With certain data types, specialized computational methods are required to produce a gene list that is appropriate for pathway  enrichment analysis, whereas with other data types, a specialized  pathway enrichment analysis technique is required. Issues specific to  data types and experimental methods must be considered, including:

  • Different gene identifiers are recommended for certain data types. We recommend UniProt accession numbers for proteins (e.g., Q00987 for MDM2) and Human Metabolome Database IDs for metabolites (e.g., ATP is denoted as HMDB00538).

  • Certain types of omics experiments by design  capture only a subset of genes or proteins. To address this limited  coverage, pathway enrichment analysis must define a custom background  gene set of the genes that can be measured in the experiment. For  example, phosphoproteomics experiments measure only proteins with one or more phosphorylation sites and thus must use the set of genes encoding  phosphoproteins as the custom background gene set. 例如,磷酸蛋白质组学实验仅测量具有一个或多个磷酸化位点的蛋白质,因此必须使用编码磷蛋白的基因集作为定制背景基因集。 Otherwise, pathway  enrichment analysis would reveal inflated P values for general processes such as kinase signaling and protein phosphorylation.

  • Pathway enrichment analysis of short non-coding genomic regions such as transcription factor binding sites from  ChIP-seq experiments need additional consideration. Genomic regions must be mapped to protein-coding genes and corrected for biases such as  increased signal in longer genes. Tools such as GREAT67 automatically perform both tasks.

  • Large genomic intervals that span multiple  genes (e.g., from genome-wide associations, CNV and differentially  methylated regions) require specialized enrichment tests such as the  PLINK CNV gene set burden test68 or INRICH69. Standard enrichment tests often reveal genes clustered in the genome  whose signals are strongly statistically inflated because each gene is  incorrectly counted as an independent signal. 标准富集测试通常会揭示基因组中聚集的基因,其信号在统计上被强烈夸大,因为每个基因被错误地视为独立信号。 Gene types that are  correlated with genomic position include olfactory receptors, histones,  major histocompatibility complex (MHC) members and homeobox  transcription factors. A simple solution to address genomic clustering  of genes in a pathway involves selecting only one representative gene  from each functionally homogeneous genomic cluster before enrichment  analysis.

  • For rare genetic variants, case–control pathway ‘burden’ tests are the most appropriate pathway enrichment analysis  method (see ‘Comparison to alternative’ methods section).

Comparison to alternative methods

Pathway enrichment analysis methods

This protocol recommends the use of g:Profiler and GSEA software for pathway enrichment analysis. g:Profiler13,39 analyzes gene lists using Fisher’s exact test and ordered gene lists  using a modified Fisher’s test. It provides a graphical web interface  and access via R and Python programming languages. The software is  frequently updated, and the gene set database can be downloaded as a GMT file (http://biit.cs.ut.ee/gprofiler). GSEA14 analyzes ranked gene lists using a permutation-based test. GSEA分析排名的基因列表基于置换的检验。 The software typically runs as a desktop application (http://software.broadinstitute.org/gsea). Hundreds of pathway enrichment analysis tools exist (reviewed in ref. 70), although many rely on out-of-date pathway databases or lack unique  features as compared to the most commonly used tools; as such, we do not cover them here. The following are alternative free pathway enrichment  analysis software tools. Although we do not cover these tools in our  protocol, we recommend the following, on the basis of their ease of use, unique features or advanced programming features.

  • Enrichr37: This is a web-based enrichment analysis tool for non-ranked gene lists  that is based on Fisher’s exact test. It is easy to use, has rich  interactive reporting features, and includes >100 gene set databases  (called libraries), including >180,000 gene sets in multiple  categories. Functionality is similar to that of the g:Profiler web  server described in this protocol.

  • Camera71: This R Bioconductor package analyzes gene lists and corrects for  inter-gene correlations such as those apparent in gene co-expression  data. The software is available as part of the limma package in  Bioconductor (https://bioconductor.org/packages/release/bioc/html/limma.html; this is an advanced tool that requires programming expertise; Supplementary Protocol 3).

  • GOseq72: This R Bioconductor package analyzes gene lists from RNA-seq  experiments by correcting for user-selected covariates such as gene  length (https://bioconductor.org/packages/release/bioc/html/goseq.html; this is an advanced tool that requires programming expertise).

  • Genomic Regions Enrichment of Annotations Tool (GREAT)67: In contrast to common methods that analyze gene lists, GREAT analyzes  genomic regions such as DNA binding sites and links these to nearby  genes for pathway enrichment analysis (http://bejerano.stanford.edu/great/public/html/). See ‘Application to diverse omics data’ section.

Visualization tools

This  protocol recommends the use of EnrichmentMap for pathway enrichment  analysis visualization to aid interpretation. EnrichmentMap16 is a Cytoscape15 application that visualizes the results from pathway enrichment  analysis and eases interpretation by displaying pathways as a network in which overlapping pathways are clustered together to identify major  biological themes in the results (http://www.baderlab.org/Software/EnrichmentMap). Two alternative useful visualization tools are:

  • ClueGO40: This Cytoscape application is conceptually similar to EnrichmentMap and provides a network-based visualization to reduce redundancy of results  from pathway enrichment analysis. It also includes a pathway enrichment  analysis feature for analysis of GO annotations using Fisher’s exact  tests. However, it currently supports only GO gene sets.

  • PathVisio49: This desktop application provides a complementary visualization  approach to those of EnrichmentMap and ClueGO. PathVisio enables the  user to visually interpret omics data in the context of gene and protein interactions in a pathway of interest. PathVisio colors pathway genes  according to user-provided omics data (https://www.pathvisio.org). This is the main advantage of PathVisio as compared to EnrichmentMap and ClueGO.

Topology-aware pathway analysis methods

Most pathway enrichment analysis methods treat all genes in a pathway  uniformly and ignore gene interactions. By contrast, topology-aware  methods explicitly model the interactions between genes. CePa73, GANPA74 and THINK-Back75 use physical gene interactions or co-expression networks to assign a  weight to each gene in each pathway. Weights can be derived from  measures of the gene importance in the network such as degree, the  number of gene connections and betweenness centrality, and can be  integrated into a traditional pathway enrichment analysis method such as GSEA. Methods such as SPIA76, Pathway-Express77 and EnrichNet78 generate an ES for the entire pathway that considers pathway regulatory interactions such as activation and inhibition. Although useful and  potentially more accurate, regulatory and biochemical gene interactions  are available for fewer genes and pathways as compared to physical  interactions networks and co-expression. We anticipate that these  methods will become more useful as more gene interactions in pathways  are characterized in detailed molecular experiments. However collecting  and curating high-quality and biochemically detailed pathway data from  the literature is currently complex and expensive. Therefore, pathway  enrichment analysis methods described in this protocol will probably  remain the most widely used approaches for the foreseeable future.

Future perspectives

Current pathway enrichment analysis methods provide a useful high-level  overview of the pathways active in a genomics experiment. However, these methods consider a simplified pathway view that involves only gene  sets. Next-generation pathway analysis methods will integrate more  biological pathway details, build pathway models based on multiple types of genomics data measured across many samples, and consider positive  and negative regulatory relationships in the data. For instance,  qualitative mathematical modeling parameterized with single-cell RNA-seq data may one day enable accurate predictions of drug combinations  capable of treating a given disease under study.

Overview of the protocol

This step-by-step protocol explains how to complete pathway enrichment  analysis using g:Profiler (filtered gene list) and GSEA (unfiltered,  whole-genome, ranked gene list), followed by visualization and  interpretation using EnrichmentMap. The example data provided for the  g:Profiler analysis is a list of genes with frequent somatic single  nucleotide variants (SNVs) identified in The Cancer Genome Atlas (TCGA)  exome-sequencing data of 3,200 tumors of 12 types6. The example data provided for the GSEA analysis is a list of  differentially expressed genes in two subtypes of ovarian cancer defined by TCGA8.

Input data

Critical

We provide downloadable example files that are referred to throughout the protocol (Supplementary Tables 1–13). We recommend saving all these files in a personal project data folder before starting. We also recommend creating an additional result data folder to save the files generated while performing the protocol.

  • A gene list or ranked gene list of interest

    • Example data for Step 6A.  g:Profiler requires a list of genes, one per line, in a text file or  spreadsheet, ready to copy and paste into a web page: for this, we use  genes with frequent somatic SNVs identified in TCGA exome sequencing  data of 3,200 tumors of 12 types6. The MuSiC cancer driver mutation detection software was used to find  127 cancer driver genes that displayed higher than expected mutation  frequencies in cancer samples (Supplementary Table 1, which is derived from column B of Supplementary Table 4 in ref. 6). Genes are ranked in decreasing order of significance (FDR Q value) and mutation frequency (not shown).

    • Example data for Step 6B. GSEA  requires an RNK file with gene scores. An RNK file is a two-column text  file with gene IDs in the first column and gene scores in the second  column. All (or most) genes in the genome need to have a score, and the  gene IDs need to match those used in the GMT file. We provide a ranked  list of differentially expressed genes in ovarian cancer from TCGA  (Supplementary Table 2). This cohort was previously stratified into four molecular subtypes on the basis of gene expression data, defined as  differentiated, immunoreactive, mesenchymal and proliferative 该队列先前根据基因表达数据分为四种分子亚型,定义为分化,免疫反应,间充质和增生7,8. We compared the immunoreactive and mesenchymal subtypes to demonstrate the protocol. Step 5 of Supplementary Protocol 1 shows how this file was created.

  • Pathway gene set database

    • In Step 6A, g:Profiler maintains an  up-to-date set of pathway gene sets from multiple sources and no further input from the user is required, but a database of pathway gene sets is required for Step 6B (GSEA). Supplementary Table 3 contains a database  of pathway gene sets used for pathway enrichment analysis in the  standard GMT format, downloaded from http://baderlab.org/GeneSets. This file contains pathways downloaded on 1 July 2017 from eight data sources: GO57, Reactome58, Panther38, NetPath60, NCI79, MSigDB curated gene sets (C2 collection, excluding Reactome and KEGG)80, MSigDB Hallmark (H collection)81 and HumanCyc59. The gene sets available from http://baderlab.org/GeneSets are updated monthly. A GMT file is a text file in which each line  represents a gene set for a single pathway. Each line includes a pathway ID, a name and the list of associated genes in a tab-separated format.

Procedure

Software installation

Timing 5 min

  1. 1

    Download the required input and output files from the Supplementary Materials of the protocol.

    • Create two directories, project data folder and results data folder.

    • Place all downloaded input and example output files into the project data folder.

    • As you progress through the protocol, place any newly generated files into the results data folder.

  2. 2

    Install Java v.8 or higher. Follow the download and installation instructions for Java JRE at http://www.oracle.com/technetwork/java/javase/downloads/index.html.

    troubleshooting

  3. 3

    Download the latest version of GSEA. We recommend the javaGSEA desktop application available at http://www.broadinstitute.org/gsea/downloads.jsp. Free registration is required.

  4. 4

    Download the latest version of Cytoscape from http://www.cytoscape.org. Cytoscape v.3.6.0 or higher is required.

  5. 5

    Install the required Cytoscape applications.

    • Launch Cytoscape.

    • Go to AppsApp Manager (i.e., open the Apps menu and select the item App Manager).

    • In the Install Apps tab search bar, search for EnrichmentMap.

    • Click on EnrichmentMap Pipeline Collection in the center panel. Verify that it is v.1.0.0 or higher.

    • Click on the Install button.

    • Go to the Currently Installed tab and verify that the applications (EnrichmentMap, clusterMaker2, WordCloud and AutoAnnotate) have been installed.

Pathway enrichment analysis

Timing 3–20 min

  1. 6

    Two major types of gene lists are used in  pathway enrichment analysis of omics data. Flat (unranked) gene lists of dozens to thousands of genes can be analyzed using g:Profiler (option  A). A statistical threshold is required to compile a gene list from  omics data. By contrast, ranked, whole-genome gene lists are suitable  for pathway enrichment analysis using GSEA (option B). Gene lists  analyzed with GSEA do not require prior filtering using statistical  thresholds. Partial, filtered ranked gene lists can also be analyzed  with g:Profiler. Select Step 6A or 6B, depending on the type of gene  list you have.

    1. (i)

      Open the g:Profiler website at http://biit.cs.ut.ee/gprofiler/ (Fig. 2).

    2. (ii)

      Paste the gene list (Supplementary Table 1) into the Query field in top-left corner of the screen. The gene list can be space-separated or one per line. The organism for the analysis, Homo sapiens, is selected by default. The input list can contain a mix of gene and  protein IDs, symbols and accession numbers. Duplicated and unrecognized  IDs will be removed automatically, and ambiguous symbols can be refined  in an interactive dialogue after submitting the query.

    3. (iii)

      Check the box next to Ordered query. This option treats the input as an ordered gene list and prioritizes  genes with higher mutation ESs at the beginning of the list.

    4. (iv)

      (Optional) Check the box next to No electronic GO annotations. This option will discard less reliable GO annotations (IEAs) that are not manually reviewed.

    5. (v)

      Set filters on gene  annotation data using the menu on the right. We recommend that initial  pathway enrichment analyses includes only biological processes (BPs) of  GO and molecular pathways of Reactome. Keep the two checkboxes checked  and uncheck all other boxes in the menu.

    6. (vi)

      Click on Show Advanced Options to set additional parameters.

    7. (vii)

      Set the values of Size of functional category in the dropdown menu to 5 (‘min’) and 350 (‘max’). Large pathways are  of limited interpretative value, whereas numerous small pathways  decrease the statistical power because of excessive multiple testing.

    8. (viii)

      Set the Size of query/term intersection in the dropdown menu to 3. The analysis will consider only more  reliable pathways that have three or more genes in the input gene list.

    9. (ix)

      Click g:Profile! to run the analysis. A graphical heat map image will be shown, with detected pathways shown along the y axis (left) and associated genes of the input list shown along the x axis (top). Resulting pathways are organized hierarchically into  related groups. g:Profiler uses graphical output by default and switches to textual output when a large number of pathways is found. g:Profiler  returns only statistically significant pathways with P values adjusted for multiple testing correction (called Q values). By default, results with Q values <0.05 are reported. g:Profiler reports unrecognized and ambiguous gene IDs that can be resolved manually.

    10. (x)

      Use the dropdown menu Output type and select the option Generic Enrichment Map (TAB). This file is required for visualizing pathway results with Cytoscape and EnrichmentMap.

    11. (xi)

      Click g:Profile! again to run the analysis with the updated parameters. The required link Download data in Generic Enrichment Map (GEM) format will appear under the g:Profiler interface. Download the file from the link and save it on your computer in your result data folder created in Step 1. Example results are provided in Supplementary Table 4.

    12. (xii)

      Download the required GMT file by clicking on the link name at the bottom of the Advanced Options form. The GMT file is a compressed ZIP archive that contains all gene  sets used by g:Profiler (e.g., gprofiler_hsapiens.NAME.gmt.zip). The  gene set files are divided by data source. Download and uncompress the  ZIP archive to your project folder. All required gene sets for this  analysis will be in the file hsapiens.pathways.Name.gmt  (Supplementary_Table5_hsapiens.pathways.NAME.gmt). Place the saved file  in your result data folder created in Step 1.

    13. (A)

                                       Pathway enrichment analysis of a gene list using g:Profiler                                

      Timing 3 min

      Fig. 2: Screenshot of g:Profiler user interface.

      figure2

      Protocol Step 6A involves populating the g:Profiler interface. Procedural steps  are highlighted with rectangles and roman numerals (refer to Step  6A(i–xii)). Purple boxes highlight files that must be downloaded for  subsequent protocol steps. The remaining boxes indicate parameters for  the analysis.

Fig. 3: Screenshot of GSEA user interface.

Fig. 4: GSEA output overview.

a, Web page summary of GSEA results showing pathways enriched in  the top or bottom of the ranked list, with the ‘na_pos’ and ‘na_neg’  phenotypes corresponding to enrichment in upregulated and downregulated  genes, respectively. These have been manually labeled here as  mesenchymal and immunoreactive, respectively. Clicking on ‘Snapshot’  under either of the phenotypes will show the top 20 enrichment plots for that phenotype. b, An example enrichment plot for the top pathway in the mesenchymal set. c, An example enrichment plot for the top pathway in the immunoreactive set.

Fig. 5: Class/phenotype-specific GSEA output.

Class/phenotype-specific GSEA output in the web page summary shows how many gene sets were found enriched in upregulated genes, regardless of significance (purple), the total number of gene sets used after size filtering (cyan), the  phenotype name (red) and the number of gene sets that pass different  thresholds (orange).

Visualization of enrichment results with EnrichmentMap

Fig. 6: Screenshot of the EnrichmentMap software user interface.

a,b, Input fields in the EnrichmentMap interface for g:Profiler (a) and GSEA (b) results. Procedural steps are shown for Step 9A and 9B. Other than the  specific input files, the parameters are the same for the two analysis  types. Attributes surrounded by a dashed box should be filled out  automatically if the user selects an appropriate folder with the  required files. Missing file names indicate that EnrichmentMap was  unable to find the specified file. Orange boxes indicate optional files. For the examples presented in the protocol, optional files are used for the GSEA analysis but not for the g:Profiler analysis to demonstrate  the two distinct use cases. EM, EnrichmentMap.



Fig. 7: Resulting enrichment maps (no manual formatting).

a,b, Unformatted enrichment maps generated from Steps 6A (a) and 6B (b). Each node (circle) represents a distinct pathway, and edges (blue  lines) represent the number of genes overlapping between two pathways,  determined using the similarity coefficient. a, Enrichment map of significantly mutated cancer driver genes generated using the g:Profiler analysis in Step 6A. b, Enrichment map of pathways enriched in upregulated genes in mesenchymal (red) and immunoreactive (blue) ovarian cancer samples using the GSEA  analysis in Step 6B.

Fig. 8: Overview of EnrichmentMap panels in Cytoscape.

(i) Cytoscape ‘Control Panel’, which contains ‘Networks’, ‘Styles’ and ‘Select’ tabs as well as the ‘EnrichmentMap’ main panel. (ii) The ‘Table Panel’ contains tables with node, edge and network  attributes, as well as an enrichment map ‘Heat Map’ panel displaying  expression for genes associated with selected nodes and edges. (iii) Cytoscape search bar, which can be used to search for genes in the enrichment map. (iv) ‘Node Table’ containing values for all variables associated with each node in the network. (v) Q-value or P-value slider bar. By default, the slider is set to Q value if the data contains Q values but can be changed to use P values by selecting the ‘P-value’ radio button. All nodes that pass the initial Q-value threshold are displayed in the enrichment map. By moving the slider to the left, the Q-value threshold is adjusted to a lower value, removing any nodes that do not pass the Q-value threshold. The currently set threshold will be displayed in the  accompanying text box. Thresholds can be manually adjusted by modifying  the text box value directly. (vi) ‘Edge Cutoff (Similarity)’  slider bar. The slider bar modifies the similarity threshold. The  similarity threshold can only be increased; i.e., edges are required to  have more genes in common in order to remain visible, which will remove  edges from the network that do not satisfy the threshold. One can also  manually change the threshold by modifying the text box value directly.

Fig. 9: Example heat map in EnrichmentMap.

Heat map created by selecting the immunoreactive pathway interferon alpha  beta signaling pathway from Reactome. The heat map is useful for  visualization of detailed gene expression patterns for a pathway of  interest. Magenta corresponds to high expression, and green corresponds  to low expression. This heat map is for GSEA results, thus the ‘leading  edge’ genes are highlighted in yellow; these genes have the largest  contribution to the enrichment signal. (ivi) Additional controls in the Heat Map panel include sorting options (i), selection of genes to include (ii), expression data visualization options (iii), data compression options (iv), the option to show values (v) and heat map settings (vi).

Fig. 10: Resulting publication-ready enrichment map.

(i) Overall thumbnail view of the publication-ready enrichment map created with parameters FDR Q value < 0.01, and combined coefficient >0.375 with combined constant = 0.5. (ii) Zoomed-in section of publication-ready enrichment map, in which red and blue nodes represent mesenchymal and immunoreactive phenotype pathways, respectively. Nodes were manually laid out to form a clearer picture.  Clusters of nodes were labeled using the AutoAnnotate Cytoscape  application. Individual node labels were removed for clarity using the  publication-ready button in EnrichmentMap and exported as PNG and PDF  files. A legend was manually added at the bottom of the figure.


Fig. 11: Collapsed enrichment map.

The enrichment map was summarized by collapsing node clusters using the AutoAnnotate application. Each cluster of nodes in Fig. 10 is now represented as a single node. The network was scaled for better  node distribution and manually adjusted to reduce node and label  overlap. (i) Overall thumbnail view of the entire collapsed enrichment map. (ii) Zoomed-in section of the publication-ready collapsed enrichment map that corresponds to the zoomed-in network in Fig. 10 (ii).

Fig. 12: Subnetwork example.

Subnetwork of the main enrichment map (Fig. 10) was manually created by selecting pathways with the top NES values and  creating a new network from the selection. Red and blue nodes are  mesenchymal and immunoreactive phenotypes, respectively. Clusters of  nodes were automatically labeled using the AutoAnnotate application.  Annotations in the subnetwork may differ slightly from those in the main network, as word counts were normalized on a network basis.

Fig. 13: Generic enrichment map legend.

Enrichment map attributes can be copied for use in a custom figure legend. Only  components relevant to the analysis should be copied. Post-analysis  ‘Signature set’ nodes are included in the generic legend (not covered in this protocol). Post-analysis nodes highlight pathways in the  enrichment map that contain specific genes of interest such as targets  of drugs or microRNAs.




https://blog.sciencenet.cn/blog-565558-1245044.html

上一篇:2019年8月文献阅读总结
下一篇:0811=miRNA=target
收藏 IP: 113.108.133.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2025-1-10 07:47

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部