|||
Nature Protocols:
Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in a gene list more than would be expected by chance. We explain the procedures of pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome-sequencing experiments. The protocol comprises three major steps: definition of a gene list from omics data, determination of statistically enriched pathways, and visualization and interpretation of the results.协议包括三个主要步骤:从组学数据中定义基因列表,确定统计丰富的途径以及结果的可视化和解释。 We describe how to use this protocol with published examples of differentially expressed genes and mutated cancer genes; however, the principles can be applied to diverse types of omics data. The protocol describes innovative visualization techniques, provides comprehensive background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can be performed in ~4.5 h and is designed for use by biologists with no prior bioinformatics training.
Comprehensive quantification of DNA, RNA and proteins in biological samples is now routine. The resulting data are growing exponentially, and their analysis helps researchers discover novel biological functions, genotype–phenotype relationships and disease mechanisms,. However, analysis and interpretation of these data represent a major challenge for many researchers. Analyses often result in long lists of genes that require an impractically large amount of manual literature searching to interpret. A standard approach to addressing this problem is pathway enrichment analysis, which summarizes the large gene list as a smaller list of more easily interpretable pathways. Pathways are statistically tested for over-representation in the experimental gene list relative to what is expected by chance, using several common statistical tests that consider the number of genes detected in the experiment, their relative ranking and the number of genes annotated to a pathway of interest. For instance, experimental data containing 40% cell cycle genes are surprisingly enriched, given that only 8% of human protein-coding genes are involved in this process. 例如,实验数据中包含40%细胞周期相关基因,而背景中只有8%的人类蛋白质编码基因参与该过程,说明这一过程在该实验数据中显著富集。
In a recent example, we used pathway enrichment analysis to help identify histone and DNA methylation by the polycomb repressive complex (PRC2) 多梳抑制复合物(PRC2) as the first rational therapeutic target for ependymoma 室管膜瘤, one of the most prevalent childhood brain cancers. This pathway is targetable by available drugs such as 5-azacytidine, which was used on a compassionate basis in a terminally ill patient and stopped rapid metastatic tumor growth. In another example, we analyzed rare copy-number variants (CNVs) in autism and identified several significant pathways affected by gene deletions, whereas few significant hits were identified with case–control association tests of single genes or loci,. These examples illustrate the useful insights into biological mechanisms that can be achieved using pathway enrichment analysis.
This protocol covers pathway enrichment analysis of large gene lists typically derived from genome-scale (omics) technology. The protocol is intended for experimental biologists who are interested in interpreting their omics data. It requires only an ability to learn and use ‘point-and-click’ computer software, although advanced users can benefit from the automatic analysis scripts we provide as Supplementary Protocols –. We analyze previously published human gene expression and somatic mutation data as examples,,; however, our conceptual framework is applicable to analysis of lists of genes or biomolecules from any organism derived from large-scale data, including proteomics, genomics, epigenomics and gene-regulation studies. We extensively use pathway enrichment analysis for many projects and have evaluated numerous available tools,,,. The software packages we cover here have been selected for their ease of use, free access, advanced features, extensive documentation and up-to-date databases, and they are ones we use daily in our research and recommend to collaborators and students. In addition, we have provided feedback to the developers of these tools, allowing them to implement features we have needed in published analyses. These tools are g:Profiler, GSEA, Cytoscape and EnrichmentMap, all freely available online:
g:Profiler (https://biit.cs.ut.ee/gprofiler/)
Cytoscape (http://www.cytoscape.org/)
EnrichmentMap (http://www.baderlab.org/Software/EnrichmentMap)
This section outlines the major stages of pathway enrichment analysis. A detailed step-by-step protocol is provided in the Procedure below. Pathway enrichment analysis involves three major stages (Fig. ; see Box for basic definitions).
1Definition of a gene list of interest using omics data. An omics experiment comprehensively measures the activity of genes in an experimental context. The resulting raw dataset generally requires computational processing, such as normalization and scoring, to identify genes of interest, considering the experimental design. For example, a list of genes differentially expressed between two groups of samples can be derived from RNA-seq data. Gene lists derived from other types of omics experiments, such as gene expression microarrays, quantitative proteomics,, germline and somatic genome sequencing,,, and global DNA methylation assays,, can be used in this protocol; however, each type of data may require specific pre-processing steps (see ‘Comparison to alternative methods’ section).
Pathway enrichment analysis. A statistical method is used to identify pathways enriched in the gene list from stage 1, relative to what is expected by chance. All pathways in a given database are tested for enrichment in the gene list (see Box for a list of pathway databases). Several established pathway enrichment analysis methods are available, and the choice of which to use depends on the type of gene list (see ‘Comparison to alternative methods’ section).
Visualization and interpretation of pathway enrichment analysis results. Many enriched pathways can be identified in stage 2, often including related versions of the same pathway. Visualization can help identify the main biological themes and their relationships for in-depth study and experimental evaluation.
Gene lists derived from diverse omics data undergo pathway enrichment analysis, using g:Profiler or GSEA, to identify pathways that are enriched in the experiment. Pathway enrichment analysis results are visualized and interpreted in Cytoscape using its EnrichmentMap, AutoAnnotate, WordCloud and clusterMaker2 applications. Protocol overview is shown on the left, starting from gene list input, and example outputs at each stage are shown on the right.
Genome-scale experiments generate raw data that must be processed to obtain gene-level information suitable for pathway enrichment analysis (Supplementary Protocols and ). The specific processing steps are particular to the omics experiment type and may be standard, and thus usually straightforward to implement, or not, in which case advanced computational skills may be needed for data processing. Standard processing methods are available for established omics technologies and are most conveniently performed by the core facility that generates the data.
There are two major ways to define a gene list from omics data: list or ranked list. Certain omics data naturally produce a gene list, such as all somatically mutated genes in a tumor identified by exome sequencing, or all proteins that interact with a bait in a proteomics experiment. Such a list is suitable for direct input into pathway enrichment analysis using g:Profiler (Step 6A). Other omics data naturally produce ranked lists. For example, a list of genes can be ranked by differential gene expression score or sensitivity in a genome-wide CRISPR screen. Some pathway enrichment analysis approaches analyze a ranked gene list filtered by a particular threshold (e.g., FDR-adjusted P value <0.05 and fold-change >2). Alternative approaches, such as GSEA, are designed to analyze ranked lists of all available genes and do not require a threshold. A whole-genome ranked list is suitable for input into pathway enrichment analysis using GSEA (Step 6B). A partial (non-whole-genome) ranked gene list should be analyzed using g:Profiler.
As an example, we describe the analysis of raw RNA-seq data from ovarian cancer samples to define a ranked gene list. DNA sequence reads are quality filtered (e.g., by trimming to remove low-quality bases) and mapped to a genome-wide reference set of transcripts to enable counting of reads per transcript. Read counts are aggregated at the gene level (counts per gene). Typically, RNA-seq data for multiple biological replicates (three or more) for each of multiple experimental conditions (two or more, e.g., treatment versus control) are available. Read counts per gene are normalized across all samples to remove unwanted technical variation between samples, for example, due to differences in sequencing lane or total read number per sequencing run,,. Next, read counts per gene are tested for differential expression across sample groups (e.g., treatment versus control) (Supplementary Protocols and for RNA-seq and microarray data, respectively). Software packages such as edgeR, DESeq, limma/voom, and Cufflinks implement procedures for RNA-seq data normalization and differential expression analysis. Differential gene expression analysis results include: (i) the P value of the significance of differential expression; (ii) the related Q value (a.k.a. adjusted P value) that has been corrected for multiple testing across all genes, for example, by using the Benjamini–Hochberg false-discovery rate (BH-FDR) procedure (Box ); (iii) effect size and direction of expression change so that upregulated genes are positive and at the top of the list and downregulated genes are negative and at the bottom of the list, often expressed as log-transformed fold-change. The gene list is then ranked by one or more of these values (e.g., −log10 P value multiplied by the sign of log-transformed fold-change) and studied using pathway enrichment analysis.
The default analysis implemented in g:Profiler and similar web-based tools,,, searches for pathways whose genes are significantly enriched (i.e., over-represented) in the fixed list of genes of interest, as compared to all genes in the genome (Step 6A)(Box ). The P value of the enrichment of a pathway is computed using a Fisher’s exact test and multiple-test correction is applied (Box ).
g:Profiler also includes an ordered enrichment test, which is suitable for lists of up to a few thousand genes that are ordered by a score, whereas the rest of the genes in the genome lack meaningful signal for ranking. For example, significantly mutated genes may be ranked by a score from a cancer driver prediction method. This analysis repeats a modified Fisher’s exact test on incrementally larger sub-lists of the input genes and reports the sub-list with the strongest enrichment P value for each pathway. g:Profiler searches a collection of gene sets representing Gene Ontology (GO) terms, pathways, networks, regulatory motifs, and disease phenotypes. Major categories of gene sets can be selected to customize the search.
Pathway enrichment methods that use the Fisher’s exact test or related tests require the definition of background genes for comparison. All annotated protein-coding genes are often used as default. This leads to inappropriate inflation of P values and false-positive results if the experiment can directly measure only a subset of all genes. For example, setting a custom background is important in analyzing data from targeted sequencing or phosphoproteomics experiments. The appropriate custom background would include all genes in the sequencing panel or all known phosphoproteins, respectively.
Pathway enrichment analysis of a ranked gene list is implemented in the GSEA software (Step 6B) (Box ). GSEA is a threshold-free method that analyzes all genes on the basis of their differential expression rank, or other score, without prior gene filtering. GSEA is particularly suitable and is recommended when ranks are available for all or most of the genes in the genome (e.g., for RNA-seq data). However, it is not suitable when only a small portion of genes have ranks available, for example, in an experiment that identifies significantly mutated cancer genes (Stage 2A; Step 6A).
GSEA searches for pathways whose genes are enriched at the top or bottom of the ranked gene list, more so than expected by chance alone. For instance, if the topmost differentially expressed genes are involved in the cell cycle, this suggests that the cell cycle pathway is regulated in the experiment. By contrast, the cell cycle pathway is probably not significantly regulated if the cell cycle genes appear randomly scattered through the whole ranked list. 例如,如果最显著差异表达的基因参与细胞周期,则表明该细胞周期途径在实验中受到调控。 相比之下,如果细胞周期基因随机出现在整个排名列表中,则细胞周期途径可能不会受到明显调节。 To calculate an enrichment score (ES) for a pathway, GSEA progressively examines genes from the top to the bottom of the ranked list, increasing the ES if a gene is part of the pathway and decreasing the score otherwise. These running sum values are weighted, so that enrichment in the very top- (and bottom-) ranking genes is amplified, whereas enrichment in genes with more moderate ranks are not amplified. The ES score is calculated as the maximum value of the running sum and normalized relative to pathway size, resulting in a normalized enrichment score (NES) that reflects the enrichment of the pathway in the list. Positive and negative NES values represent enrichment at the top and bottom of the list, respectively. Finally, a permutation-based P value is computed and corrected for multiple testing to produce a permutation-based false-discovery rate (FDR) Q value that ranges from 0 (highly significant) to 1 (not significant) (Box ). The same analysis is performed starting from the bottom of the ranked gene list to identify pathways enriched in the bottom of the list. Resulting pathways are selected using the FDR Q value threshold (e.g., Q < 0.05) and ranked using NES. In addition, the ‘leading edge’ aspect of the GSEA analysis identifies specific genes that most strongly contribute to the detected enrichment signal of a pathway.
GSEA has two methods for determining the statistical significance (P value) of the ES: gene set permutation and phenotype permutation. The gene set permutation test requires a ranked list, and GSEA compares the observed pathway ES to a distribution of scores obtained by repeating the analysis with randomly sampled gene sets of matching sizes (e.g., 1,000 times). The phenotype permutation test requires expression data for all samples (e.g., biological replicates), along with a definition of sample groups called ‘phenotypes’ that are compared with each other (e.g., cases versus controls; tumor versus. normal samples). The observed pathway ES is compared to a distribution of scores obtained by randomly shuffling the samples among phenotype categories and repeating the analysis (e.g., 1,000 times), including computation of the ranked gene list and resulting pathway ES. Gene set permutation is recommended for studies with limited variability and biological replicates (i.e., two to five per condition). In this case, differential gene expression values should be computed outside of GSEA, using methods that include variance stabilization (such as edgeR, DESeq and limma/voom,) and imported into the GSEA software before pathway analysis. Phenotype permutation should be used with a larger number of replicates (e.g., at least ten per condition). The main advantage of the phenotype permutation approach is that it maintains the structure of gene sets with biologically important gene correlations during permutation, in contrast to the gene set permutation approach. This protocol covers only gene set permutation because it is appropriate for the most common use case of pathway enrichment analysis. Phenotype permutation is computationally expensive and, for the current version of GSEA, requires custom programming to compute ESs and differential expression statistics separately for thousands of phenotype randomizations. For advanced users, we provide a supplementary protocol for this procedure (Supplementary Protocol ).
By default, the GSEA desktop software searches the MSigDB gene set database, which includes pathways, published gene signatures, microRNA target genes and other gene set types (Box ). The user can also provide a custom database as a text-based GMT (Gene Matrix Transposed) file in which each line defines a pathway, with its name, identifier and a list of genes it contains. Gene identifiers in the GMT file must match those in the input gene list.
Pathway information is inherently redundant, as genes often participate in multiple pathways, and databases may organize pathways hierarchically by including general and specific pathways with many shared genes (e.g., ‘cell cycle’ and ‘M-phase of cell cycle’). Consequently, pathway enrichment analysis often highlights several versions of the same pathway. Collapsing redundant pathways into a single biological theme simplifies interpretation. We recommend addressing such redundancy with visualization methods such as EnrichmentMap, ClueGO and others,,. An ‘enrichment map’ is a network visualization that represents overlaps among enriched pathways (Fig. ), whereas ‘EnrichmentMap’ refers to the Cytoscape application that creates the visualization. Pathways are shown as circles (nodes) that are connected with lines (edges) if the pathways share many genes. Nodes are colored by ES, and edges are sized on the basis of the number of genes shared by the connected pathways. Network layout and clustering algorithms automatically group similar pathways into major biological themes. The EnrichmentMap software takes as input a text file containing pathway enrichment analysis results and another text file containing the pathway gene sets used in the original enrichment analysis. Interactive exploration of pathway ES (filtering nodes) and connections between pathways (filtering edges) is possible (Step 9A(xii and xiii) and 9B(xiii and xiv)). Multiple enrichment analysis results can be simultaneously visualized in a single enrichment map, in which case different colors are used on the nodes for each enrichment. If the gene expression data are optionally loaded, clicking on a pathway node will display a gene expression heat map of all genes in the pathway.
An enrichment map helps identify interesting pathways and themes. First, expected themes should be identified to help validate the pathway enrichment analysis results (positive controls). For instance, growth-related pathways and other hallmarks of cancer are expected to be identified in analyses of cancer genomics datasets. Second, pathways not previously associated with the experimental context are evaluated more carefully as potential discoveries. Pathways and themes with the strongest ESs should be studied first, followed by progressively weaker signals (Step 12). Third, interesting pathways are examined in more detail, examining genes within the pathways (e.g., expression heat maps and the GSEA leading edge genes). Further, gene expression values can be overlaid on a pathway diagram, if available, from databases such as Pathway Commons, Reactome, KEGG or WikiPathways, using tools such as PathVisio. If a diagram is not available, tools such as STRING or GeneMANIA can be used with Cytoscape to define an interaction network among pathway genes for expression overlay. This helps in visual identification of the pathway components (e.g., single genes or entire signaling cascades) that are the most altered (e.g., differentially expressed) in the experiment. In addition, master regulators for enriched pathways can be searched for by integrating gene sets of miRNA or transcription factor targets using the EnrichmentMap post-analysis tool. Finally, pathway enrichment analysis results can be published to support a scientific conclusion (e.g., functional differences of two cancer subtypes), or used for hypothesis generation or planning of experiments to support the identification of novel pathways. More pathway enrichment analysis examples and deeper explanation of core concepts is provided at http://www.pathwaycommons.org/guide/.
Pathway enrichment analysis of omics data has several advantages as compared to analysis of single genes, transcripts or proteins. First, it improves statistical power in two ways: (i) it aggregates counts of mutations across all the genes and genomic regions involved in the given cell mechanism, providing a higher number of counts, which makes statistical analyses more reliable; and (ii) it reduces the dimensionality from tens of thousands of genes or millions of genomic regions (e.g., SNPs) to a much smaller number of ‘systems’ or ‘pathways’, thereby reducing the cost of multiple hypothesis testing. Second, results are often easier to interpret because the analysis is phrased at the level of familiar concepts such as ‘cell cycle’. Third, the approach can help identify potential causal mechanisms and drug targets. Fourth, results obtained from related, but different, data may be more comparable because results are projected onto a smaller, shared feature space (i.e., a limited number of pathways). Fifth, the approach facilitates integration of diverse data types, such as genomics, transcriptomics and proteomics, which can all be mapped to the same pathways. Thus, projecting disease data onto known mechanisms increases statistical and interpretative power.
The following limitations are important to consider when interpreting pathway enrichment analysis results, in general, including those covered by this protocol. Additional limitations apply, depending on the omics data type (see ‘Application to diverse omics data’ section). Advantages and disadvantages of specific and alternative pathway enrichment analysis methods are presented in the ‘Comparison to alternative methods’ section.
Enrichment analysis is more effective for pathways in which multiple genes have strong biological signals (e.g., differential expression). For instance, in a transcriptomics experiment, we assume that evolution has optimized a cell to express a pathway only when needed, and that pathway activation or deactivation can be identified as coordinated activity of many genes in a pathway. Pathways in which activity is controlled by only a few genes or not controlled by gene expression (e.g., by post-translational regulation) will never be observed as enriched. 如果同路的变化不是通过基因表达反映出来的,则无法评估 Some pathway analysis methods address this by using activating and inhibiting gene interactions to construct quantitative models of pathway activity that include genes that are not differentially expressed yet are still important regulators. However, these methods require pathway models with detailed biochemical and regulatory gene interactions that are obtained through focused experiments and thus are in limited supply (Box ).
Pathway boundaries tend to be arbitrary, and different databases will disagree about which genes are involved in a given pathway. By using multiple databases, multiple pathway definitions can be analyzed, and some may be better than others at explaining the experimental data.
Some pathway enrichment methods, such as those based on the Fisher’s exact test, are statistically more likely to identify larger pathways as significant. Users can address this limitation by selecting an upper limit for the size of the gene sets considered in the analysis.
Multi-functional genes that are highly ranked in the gene list may lead to enrichment of many different pathways, some of which are not relevant to the experiment. Repeating the analysis after excluding such genes may reveal pathways whose enrichment is overly dependent on their presence or confirm the robustness of pathway enrichment.
Pathway databases, and therefore enrichment results, are biased toward well-known pathways. In fact, pathway enrichment analysis ignores genes with no pathway annotations, sometimes called the ‘dark matter of the genome’, and these genes should be studied separately. For example, non-coding RNA genes currently lack systematic annotations and are not directly usable in pathway enrichment analysis. ncRNA没有得到很好的注释,因此不能用于功能富集分析
Most enrichment analysis methods make unrealistic assumptions of statistical independence among genes as well as pathways. Some genes may always be co-expressed (e.g., genes within a protein complex), and some pathways have genes in common. Thus, standard FDRs, which assume statistical independence between tests, are often either more or less conservative than ideal. Nonetheless, they should still be used to adjust for multiple testing and rank enriched pathways for exploratory analysis and hypothesis generation. Custom permutation tests may lead to better estimates of false discovery (see ‘Comparison to alternative methods’ section).
Pathway enrichment analysis benefits greatly from careful experimental design. Otherwise, the analysis may reveal apparently meaningful results caused by experimental biases or other confounders. This section covers a range of experimental aspects that must be considered before performing this protocol.
The experimental conditions must be well defined such that the major variations observed are responses that the experimenter would like to monitor and that are related to the biological question of interest (e.g., tumor versus normal, treated versus untreated, comparison of four disease subtypes, time series).
Biological replicates are independently processed samples obtained from distinct organisms or cell lines that are required for measuring variability across samples and to compute statistical significance (P values). Lack of replication (i.e., one sample per group) will not permit robust estimation of the significance of the signal. Insufficient replication may result in lack of signal in the data (e.g., no significantly differentially expressed genes). The larger the variation in the set of samples, the more biological replicates are needed to accurately measure the signal. For systems with lower variability (i.e., model organisms with the same genetic background in controlled laboratory conditions, or stable cell lines derived from the same clone), at least three to four biological replicates are recommended per condition for differential analysis with variance stabilization normalization. Variance stabilization uses a global statistical model to ‘stabilize’ gene-wise variance estimates to reduce inaccuracies resulting from few replicates. For experiments with higher variability (e.g., tumor samples), more replicates are required; ideally, a pilot experiment followed by formal statistical power calculations (sometimes called sensitivity testing) should be used to determine the minimal number of replicates required to identify the signal of differentially expressed genes or enriched pathways. Technical replicates comprising repeated experiments of the same samples are usually not needed for well-established experimental techniques, such as RNA-seq, that have low technical variability, but can be helpful for novel techniques.
Differences in factors not related to the experimental question should be avoided or at least balanced across conditions so that statistical techniques such as generalized linear models can correct for each factor. Common factors include sequencing batch, nucleic acid extraction protocol, subject age and many others. Otherwise, it may be impossible to accurately separate the experimental signals coming from the experimental response from the confounding factors. Knowing important factors in advance supports correct experimental design. Statistical exploratory analyses such as clustering or principal component analysis (PCA) can help identify unknown factors. For example, cases and controls are expected to cluster separately and not by processing batch.
Outlier samples may considerably differ from others because of major experimental or technical problems, such as contamination or sample mix-up. Alternatively, they may present extreme biological features, such as tumor samples with exceptionally aggressive phenotypes. 它们可能表现出极端的生物学特征,例如具有异常侵袭性表型的肿瘤样品。 Unbiased identification of outlier samples is possible using statistical techniques such as PCA or clustering. Pathway enrichment analysis should be performed with and without outliers to ensure robust results. Systematic removal of outliers may be justified to reduce variability in the experiment.
Some experimental methods can be tuned to be more or less sensitive. For instance, the number of reads in RNA-seq experiments influences downstream analysis. For quantifying gene expression in a biological system with modest variability and testing differential expression with variance stabilization, at least three to five replicates and 10 million mapped reads are required. Substantially greater sequencing depth, such as 50–100 million mapped reads, is required to investigate splice isoforms, to detect poorly expressed genes or for samples with complex cellular mixtures such as surgical resection specimens.
We recommend searching enrichment of pathway gene sets only at first, as these capture familiar normal cellular processes that are easy to interpret. GO biological process terms and manually curated molecular pathways from Reactome, Panther, HumanCyc and NetPath are good resources for human pathways (Box ). GO biological process annotations include a mix of manually curated and electronically inferred sources.
A large fraction of gene annotations in GO originate from automatic data analysis and are not verified by human curators. These have the evidence code ‘inferred from electronic annotation’ (IEA). Earlier literature has cautioned against analyzing and interpreting IEA-labeled annotations, whereas more recent studies suggest that these are often as reliable as annotations assigned by human curators. For high-confidence analyses of data from human and common model organisms that have many manually curated annotations, we generally recommend comparing versions of the analysis with and without filtering of IEA annotations to verify robustness. However, IEA annotations make up the majority of information in less-well-studied species and should be used by default in these cases. Removing IEA-coded annotations may bias the analysis toward well-studied biological processes.
Different types of gene sets help answer a variety of questions. For instance, non-pathway gene sets corresponding to microRNA and transcription factor targets can be used to discover important regulators,. However, simultaneously analyzing all available types of gene sets reduces data interpretability. It may also lead to false negatives, as the increased number of conducted tests increases the effect of multiple testing correction and reduces the multiple-test adjusted significance of individual pathways. We therefore recommend performing the analysis of non-pathway and pathway gene sets separately.
It is often beneficial to exclude numerous small pathways because they are redundant with larger pathways and complicate interpretation, and their abundance makes multiple-testing correction more stringent. Large pathways should be also excluded, as these are overly general (e.g., ‘metabolism’), they do not contribute to interpretability of results, and their statistical significance can be inflated when using certain statistical enrichment methods (e.g., Fisher’s exact test). For analyzing human gene expression data, we often recommend excluding pathway gene sets with <10–15 genes and >200–500 genes, although upper bounds of 200–2,000 genes can be found in the literature. However, for non-human organisms and other types of gene sets, which may have different gene set size distributions, larger sets may need to be included. Filtering of pathways depends on the experimental context, as different areas of biology have variable coverage in pathway databases. One can determine the lower and upper bounds of pathway size by examining the sizes of several pathways of interest that are expected to be relevant to the experiment.
Pathway enrichment analysis depends on gene sets and databases used in the analysis, and many recent studies using pathway enrichment analysis are strongly impacted by outdated resources. For improved reproducibility and transparency of research, investigators should report in publications the analysis date and versions of pathway enrichment analysis software and gene set databases used, as well as all analysis parameters. In addition to enrichment maps, authors should consider adding their studied gene lists and complete tables of enriched pathways as supplementary information.
Genes are associated with many diverse database identifiers (IDs). We recommend using unambiguous, unique and stable IDs, as some IDs become obsolete over time. For human genes, we recommend using the Entrez Gene database IDs (e.g., corresponds to MDM2) or gene symbols (MDM2 is the official symbol recommended by the HUGO Gene Nomenclature Committee). As gene symbols change over time, we recommend maintaining both gene symbols and Entrez Gene IDs. The g:Profiler and related g:Convert tools support automatic conversion of multiple ID types to standard IDs.
Unexpected biological themes revealed in a pathway analysis may indicate problems with experimental design, data generation or analysis. For example, enrichment of the apoptosis pathway may indicate a problem with the experimental protocol that led to increased cell death during sample preparation. In these cases, the experimental design and data generation should be carefully reviewed before further data interpretation.
This protocol uses RNA-seq data and somatic mutation data as examples because these data types are frequently encountered. However, the general concepts of pathway enrichment analysis that we present are applicable to many types of experiments that can generate lists of genes, such as single-cell transcriptomics, CNVs, proteomics, phosphoproteomics, DNA methylation and metabolomics. Most data types require protocol modifications, which we only briefly discuss here. With certain data types, specialized computational methods are required to produce a gene list that is appropriate for pathway enrichment analysis, whereas with other data types, a specialized pathway enrichment analysis technique is required. Issues specific to data types and experimental methods must be considered, including:
Different gene identifiers are recommended for certain data types. We recommend UniProt accession numbers for proteins (e.g., for MDM2) and Human Metabolome Database IDs for metabolites (e.g., ATP is denoted as ).
Certain types of omics experiments by design capture only a subset of genes or proteins. To address this limited coverage, pathway enrichment analysis must define a custom background gene set of the genes that can be measured in the experiment. For example, phosphoproteomics experiments measure only proteins with one or more phosphorylation sites and thus must use the set of genes encoding phosphoproteins as the custom background gene set. 例如,磷酸蛋白质组学实验仅测量具有一个或多个磷酸化位点的蛋白质,因此必须使用编码磷蛋白的基因集作为定制背景基因集。 Otherwise, pathway enrichment analysis would reveal inflated P values for general processes such as kinase signaling and protein phosphorylation.
Pathway enrichment analysis of short non-coding genomic regions such as transcription factor binding sites from ChIP-seq experiments need additional consideration. Genomic regions must be mapped to protein-coding genes and corrected for biases such as increased signal in longer genes. Tools such as GREAT automatically perform both tasks.
Large genomic intervals that span multiple genes (e.g., from genome-wide associations, CNV and differentially methylated regions) require specialized enrichment tests such as the PLINK CNV gene set burden test or INRICH. Standard enrichment tests often reveal genes clustered in the genome whose signals are strongly statistically inflated because each gene is incorrectly counted as an independent signal. 标准富集测试通常会揭示基因组中聚集的基因,其信号在统计上被强烈夸大,因为每个基因被错误地视为独立信号。 Gene types that are correlated with genomic position include olfactory receptors, histones, major histocompatibility complex (MHC) members and homeobox transcription factors. A simple solution to address genomic clustering of genes in a pathway involves selecting only one representative gene from each functionally homogeneous genomic cluster before enrichment analysis.
For rare genetic variants, case–control pathway ‘burden’ tests are the most appropriate pathway enrichment analysis method (see ‘Comparison to alternative’ methods section).
This protocol recommends the use of g:Profiler and GSEA software for pathway enrichment analysis. g:Profiler, analyzes gene lists using Fisher’s exact test and ordered gene lists using a modified Fisher’s test. It provides a graphical web interface and access via R and Python programming languages. The software is frequently updated, and the gene set database can be downloaded as a GMT file (http://biit.cs.ut.ee/gprofiler). GSEA analyzes ranked gene lists using a permutation-based test. GSEA分析排名的基因列表基于置换的检验。 The software typically runs as a desktop application (http://software.broadinstitute.org/gsea). Hundreds of pathway enrichment analysis tools exist (reviewed in ref. ), although many rely on out-of-date pathway databases or lack unique features as compared to the most commonly used tools; as such, we do not cover them here. The following are alternative free pathway enrichment analysis software tools. Although we do not cover these tools in our protocol, we recommend the following, on the basis of their ease of use, unique features or advanced programming features.
Enrichr: This is a web-based enrichment analysis tool for non-ranked gene lists that is based on Fisher’s exact test. It is easy to use, has rich interactive reporting features, and includes >100 gene set databases (called libraries), including >180,000 gene sets in multiple categories. Functionality is similar to that of the g:Profiler web server described in this protocol.
Camera: This R Bioconductor package analyzes gene lists and corrects for inter-gene correlations such as those apparent in gene co-expression data. The software is available as part of the limma package in Bioconductor (https://bioconductor.org/packages/release/bioc/html/limma.html; this is an advanced tool that requires programming expertise; Supplementary Protocol ).
GOseq: This R Bioconductor package analyzes gene lists from RNA-seq experiments by correcting for user-selected covariates such as gene length (https://bioconductor.org/packages/release/bioc/html/goseq.html; this is an advanced tool that requires programming expertise).
Genomic Regions Enrichment of Annotations Tool (GREAT): In contrast to common methods that analyze gene lists, GREAT analyzes genomic regions such as DNA binding sites and links these to nearby genes for pathway enrichment analysis (http://bejerano.stanford.edu/great/public/html/). See ‘Application to diverse omics data’ section.
This protocol recommends the use of EnrichmentMap for pathway enrichment analysis visualization to aid interpretation. EnrichmentMap is a Cytoscape application that visualizes the results from pathway enrichment analysis and eases interpretation by displaying pathways as a network in which overlapping pathways are clustered together to identify major biological themes in the results (http://www.baderlab.org/Software/EnrichmentMap). Two alternative useful visualization tools are:
ClueGO: This Cytoscape application is conceptually similar to EnrichmentMap and provides a network-based visualization to reduce redundancy of results from pathway enrichment analysis. It also includes a pathway enrichment analysis feature for analysis of GO annotations using Fisher’s exact tests. However, it currently supports only GO gene sets.
PathVisio: This desktop application provides a complementary visualization approach to those of EnrichmentMap and ClueGO. PathVisio enables the user to visually interpret omics data in the context of gene and protein interactions in a pathway of interest. PathVisio colors pathway genes according to user-provided omics data (). This is the main advantage of PathVisio as compared to EnrichmentMap and ClueGO.
Most pathway enrichment analysis methods treat all genes in a pathway uniformly and ignore gene interactions. By contrast, topology-aware methods explicitly model the interactions between genes. CePa, GANPA and THINK-Back use physical gene interactions or co-expression networks to assign a weight to each gene in each pathway. Weights can be derived from measures of the gene importance in the network such as degree, the number of gene connections and betweenness centrality, and can be integrated into a traditional pathway enrichment analysis method such as GSEA. Methods such as SPIA, Pathway-Express and EnrichNet generate an ES for the entire pathway that considers pathway regulatory interactions such as activation and inhibition. Although useful and potentially more accurate, regulatory and biochemical gene interactions are available for fewer genes and pathways as compared to physical interactions networks and co-expression. We anticipate that these methods will become more useful as more gene interactions in pathways are characterized in detailed molecular experiments. However collecting and curating high-quality and biochemically detailed pathway data from the literature is currently complex and expensive. Therefore, pathway enrichment analysis methods described in this protocol will probably remain the most widely used approaches for the foreseeable future.
Current pathway enrichment analysis methods provide a useful high-level overview of the pathways active in a genomics experiment. However, these methods consider a simplified pathway view that involves only gene sets. Next-generation pathway analysis methods will integrate more biological pathway details, build pathway models based on multiple types of genomics data measured across many samples, and consider positive and negative regulatory relationships in the data. For instance, qualitative mathematical modeling parameterized with single-cell RNA-seq data may one day enable accurate predictions of drug combinations capable of treating a given disease under study.
This step-by-step protocol explains how to complete pathway enrichment analysis using g:Profiler (filtered gene list) and GSEA (unfiltered, whole-genome, ranked gene list), followed by visualization and interpretation using EnrichmentMap. The example data provided for the g:Profiler analysis is a list of genes with frequent somatic single nucleotide variants (SNVs) identified in The Cancer Genome Atlas (TCGA) exome-sequencing data of 3,200 tumors of 12 types. The example data provided for the GSEA analysis is a list of differentially expressed genes in two subtypes of ovarian cancer defined by TCGA.
We provide downloadable example files that are referred to throughout the protocol (Supplementary Tables ). We recommend saving all these files in a personal project data folder before starting. We also recommend creating an additional result data folder to save the files generated while performing the protocol.
A gene list or ranked gene list of interest
Example data for Step 6A. g:Profiler requires a list of genes, one per line, in a text file or spreadsheet, ready to copy and paste into a web page: for this, we use genes with frequent somatic SNVs identified in TCGA exome sequencing data of 3,200 tumors of 12 types. The MuSiC cancer driver mutation detection software was used to find 127 cancer driver genes that displayed higher than expected mutation frequencies in cancer samples (Supplementary Table , which is derived from column B of Supplementary Table in ref. ). Genes are ranked in decreasing order of significance (FDR Q value) and mutation frequency (not shown).
Example data for Step 6B. GSEA requires an RNK file with gene scores. An RNK file is a two-column text file with gene IDs in the first column and gene scores in the second column. All (or most) genes in the genome need to have a score, and the gene IDs need to match those used in the GMT file. We provide a ranked list of differentially expressed genes in ovarian cancer from TCGA (Supplementary Table 2). This cohort was previously stratified into four molecular subtypes on the basis of gene expression data, defined as differentiated, immunoreactive, mesenchymal and proliferative 该队列先前根据基因表达数据分为四种分子亚型,定义为分化,免疫反应,间充质和增生,. We compared the immunoreactive and mesenchymal subtypes to demonstrate the protocol. Step 5 of Supplementary Protocol shows how this file was created.
Pathway gene set database
In Step 6A, g:Profiler maintains an up-to-date set of pathway gene sets from multiple sources and no further input from the user is required, but a database of pathway gene sets is required for Step 6B (GSEA). Supplementary Table 3 contains a database of pathway gene sets used for pathway enrichment analysis in the standard GMT format, downloaded from http://baderlab.org/GeneSets. This file contains pathways downloaded on 1 July 2017 from eight data sources: GO, Reactome, Panther, NetPath, NCI, MSigDB curated gene sets (C2 collection, excluding Reactome and KEGG), MSigDB Hallmark (H collection) and HumanCyc. The gene sets available from http://baderlab.org/GeneSets are updated monthly. A GMT file is a text file in which each line represents a gene set for a single pathway. Each line includes a pathway ID, a name and the list of associated genes in a tab-separated format.
Timing 5 min
1
Download the required input and output files from the of the protocol.
Create two directories, project data folder and results data folder.
Place all downloaded input and example output files into the project data folder.
As you progress through the protocol, place any newly generated files into the results data folder.
2
Install Java v.8 or higher. Follow the download and installation instructions for Java JRE at
3
Download the latest version of GSEA. We recommend the javaGSEA desktop application available at http://www.broadinstitute.org/gsea/downloads.jsp. Free registration is required.
4
Download the latest version of Cytoscape from . Cytoscape v.3.6.0 or higher is required.
5
Install the required Cytoscape applications.
Launch Cytoscape.
Go to Apps → App Manager (i.e., open the Apps menu and select the item App Manager).
In the Install Apps tab search bar, search for EnrichmentMap.
Click on EnrichmentMap Pipeline Collection in the center panel. Verify that it is v.1.0.0 or higher.
Click on the Install button.
Go to the Currently Installed tab and verify that the applications (EnrichmentMap, clusterMaker2, WordCloud and AutoAnnotate) have been installed.
Timing 3–20 min
6
Two major types of gene lists are used in pathway enrichment analysis of omics data. Flat (unranked) gene lists of dozens to thousands of genes can be analyzed using g:Profiler (option A). A statistical threshold is required to compile a gene list from omics data. By contrast, ranked, whole-genome gene lists are suitable for pathway enrichment analysis using GSEA (option B). Gene lists analyzed with GSEA do not require prior filtering using statistical thresholds. Partial, filtered ranked gene lists can also be analyzed with g:Profiler. Select Step 6A or 6B, depending on the type of gene list you have.
(i)
Open the g:Profiler website at http://biit.cs.ut.ee/gprofiler/ (Fig. ).
(ii)
Paste the gene list (Supplementary Table 1) into the Query field in top-left corner of the screen. The gene list can be space-separated or one per line. The organism for the analysis, Homo sapiens, is selected by default. The input list can contain a mix of gene and protein IDs, symbols and accession numbers. Duplicated and unrecognized IDs will be removed automatically, and ambiguous symbols can be refined in an interactive dialogue after submitting the query.
(iii)
Check the box next to Ordered query. This option treats the input as an ordered gene list and prioritizes genes with higher mutation ESs at the beginning of the list.
(iv)
(Optional) Check the box next to No electronic GO annotations. This option will discard less reliable GO annotations (IEAs) that are not manually reviewed.
(v)
Set filters on gene annotation data using the menu on the right. We recommend that initial pathway enrichment analyses includes only biological processes (BPs) of GO and molecular pathways of Reactome. Keep the two checkboxes checked and uncheck all other boxes in the menu.
(vi)
Click on Show Advanced Options to set additional parameters.
(vii)
Set the values of Size of functional category in the dropdown menu to 5 (‘min’) and 350 (‘max’). Large pathways are of limited interpretative value, whereas numerous small pathways decrease the statistical power because of excessive multiple testing.
(viii)
Set the Size of query/term intersection in the dropdown menu to 3. The analysis will consider only more reliable pathways that have three or more genes in the input gene list.
(ix)
Click g:Profile! to run the analysis. A graphical heat map image will be shown, with detected pathways shown along the y axis (left) and associated genes of the input list shown along the x axis (top). Resulting pathways are organized hierarchically into related groups. g:Profiler uses graphical output by default and switches to textual output when a large number of pathways is found. g:Profiler returns only statistically significant pathways with P values adjusted for multiple testing correction (called Q values). By default, results with Q values <0.05 are reported. g:Profiler reports unrecognized and ambiguous gene IDs that can be resolved manually.
(x)
Use the dropdown menu Output type and select the option Generic Enrichment Map (TAB). This file is required for visualizing pathway results with Cytoscape and EnrichmentMap.
(xi)
Click g:Profile! again to run the analysis with the updated parameters. The required link Download data in Generic Enrichment Map (GEM) format will appear under the g:Profiler interface. Download the file from the link and save it on your computer in your result data folder created in Step 1. Example results are provided in Supplementary Table 4.
(xii)
Download the required GMT file by clicking on the link name at the bottom of the Advanced Options form. The GMT file is a compressed ZIP archive that contains all gene sets used by g:Profiler (e.g., gprofiler_hsapiens.NAME.gmt.zip). The gene set files are divided by data source. Download and uncompress the ZIP archive to your project folder. All required gene sets for this analysis will be in the file hsapiens.pathways.Name.gmt (Supplementary_Table5_hsapiens.pathways.NAME.gmt). Place the saved file in your result data folder created in Step 1.
(A)
Pathway enrichment analysis of a gene list using g:Profiler
Timing 3 min
Fig. 2: Screenshot of g:Profiler user interface.
Protocol Step 6A involves populating the g:Profiler interface. Procedural steps are highlighted with rectangles and roman numerals (refer to Step 6A(i–xii)). Purple boxes highlight files that must be downloaded for subsequent protocol steps. The remaining boxes indicate parameters for the analysis.
a, Web page summary of GSEA results showing pathways enriched in the top or bottom of the ranked list, with the ‘na_pos’ and ‘na_neg’ phenotypes corresponding to enrichment in upregulated and downregulated genes, respectively. These have been manually labeled here as mesenchymal and immunoreactive, respectively. Clicking on ‘Snapshot’ under either of the phenotypes will show the top 20 enrichment plots for that phenotype. b, An example enrichment plot for the top pathway in the mesenchymal set. c, An example enrichment plot for the top pathway in the immunoreactive set.
Class/phenotype-specific GSEA output in the web page summary shows how many gene sets were found enriched in upregulated genes, regardless of significance (purple), the total number of gene sets used after size filtering (cyan), the phenotype name (red) and the number of gene sets that pass different thresholds (orange).
a,b, Input fields in the EnrichmentMap interface for g:Profiler (a) and GSEA (b) results. Procedural steps are shown for Step 9A and 9B. Other than the specific input files, the parameters are the same for the two analysis types. Attributes surrounded by a dashed box should be filled out automatically if the user selects an appropriate folder with the required files. Missing file names indicate that EnrichmentMap was unable to find the specified file. Orange boxes indicate optional files. For the examples presented in the protocol, optional files are used for the GSEA analysis but not for the g:Profiler analysis to demonstrate the two distinct use cases. EM, EnrichmentMap.
a,b, Unformatted enrichment maps generated from Steps 6A (a) and 6B (b). Each node (circle) represents a distinct pathway, and edges (blue lines) represent the number of genes overlapping between two pathways, determined using the similarity coefficient. a, Enrichment map of significantly mutated cancer driver genes generated using the g:Profiler analysis in Step 6A. b, Enrichment map of pathways enriched in upregulated genes in mesenchymal (red) and immunoreactive (blue) ovarian cancer samples using the GSEA analysis in Step 6B.
(i) Cytoscape ‘Control Panel’, which contains ‘Networks’, ‘Styles’ and ‘Select’ tabs as well as the ‘EnrichmentMap’ main panel. (ii) The ‘Table Panel’ contains tables with node, edge and network attributes, as well as an enrichment map ‘Heat Map’ panel displaying expression for genes associated with selected nodes and edges. (iii) Cytoscape search bar, which can be used to search for genes in the enrichment map. (iv) ‘Node Table’ containing values for all variables associated with each node in the network. (v) Q-value or P-value slider bar. By default, the slider is set to Q value if the data contains Q values but can be changed to use P values by selecting the ‘P-value’ radio button. All nodes that pass the initial Q-value threshold are displayed in the enrichment map. By moving the slider to the left, the Q-value threshold is adjusted to a lower value, removing any nodes that do not pass the Q-value threshold. The currently set threshold will be displayed in the accompanying text box. Thresholds can be manually adjusted by modifying the text box value directly. (vi) ‘Edge Cutoff (Similarity)’ slider bar. The slider bar modifies the similarity threshold. The similarity threshold can only be increased; i.e., edges are required to have more genes in common in order to remain visible, which will remove edges from the network that do not satisfy the threshold. One can also manually change the threshold by modifying the text box value directly.
Heat map created by selecting the immunoreactive pathway interferon alpha beta signaling pathway from Reactome. The heat map is useful for visualization of detailed gene expression patterns for a pathway of interest. Magenta corresponds to high expression, and green corresponds to low expression. This heat map is for GSEA results, thus the ‘leading edge’ genes are highlighted in yellow; these genes have the largest contribution to the enrichment signal. (i–vi) Additional controls in the Heat Map panel include sorting options (i), selection of genes to include (ii), expression data visualization options (iii), data compression options (iv), the option to show values (v) and heat map settings (vi).
(i) Overall thumbnail view of the publication-ready enrichment map created with parameters FDR Q value < 0.01, and combined coefficient >0.375 with combined constant = 0.5. (ii) Zoomed-in section of publication-ready enrichment map, in which red and blue nodes represent mesenchymal and immunoreactive phenotype pathways, respectively. Nodes were manually laid out to form a clearer picture. Clusters of nodes were labeled using the AutoAnnotate Cytoscape application. Individual node labels were removed for clarity using the publication-ready button in EnrichmentMap and exported as PNG and PDF files. A legend was manually added at the bottom of the figure.
The enrichment map was summarized by collapsing node clusters using the AutoAnnotate application. Each cluster of nodes in Fig. is now represented as a single node. The network was scaled for better node distribution and manually adjusted to reduce node and label overlap. (i) Overall thumbnail view of the entire collapsed enrichment map. (ii) Zoomed-in section of the publication-ready collapsed enrichment map that corresponds to the zoomed-in network in Fig. (ii).
Subnetwork of the main enrichment map (Fig. ) was manually created by selecting pathways with the top NES values and creating a new network from the selection. Red and blue nodes are mesenchymal and immunoreactive phenotypes, respectively. Clusters of nodes were automatically labeled using the AutoAnnotate application. Annotations in the subnetwork may differ slightly from those in the main network, as word counts were normalized on a network basis.
Nature Protocols:
Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in a gene list more than would be expected by chance. We explain the procedures of pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome-sequencing experiments. The protocol comprises three major steps: definition of a gene list from omics data, determination of statistically enriched pathways, and visualization and interpretation of the results.协议包括三个主要步骤:从组学数据中定义基因列表,确定统计丰富的途径以及结果的可视化和解释。 We describe how to use this protocol with published examples of differentially expressed genes and mutated cancer genes; however, the principles can be applied to diverse types of omics data. The protocol describes innovative visualization techniques, provides comprehensive background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can be performed in ~4.5 h and is designed for use by biologists with no prior bioinformatics training.
Comprehensive quantification of DNA, RNA and proteins in biological samples is now routine. The resulting data are growing exponentially, and their analysis helps researchers discover novel biological functions, genotype–phenotype relationships and disease mechanisms,. However, analysis and interpretation of these data represent a major challenge for many researchers. Analyses often result in long lists of genes that require an impractically large amount of manual literature searching to interpret. A standard approach to addressing this problem is pathway enrichment analysis, which summarizes the large gene list as a smaller list of more easily interpretable pathways. Pathways are statistically tested for over-representation in the experimental gene list relative to what is expected by chance, using several common statistical tests that consider the number of genes detected in the experiment, their relative ranking and the number of genes annotated to a pathway of interest. For instance, experimental data containing 40% cell cycle genes are surprisingly enriched, given that only 8% of human protein-coding genes are involved in this process. 例如,实验数据中包含40%细胞周期相关基因,而背景中只有8%的人类蛋白质编码基因参与该过程,说明这一过程在该实验数据中显著富集。
In a recent example, we used pathway enrichment analysis to help identify histone and DNA methylation by the polycomb repressive complex (PRC2) 多梳抑制复合物(PRC2) as the first rational therapeutic target for ependymoma 室管膜瘤, one of the most prevalent childhood brain cancers. This pathway is targetable by available drugs such as 5-azacytidine, which was used on a compassionate basis in a terminally ill patient and stopped rapid metastatic tumor growth. In another example, we analyzed rare copy-number variants (CNVs) in autism and identified several significant pathways affected by gene deletions, whereas few significant hits were identified with case–control association tests of single genes or loci,. These examples illustrate the useful insights into biological mechanisms that can be achieved using pathway enrichment analysis.
This protocol covers pathway enrichment analysis of large gene lists typically derived from genome-scale (omics) technology. The protocol is intended for experimental biologists who are interested in interpreting their omics data. It requires only an ability to learn and use ‘point-and-click’ computer software, although advanced users can benefit from the automatic analysis scripts we provide as Supplementary Protocols –. We analyze previously published human gene expression and somatic mutation data as examples,,; however, our conceptual framework is applicable to analysis of lists of genes or biomolecules from any organism derived from large-scale data, including proteomics, genomics, epigenomics and gene-regulation studies. We extensively use pathway enrichment analysis for many projects and have evaluated numerous available tools,,,. The software packages we cover here have been selected for their ease of use, free access, advanced features, extensive documentation and up-to-date databases, and they are ones we use daily in our research and recommend to collaborators and students. In addition, we have provided feedback to the developers of these tools, allowing them to implement features we have needed in published analyses. These tools are g:Profiler, GSEA, Cytoscape and EnrichmentMap, all freely available online:
g:Profiler (https://biit.cs.ut.ee/gprofiler/)
Cytoscape (http://www.cytoscape.org/)
EnrichmentMap (http://www.baderlab.org/Software/EnrichmentMap)
This section outlines the major stages of pathway enrichment analysis. A detailed step-by-step protocol is provided in the Procedure below. Pathway enrichment analysis involves three major stages (Fig. ; see Box for basic definitions).
1Definition of a gene list of interest using omics data. An omics experiment comprehensively measures the activity of genes in an experimental context. The resulting raw dataset generally requires computational processing, such as normalization and scoring, to identify genes of interest, considering the experimental design. For example, a list of genes differentially expressed between two groups of samples can be derived from RNA-seq data. Gene lists derived from other types of omics experiments, such as gene expression microarrays, quantitative proteomics,, germline and somatic genome sequencing,,, and global DNA methylation assays,, can be used in this protocol; however, each type of data may require specific pre-processing steps (see ‘Comparison to alternative methods’ section).
Pathway enrichment analysis. A statistical method is used to identify pathways enriched in the gene list from stage 1, relative to what is expected by chance. All pathways in a given database are tested for enrichment in the gene list (see Box for a list of pathway databases). Several established pathway enrichment analysis methods are available, and the choice of which to use depends on the type of gene list (see ‘Comparison to alternative methods’ section).
Visualization and interpretation of pathway enrichment analysis results. Many enriched pathways can be identified in stage 2, often including related versions of the same pathway. Visualization can help identify the main biological themes and their relationships for in-depth study and experimental evaluation.
Gene lists derived from diverse omics data undergo pathway enrichment analysis, using g:Profiler or GSEA, to identify pathways that are enriched in the experiment. Pathway enrichment analysis results are visualized and interpreted in Cytoscape using its EnrichmentMap, AutoAnnotate, WordCloud and clusterMaker2 applications. Protocol overview is shown on the left, starting from gene list input, and example outputs at each stage are shown on the right.
Genome-scale experiments generate raw data that must be processed to obtain gene-level information suitable for pathway enrichment analysis (Supplementary Protocols and ). The specific processing steps are particular to the omics experiment type and may be standard, and thus usually straightforward to implement, or not, in which case advanced computational skills may be needed for data processing. Standard processing methods are available for established omics technologies and are most conveniently performed by the core facility that generates the data.
There are two major ways to define a gene list from omics data: list or ranked list. Certain omics data naturally produce a gene list, such as all somatically mutated genes in a tumor identified by exome sequencing, or all proteins that interact with a bait in a proteomics experiment. Such a list is suitable for direct input into pathway enrichment analysis using g:Profiler (Step 6A). Other omics data naturally produce ranked lists. For example, a list of genes can be ranked by differential gene expression score or sensitivity in a genome-wide CRISPR screen. Some pathway enrichment analysis approaches analyze a ranked gene list filtered by a particular threshold (e.g., FDR-adjusted P value <0.05 and fold-change >2). Alternative approaches, such as GSEA, are designed to analyze ranked lists of all available genes and do not require a threshold. A whole-genome ranked list is suitable for input into pathway enrichment analysis using GSEA (Step 6B). A partial (non-whole-genome) ranked gene list should be analyzed using g:Profiler.
As an example, we describe the analysis of raw RNA-seq data from ovarian cancer samples to define a ranked gene list. DNA sequence reads are quality filtered (e.g., by trimming to remove low-quality bases) and mapped to a genome-wide reference set of transcripts to enable counting of reads per transcript. Read counts are aggregated at the gene level (counts per gene). Typically, RNA-seq data for multiple biological replicates (three or more) for each of multiple experimental conditions (two or more, e.g., treatment versus control) are available. Read counts per gene are normalized across all samples to remove unwanted technical variation between samples, for example, due to differences in sequencing lane or total read number per sequencing run,,. Next, read counts per gene are tested for differential expression across sample groups (e.g., treatment versus control) (Supplementary Protocols and for RNA-seq and microarray data, respectively). Software packages such as edgeR, DESeq, limma/voom, and Cufflinks implement procedures for RNA-seq data normalization and differential expression analysis. Differential gene expression analysis results include: (i) the P value of the significance of differential expression; (ii) the related Q value (a.k.a. adjusted P value) that has been corrected for multiple testing across all genes, for example, by using the Benjamini–Hochberg false-discovery rate (BH-FDR) procedure (Box ); (iii) effect size and direction of expression change so that upregulated genes are positive and at the top of the list and downregulated genes are negative and at the bottom of the list, often expressed as log-transformed fold-change. The gene list is then ranked by one or more of these values (e.g., −log10 P value multiplied by the sign of log-transformed fold-change) and studied using pathway enrichment analysis.
The default analysis implemented in g:Profiler and similar web-based tools,,, searches for pathways whose genes are significantly enriched (i.e., over-represented) in the fixed list of genes of interest, as compared to all genes in the genome (Step 6A)(Box ). The P value of the enrichment of a pathway is computed using a Fisher’s exact test and multiple-test correction is applied (Box ).
g:Profiler also includes an ordered enrichment test, which is suitable for lists of up to a few thousand genes that are ordered by a score, whereas the rest of the genes in the genome lack meaningful signal for ranking. For example, significantly mutated genes may be ranked by a score from a cancer driver prediction method. This analysis repeats a modified Fisher’s exact test on incrementally larger sub-lists of the input genes and reports the sub-list with the strongest enrichment P value for each pathway. g:Profiler searches a collection of gene sets representing Gene Ontology (GO) terms, pathways, networks, regulatory motifs, and disease phenotypes. Major categories of gene sets can be selected to customize the search.
Pathway enrichment methods that use the Fisher’s exact test or related tests require the definition of background genes for comparison. All annotated protein-coding genes are often used as default. This leads to inappropriate inflation of P values and false-positive results if the experiment can directly measure only a subset of all genes. For example, setting a custom background is important in analyzing data from targeted sequencing or phosphoproteomics experiments. The appropriate custom background would include all genes in the sequencing panel or all known phosphoproteins, respectively.
Pathway enrichment analysis of a ranked gene list is implemented in the GSEA software (Step 6B) (Box ). GSEA is a threshold-free method that analyzes all genes on the basis of their differential expression rank, or other score, without prior gene filtering. GSEA is particularly suitable and is recommended when ranks are available for all or most of the genes in the genome (e.g., for RNA-seq data). However, it is not suitable when only a small portion of genes have ranks available, for example, in an experiment that identifies significantly mutated cancer genes (Stage 2A; Step 6A).
GSEA searches for pathways whose genes are enriched at the top or bottom of the ranked gene list, more so than expected by chance alone. For instance, if the topmost differentially expressed genes are involved in the cell cycle, this suggests that the cell cycle pathway is regulated in the experiment. By contrast, the cell cycle pathway is probably not significantly regulated if the cell cycle genes appear randomly scattered through the whole ranked list. 例如,如果最显著差异表达的基因参与细胞周期,则表明该细胞周期途径在实验中受到调控。 相比之下,如果细胞周期基因随机出现在整个排名列表中,则细胞周期途径可能不会受到明显调节。 To calculate an enrichment score (ES) for a pathway, GSEA progressively examines genes from the top to the bottom of the ranked list, increasing the ES if a gene is part of the pathway and decreasing the score otherwise. These running sum values are weighted, so that enrichment in the very top- (and bottom-) ranking genes is amplified, whereas enrichment in genes with more moderate ranks are not amplified. The ES score is calculated as the maximum value of the running sum and normalized relative to pathway size, resulting in a normalized enrichment score (NES) that reflects the enrichment of the pathway in the list. Positive and negative NES values represent enrichment at the top and bottom of the list, respectively. Finally, a permutation-based P value is computed and corrected for multiple testing to produce a permutation-based false-discovery rate (FDR) Q value that ranges from 0 (highly significant) to 1 (not significant) (Box ). The same analysis is performed starting from the bottom of the ranked gene list to identify pathways enriched in the bottom of the list. Resulting pathways are selected using the FDR Q value threshold (e.g., Q < 0.05) and ranked using NES. In addition, the ‘leading edge’ aspect of the GSEA analysis identifies specific genes that most strongly contribute to the detected enrichment signal of a pathway.
GSEA has two methods for determining the statistical significance (P value) of the ES: gene set permutation and phenotype permutation. The gene set permutation test requires a ranked list, and GSEA compares the observed pathway ES to a distribution of scores obtained by repeating the analysis with randomly sampled gene sets of matching sizes (e.g., 1,000 times). The phenotype permutation test requires expression data for all samples (e.g., biological replicates), along with a definition of sample groups called ‘phenotypes’ that are compared with each other (e.g., cases versus controls; tumor versus. normal samples). The observed pathway ES is compared to a distribution of scores obtained by randomly shuffling the samples among phenotype categories and repeating the analysis (e.g., 1,000 times), including computation of the ranked gene list and resulting pathway ES. Gene set permutation is recommended for studies with limited variability and biological replicates (i.e., two to five per condition). In this case, differential gene expression values should be computed outside of GSEA, using methods that include variance stabilization (such as edgeR, DESeq and limma/voom,) and imported into the GSEA software before pathway analysis. Phenotype permutation should be used with a larger number of replicates (e.g., at least ten per condition). The main advantage of the phenotype permutation approach is that it maintains the structure of gene sets with biologically important gene correlations during permutation, in contrast to the gene set permutation approach. This protocol covers only gene set permutation because it is appropriate for the most common use case of pathway enrichment analysis. Phenotype permutation is computationally expensive and, for the current version of GSEA, requires custom programming to compute ESs and differential expression statistics separately for thousands of phenotype randomizations. For advanced users, we provide a supplementary protocol for this procedure (Supplementary Protocol ).
By default, the GSEA desktop software searches the MSigDB gene set database, which includes pathways, published gene signatures, microRNA target genes and other gene set types (Box ). The user can also provide a custom database as a text-based GMT (Gene Matrix Transposed) file in which each line defines a pathway, with its name, identifier and a list of genes it contains. Gene identifiers in the GMT file must match those in the input gene list.
Pathway information is inherently redundant, as genes often participate in multiple pathways, and databases may organize pathways hierarchically by including general and specific pathways with many shared genes (e.g., ‘cell cycle’ and ‘M-phase of cell cycle’). Consequently, pathway enrichment analysis often highlights several versions of the same pathway. Collapsing redundant pathways into a single biological theme simplifies interpretation. We recommend addressing such redundancy with visualization methods such as EnrichmentMap, ClueGO and others,,. An ‘enrichment map’ is a network visualization that represents overlaps among enriched pathways (Fig. ), whereas ‘EnrichmentMap’ refers to the Cytoscape application that creates the visualization. Pathways are shown as circles (nodes) that are connected with lines (edges) if the pathways share many genes. Nodes are colored by ES, and edges are sized on the basis of the number of genes shared by the connected pathways. Network layout and clustering algorithms automatically group similar pathways into major biological themes. The EnrichmentMap software takes as input a text file containing pathway enrichment analysis results and another text file containing the pathway gene sets used in the original enrichment analysis. Interactive exploration of pathway ES (filtering nodes) and connections between pathways (filtering edges) is possible (Step 9A(xii and xiii) and 9B(xiii and xiv)). Multiple enrichment analysis results can be simultaneously visualized in a single enrichment map, in which case different colors are used on the nodes for each enrichment. If the gene expression data are optionally loaded, clicking on a pathway node will display a gene expression heat map of all genes in the pathway.
An enrichment map helps identify interesting pathways and themes. First, expected themes should be identified to help validate the pathway enrichment analysis results (positive controls). For instance, growth-related pathways and other hallmarks of cancer are expected to be identified in analyses of cancer genomics datasets. Second, pathways not previously associated with the experimental context are evaluated more carefully as potential discoveries. Pathways and themes with the strongest ESs should be studied first, followed by progressively weaker signals (Step 12). Third, interesting pathways are examined in more detail, examining genes within the pathways (e.g., expression heat maps and the GSEA leading edge genes). Further, gene expression values can be overlaid on a pathway diagram, if available, from databases such as Pathway Commons, Reactome, KEGG or WikiPathways, using tools such as PathVisio. If a diagram is not available, tools such as STRING or GeneMANIA can be used with Cytoscape to define an interaction network among pathway genes for expression overlay. This helps in visual identification of the pathway components (e.g., single genes or entire signaling cascades) that are the most altered (e.g., differentially expressed) in the experiment. In addition, master regulators for enriched pathways can be searched for by integrating gene sets of miRNA or transcription factor targets using the EnrichmentMap post-analysis tool. Finally, pathway enrichment analysis results can be published to support a scientific conclusion (e.g., functional differences of two cancer subtypes), or used for hypothesis generation or planning of experiments to support the identification of novel pathways. More pathway enrichment analysis examples and deeper explanation of core concepts is provided at http://www.pathwaycommons.org/guide/.
Pathway enrichment analysis of omics data has several advantages as compared to analysis of single genes, transcripts or proteins. First, it improves statistical power in two ways: (i) it aggregates counts of mutations across all the genes and genomic regions involved in the given cell mechanism, providing a higher number of counts, which makes statistical analyses more reliable; and (ii) it reduces the dimensionality from tens of thousands of genes or millions of genomic regions (e.g., SNPs) to a much smaller number of ‘systems’ or ‘pathways’, thereby reducing the cost of multiple hypothesis testing. Second, results are often easier to interpret because the analysis is phrased at the level of familiar concepts such as ‘cell cycle’. Third, the approach can help identify potential causal mechanisms and drug targets. Fourth, results obtained from related, but different, data may be more comparable because results are projected onto a smaller, shared feature space (i.e., a limited number of pathways). Fifth, the approach facilitates integration of diverse data types, such as genomics, transcriptomics and proteomics, which can all be mapped to the same pathways. Thus, projecting disease data onto known mechanisms increases statistical and interpretative power.
The following limitations are important to consider when interpreting pathway enrichment analysis results, in general, including those covered by this protocol. Additional limitations apply, depending on the omics data type (see ‘Application to diverse omics data’ section). Advantages and disadvantages of specific and alternative pathway enrichment analysis methods are presented in the ‘Comparison to alternative methods’ section.
Enrichment analysis is more effective for pathways in which multiple genes have strong biological signals (e.g., differential expression). For instance, in a transcriptomics experiment, we assume that evolution has optimized a cell to express a pathway only when needed, and that pathway activation or deactivation can be identified as coordinated activity of many genes in a pathway. Pathways in which activity is controlled by only a few genes or not controlled by gene expression (e.g., by post-translational regulation) will never be observed as enriched. 如果同路的变化不是通过基因表达反映出来的,则无法评估 Some pathway analysis methods address this by using activating and inhibiting gene interactions to construct quantitative models of pathway activity that include genes that are not differentially expressed yet are still important regulators. However, these methods require pathway models with detailed biochemical and regulatory gene interactions that are obtained through focused experiments and thus are in limited supply (Box ).
Pathway boundaries tend to be arbitrary, and different databases will disagree about which genes are involved in a given pathway. By using multiple databases, multiple pathway definitions can be analyzed, and some may be better than others at explaining the experimental data.
Some pathway enrichment methods, such as those based on the Fisher’s exact test, are statistically more likely to identify larger pathways as significant. Users can address this limitation by selecting an upper limit for the size of the gene sets considered in the analysis.
Multi-functional genes that are highly ranked in the gene list may lead to enrichment of many different pathways, some of which are not relevant to the experiment. Repeating the analysis after excluding such genes may reveal pathways whose enrichment is overly dependent on their presence or confirm the robustness of pathway enrichment.
Pathway databases, and therefore enrichment results, are biased toward well-known pathways. In fact, pathway enrichment analysis ignores genes with no pathway annotations, sometimes called the ‘dark matter of the genome’, and these genes should be studied separately. For example, non-coding RNA genes currently lack systematic annotations and are not directly usable in pathway enrichment analysis. ncRNA没有得到很好的注释,因此不能用于功能富集分析
Most enrichment analysis methods make unrealistic assumptions of statistical independence among genes as well as pathways. Some genes may always be co-expressed (e.g., genes within a protein complex), and some pathways have genes in common. Thus, standard FDRs, which assume statistical independence between tests, are often either more or less conservative than ideal. Nonetheless, they should still be used to adjust for multiple testing and rank enriched pathways for exploratory analysis and hypothesis generation. Custom permutation tests may lead to better estimates of false discovery (see ‘Comparison to alternative methods’ section).
Pathway enrichment analysis benefits greatly from careful experimental design. Otherwise, the analysis may reveal apparently meaningful results caused by experimental biases or other confounders. This section covers a range of experimental aspects that must be considered before performing this protocol.
The experimental conditions must be well defined such that the major variations observed are responses that the experimenter would like to monitor and that are related to the biological question of interest (e.g., tumor versus normal, treated versus untreated, comparison of four disease subtypes, time series).
Biological replicates are independently processed samples obtained from distinct organisms or cell lines that are required for measuring variability across samples and to compute statistical significance (P values). Lack of replication (i.e., one sample per group) will not permit robust estimation of the significance of the signal. Insufficient replication may result in lack of signal in the data (e.g., no significantly differentially expressed genes). The larger the variation in the set of samples, the more biological replicates are needed to accurately measure the signal. For systems with lower variability (i.e., model organisms with the same genetic background in controlled laboratory conditions, or stable cell lines derived from the same clone), at least three to four biological replicates are recommended per condition for differential analysis with variance stabilization normalization. Variance stabilization uses a global statistical model to ‘stabilize’ gene-wise variance estimates to reduce inaccuracies resulting from few replicates. For experiments with higher variability (e.g., tumor samples), more replicates are required; ideally, a pilot experiment followed by formal statistical power calculations (sometimes called sensitivity testing) should be used to determine the minimal number of replicates required to identify the signal of differentially expressed genes or enriched pathways. Technical replicates comprising repeated experiments of the same samples are usually not needed for well-established experimental techniques, such as RNA-seq, that have low technical variability, but can be helpful for novel techniques.
Differences in factors not related to the experimental question should be avoided or at least balanced across conditions so that statistical techniques such as generalized linear models can correct for each factor. Common factors include sequencing batch, nucleic acid extraction protocol, subject age and many others. Otherwise, it may be impossible to accurately separate the experimental signals coming from the experimental response from the confounding factors. Knowing important factors in advance supports correct experimental design. Statistical exploratory analyses such as clustering or principal component analysis (PCA) can help identify unknown factors. For example, cases and controls are expected to cluster separately and not by processing batch.
Outlier samples may considerably differ from others because of major experimental or technical problems, such as contamination or sample mix-up. Alternatively, they may present extreme biological features, such as tumor samples with exceptionally aggressive phenotypes. 它们可能表现出极端的生物学特征,例如具有异常侵袭性表型的肿瘤样品。 Unbiased identification of outlier samples is possible using statistical techniques such as PCA or clustering. Pathway enrichment analysis should be performed with and without outliers to ensure robust results. Systematic removal of outliers may be justified to reduce variability in the experiment.
Some experimental methods can be tuned to be more or less sensitive. For instance, the number of reads in RNA-seq experiments influences downstream analysis. For quantifying gene expression in a biological system with modest variability and testing differential expression with variance stabilization, at least three to five replicates and 10 million mapped reads are required. Substantially greater sequencing depth, such as 50–100 million mapped reads, is required to investigate splice isoforms, to detect poorly expressed genes or for samples with complex cellular mixtures such as surgical resection specimens.
We recommend searching enrichment of pathway gene sets only at first, as these capture familiar normal cellular processes that are easy to interpret. GO biological process terms and manually curated molecular pathways from Reactome, Panther, HumanCyc and NetPath are good resources for human pathways (Box ). GO biological process annotations include a mix of manually curated and electronically inferred sources.
A large fraction of gene annotations in GO originate from automatic data analysis and are not verified by human curators. These have the evidence code ‘inferred from electronic annotation’ (IEA). Earlier literature has cautioned against analyzing and interpreting IEA-labeled annotations, whereas more recent studies suggest that these are often as reliable as annotations assigned by human curators. For high-confidence analyses of data from human and common model organisms that have many manually curated annotations, we generally recommend comparing versions of the analysis with and without filtering of IEA annotations to verify robustness. However, IEA annotations make up the majority of information in less-well-studied species and should be used by default in these cases. Removing IEA-coded annotations may bias the analysis toward well-studied biological processes.
Different types of gene sets help answer a variety of questions. For instance, non-pathway gene sets corresponding to microRNA and transcription factor targets can be used to discover important regulators,. However, simultaneously analyzing all available types of gene sets reduces data interpretability. It may also lead to false negatives, as the increased number of conducted tests increases the effect of multiple testing correction and reduces the multiple-test adjusted significance of individual pathways. We therefore recommend performing the analysis of non-pathway and pathway gene sets separately.
It is often beneficial to exclude numerous small pathways because they are redundant with larger pathways and complicate interpretation, and their abundance makes multiple-testing correction more stringent. Large pathways should be also excluded, as these are overly general (e.g., ‘metabolism’), they do not contribute to interpretability of results, and their statistical significance can be inflated when using certain statistical enrichment methods (e.g., Fisher’s exact test). For analyzing human gene expression data, we often recommend excluding pathway gene sets with <10–15 genes and >200–500 genes, although upper bounds of 200–2,000 genes can be found in the literature. However, for non-human organisms and other types of gene sets, which may have different gene set size distributions, larger sets may need to be included. Filtering of pathways depends on the experimental context, as different areas of biology have variable coverage in pathway databases. One can determine the lower and upper bounds of pathway size by examining the sizes of several pathways of interest that are expected to be relevant to the experiment.
Pathway enrichment analysis depends on gene sets and databases used in the analysis, and many recent studies using pathway enrichment analysis are strongly impacted by outdated resources. For improved reproducibility and transparency of research, investigators should report in publications the analysis date and versions of pathway enrichment analysis software and gene set databases used, as well as all analysis parameters. In addition to enrichment maps, authors should consider adding their studied gene lists and complete tables of enriched pathways as supplementary information.
Genes are associated with many diverse database identifiers (IDs). We recommend using unambiguous, unique and stable IDs, as some IDs become obsolete over time. For human genes, we recommend using the Entrez Gene database IDs (e.g., corresponds to MDM2) or gene symbols (MDM2 is the official symbol recommended by the HUGO Gene Nomenclature Committee). As gene symbols change over time, we recommend maintaining both gene symbols and Entrez Gene IDs. The g:Profiler and related g:Convert tools support automatic conversion of multiple ID types to standard IDs.
Unexpected biological themes revealed in a pathway analysis may indicate problems with experimental design, data generation or analysis. For example, enrichment of the apoptosis pathway may indicate a problem with the experimental protocol that led to increased cell death during sample preparation. In these cases, the experimental design and data generation should be carefully reviewed before further data interpretation.
This protocol uses RNA-seq data and somatic mutation data as examples because these data types are frequently encountered. However, the general concepts of pathway enrichment analysis that we present are applicable to many types of experiments that can generate lists of genes, such as single-cell transcriptomics, CNVs, proteomics, phosphoproteomics, DNA methylation and metabolomics. Most data types require protocol modifications, which we only briefly discuss here. With certain data types, specialized computational methods are required to produce a gene list that is appropriate for pathway enrichment analysis, whereas with other data types, a specialized pathway enrichment analysis technique is required. Issues specific to data types and experimental methods must be considered, including:
Different gene identifiers are recommended for certain data types. We recommend UniProt accession numbers for proteins (e.g., for MDM2) and Human Metabolome Database IDs for metabolites (e.g., ATP is denoted as ).
Certain types of omics experiments by design capture only a subset of genes or proteins. To address this limited coverage, pathway enrichment analysis must define a custom background gene set of the genes that can be measured in the experiment. For example, phosphoproteomics experiments measure only proteins with one or more phosphorylation sites and thus must use the set of genes encoding phosphoproteins as the custom background gene set. 例如,磷酸蛋白质组学实验仅测量具有一个或多个磷酸化位点的蛋白质,因此必须使用编码磷蛋白的基因集作为定制背景基因集。 Otherwise, pathway enrichment analysis would reveal inflated P values for general processes such as kinase signaling and protein phosphorylation.
Pathway enrichment analysis of short non-coding genomic regions such as transcription factor binding sites from ChIP-seq experiments need additional consideration. Genomic regions must be mapped to protein-coding genes and corrected for biases such as increased signal in longer genes. Tools such as GREAT automatically perform both tasks.
Large genomic intervals that span multiple genes (e.g., from genome-wide associations, CNV and differentially methylated regions) require specialized enrichment tests such as the PLINK CNV gene set burden test or INRICH. Standard enrichment tests often reveal genes clustered in the genome whose signals are strongly statistically inflated because each gene is incorrectly counted as an independent signal. 标准富集测试通常会揭示基因组中聚集的基因,其信号在统计上被强烈夸大,因为每个基因被错误地视为独立信号。 Gene types that are correlated with genomic position include olfactory receptors, histones, major histocompatibility complex (MHC) members and homeobox transcription factors. A simple solution to address genomic clustering of genes in a pathway involves selecting only one representative gene from each functionally homogeneous genomic cluster before enrichment analysis.
For rare genetic variants, case–control pathway ‘burden’ tests are the most appropriate pathway enrichment analysis method (see ‘Comparison to alternative’ methods section).
This protocol recommends the use of g:Profiler and GSEA software for pathway enrichment analysis. g:Profiler, analyzes gene lists using Fisher’s exact test and ordered gene lists using a modified Fisher’s test. It provides a graphical web interface and access via R and Python programming languages. The software is frequently updated, and the gene set database can be downloaded as a GMT file (http://biit.cs.ut.ee/gprofiler). GSEA analyzes ranked gene lists using a permutation-based test. GSEA分析排名的基因列表基于置换的检验。 The software typically runs as a desktop application (http://software.broadinstitute.org/gsea). Hundreds of pathway enrichment analysis tools exist (reviewed in ref. ), although many rely on out-of-date pathway databases or lack unique features as compared to the most commonly used tools; as such, we do not cover them here. The following are alternative free pathway enrichment analysis software tools. Although we do not cover these tools in our protocol, we recommend the following, on the basis of their ease of use, unique features or advanced programming features.
Enrichr: This is a web-based enrichment analysis tool for non-ranked gene lists that is based on Fisher’s exact test. It is easy to use, has rich interactive reporting features, and includes >100 gene set databases (called libraries), including >180,000 gene sets in multiple categories. Functionality is similar to that of the g:Profiler web server described in this protocol.
Camera: This R Bioconductor package analyzes gene lists and corrects for inter-gene correlations such as those apparent in gene co-expression data. The software is available as part of the limma package in Bioconductor (https://bioconductor.org/packages/release/bioc/html/limma.html; this is an advanced tool that requires programming expertise; Supplementary Protocol ).
GOseq: This R Bioconductor package analyzes gene lists from RNA-seq experiments by correcting for user-selected covariates such as gene length (https://bioconductor.org/packages/release/bioc/html/goseq.html; this is an advanced tool that requires programming expertise).
Genomic Regions Enrichment of Annotations Tool (GREAT): In contrast to common methods that analyze gene lists, GREAT analyzes genomic regions such as DNA binding sites and links these to nearby genes for pathway enrichment analysis (http://bejerano.stanford.edu/great/public/html/). See ‘Application to diverse omics data’ section.
This protocol recommends the use of EnrichmentMap for pathway enrichment analysis visualization to aid interpretation. EnrichmentMap is a Cytoscape application that visualizes the results from pathway enrichment analysis and eases interpretation by displaying pathways as a network in which overlapping pathways are clustered together to identify major biological themes in the results (http://www.baderlab.org/Software/EnrichmentMap). Two alternative useful visualization tools are:
ClueGO: This Cytoscape application is conceptually similar to EnrichmentMap and provides a network-based visualization to reduce redundancy of results from pathway enrichment analysis. It also includes a pathway enrichment analysis feature for analysis of GO annotations using Fisher’s exact tests. However, it currently supports only GO gene sets.
PathVisio: This desktop application provides a complementary visualization approach to those of EnrichmentMap and ClueGO. PathVisio enables the user to visually interpret omics data in the context of gene and protein interactions in a pathway of interest. PathVisio colors pathway genes according to user-provided omics data (). This is the main advantage of PathVisio as compared to EnrichmentMap and ClueGO.
Most pathway enrichment analysis methods treat all genes in a pathway uniformly and ignore gene interactions. By contrast, topology-aware methods explicitly model the interactions between genes. CePa, GANPA and THINK-Back use physical gene interactions or co-expression networks to assign a weight to each gene in each pathway. Weights can be derived from measures of the gene importance in the network such as degree, the number of gene connections and betweenness centrality, and can be integrated into a traditional pathway enrichment analysis method such as GSEA. Methods such as SPIA, Pathway-Express and EnrichNet generate an ES for the entire pathway that considers pathway regulatory interactions such as activation and inhibition. Although useful and potentially more accurate, regulatory and biochemical gene interactions are available for fewer genes and pathways as compared to physical interactions networks and co-expression. We anticipate that these methods will become more useful as more gene interactions in pathways are characterized in detailed molecular experiments. However collecting and curating high-quality and biochemically detailed pathway data from the literature is currently complex and expensive. Therefore, pathway enrichment analysis methods described in this protocol will probably remain the most widely used approaches for the foreseeable future.
Current pathway enrichment analysis methods provide a useful high-level overview of the pathways active in a genomics experiment. However, these methods consider a simplified pathway view that involves only gene sets. Next-generation pathway analysis methods will integrate more biological pathway details, build pathway models based on multiple types of genomics data measured across many samples, and consider positive and negative regulatory relationships in the data. For instance, qualitative mathematical modeling parameterized with single-cell RNA-seq data may one day enable accurate predictions of drug combinations capable of treating a given disease under study.
This step-by-step protocol explains how to complete pathway enrichment analysis using g:Profiler (filtered gene list) and GSEA (unfiltered, whole-genome, ranked gene list), followed by visualization and interpretation using EnrichmentMap. The example data provided for the g:Profiler analysis is a list of genes with frequent somatic single nucleotide variants (SNVs) identified in The Cancer Genome Atlas (TCGA) exome-sequencing data of 3,200 tumors of 12 types. The example data provided for the GSEA analysis is a list of differentially expressed genes in two subtypes of ovarian cancer defined by TCGA.
We provide downloadable example files that are referred to throughout the protocol (Supplementary Tables ). We recommend saving all these files in a personal project data folder before starting. We also recommend creating an additional result data folder to save the files generated while performing the protocol.
A gene list or ranked gene list of interest
Example data for Step 6A. g:Profiler requires a list of genes, one per line, in a text file or spreadsheet, ready to copy and paste into a web page: for this, we use genes with frequent somatic SNVs identified in TCGA exome sequencing data of 3,200 tumors of 12 types. The MuSiC cancer driver mutation detection software was used to find 127 cancer driver genes that displayed higher than expected mutation frequencies in cancer samples (Supplementary Table , which is derived from column B of Supplementary Table in ref. ). Genes are ranked in decreasing order of significance (FDR Q value) and mutation frequency (not shown).
Example data for Step 6B. GSEA requires an RNK file with gene scores. An RNK file is a two-column text file with gene IDs in the first column and gene scores in the second column. All (or most) genes in the genome need to have a score, and the gene IDs need to match those used in the GMT file. We provide a ranked list of differentially expressed genes in ovarian cancer from TCGA (Supplementary Table 2). This cohort was previously stratified into four molecular subtypes on the basis of gene expression data, defined as differentiated, immunoreactive, mesenchymal and proliferative 该队列先前根据基因表达数据分为四种分子亚型,定义为分化,免疫反应,间充质和增生,. We compared the immunoreactive and mesenchymal subtypes to demonstrate the protocol. Step 5 of Supplementary Protocol shows how this file was created.
Pathway gene set database
In Step 6A, g:Profiler maintains an up-to-date set of pathway gene sets from multiple sources and no further input from the user is required, but a database of pathway gene sets is required for Step 6B (GSEA). Supplementary Table 3 contains a database of pathway gene sets used for pathway enrichment analysis in the standard GMT format, downloaded from http://baderlab.org/GeneSets. This file contains pathways downloaded on 1 July 2017 from eight data sources: GO, Reactome, Panther, NetPath, NCI, MSigDB curated gene sets (C2 collection, excluding Reactome and KEGG), MSigDB Hallmark (H collection) and HumanCyc. The gene sets available from http://baderlab.org/GeneSets are updated monthly. A GMT file is a text file in which each line represents a gene set for a single pathway. Each line includes a pathway ID, a name and the list of associated genes in a tab-separated format.
Timing 5 min
1
Download the required input and output files from the of the protocol.
Create two directories, project data folder and results data folder.
Place all downloaded input and example output files into the project data folder.
As you progress through the protocol, place any newly generated files into the results data folder.
2
Install Java v.8 or higher. Follow the download and installation instructions for Java JRE at
3
Download the latest version of GSEA. We recommend the javaGSEA desktop application available at http://www.broadinstitute.org/gsea/downloads.jsp. Free registration is required.
4
Download the latest version of Cytoscape from . Cytoscape v.3.6.0 or higher is required.
5
Install the required Cytoscape applications.
Launch Cytoscape.
Go to Apps → App Manager (i.e., open the Apps menu and select the item App Manager).
In the Install Apps tab search bar, search for EnrichmentMap.
Click on EnrichmentMap Pipeline Collection in the center panel. Verify that it is v.1.0.0 or higher.
Click on the Install button.
Go to the Currently Installed tab and verify that the applications (EnrichmentMap, clusterMaker2, WordCloud and AutoAnnotate) have been installed.
Timing 3–20 min
6
Two major types of gene lists are used in pathway enrichment analysis of omics data. Flat (unranked) gene lists of dozens to thousands of genes can be analyzed using g:Profiler (option A). A statistical threshold is required to compile a gene list from omics data. By contrast, ranked, whole-genome gene lists are suitable for pathway enrichment analysis using GSEA (option B). Gene lists analyzed with GSEA do not require prior filtering using statistical thresholds. Partial, filtered ranked gene lists can also be analyzed with g:Profiler. Select Step 6A or 6B, depending on the type of gene list you have.
(i)
Open the g:Profiler website at http://biit.cs.ut.ee/gprofiler/ (Fig. ).
(ii)
Paste the gene list (Supplementary Table 1) into the Query field in top-left corner of the screen. The gene list can be space-separated or one per line. The organism for the analysis, Homo sapiens, is selected by default. The input list can contain a mix of gene and protein IDs, symbols and accession numbers. Duplicated and unrecognized IDs will be removed automatically, and ambiguous symbols can be refined in an interactive dialogue after submitting the query.
(iii)
Check the box next to Ordered query. This option treats the input as an ordered gene list and prioritizes genes with higher mutation ESs at the beginning of the list.
(iv)
(Optional) Check the box next to No electronic GO annotations. This option will discard less reliable GO annotations (IEAs) that are not manually reviewed.
(v)
Set filters on gene annotation data using the menu on the right. We recommend that initial pathway enrichment analyses includes only biological processes (BPs) of GO and molecular pathways of Reactome. Keep the two checkboxes checked and uncheck all other boxes in the menu.
(vi)
Click on Show Advanced Options to set additional parameters.
(vii)
Set the values of Size of functional category in the dropdown menu to 5 (‘min’) and 350 (‘max’). Large pathways are of limited interpretative value, whereas numerous small pathways decrease the statistical power because of excessive multiple testing.
(viii)
Set the Size of query/term intersection in the dropdown menu to 3. The analysis will consider only more reliable pathways that have three or more genes in the input gene list.
(ix)
Click g:Profile! to run the analysis. A graphical heat map image will be shown, with detected pathways shown along the y axis (left) and associated genes of the input list shown along the x axis (top). Resulting pathways are organized hierarchically into related groups. g:Profiler uses graphical output by default and switches to textual output when a large number of pathways is found. g:Profiler returns only statistically significant pathways with P values adjusted for multiple testing correction (called Q values). By default, results with Q values <0.05 are reported. g:Profiler reports unrecognized and ambiguous gene IDs that can be resolved manually.
(x)
Use the dropdown menu Output type and select the option Generic Enrichment Map (TAB). This file is required for visualizing pathway results with Cytoscape and EnrichmentMap.
(xi)
Click g:Profile! again to run the analysis with the updated parameters. The required link Download data in Generic Enrichment Map (GEM) format will appear under the g:Profiler interface. Download the file from the link and save it on your computer in your result data folder created in Step 1. Example results are provided in Supplementary Table 4.
(xii)
Download the required GMT file by clicking on the link name at the bottom of the Advanced Options form. The GMT file is a compressed ZIP archive that contains all gene sets used by g:Profiler (e.g., gprofiler_hsapiens.NAME.gmt.zip). The gene set files are divided by data source. Download and uncompress the ZIP archive to your project folder. All required gene sets for this analysis will be in the file hsapiens.pathways.Name.gmt (Supplementary_Table5_hsapiens.pathways.NAME.gmt). Place the saved file in your result data folder created in Step 1.
(A)
Pathway enrichment analysis of a gene list using g:Profiler
Timing 3 min
Fig. 2: Screenshot of g:Profiler user interface.
Protocol Step 6A involves populating the g:Profiler interface. Procedural steps are highlighted with rectangles and roman numerals (refer to Step 6A(i–xii)). Purple boxes highlight files that must be downloaded for subsequent protocol steps. The remaining boxes indicate parameters for the analysis.
a, Web page summary of GSEA results showing pathways enriched in the top or bottom of the ranked list, with the ‘na_pos’ and ‘na_neg’ phenotypes corresponding to enrichment in upregulated and downregulated genes, respectively. These have been manually labeled here as mesenchymal and immunoreactive, respectively. Clicking on ‘Snapshot’ under either of the phenotypes will show the top 20 enrichment plots for that phenotype. b, An example enrichment plot for the top pathway in the mesenchymal set. c, An example enrichment plot for the top pathway in the immunoreactive set.
Class/phenotype-specific GSEA output in the web page summary shows how many gene sets were found enriched in upregulated genes, regardless of significance (purple), the total number of gene sets used after size filtering (cyan), the phenotype name (red) and the number of gene sets that pass different thresholds (orange).
a,b, Input fields in the EnrichmentMap interface for g:Profiler (a) and GSEA (b) results. Procedural steps are shown for Step 9A and 9B. Other than the specific input files, the parameters are the same for the two analysis types. Attributes surrounded by a dashed box should be filled out automatically if the user selects an appropriate folder with the required files. Missing file names indicate that EnrichmentMap was unable to find the specified file. Orange boxes indicate optional files. For the examples presented in the protocol, optional files are used for the GSEA analysis but not for the g:Profiler analysis to demonstrate the two distinct use cases. EM, EnrichmentMap.
a,b, Unformatted enrichment maps generated from Steps 6A (a) and 6B (b). Each node (circle) represents a distinct pathway, and edges (blue lines) represent the number of genes overlapping between two pathways, determined using the similarity coefficient. a, Enrichment map of significantly mutated cancer driver genes generated using the g:Profiler analysis in Step 6A. b, Enrichment map of pathways enriched in upregulated genes in mesenchymal (red) and immunoreactive (blue) ovarian cancer samples using the GSEA analysis in Step 6B.
(i) Cytoscape ‘Control Panel’, which contains ‘Networks’, ‘Styles’ and ‘Select’ tabs as well as the ‘EnrichmentMap’ main panel. (ii) The ‘Table Panel’ contains tables with node, edge and network attributes, as well as an enrichment map ‘Heat Map’ panel displaying expression for genes associated with selected nodes and edges. (iii) Cytoscape search bar, which can be used to search for genes in the enrichment map. (iv) ‘Node Table’ containing values for all variables associated with each node in the network. (v) Q-value or P-value slider bar. By default, the slider is set to Q value if the data contains Q values but can be changed to use P values by selecting the ‘P-value’ radio button. All nodes that pass the initial Q-value threshold are displayed in the enrichment map. By moving the slider to the left, the Q-value threshold is adjusted to a lower value, removing any nodes that do not pass the Q-value threshold. The currently set threshold will be displayed in the accompanying text box. Thresholds can be manually adjusted by modifying the text box value directly. (vi) ‘Edge Cutoff (Similarity)’ slider bar. The slider bar modifies the similarity threshold. The similarity threshold can only be increased; i.e., edges are required to have more genes in common in order to remain visible, which will remove edges from the network that do not satisfy the threshold. One can also manually change the threshold by modifying the text box value directly.
Heat map created by selecting the immunoreactive pathway interferon alpha beta signaling pathway from Reactome. The heat map is useful for visualization of detailed gene expression patterns for a pathway of interest. Magenta corresponds to high expression, and green corresponds to low expression. This heat map is for GSEA results, thus the ‘leading edge’ genes are highlighted in yellow; these genes have the largest contribution to the enrichment signal. (i–vi) Additional controls in the Heat Map panel include sorting options (i), selection of genes to include (ii), expression data visualization options (iii), data compression options (iv), the option to show values (v) and heat map settings (vi).
(i) Overall thumbnail view of the publication-ready enrichment map created with parameters FDR Q value < 0.01, and combined coefficient >0.375 with combined constant = 0.5. (ii) Zoomed-in section of publication-ready enrichment map, in which red and blue nodes represent mesenchymal and immunoreactive phenotype pathways, respectively. Nodes were manually laid out to form a clearer picture. Clusters of nodes were labeled using the AutoAnnotate Cytoscape application. Individual node labels were removed for clarity using the publication-ready button in EnrichmentMap and exported as PNG and PDF files. A legend was manually added at the bottom of the figure.
The enrichment map was summarized by collapsing node clusters using the AutoAnnotate application. Each cluster of nodes in Fig. is now represented as a single node. The network was scaled for better node distribution and manually adjusted to reduce node and label overlap. (i) Overall thumbnail view of the entire collapsed enrichment map. (ii) Zoomed-in section of the publication-ready collapsed enrichment map that corresponds to the zoomed-in network in Fig. (ii).
Subnetwork of the main enrichment map (Fig. ) was manually created by selecting pathways with the top NES values and creating a new network from the selection. Red and blue nodes are mesenchymal and immunoreactive phenotypes, respectively. Clusters of nodes were automatically labeled using the AutoAnnotate application. Annotations in the subnetwork may differ slightly from those in the main network, as word counts were normalized on a network basis.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2025-1-10 07:47
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社