10 Enrichment analysis

In this chapter, we demonstrate how to perform enrichment analyses of omic-wise association analysis. The identification of single variables in omic data through massive univariate testing is clearly a simplistic approach to study biological systems that are inherently complex. Its strength resides in the scope and the easiness of the approach to identify landmarks in the data. A frequent result from these studies is the identification of several variables that are significantly associated with the phenotype of interest. In addition, as more individuals are included in the studies, more significant associations are identified. How do we make sense of the observed associations? Enrichment analyses try to answer the question of whether the emerging pattern of associations can be mapped to known biological functions.

The are two predominantly used enrichment methods. One is known as over representation analysis (ORA) which tests whether a gene set contains disproportional many genes of significant expression change in a given gene set. The second method, called gene set enrichment analysis (GSEA), assesses whether genes of a gene set accumulate at the top or bottom of the full gene vector ordered by direction and magnitude of expression change. However, the term gene set enrichment analysis nowadays subsumes a general strategy implemented by a wide range of methods. Those methods have in common the same goal, although approach and statistical model can vary substantially. In this chapter we will focus on ORA approach which is commonly used in practice since it can be extended to any gene set as those defined, for instance, at Molecular Siganture Database (MSigDb).

We illustrate the practical issues of the approach on the transcriptomic study of Alzheimer’s disease, discussed through the book. We particularly demonstrate how to annotate the results of the association analysis to the reference genome, using Bioconductor’s annotation databases (.db), how to map the results to biological functions curated in databases like GO and KEGG, and how to assess whether the mapping, or enrichment, is significant, using the Bioconductor’s package GOstats and clusterProfiler.

We also outline how to perform these type of analysis in the CNV settings where the outcome is a list of genomic ranges instead of genes. In such situations, one can be insterested in deciphering whether the resulting CNV regions overlap with functional genomic regions such as genes, promoters, or enhancers. Finally, we explain the use of CTDquerier to study whether the significant associations are enriched in diseases and interactions with chemicals.

The R code to reproduce the results and figures of this chapter can be found here.