11 Transcriptomic analysis in the HELIX cohort.
In this section, we will illustrate how to perform gene expression analysis using data from the HELIX project. Particularly, we will analyse microarray data deriving from the Human Transcriptome Array 2.0 Affymetrix platform. On these data, we will show how to perform a differential gene expression analysis comparing the transcriptomic profiles of boys and girls in the HELIX cohorts.
The microarray data to be analysed has been previously pre-processed (e.g., normalized, low expression or cross-reactive probes removed, etc.), so, this showcase will be restricted to the analysis part. For that, we will employ the dsOmicsClient functions that adapt some of the most important “limma” methods for their use in dataSHIELD. Please, note that pre-processing of gene expression data is also included in dsOmicsClient as indicated in our bookdown. For this tutorial, data are hosted in the Opal BRGE site simulating a single-site DataSHIELD architecture, as previously described in the section 3.5.1.1.
11.1 Getting started.
In this section, we will describe how to configure R and DataSHIELD with the needed packages to carry out proposed analyses in remote. We start by installing the client-side version of the following DataSHIELD/Opal integration packages.
install.packages("DSOpal", dependencies=TRUE)
install.packages("DSI", dependencies=TRUE)
Make sure you also install the DataSHIELD client-side version of the package dsBaseClient.
install.packages("dsBaseClient",
repos = c("http://cran.datashield.org","https://cloud.r-project.org/"),
dependencies = TRUE)
Then, install the client-side version of the dsOmicsClient package directly from GitHub.
install.packages("devtools")
require("devtools")
::install_github("isglobal-brge/dsOmicsClient",ref="master",force=TRUE) devtools
Once installations are completed, all the packages are loaded as usual.
require(DSOpal)
require(DSI)
require(dsBaseClient)
require(dsOmicsClient)
# Loading additional required packages (if not installed, you can easly install them using the BiocManager::install() function)
require(clusterProfiler)
require(org.Hs.eg.db)
require(ggplot2)
require(biomaRt)
11.2 Data formatting and manipulation in DataSHIELD.
In this section, we will cover how to load and inspect input microarray data with DataSHIELD. We start by creating the connection to the opal server using an user who have DataSHIELD permissions.
<- DSI::newDSLoginBuilder()
builder $append(server = "BIB", url = "https://opal-demo.obiba.org",
builderuser = "dsuser", password = "P@ssw0rd",
profile = "omics")
$append(server = "EDEN", url = "https://opal-demo.obiba.org",
builderuser = "dsuser", password = "P@ssw0rd",
profile = "omics")
$append(server = "KANC", url = "https://opal-demo.obiba.org",
builderuser = "dsuser", password = "P@ssw0rd",
profile = "omics")
$append(server = "MoBA", url = "https://opal-demo.obiba.org",
builderuser = "dsuser", password = "P@ssw0rd",
profile = "omics")
$append(server = "Rhea", url = "https://opal-demo.obiba.org",
builderuser = "dsuser", password = "P@ssw0rd",
profile = "omics")
$append(server = "INMASAB", url = "https://opal-demo.obiba.org",
builderuser = "dsuser", password = "P@ssw0rd",
profile = "omics")
<- builder$build()
logindata <- DSI::datashield.login(logins = logindata) conns
Now that we have created a connection object to the Opal, we have started a new R session on the server, and our analysis will take place in this remote session, so we have to load the data into it. In this use case, available data are in the form ExpressionSet files. ExpressionSet is a file-format of the BioConductor project that may contain different sources of genomic data. Alongside the genomic data, ExpressionSet may also contain some phenotypes and metadata associated to the population. Researchers who are not familiar with ExpressionSet can find further information here.
For our showcase, one ExpressionSet file per cohort is hosted in the Opal BRGE site in the form of a resource, all organized within the Opal server in a project called “OMICS”.
For loading these data into the remote DataSHIELD session we will use the DSI::datashield.assign.resource() function. This function takes the connections to the server created in the previous code chunk to assign all available resource objects from a project in the Opal to an R object in the DataSHIELD remote session. As it can be seen in the code, resources in DataSHIELD are called in the function following the next structure “NameOfOpalProject.NameOfResource”.
# We assign microarray data from all cohorts to an object called resource_pos
::datashield.assign.resource(conns[1], "resource_expr",
DSI"OMICS.genexpr_BIB")
::datashield.assign.resource(conns[2], "resource_expr",
DSI"OMICS.genexpr_EDE")
::datashield.assign.resource(conns[3], "resource_expr",
DSI"OMICS.genexpr_KAN")
::datashield.assign.resource(conns[4], "resource_expr",
DSI"OMICS.genexpr_MOB")
::datashield.assign.resource(conns[5], "resource_expr",
DSI"OMICS.genexpr_RHE")
::datashield.assign.resource(conns[6], "resource_expr",
DSI"OMICS.genexpr_SAB")
With these instructions, we have assigned the resources into our remote R session to a variable called “resource_expr”. To verify this step has been performed correctly, we could use the ds.class() function to check for their class and that they exist on the remote session. Then, we have to resolve the resource and retrieve the data in the remote session. For that, we will use the DSI::datashield.assign.expr() function.
# We resolve the resource
::datashield.assign.expr(conns = conns, symbol = "resource_expr_HELIX",
DSIexpr = as.symbol("as.resource.object(resource_expr)"))
As a result, we will get an R object (here named “resource_expr_HELIX”) containing the available “ExpressionSets” files for cohorts. For created objects, we can extract some useful information including the class, and the name of the genes included in the microarray and additional phenotype data available.
# Retrieve the class of loaded files:
ds.class("resource_expr_HELIX")
# Get the names of the first features (probes) available for the first cohort:
<- ds.featureNames("resource_expr_HELIX")
name.features lapply(name.features, head)[[1]]
# Get the names of the phenotype data available for the first cohort:
<- ds.varLabels("resource_expr_HELIX")
name.phenotypes lapply(name.phenotypes, head)[[1]]
# Also the annotation information can be inspected by:
<- ds.fvarLabels("resource_expr_HELIX")
name.annot lapply(name.annot, head)[[1]]
# Annotation information refers to additional data available at the feature
# level and that could be added to the summary statistics for a better
# identification of significant genes.
11.3 Differential gene expression analysis.
The differential expression analysis in dsOmicsClient is implemented in the funcion ds.limma(). Although, by default, the function considers that data is obtained from a microarray experiment (type.data = “microarray”), it could also be applied to RNAseq data.
In this case, since we are interested in evaluating the gene expression differences between boys and girls, we will run the next model (adjusted by possible confounders such as age and ethnicity):
<- ds.limma(model = ~ e3_sex + age_sample_years + h_ethnicity_cauc,
model.sex Set = "resource_expr_HELIX")
The top differentially expressed genes by cohort can then be visualized by:
model.sex
$BIB
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC13001096.hg.1 162 -0.126 0.164 -7.09 4.12e-11 0.00000119
2 TC22000217.hg.1 162 -0.156 0.0180 -6.74 2.72e-10 0.00000391
3 TC17000757.hg.1 162 0.0900 0.0233 6.22 4.23e- 9 0.0000387
4 TC20000999.hg.1 162 0.172 0.0242 6.10 7.48e- 9 0.0000387
5 TC20001182.hg.1 162 0.411 0.0267 6.10 7.58e- 9 0.0000387
6 TC20000178.hg.1 162 0.341 0.0321 6.09 8.09e- 9 0.0000387
7 TC19001398.hg.1 162 -0.207 0.0258 -5.79 3.68e- 8 0.000151
8 TC04001402.hg.1 162 -0.296 0.0459 -5.75 4.30e- 8 0.000154
9 TC05001973.hg.1 162 -0.390 0.0231 -5.52 1.34e- 7 0.000429
10 TC09000119.hg.1 162 0.0775 0.0309 5.49 1.57e- 7 0.000452
# ... with 28,728 more rows
$EDEN
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC12000357.hg.1 109 -0.0574 0.127 -4.71 0.00000729 0.121
2 TC15000639.hg.1 109 0.135 0.0200 4.64 0.00000985 0.121
3 TC02004993.hg.1 109 0.124 0.0302 4.52 0.0000158 0.121
4 TC02001326.hg.1 109 -0.0736 0.0418 -4.51 0.0000168 0.121
5 TC15001330.hg.1 109 0.0945 0.0488 4.34 0.0000317 0.135
6 TC19001398.hg.1 109 -0.191 0.0335 -4.27 0.0000416 0.135
7 TC20000999.hg.1 109 0.115 0.0279 4.26 0.0000441 0.135
8 TC12002776.hg.1 109 -0.155 0.0504 -4.25 0.0000459 0.135
9 TC05000721.hg.1 109 -0.171 0.0266 -4.23 0.0000491 0.135
10 TC01006291.hg.1 109 0.113 0.0315 4.23 0.0000498 0.135
# ... with 28,728 more rows
$KANC
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC11001104.hg.1 151 -0.346 0.148 -6.57 7.84e-10 0.0000225
2 TC10000002.hg.1 151 0.0839 0.0192 6.34 2.51e- 9 0.0000360
3 TC01001596.hg.1 151 -0.224 0.0245 -5.93 1.96e- 8 0.000158
4 TC04001402.hg.1 151 -0.226 0.0272 -5.91 2.20e- 8 0.000158
5 TC01000909.hg.1 151 -0.147 0.0510 -5.76 4.53e- 8 0.000248
6 TC15000872.hg.1 151 0.102 0.0409 5.72 5.44e- 8 0.000248
7 TC08000814.hg.1 151 0.173 0.0296 5.70 6.03e- 8 0.000248
8 TC13000150.hg.1 151 -0.0825 0.0424 -5.66 7.56e- 8 0.000272
9 TC20000178.hg.1 151 0.257 0.0298 5.55 1.27e- 7 0.000394
10 TC05001973.hg.1 151 -0.294 0.0292 -5.53 1.37e- 7 0.000394
# ... with 28,728 more rows
$MoBA
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC19001274.hg.1 245 -0.291 0.136 -7.19 8.03e-12 0.000000231
2 TC13000150.hg.1 245 -0.0768 0.0125 -6.99 2.68e-11 0.000000385
3 TC06000733.hg.1 245 0.110 0.0184 6.81 7.63e-11 0.000000563
4 TC15001330.hg.1 245 0.0911 0.0194 6.80 7.84e-11 0.000000563
5 TC01001720.hg.1 245 -0.262 0.0229 -6.75 1.07e-10 0.000000617
6 TC19001398.hg.1 245 -0.229 0.0277 -6.56 3.19e-10 0.00000142
7 TC01001596.hg.1 245 -0.204 0.0187 -6.55 3.46e-10 0.00000142
8 TC09000119.hg.1 245 0.0859 0.0368 6.47 5.21e-10 0.00000187
9 TC10002872.hg.1 245 0.0882 0.0192 6.39 8.39e-10 0.00000268
10 TC10000002.hg.1 245 0.0677 0.0243 6.27 1.62e- 9 0.00000466
# ... with 28,728 more rows
$Rhea
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC09001482.hg.1 156 0.108 0.150 7.17 2.77e-11 0.000000487
2 TC10000002.hg.1 156 0.0843 0.0201 7.14 3.39e-11 0.000000487
3 TC12003242.hg.1 156 0.132 0.0220 6.52 9.25e-10 0.00000886
4 TC15000872.hg.1 156 0.128 0.0209 6.12 7.39e- 9 0.0000531
5 TC17000757.hg.1 156 0.0990 0.0279 5.77 4.24e- 8 0.000244
6 TC11002451.hg.1 156 0.0894 0.0380 5.63 8.26e- 8 0.000367
7 TC20001182.hg.1 156 0.351 0.0225 5.61 8.94e- 8 0.000367
8 TC20000501.hg.1 156 0.0949 0.0540 5.36 2.93e- 7 0.00105
9 TC01000598.hg.1 156 0.0911 0.0234 5.29 4.03e- 7 0.00129
10 TC13000150.hg.1 156 -0.0741 0.0375 -5.22 5.54e- 7 0.00159
# ... with 28,728 more rows
$INMASAB
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC03003362.hg.1 184 -0.0779 0.127 -6.57 5.11e-10 0.0000147
2 TC09000450.hg.1 184 0.0782 0.0151 6.35 1.61e- 9 0.0000193
3 TC01006291.hg.1 184 0.114 0.0199 6.31 2.02e- 9 0.0000193
4 TC03000563.hg.1 184 -0.200 0.0225 -6.18 3.96e- 9 0.0000284
5 TC19001398.hg.1 184 -0.219 0.0319 -6.07 7.08e- 9 0.0000407
6 TC10002953.hg.1 184 0.0983 0.0333 5.62 6.96e- 8 0.000322
7 TC01000909.hg.1 184 -0.129 0.0247 -5.60 7.84e- 8 0.000322
8 TC17000757.hg.1 184 0.0793 0.0417 5.40 2.00e- 7 0.000709
9 TC10002872.hg.1 184 0.0917 0.0225 5.38 2.22e- 7 0.000709
10 TC01003403.hg.1 184 -0.182 0.0331 -5.32 2.97e- 7 0.000855
# ... with 28,728 more rows
attr(,"class")
[1] "dsLimma" "list"
We can verify whether the distribution of the observed p-values are the ones we would expect in this type of analyses:
ggplot(model.sex$BIB, aes(x=P.Value))+
geom_histogram(color="darkblue", fill="lightblue")+labs(
title="P-Value histogram plot",x="Raw p-value gender effect", y = "Count")
Another functionality of dsOmicsClient is that we can add annotation information at the probe-level to the output that is obtained with ds.limma(). Annotation information will be available as part of the expression set file “resource_expr_HELIX” and can be inspected by:
# Get annotation information for the first cohort
ds.fvarLabels("resource_expr_HELIX")[[1]]
[1] "transcript_cluster_id" "probeset_id" "seqname"
[4] "strand" "start" "stop"
[7] "total_probes" "gene_assignment" "mrna_assignment"
[10] "notes" "phase" "TC_size"
[13] "TSS_Affy" "EntrezeGeneID_Affy" "GeneSymbol_Affy"
[16] "GeneSymbolDB" "GeneSymbolDB2" "mrna_ID"
[19] "mrna_DB" "mrna_N" "notesYN"
[22] "geneYN" "genes_N" "CallRate"
[25] "fil1"
In case we were interested in adding gene symbol, entrezID and chromosome number mapping probes from each output, we can run:
<- ds.limma(model = ~ e3_sex + age_sample_years + h_ethnicity_cauc,
model.sex_annot Set = "resource_expr_HELIX",
annotCols = c("GeneSymbol_Affy","seqname",
"EntrezeGeneID_Affy"))
The results for each of the cohort (ordered by level of significance) can be accessed by:
model.sex_annot
$BIB
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC1300~ 162 -0.126 0.164 -7.09 4.12e-11 1.19e-6 INTS6-AS1 chr13
2 TC2200~ 162 -0.156 0.0180 -6.74 2.72e-10 3.91e-6 OSBP2 chr22
3 TC1700~ 162 0.0900 0.0233 6.22 4.23e- 9 3.87e-5 TLK2 chr17
4 TC2000~ 162 0.172 0.0242 6.10 7.48e- 9 3.87e-5 SYCP2 chr20
5 TC2000~ 162 0.411 0.0267 6.10 7.58e- 9 3.87e-5 FRG1BP chr20
6 TC2000~ 162 0.341 0.0321 6.09 8.09e- 9 3.87e-5 FRG1BP;FRG1CP chr20
7 TC1900~ 162 -0.207 0.0258 -5.79 3.68e- 8 1.51e-4 RPS4XP21 chr19
8 TC0400~ 162 -0.296 0.0459 -5.75 4.30e- 8 1.54e-4 TSPAN5 chr4
9 TC0500~ 162 -0.390 0.0231 -5.52 1.34e- 7 4.29e-4 FAXDC2 chr5
10 TC0900~ 162 0.0775 0.0309 5.49 1.57e- 7 4.52e-4 IFT74 chr9
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$EDEN
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC12000~ 109 -0.0574 0.127 -4.71 7.29e-6 0.121 CACNB3 chr12
2 TC15000~ 109 0.135 0.0200 4.64 9.85e-6 0.121 RPLP1 chr15
3 TC02004~ 109 0.124 0.0302 4.52 1.58e-5 0.121 C2orf15 chr2
4 TC02001~ 109 -0.0736 0.0418 -4.51 1.68e-5 0.121 DES chr2
5 TC15001~ 109 0.0945 0.0488 4.34 3.17e-5 0.135 MYEF2 chr15
6 TC19001~ 109 -0.191 0.0335 -4.27 4.16e-5 0.135 RPS4XP21 chr19
7 TC20000~ 109 0.115 0.0279 4.26 4.41e-5 0.135 SYCP2 chr20
8 TC12002~ 109 -0.155 0.0504 -4.25 4.59e-5 0.135 ART4 chr12
9 TC05000~ 109 -0.171 0.0266 -4.23 4.91e-5 0.135 IGIP chr5
10 TC01006~ 109 0.113 0.0315 4.23 4.98e-5 0.135 CFH chr1
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$KANC
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC1100~ 151 -0.346 0.148 -6.57 7.84e-10 0.0000225 TBCEL chr11
2 TC1000~ 151 0.0839 0.0192 6.34 2.51e- 9 0.0000360 ZMYND11 chr10
3 TC0100~ 151 -0.224 0.0245 -5.93 1.96e- 8 0.000158 SWT1 chr1
4 TC0400~ 151 -0.226 0.0272 -5.91 2.20e- 8 0.000158 TSPAN5 chr4
5 TC0100~ 151 -0.147 0.0510 -5.76 4.53e- 8 0.000248 S1PR1 chr1
6 TC1500~ 151 0.102 0.0409 5.72 5.44e- 8 0.000248 MAN2A2 chr15
7 TC0800~ 151 0.173 0.0296 5.70 6.03e- 8 0.000248 LY6E chr8
8 TC1300~ 151 -0.0825 0.0424 -5.66 7.56e- 8 0.000272 DGKH chr13
9 TC2000~ 151 0.257 0.0298 5.55 1.27e- 7 0.000394 FRG1BP;FRG1CP chr20
10 TC0500~ 151 -0.294 0.0292 -5.53 1.37e- 7 0.000394 FAXDC2 chr5
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$MoBA
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC1900~ 245 -0.291 0.136 -7.19 8.03e-12 2.31e-7 PLVAP chr19
2 TC1300~ 245 -0.0768 0.0125 -6.99 2.68e-11 3.85e-7 DGKH chr13
3 TC0600~ 245 0.110 0.0184 6.81 7.63e-11 5.63e-7 DDX43 chr6
4 TC1500~ 245 0.0911 0.0194 6.80 7.84e-11 5.63e-7 MYEF2 chr15
5 TC0100~ 245 -0.262 0.0229 -6.75 1.07e-10 6.17e-7 CTSE chr1
6 TC1900~ 245 -0.229 0.0277 -6.56 3.19e-10 1.42e-6 RPS4XP21 chr19
7 TC0100~ 245 -0.204 0.0187 -6.55 3.46e-10 1.42e-6 SWT1 chr1
8 TC0900~ 245 0.0859 0.0368 6.47 5.21e-10 1.87e-6 IFT74 chr9
9 TC1000~ 245 0.0882 0.0192 6.39 8.39e-10 2.68e-6 FAM24B;CUZD1 chr10
10 TC1000~ 245 0.0677 0.0243 6.27 1.62e- 9 4.66e-6 ZMYND11 chr10
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$Rhea
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC0900~ 156 0.108 0.150 7.17 2.77e-11 4.87e-7 KIAA0368 chr9
2 TC1000~ 156 0.0843 0.0201 7.14 3.39e-11 4.87e-7 ZMYND11 chr10
3 TC1200~ 156 0.132 0.0220 6.52 9.25e-10 8.86e-6 PCBP2 chr12
4 TC1500~ 156 0.128 0.0209 6.12 7.39e- 9 5.31e-5 MAN2A2 chr15
5 TC1700~ 156 0.0990 0.0279 5.77 4.24e- 8 2.44e-4 TLK2 chr17
6 TC1100~ 156 0.0894 0.0380 5.63 8.26e- 8 3.67e-4 ZBTB44 chr11
7 TC2000~ 156 0.351 0.0225 5.61 8.94e- 8 3.67e-4 FRG1BP chr20
8 TC2000~ 156 0.0949 0.0540 5.36 2.93e- 7 1.05e-3 GID8 chr20
9 TC0100~ 156 0.0911 0.0234 5.29 4.03e- 7 1.29e-3 CMPK1 chr1
10 TC1300~ 156 -0.0741 0.0375 -5.22 5.54e- 7 1.59e-3 DGKH chr13
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$INMASAB
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC0300~ 184 -0.0779 0.127 -6.57 5.11e-10 0.0000147 PHLDB2 chr3
2 TC0900~ 184 0.0782 0.0151 6.35 1.61e- 9 0.0000193 PHF2 chr9
3 TC0100~ 184 0.114 0.0199 6.31 2.02e- 9 0.0000193 CFH chr1
4 TC0300~ 184 -0.200 0.0225 -6.18 3.96e- 9 0.0000284 CD96 chr3
5 TC1900~ 184 -0.219 0.0319 -6.07 7.08e- 9 0.0000407 RPS4XP21 chr19
6 TC1000~ 184 0.0983 0.0333 5.62 6.96e- 8 0.000322 CUZD1 chr10
7 TC0100~ 184 -0.129 0.0247 -5.60 7.84e- 8 0.000322 S1PR1 chr1
8 TC1700~ 184 0.0793 0.0417 5.40 2.00e- 7 0.000709 TLK2 chr17
9 TC1000~ 184 0.0917 0.0225 5.38 2.22e- 7 0.000709 FAM24B;CUZD1 chr10
10 TC0100~ 184 -0.182 0.0331 -5.32 2.97e- 7 0.000855 SLAMF6 chr1
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
attr(,"class")
[1] "dsLimma" "list"
11.4 Surrogate variable analysis.
We can also check whether there is inflation just executing:
qqplot(model.sex$BIB$P.Value)
So, in that case, the model does not need to remove unwanted variability (λ<2). If that was not the case, we could use surrogate variable analysis just changing the argument sva=TRUE..
The ds.limma function has another arguments that can be used to fit other type of models:
- method: Linear regression (“ls”) or robust regression (“robust”) used in limma (lmFit).
- robust: Logical. Robust method used for outlier sample variances used in limma (eBayes).
- normalization: normalization method used in the voom transformation (default “none”).
- voomQualityWeights: Logical. Should voomQualityWeights function be used instead of voom? (default FALSE)
- big: Logical. If sva was set to TRUE, should SmartSVA be implemented?
Then, data can be combined meta-analysed as follows:
<- metaPvalues(model.sex_annot)
metaP.model.sex.annot metaP.model.sex.annot
# A tibble: 28,738 x 8
id BIB EDEN KANC MoBA Rhea INMASAB p.meta
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC1000000~ 3.10e- 6 8.20e-4 2.51e-9 1.62e- 9 3.39e-11 1.09e-5 2.63e-34
2 TC1900139~ 3.68e- 8 4.16e-5 1.92e-7 3.19e-10 6.26e- 5 7.08e-9 2.54e-33
3 TC2000118~ 7.58e- 9 7.07e-5 5.48e-7 8.83e- 9 8.94e- 8 8.94e-7 1.16e-32
4 TC1700075~ 4.23e- 9 7.24e-4 9.59e-6 3.60e- 9 4.24e- 8 2.00e-7 4.64e-32
5 TC1500087~ 4.20e- 6 6.67e-5 5.44e-8 3.83e- 8 7.39e- 9 8.93e-7 1.84e-31
6 TC2000017~ 8.09e- 9 9.86e-5 1.27e-7 9.23e- 8 1.26e- 6 1.46e-6 7.53e-31
7 TC1300109~ 4.12e-11 1.01e-4 3.58e-5 3.58e- 7 1.02e- 6 1.88e-5 3.56e-29
8 TC0100090~ 5.82e- 5 8.74e-4 4.53e-8 3.43e- 9 2.77e- 6 7.84e-8 5.75e-29
9 TC0900148~ 7.50e- 6 1.11e-3 1.74e-5 2.15e- 6 2.77e-11 8.15e-7 2.17e-28
10 TC1300015~ 5.68e- 6 8.56e-3 7.56e-8 2.68e-11 5.54e- 7 1.97e-4 3.22e-28
# ... with 28,728 more rows
# Get the number of differentially expressed genes in the meta approach
dim(metaP.model.sex.annot[metaP.model.sex.annot$p.meta<0.05,])
[1] 3839 8
11.5 Enrichment analysis of functional annotations.
Once we have obtained the top differentially expressed genes per cohort, we could extract their gene symbols or gene entrez ids directly from each output dataframe and continue with the functional annotation analysis in our local session as usual. To this end, one of the classic approaches is to evaluate whether any information (in the form of annotations) is over-represented in the gene list compared to the rest of the genes in the genome. This type of enrichment analysis is based on annotating genes with information available in different databases such as Gene Ontology or the Kyoto Encyclopedia of Genes and Genomes, and establishing the frequencies of each term in the gene list and the rest of the genome. This allows a statistical test to be applied to determine which functional annotations are significantly enriched in the list.
To extract the list of top differentially expressed genes by cohort and perform enrichment analysis with KEGG we can run the next code lines:
# Get the entrez id for the top differentially expressed genes in each cohort.
<- function(x){ x[(x$adj.P.Val < 0.05),c("id","EntrezeGeneID_Affy")] }
get_entrezid <- lapply(model.sex_annot,get_entrezid)
topgenes_by_cohort length(unlist(lapply(topgenes_by_cohort,function(x) x[,2])))
[1] 666
# Get the entrez id for the top differentially expressed genes in the meta-analysis
# approach
<- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
ensembl <- listFilters(ensembl)
filters <- listAttributes(ensembl)
attributes <- gsub(x = as.character(metaP.model.sex.annot[metaP.model.sex.annot$p.meta<
affy_id 0.05/nrow(metaP.model.sex.annot),1]$id), pattern = "hg.1", "hg")
# Number of probes passing multiple-testing correction in the meta-analysis.
length(affy_id)
[1] 325
# Get the entrez id of the genes mapped by the probes from the previous list.
<- getBM(attributes = c("entrezgene_id"),
topgenes_metaP filters = "affy_hta_2_0",
values = affy_id,
mart = ensembl)
length(topgenes_metaP[,1])
[1] 276
# Functional enrichment analysis with KEGG.
# We will use as reference a list of genes loaded from the DOSE package.
data(geneList, package="DOSE")
# Enrichment for top genes from each cohort
<- enrichKEGG(gene = unlist(lapply(topgenes_by_cohort,function(x) x[,2])),
enrichKEGG organism = 'hsa',
pvalueCutoff = 0.1)
head(enrichKEGG)
ID Description GeneRatio
hsa03010 hsa03010 Ribosome 20/183
hsa03008 hsa03008 Ribosome biogenesis in eukaryotes 16/183
hsa04211 hsa04211 Longevity regulating pathway 10/183
hsa04213 hsa04213 Longevity regulating pathway - multiple species 7/183
hsa04150 hsa04150 mTOR signaling pathway 11/183
hsa04152 hsa04152 AMPK signaling pathway 9/183
BgRatio pvalue p.adjust qvalue
hsa03010 158/8135 2.903164e-10 7.490164e-08 6.295282e-08
hsa03008 109/8135 2.144118e-09 2.765912e-07 2.324675e-07
hsa04211 89/8135 2.831858e-05 2.435398e-03 2.046887e-03
hsa04213 62/8135 4.497217e-04 2.900705e-02 2.437965e-02
hsa04150 156/8135 7.577933e-04 3.910213e-02 3.286430e-02
hsa04152 120/8135 1.483310e-03 5.812931e-02 4.885613e-02
geneID
hsa03010 6176/6171/6208/100169751/100169753/100169754/100169755/100169756/100169757/100169758/100169759/100169761/100169762/100169763/100169764/100169765/100169766/100169767/100169768/23521
hsa03008 100169751/100169753/100169754/100169755/100169756/100169757/100169758/100169759/100169761/100169762/100169763/100169764/100169765/100169766/100169767/100169768
hsa04211 5567/5562/6198/27244/9586/64764/6648/3480/10000/207
hsa04213 5567/5562/6198/6648/3480/10000/207
hsa04150 523/6446/5562/6198/7477/1975/3480/5578/10000/207/9706
hsa04152 5562/6198/55844/1938/9586/64764/3480/10000/207
Count
hsa03010 20
hsa03008 16
hsa04211 10
hsa04213 7
hsa04150 11
hsa04152 9
# Enrichment for top genes from the meta-analysis
<- enrichKEGG(gene = as.character(topgenes_metaP[,1]),
enrichKEGG_meta organism = 'hsa',
pvalueCutoff = 0.1)
head(enrichKEGG_meta)
ID Description GeneRatio BgRatio pvalue
hsa04211 hsa04211 Longevity regulating pathway 7/117 89/8135 0.0002767746
p.adjust qvalue geneID Count
hsa04211 0.06393493 0.0614731 9586/6198/3480/26060/27244/5562/5567 7
# The pathways can be visualized alongside the genes associated with them:
browseKEGG(enrichKEGG, 'hsa04211')
As a result, we identify a list of 325 probes (mapping 287 genes) differentially expressed between boys and girls of the HELIX cohort, passing multiple-testing correction filter (P-value threshold=1.74e-06). As a proof of the ability of OmicSHIELD to be integrated with other R functionalities and bioconductor utilties, we continued the showcase presenting a functional enrichment analysis (FEA) of results, showing that significant DE genes participate in processes with evident and previously-described sexual dimorphism such is the case of “Longevity regulating pathways” (hsa04211).
Once analyses are completed, we proceed to close the DataSHIELD session by:
datashield.logout(conns)