14 Differential gene expression (DGE) analysis
Let us illustrate how to perform transcriptomic data analysis using data from TCGA project. We have uploaded to the opal server a resource called tcga_liver
whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. This resource contains the RangeSummarizedExperiment
with the RNAseq profiling of liver cancer data from TCGA. Next, we illustrate how a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender
). The DGE analysis is normally performed using limma package. In that case, as we are analyzing RNA-seq data, limma + voom
method will be required.
Let us start by creating the connection to the opal server:
builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
user = "dsuser", password = "password",
resource = "RSRC.tcga_liver", driver = "OpalDriver")
logindata <- builder$build()
conns <- datashield.login(logins = logindata, assign = TRUE,
symbol = "res")
Then, let us coerce the resource to a RangedSummarizedExperiment
which is the type of object that is available in the recount project.
datashield.assign.expr(conns, symbol = "rse",
expr = quote(as.resource.object(res)))
ds.class("rse")
$study1
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
The number of features and samples can be inspected by
$`dimensions of rse in study1`
[1] 58037 424
$`dimensions of rse in combined studies`
[1] 58037 424
And the names of the features using the same function used in the case of analyzing an ExpressionSet
$study1
[1] "ENSG00000000003.14" "ENSG00000000005.5" "ENSG00000000419.12" "ENSG00000000457.13" "ENSG00000000460.16"
[6] "ENSG00000000938.12"
Also the covariate names can be inspected by
$study1
[1] "project" "sample"
[3] "experiment" "run"
[5] "read_count_as_reported_by_sra" "reads_downloaded"
[7] "proportion_of_reads_reported_by_sra_downloaded" "paired_end"
[9] "sra_misreported_paired_end" "mapped_read_count"
[11] "auc" "sharq_beta_tissue"
[13] "sharq_beta_cell_type" "biosample_submission_date"
[15] "biosample_publication_date"
We can visualize the levels of the variable having gender information
Data in all studies were valid
Study 1 : No errors reported from this study
$output.list
$output.list$TABLE_rvar.by.study_row.props
study
rse$gdc_cases.demographic.gender 1
female 1
male 1
$output.list$TABLE_rvar.by.study_col.props
study
rse$gdc_cases.demographic.gender 1
female 0.3372642
male 0.6627358
$output.list$TABLE_rvar.by.study_counts
study
rse$gdc_cases.demographic.gender 1
female 143
male 281
$output.list$TABLES.COMBINED_all.sources_proportions
rse$gdc_cases.demographic.gender
female male
0.337 0.663
$output.list$TABLES.COMBINED_all.sources_counts
rse$gdc_cases.demographic.gender
female male
143 281
$validity.message
[1] "Data in all studies were valid"
The differential expression analysis is then performed by:
ans.gender <- ds.limma(model = ~ gdc_cases.demographic.gender,
Set = "rse", type.data = "RNAseq",
sva = FALSE)
Notice that we have set type.data='RNAseq'
to consider that our data are counts obtained from a RNA-seq experiment. By indicating so, the differential analysis is performed by using voom
+ limma
as previously mention.
The top differentially expressed genes can be visualized by:
$study1
# A tibble: 58,037 x 10
id logFC CI.L CI.R AveExpr t P.Value adj.P.Val B SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000233070.1 10.9 10.5 11.3 -5.59 53.6 7.05e-191 4.09e-186 402. 0.0761
2 ENSG00000213318.4 11.4 10.9 11.8 -4.35 49.4 5.93e-178 1.72e-173 376. 0.462
3 ENSG00000067048.16 9.63 9.24 10.0 0.856 47.9 5.05e-173 9.78e-169 366. 0.0608
4 ENSG00000260197.1 10.2 9.76 10.6 -5.88 46.3 5.27e-168 7.65e-164 355. 0.0654
5 ENSG00000012817.15 11.3 10.8 11.8 -0.226 44.7 1.29e-162 1.50e-158 344. 0.0880
6 ENSG00000131002.11 11.4 10.9 11.9 -1.10 44.5 7.80e-162 7.55e-158 343. 0.120
7 ENSG00000198692.9 12.3 11.8 12.9 -0.882 44.4 1.62e-161 1.34e-157 342. 0.129
8 ENSG00000183878.15 8.66 8.27 9.05 -0.781 43.9 9.83e-160 7.13e-156 338. 0.0813
9 ENSG00000129824.15 11.4 10.8 11.9 2.37 43.3 1.44e-157 9.29e-154 334. 0.0896
10 ENSG00000274655.1 -12.4 -12.9 -11.8 -7.71 -43.1 5.20e-157 3.02e-153 333. 0.0931
# ... with 58,027 more rows
attr(,"class")
[1] "dsLimma" "list"
We have also implemented two other functions ds.DESeq2
and ds.edgeR
that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:
To be supplied
We close the DataSHIELD session by: