14 Differential gene expression (DGE) analysis

Let us illustrate how to perform transcriptomic data analysis using data from TCGA project. We have uploaded to the opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. This resource contains the RangeSummarizedExperiment with the RNAseq profiling of liver cancer data from TCGA. Next, we illustrate how a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender). The DGE analysis is normally performed using limma package. In that case, as we are analyzing RNA-seq data, limma + voom method will be required.

Let us start by creating the connection to the opal server:

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "password", 
               resource = "RSRC.tcga_liver", driver = "OpalDriver")

logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE, 
                          symbol = "res")

Then, let us coerce the resource to a RangedSummarizedExperiment which is the type of object that is available in the recount project.

datashield.assign.expr(conns, symbol = "rse", 
                       expr = quote(as.resource.object(res)))
ds.class("rse")

$study1
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"

The number of features and samples can be inspected by

ds.dim("rse")

$`dimensions of rse in study1`
[1] 58037   424

$`dimensions of rse in combined studies`
[1] 58037   424

And the names of the features using the same function used in the case of analyzing an ExpressionSet

name.features <- ds.featureNames("rse")
lapply(name.features, head)

$study1
[1] "ENSG00000000003.14" "ENSG00000000005.5"  "ENSG00000000419.12" "ENSG00000000457.13" "ENSG00000000460.16"
[6] "ENSG00000000938.12"

Also the covariate names can be inspected by

name.vars <- ds.featureData("rse")
lapply(name.vars, head, n=15)

$study1
 [1] "project"                                        "sample"                                        
 [3] "experiment"                                     "run"                                           
 [5] "read_count_as_reported_by_sra"                  "reads_downloaded"                              
 [7] "proportion_of_reads_reported_by_sra_downloaded" "paired_end"                                    
 [9] "sra_misreported_paired_end"                     "mapped_read_count"                             
[11] "auc"                                            "sharq_beta_tissue"                             
[13] "sharq_beta_cell_type"                           "biosample_submission_date"                     
[15] "biosample_publication_date"

We can visualize the levels of the variable having gender information

ds.table("rse$gdc_cases.demographic.gender")


 Data in all studies were valid 

Study 1 :  No errors reported from this study

$output.list
$output.list$TABLE_rvar.by.study_row.props
                                study
rse$gdc_cases.demographic.gender 1
                          female 1
                          male   1

$output.list$TABLE_rvar.by.study_col.props
                                study
rse$gdc_cases.demographic.gender         1
                          female 0.3372642
                          male   0.6627358

$output.list$TABLE_rvar.by.study_counts
                                study
rse$gdc_cases.demographic.gender   1
                          female 143
                          male   281

$output.list$TABLES.COMBINED_all.sources_proportions
rse$gdc_cases.demographic.gender
female   male 
 0.337  0.663 

$output.list$TABLES.COMBINED_all.sources_counts
rse$gdc_cases.demographic.gender
female   male 
   143    281 


$validity.message
[1] "Data in all studies were valid"

The differential expression analysis is then performed by:

ans.gender <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse", type.data = "RNAseq", 
                       sva = FALSE)

Notice that we have set type.data='RNAseq' to consider that our data are counts obtained from a RNA-seq experiment. By indicating so, the differential analysis is performed by using voom + limma as previously mention.

The top differentially expressed genes can be visualized by:

ans.gender

$study1
# A tibble: 58,037 x 10
   id                  logFC   CI.L   CI.R AveExpr     t   P.Value adj.P.Val     B     SE
   <chr>               <dbl>  <dbl>  <dbl>   <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>
 1 ENSG00000233070.1   10.9   10.5   11.3   -5.59   53.6 7.05e-191 4.09e-186  402. 0.0761
 2 ENSG00000213318.4   11.4   10.9   11.8   -4.35   49.4 5.93e-178 1.72e-173  376. 0.462 
 3 ENSG00000067048.16   9.63   9.24  10.0    0.856  47.9 5.05e-173 9.78e-169  366. 0.0608
 4 ENSG00000260197.1   10.2    9.76  10.6   -5.88   46.3 5.27e-168 7.65e-164  355. 0.0654
 5 ENSG00000012817.15  11.3   10.8   11.8   -0.226  44.7 1.29e-162 1.50e-158  344. 0.0880
 6 ENSG00000131002.11  11.4   10.9   11.9   -1.10   44.5 7.80e-162 7.55e-158  343. 0.120 
 7 ENSG00000198692.9   12.3   11.8   12.9   -0.882  44.4 1.62e-161 1.34e-157  342. 0.129 
 8 ENSG00000183878.15   8.66   8.27   9.05  -0.781  43.9 9.83e-160 7.13e-156  338. 0.0813
 9 ENSG00000129824.15  11.4   10.8   11.9    2.37   43.3 1.44e-157 9.29e-154  334. 0.0896
10 ENSG00000274655.1  -12.4  -12.9  -11.8   -7.71  -43.1 5.20e-157 3.02e-153  333. 0.0931
# ... with 58,027 more rows

attr(,"class")
[1] "dsLimma" "list"

We have also implemented two other functions ds.DESeq2 and ds.edgeR that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:

To be supplied

We close the DataSHIELD session by:

datashield.logout(conns)