6 RNAseq data analysis
In this chapter we will assume that the data analyst has obtained a read count table from raw fastq reads obtained from an Illumina sequencing run. This can also be performed using Bioconductor R packages, but sometimes you can ask the core facility for this data since it can be very computational expensive. Here we demonstrate how to process the count table, make a case-control differential expression analysis, and do some downstream functional enrichment analysis that can also be applied to the results obtained from microarrays experiments.
6.1 Introduction
The transcription profile of a particular gene follows from counting the number of times the transcripts of the genes were mapped by sequenced reads. The summarized RNA-seq data is known as count data. Figure 6.1 describes the main steps that are taken to obtain the count table that measures the transcription of active genes in biological samples.
The sequenced reads can be counted in a number of different ways:
- By alignment to the genome and summarized at either gene or transcript (isoform) level.
- By alignment to the transcriptome and summarized at either gene or transcript (isoform) level.
- By assembling directly into transcripts and summarized at either gene or transcript (isoform) level.
Table 6.2 shows a hypothetical RNA-seq data of \(G\) genes obtained for 6 samples, belonging to two different conditions. The main goal is to discover the genes that are differentially expressed between individuals from condition A and B.
In RNA-seq analysis, we deal with the number of reads (counts) that map to the biological feature of interest (gene, transcript, exon, etc.). The count number depends linearly with the abundance of the target’s transcription because the sequencing of RNA is a direct measure of transcription. This is considered as one of the advantages over microarrays that indirectly measure transcription by hybridization. Figure 6.3 illustrates the type of data from microarray and RNA-seq experiments. While microarray data is continuous, RNA-seq is discrete and, therefore, the modeling of each type of data is different.
There are two important factors that influence the number of gene counts and which need to be taken into account, see figure 6.4. The first factor is the sequencing depth or library size, that is, the total number of reads mapped to the genome; the second factor is the gene length, i.e. the number of bases covering a gene. It is expected that larger genes, for a given level of transcription, will have more gene counts.
6.2 Within sample normalization of read counts
The most common application after a gene’s expression is quantified (as the number of reads aligned to the gene), is to compare the gene’s expression in different conditions, for instance, in a case-control setting (e.g. disease versus normal) or in a time-series (e.g. along different developmental stages). Making such comparisons helps identify the genes that might be responsible for a disease or an impaired developmental trajectory. However, there are multiple caveats that needs to be addressed before making a comparison between the read counts of a gene in different conditions (Maza, et al. 2013).
Library size (i.e. sequencing depth) varies between samples coming from different lanes of the flow cell of the sequencing machine. Longer genes will have a higher number of reads. Library composition (i.e. relative size of the studied transcriptome) can be different in two different biological conditions. GC content biases across different samples may lead to a biased sampling of genes.
The most basic normalization approaches address the sequencing depth bias. Such procedures normalize the read counts per gene by dividing each gene’s read count by a certain value and multiplying it by \(10^6\). These normalized values are usually referred to as CPM (counts per million reads):
- Total Counts Normalization (divide counts by the sum of all counts)
- Upper Quartile Normalization (divide counts by the upper quartile value of the counts)
- Median Normalization (divide counts by the median of all counts)
Popular metrics that improve upon CPM are RPKM/FPKM (reads/fragments per kilobase of million reads) and TPM (transcripts per million). RPKM is obtained by dividing the CPM value by another factor, which is the length of the gene per kilobase. FPKM is the same as RPKM, but is used for paired-end reads. Thus, RPKM/FPKM methods account for, firstly, the library size, and secondly, the gene lengths.
TPM also controls for both the library size and the gene lengths, however, with the TPM method, the read counts are first normalized by the gene length (per kilobase), and then gene-length normalized values are divided by the sum of the gene-length normalized values and multiplied by \(10^6\). Thus, the sum of normalized values for TPM will always be equal to \(10^6\) for each library, while the sum of RPKM/FPKM values do not sum to \(10^6\). Therefore, it is easier to interpret TPM values than RPKM/FPKM values.
Other methods that consider GC content are: cqn (Hansen and Irizarry, 2012), EDAseq (Risso, et al. 2011).
We demonstrate different methods to normalize the count data from lymphoblastoid cell lines from 69 unrelated Nigerian individuals, described in Pickrell, et al. 2010. This data are available through the Bioconductor’s tweeDEseqCountData
package.
library(tweeDEseqCountData)
data(pickrell)
pickrell.eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 52580 features, 69 samples
element names: exprs
protocolData: none
phenoData
sampleNames: NA18486 NA18498 ... NA19257 (69 total)
varLabels: num.tech.reps population study gender
varMetadata: labelDescription
featureData
featureNames: ENSG00000000003 ENSG00000000005 ... LRG_99
(52580 total)
fvarLabels: gene
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
The object pickrell.est
is an ExpressionSet
containing RNA-seq count data obtained from the ReCount repository available at . Details on the pre-processing steps to obtain this table of counts from the raw reads are provided on the website and in Frazee, et al. 2011.
Our aim is to use this data to compare different normalization methods. The most simple normalization is to standardize the counts dividing them by the total number of reads in the sample (size of the libraries) and multiplying by \(10^6\) (CPM):
<- exprs(pickrell.eset)
counts <- colSums(counts)
lib.size <- sweep(counts, 2, FUN="/", lib.size) * 10^6 NormByCPM
Check that the sum of each column after CPM normalization equals to \(10^6\)
colSums(NormByCPM)
NA18486 NA18498 NA18499 NA18501 NA18502 NA18504 NA18505 NA18507
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA18508 NA18510 NA18511 NA18516 NA18517 NA18519 NA18520 NA18522
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA18523 NA18852 NA18853 NA18855 NA18856 NA18858 NA18861 NA18862
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA18870 NA18871 NA18909 NA18912 NA18913 NA18916 NA19093 NA19098
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19099 NA19101 NA19102 NA19108 NA19114 NA19116 NA19119 NA19127
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19128 NA19130 NA19131 NA19137 NA19138 NA19140 NA19143 NA19144
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19147 NA19152 NA19153 NA19159 NA19160 NA19171 NA19172 NA19190
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19192 NA19193 NA19200 NA19201 NA19203 NA19204 NA19209 NA19210
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19222 NA19225 NA19238 NA19239 NA19257
1e+06 1e+06 1e+06 1e+06 1e+06
For an RPKM (reads per kilobase of transcript and million mapped reads) normalization, we first need to know the length of query regions (e.g. length of genes). This information is also available from the tweeDEseqCountData
package, but can also be obtained from biomaRt
.
data(annotEnsembl63)
head(annotEnsembl63)
Symbol Chr Start End EntrezID
ENSG00000252775 U7 5 133913821 133913880 <NA>
ENSG00000207459 U6 5 133970529 133970635 <NA>
ENSG00000252899 U7 5 133997420 133997479 <NA>
ENSG00000201298 U6 5 134036862 134036968 <NA>
ENSG00000222266 U6 5 134051173 134051272 <NA>
ENSG00000222924 U6 5 137405044 137405147 <NA>
Description
ENSG00000252775 U7 small nuclear RNA [Source:RFAM;Acc:RF00066]
ENSG00000207459 U6 spliceosomal RNA [Source:RFAM;Acc:RF00026]
ENSG00000252899 U7 small nuclear RNA [Source:RFAM;Acc:RF00066]
ENSG00000201298 U6 spliceosomal RNA [Source:RFAM;Acc:RF00026]
ENSG00000222266 U6 spliceosomal RNA [Source:RFAM;Acc:RF00026]
ENSG00000222924 U6 spliceosomal RNA [Source:RFAM;Acc:RF00026]
Length GCcontent
ENSG00000252775 NA NA
ENSG00000207459 NA NA
ENSG00000252899 NA NA
ENSG00000201298 NA NA
ENSG00000222266 NA NA
ENSG00000222924 NA NA
<- intersect(as.character(rownames(counts)),
genes.ok as.character(rownames(annotEnsembl63)))
<- annotEnsembl63[genes.ok,]
geneAnnot <- counts[genes.ok,]
counts.ok identical(rownames(geneAnnot), rownames(counts.ok))
[1] TRUE
We obtain annotation data in the object annotEnsembl63
and map it to counts
, we then check that both data.frame
are in the same order. The RPKM normalization is computed by
<- geneAnnot$Length
geneLengths <- t(t(counts.ok / geneLengths *10^3)
NormByRPKM /colSums(counts.ok)*10^6)
Check that the sum of each column after RPKM normalization is not equal to \(10^6\)
colSums(NormByRPKM, na.rm=TRUE)
NA18486 NA18498 NA18499 NA18501 NA18502 NA18504 NA18505
446915.7 468875.1 451749.9 424849.8 465086.6 462163.4 461488.9
NA18507 NA18508 NA18510 NA18511 NA18516 NA18517 NA18519
474510.7 461575.7 463082.1 469532.5 494144.4 453952.3 465812.3
NA18520 NA18522 NA18523 NA18852 NA18853 NA18855 NA18856
475356.5 450742.2 465948.1 449445.6 541396.1 458492.5 446573.0
NA18858 NA18861 NA18862 NA18870 NA18871 NA18909 NA18912
461737.2 419591.0 481777.5 464991.6 448378.9 487215.0 529199.2
NA18913 NA18916 NA19093 NA19098 NA19099 NA19101 NA19102
471316.9 447690.0 436454.4 436349.0 428527.4 450986.4 539169.3
NA19108 NA19114 NA19116 NA19119 NA19127 NA19128 NA19130
464412.9 552272.0 452169.8 488080.0 533040.2 565341.8 460997.0
NA19131 NA19137 NA19138 NA19140 NA19143 NA19144 NA19147
459320.6 451752.3 425970.4 503160.3 462846.1 440995.7 512702.4
NA19152 NA19153 NA19159 NA19160 NA19171 NA19172 NA19190
462656.1 443485.7 435608.1 483709.9 428850.3 459705.1 491931.6
NA19192 NA19193 NA19200 NA19201 NA19203 NA19204 NA19209
465900.2 567158.2 439382.1 465267.3 512908.7 424076.6 468026.6
NA19210 NA19222 NA19225 NA19238 NA19239 NA19257
491900.9 440928.6 510870.3 455355.5 439798.3 498785.3
#find gene length normalized values
<- apply(counts.ok, 2,
rpk function(x) x/(geneLengths/10^3))
#normalize by the sample size using rpk values
<- apply(rpk, 2,
NormByTPM function(x) x / sum(as.numeric(x), na.rm=TRUE) * 10^6)
Check that the sum of each column after RPM normalization equals to \(10^6\)
colSums(NormByTPM, na.rm=TRUE)
NA18486 NA18498 NA18499 NA18501 NA18502 NA18504 NA18505 NA18507
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA18508 NA18510 NA18511 NA18516 NA18517 NA18519 NA18520 NA18522
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA18523 NA18852 NA18853 NA18855 NA18856 NA18858 NA18861 NA18862
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA18870 NA18871 NA18909 NA18912 NA18913 NA18916 NA19093 NA19098
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19099 NA19101 NA19102 NA19108 NA19114 NA19116 NA19119 NA19127
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19128 NA19130 NA19131 NA19137 NA19138 NA19140 NA19143 NA19144
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19147 NA19152 NA19153 NA19159 NA19160 NA19171 NA19172 NA19190
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19192 NA19193 NA19200 NA19201 NA19203 NA19204 NA19209 NA19210
1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
NA19222 NA19225 NA19238 NA19239 NA19257
1e+06 1e+06 1e+06 1e+06 1e+06
None of these metrics (CPM, RPKM/FPKM, TPM) account for the other important confounding factor when comparing expression levels of genes across samples: the library composition, which may also be referred to as the relative size of the compared transcriptomes. This factor is not dependent on the sequencing technology, it is rather biological. For instance, when comparing transcriptomes of different tissues, there can be sets of genes in one tissue that consume a big chunk of the reads, while in the other tissues they are not expressed at all. This kind of imbalance in the composition of compared transcriptomes can lead to wrong conclusions about which genes are actually differentially expressed. This consideration is addressed in two popular R packages: DESeq2
(Love, Huber, and Anders 2014) and edgeR
(Robinson, McCarthy, and Smyth 2010) each with a different algorithm. edgeR
uses a normalization procedure called Trimmed Mean of M-values (TMM). DESeq2
implements a normalization procedure using median of Ratios, which is obtained by finding the ratio of the log-transformed count of a gene divided by the average of log-transformed values of the gene in all samples (geometric mean), and then taking the median of these values for all genes. The raw read count of the gene is finally divided by this value (median of ratios) to obtain the normalized counts.
The DESeq2
normalization is automatically performed when using DESeq2
package (we will see it later). We now apply a TMM (trimmed mean of M-values) normalization which, as previously mentioned, was developed to correct the gene counts by the expression properties of the whole sample. The TMM normalization method is implemented in the Bioconductor’s tweeDEseq
package and can be performed by:
library(tweeDEseq)
<- normalizeCounts(counts.ok, method="TMM") NormByTMM
Another type of normalization is CQN, available in the package cqn
. It requires different steps that have been encapsulated in the function called normalizeCounts ()
from tweeDEseq
. The function requires the length and the percentage of GC-content of each gene.
library(cqn)
<- geneAnnot[,c("Length", "GCcontent")]
annotation <- normalizeCounts(counts.ok, method="cqn",
NormByCQN annot=annotation)
RQ fit .....................................................................
SQN fitting ...
|
| | 0%
|
|=================== | 33%
|
|====================================== | 67%
|
|=========================================================| 100%
.
We now compare the performance of each normalization method. Assuming that between two samples, most genes are not differentially expressed, the distribution of the difference of log-ratios between the samples should be centered around 0 when data is correctly normalized. We thus examine the distributions for the 1st and 2nd sample under normalization by the total number of reads and by RPKM (Figure (fig:checkNorm):
<- log2(NormByCPM[, 1] / NormByCPM[, 2])
MbyT <- log2(NormByRPKM[, 1] / NormByRPKM[, 2])
MbyRPKM
par(mfrow=c(1,2))
hist(MbyT, xlab="log2-ratio", main="Total reads")
abline(v=0, col="red")
hist(MbyRPKM, xlab="log2-ratio", main="RPKM")
abline(v=0, col="red")
Figure 6.5, however, does not show whether the null difference between samples holds for different levels of gene expression. With a MA-plot, one can check whether data has been correctly normalized at any expression level. The MA-plot is available from the Bioconductor package edgeR
. The resulting plot can be seen in Figure 6.6 (top left part). In the X-axis the plot shows the mean between the log-ratios. In the Y-axis the plot shows the difference of log-ratios between the samples, similar to figure 6.5. The MA-plot is, therefore, the difference against the mean of the gene counts between the two samples. The red line shows the expected M-values as a function of A-values. In our example of non-normalized data, we can see that for genes with low counts the distribution of the difference between samples is not zero; in particular, we observe that sample 2 has more counts than sample 1.
library(edgeR)
par(mfrow=c(3,2))
maPlot(counts[,1], counts[,2], pch=19, cex=.5, ylim=c(-8,8),
allCol="darkgray", lowess=TRUE,
xlab=expression( A == log[2] (sqrt(Sample1 %.% Sample3)) ),
ylab=expression(M == log[2](Sample1/Sample3)))
grid(col="black")
title("Raw data")
maPlot(NormByCPM[,1], NormByCPM[,2], pch=19, cex=.5, ylim=c(-8,8),
allCol="darkgray", lowess=TRUE,
xlab=expression( A == log[2] (sqrt(Sample1 %.% Sample3)) ),
ylab=expression(M == log[2](Sample1/Sample3)))
grid(col="black")
title("CPM normalization")
<- NormByRPKM[complete.cases(NormByRPKM),]
NormByRPKM maPlot(NormByRPKM[,1], NormByRPKM[,2], pch=19, cex=.5, ylim=c(-8,8),
allCol="darkgray", lowess=TRUE,
xlab=expression( A == log[2] (sqrt(Sample1 %.% Sample3)) ),
ylab=expression(M == log[2](Sample1/Sample3)))
grid(col="black")
title("RPKM normalization")
<- NormByTPM[complete.cases(NormByTPM),]
NormByTPM maPlot(NormByTPM[,1], NormByTPM[,2], pch=19, cex=.5, ylim=c(-8,8),
allCol="darkgray", lowess=TRUE,
xlab=expression( A == log[2] (sqrt(Sample1 %.% Sample3)) ),
ylab=expression(M == log[2](Sample1/Sample3)))
grid(col="black")
title("TPM normalization")
maPlot(NormByTMM[,1], NormByTMM[,2], pch=19, cex=.5, ylim=c(-8,8),
allCol="darkgray", lowess=TRUE,
xlab=expression( A == log[2] (sqrt(Sample1 %.% Sample3)) ),
ylab=expression(M == log[2](Sample1/Sample3)))
grid(col="black")
title("TMM normalization")
maPlot(NormByCQN[,1], NormByCQN[,2], pch=19, cex=.5, ylim=c(-8,8),
allCol="darkgray", lowess=TRUE,
xlab=expression( A == log[2] (sqrt(Sample1 %.% Sample3)) ),
ylab=expression(M == log[2](Sample1/Sample3)))
grid(col="black")
title("CQN normalization")
6.3 Exploratory analysis of the read counts table
A typical quality control, in this case interrogating the RNA-seq experiment design, is to measure the similarity of the samples with each other in terms of the quantified expression level profiles across a set of genes. One important observation to make is to see whether the most similar samples to any given sample are the biological replicates of that sample. This can be computed using unsupervised clustering techniques such as hierarchical clustering and visualized as a heatmap with dendrograms. Another most commonly applied technique is a dimensionality reduction technique called Principal Component Analysis (PCA) and visualized as a two-dimensional (or in some cases three-dimensional) scatter plot.
6.3.1 Clustering
We can combine clustering and visualization of the clustering results by using heatmap functions that are available in a variety of R libraries. The basic R installation comes with the stats::heatmap ()
function. However, there are other libraries available in CRAN (e.g. pheatmap
) or Bioconductor (e.g. ComplexHeatmap
) that come with more flexibility and more appealing visualizations.
Here we demonstrate a heatmap using the pheatmap
package and the previously calculated NormByTPM
matrix. As these matrices can be quite large, both computing the clustering and rendering the heatmaps can take a lot of resources and time. Therefore, a quick and informative way to compare samples is to select a subset of genes that are, for instance, most variable across samples, and use that subset to do the clustering and visualization.
Let’s select the top 100 most variable genes among the samples.
#compute the variance of each gene across samples
<- apply(NormByTPM, 1, var)
V #sort the results by variance in decreasing order
#and select the top 100 genes
<- names(V[order(V, decreasing = T)][1:100]) selectedGenes
Now we can quickly produce a heatmap where samples and genes are clustered (see Figure @ref{fig:clust}).
library(pheatmap)
pheatmap(NormByTPM[selectedGenes,], scale = 'row', show_rownames = FALSE)
6.3.2 PCA
Let’s make a PCA plot to see the clustering of replicates as a scatter plot in two dimensions. We can put a different color for our grouping variable (Figure @ref{fig:pcaRNAseq}). NOTE: Having clusters of individuals given a third variable (plate, population, …) can be a problem that must be corrected in downstream analyses.
library(stats)
library(ggplot2)
library(ggfortify)
#transpose the matrix
<- t(NormByTPM[selectedGenes,])
M # transform the counts to log2 scale
<- log2(M + 1)
M #compute PCA
<- prcomp(M)
pcaResults
#plot PCA results making use of ggplot2's autoplot function
#ggfortify is needed to let ggplot2 know about PCA data structure.
autoplot(pcaResults, data = pData(pickrell.eset), colour = 'gender')
6.4 Differential expression analysis
RNA-seq data are discrete therefore linear models, such that those implemented in limma
cannot be applied. While it is common practice to apply logarithmic transformations before fitting linear models, the transformations depend on an offset level to account for zero counts, which, in turn, can affect the group differences assessed in regression models. Other transformations may be applied, however, it is more adequate to make inferences based on distributions of count data such as the Poisson or the negative binomial distributions . Negative binomial modeling is preferred over Poisson’s, as biological variability results in a difference between the mean and variance of the data. The negative binomial is defined by the parameters \(\lambda\) and \(\phi\) that model the intensity and overdispersion of data.
Let us then assume that \(X_{gA}\) corresponds to the number of reads that mapped into gene \(g\) (\(g=1, \ldots, G\)) across subjects within condition \(A\) and that \(X_{gA} \sim NB(\lambda_{gA}, \phi_g)\). If counts in condition \(B\) also distribute \(X_{gB} \sim NB(\lambda_{gB}, \phi_g)\) then we aim to test whether \(\lambda_{gA}\) and \(\lambda_{gB}\) are significantly different across all \(g\). The dispersion of the gene \(\phi_g\) cannot be estimated when few individuals are analyzed, therefore we can:
There is another approach proposed by Law, et al. 2014 that makes a transformation of the count data (voom) that allows the use of normal linear models to analyze RNA-sea experiments. This method estimates the mean-variance relationship of the log-counts, generates some weights for each observation that are used in the linear models implemented in limma
.
::include_graphics('figures/mean_variance.png') knitr
We demonstrate the use of both DESeq2
and voom
using our previous example of lymphoblastoid cell lines. Our final aim is to detect the genes that are differentially expressed between males and females.
6.4.1 DESeq2
Let us start with DESeq2
workflow and how it calculates differential expression.
- The read counts are normalized by computing size factors, which addresses the differences not only in the library sizes, but also the library compositions.
- For each gene, a dispersion estimate is calculated. The dispersion value computed by DESeq2 is equal to the squared coefficient of variation (variation divided by the mean).
- A line is fit across the dispersion estimates of all genes computed in step 2 versus the mean normalized counts of the genes (Figure 6.8).
- Dispersion values of each gene are shrunk towards the fitted line in step 3.
- A Generalized Linear Model is fitted which considers additional confounding variables related to the experimental design such as sequencing batches, treatment, temperature, patient’s age, sequencing technology, etc., and uses negative binomial distribution for fitting count data.
- For a given contrast (e.g. treatment type: drug-A versus untreated), a test for differential expression is carried out against the null hypothesis that the log fold change of the normalized counts of the gene in the given pair of groups is exactly zero.
- It adjusts p-values for multiple-testing.
In order to carry out a differential expression analysis using DESeq2, three kinds of inputs are necessary:
- The read count table: This table must be raw read counts as integers that are not processed in any form by a normalization technique. The rows represent features (e.g. genes, transcripts, genomic intervals) and columns represent samples.
- A colData table: This table describes the experimental design.
- A design formula: This formula is needed to describe the variable of interest in the analysis (e.g. treatment status) along with (optionally) other covariates (e.g. batch, temperature, sequencing technology).
Let’s define these inputs:
<- as.matrix(counts)
countData #define the experimental setup
<- pData(pickrell.eset)
colData #define the design formula
<- as.formula(~ gender) designFormula
Now, we are ready to run DESeq2
.
library(DESeq2)
#create a DESeq dataset object from the count matrix and the colData
<- DESeqDataSetFromMatrix(countData = countData,
dds colData = colData,
design = designFormula)
#print dds object to see the contents
print(dds)
class: DESeqDataSet
dim: 52580 69
metadata(1): version
assays(1): counts
rownames(52580): ENSG00000000003 ENSG00000000005 ... LRG_98
LRG_99
rowData names(0):
colnames(69): NA18486 NA18498 ... NA19239 NA19257
colData names(4): num.tech.reps population study gender
The DESeqDataSet
object contains all the information about the experimental setup, the read counts, and the design formulas. Certain functions can be used to access this information separately:
rownames(dds)
shows which features are used in the study (e.g. genes),colnames(dds)
displays the studied samples,counts(dds)
displays the count table, andcolData(dds)
displays the experimental setup.
Remove genes that have almost no information in any of the given samples.
<- dds[ rowSums(counts(dds)) > 1, ] dds
Now, we can use the DESeq ()
function of DESeq2
, which is a wrapper function that implements estimation of size factors to normalize the counts, estimation of dispersion values, and computing a GLM model based on the experimental design formula. This function returns a DESeqDataSet
object, which is an updated version of the dds
object that we pass to the function as input.
<- DESeq(dds) dds
Now, we can compare and contrast the samples based on different variables of interest. In this case, we currently have only one variable, which is the gender variable that determines if a sample is a male or a female.
<- results(dds, contrast = c("gender", 'male', 'female'))
DEresults <- DEresults[order(DEresults$pvalue),] DEresults
Thus we have obtained a table containing the differential expression status of male samples compared to the femal samples (NOTE: in general you will have contrast = c("group", 'CASE', 'CONTROL')
).
It is important to note that the sequence of the elements provided in the contrast argument determines which group of samples are to be used as the control. This impacts the way the results are interpreted, for instance, if a gene is found up-regulated (has a positive log2 fold change), the up-regulation status is only relative to the factor that is provided as control. In this case, we used samples from the “female” group as control and contrasted the samples from the “male” group with respect to the “female” samples. Thus genes with a positive log2 fold change are called up-regulated in females with respect to females, while genes with a negative log2 fold change are down-regulated in the males. Whether the deregulation is significant or not, warrants assessment of the adjusted p-values.
Let’s have a look at the contents of the DEresults
table.
DEresults
log2 fold change (MLE): gender male vs female
Wald test p-value: gender male vs female
DataFrame with 11916 rows and 6 columns
baseMean log2FoldChange lfcSE stat
<numeric> <numeric> <numeric> <numeric>
ENSG00000129824 153.69742 10.38990 0.339760 30.5801
ENSG00000099749 15.35847 7.58113 0.346235 21.8959
ENSG00000198692 8.19446 6.79870 0.369269 18.4112
ENSG00000154620 10.90122 5.88750 0.344185 17.1056
ENSG00000157828 5.71040 6.11005 0.462473 13.2117
... ... ... ... ...
ENSG00000136352 0.1022197 9.16111e-05 2.0766732 4.41144e-05
ENSG00000081800 0.0923587 9.14659e-05 2.1146437 4.32536e-05
ENSG00000184302 0.0920789 9.14620e-05 2.1156773 4.32306e-05
ENSG00000203993 93.4492562 -2.71260e-06 0.0904723 -2.99827e-05
ENSG00000204019 0.0000000 0.00000e+00 0.0000000 0.00000e+00
pvalue padj
<numeric> <numeric>
ENSG00000129824 2.24914e-205 2.00444e-201
ENSG00000099749 2.84178e-106 1.26630e-102
ENSG00000198692 1.06759e-75 3.17147e-72
ENSG00000154620 1.34765e-65 3.00257e-62
ENSG00000157828 7.51022e-40 1.33862e-36
... ... ...
ENSG00000136352 0.999965 NA
ENSG00000081800 0.999965 NA
ENSG00000184302 0.999966 NA
ENSG00000203993 0.999976 0.999976
ENSG00000204019 1.000000 NA
The first three lines in this output show the contrast and the statistical test that were used to compute these results, along with the dimensions of the resulting table (number of columns and rows). Below these lines is the actual table with 6 columns: baseMean
represents the average normalized expression of the gene across all considered samples. log2FoldChange
represents the base-2 logarithm of the fold change of the normalized expression of the gene in the given contrast. lfcSE
represents the standard error of log2 fold change estimate, and stat
is the statistic calculated in the contrast which is translated into a pvalue
and adjusted for multiple testing in the padj
column.
6.4.1.1 Diagnostic plots
At this point, before proceeding to do any downstream analysis and jumping to conclusions about the biological insights that are reachable with the experimental data at hand, it is important to do some more diagnostic tests to improve our confidence about the quality of the data and the experimental setup.
6.4.1.1.1 MA plot
An MA plot is useful to observe if the data normalization worked well (Figure 6.9). The MA plot is a scatter plot where the x-axis denotes the average of normalized counts across samples and the y-axis denotes the log fold change in the given contrast. Most points are expected to be on the horizontal 0 line (most genes are not expected to be differentially expressed).
::plotMA(object = dds, ylim = c(-5, 5)) DESeq2
6.4.1.1.2 Volcano plot
We highly recommend to read the vignette of EnhancedVolcano Bioconductor package to get this type of plots. This figure has the default cut-off for log2FC as >|2| and the default cut-off for P value as 10e-6.:
library(EnhancedVolcano)
EnhancedVolcano(DEresults,
lab = rownames(DEresults),
x = 'log2FoldChange',
y = 'pvalue', ylim=c(0,50))
6.4.1.1.3 P-value distribution
It is also important to observe the distribution of raw p-values (Figure 6.11). We expect to see a peak around low p-values and a uniform distribution at P-values above 0.1. Otherwise, adjustment for multiple testing does not work and the results are not meaningful.
library(ggplot2)
ggplot(data = as.data.frame(DEresults), aes(x = pvalue)) +
geom_histogram(bins = 100)
This type of plots indicates that there is unwanted variability driven by unobserved variables. Let’s talk about it later.
6.4.1.1.4 PCA plot
A final diagnosis is to check the biological reproducibility of the sample replicates in a PCA plot or a heatmap. To plot the PCA results, we need to extract the normalized counts from the DESeqDataSet
object. It is possible to color the points in the scatter plot by the variable of interest, which helps to see if the replicates cluster well (Figure 6.12).
# extract normalized counts from the DESeqDataSet object
<- DESeq2::counts(dds, normalized = TRUE)
countsNormalized
# select top 500 most variable genes
<- names(sort(apply(countsNormalized, 1, var),
selectedGenes decreasing = TRUE)[1:500])
<- prcomp(t(countsNormalized[selectedGenes,]))
pcaResults2 autoplot(pcaResults2, data = pData(pickrell.eset), colour = 'gender')
Alternatively, the normalized counts can be transformed using the DESeq2::rlog ()
or DESeq2::vst ()
function (which is much faster) and DESeq2::plotPCA ()
can be readily used to plot the PCA results
<- vst(dds)
rld ::plotPCA(rld, ntop = 500, intgroup = 'gender') +
DESeq2ylim(-25, 25) + theme_bw()
A similar plot to the MA plot is the RLE (Relative Log Expression) plot that is useful in finding out if the data at hand needs normalization (Gandolfo and Speed 2018). Sometimes, even the datasets normalized using the explained methods above may need further normalization due to unforeseen sources of variation that might stem from the library preparation, the person who carries out the experiment, the date of sequencing, the temperature changes in the laboratory at the time of library preparation, and so on and so forth. The RLE plot is a quick diagnostic that can be applied on the raw or normalized count matrices to see if further processing is required.
Let’s do RLE plots on the raw counts and normalized counts using the EDASeq package.
library(EDASeq)
par(mfrow = c(2, 1))
plotRLE(DESeq2::counts(dds, normalized = FALSE),
outline=FALSE, ylim=c(-2, 2),
col = as.numeric(colData$gender),
main = 'Raw counts')
plotRLE(DESeq2::counts(dds, normalized = TRUE),
outline=FALSE, ylim=c(-2, 2),
col = as.numeric(colData$gender),
main = 'Normalized Counts (DESeq2)')
Here the RLE plot is comprised of boxplots, where each box-plot represents the distribution of the relative log expression of the genes expressed in the corresponding sample. Each gene’s expression is divided by the median expression value of that gene across all samples. Then this is transformed to log scale, which gives the relative log expression value for a single gene. The RLE values for all the genes from a sample are visualized as a boxplot.
Ideally the boxplots are centered around the horizontal zero line and are as tightly distributed as possible (Risso, et al. 2014). From the plots that we have made for the raw and normalized count data, we can observe how the normalized dataset has improved upon the raw count data for all the samples. However, in some cases, it is important to visualize RLE plots in combination with other diagnostic plots such as PCA plots, heatmaps, and correlation plots to see if there is more unwanted variation in the data, which can be further accounted for using packages such as RUVSeq
(Risso, et al. 2014) and sva
(Leek, et al. 2012). We will cover details about the RUVSeq
and sva
packages to account for unwanted sources of noise in RNA-seq datasets in later sections.
6.4.2 voom
The analyses using voom
are similar to those performed using limma
the only difference is that count data is heteroscedastic. Basically, in order to remove Heteroscedasticity edgeR
proposed to do a transformation of count data called voom
. Here, we have the core we need to use for thi approach
library(limma)
<- model.matrix( ~ gender, data=colData)
design <- voom(counts, design=design, plot=TRUE) v
Figure 6.14 shows the mean-variance relationship, from which precision weights are given to the count data, so continuous data are derived and analyzed by the usual limma
procedure
<- lmFit(v, design)
fit <- eBayes(fit)
fit topTable(fit, coef=ncol(design))
logFC AveExpr t P.Value
ENSG00000129824 9.1311273 1.95469533 54.225084 1.065179e-64
ENSG00000099749 6.1321934 0.40983758 45.639189 6.485999e-59
ENSG00000198692 5.1906429 -0.03456802 35.287216 2.001501e-50
ENSG00000154620 4.9963042 0.48128403 25.017166 1.672057e-39
ENSG00000157828 4.1312237 -0.38999476 16.850518 5.322729e-28
ENSG00000006757 -0.9164784 5.30180979 -9.994009 1.027507e-15
ENSG00000183878 1.8245692 -1.50072787 10.254164 3.210781e-16
ENSG00000092377 0.7785641 -1.95781202 4.607544 1.524213e-05
ENSG00000177606 -0.5767409 8.24149085 -4.269167 5.368351e-05
ENSG00000143921 0.8071775 -1.86105437 4.513827 2.170952e-05
adj.P.Val B
ENSG00000129824 5.600714e-60 67.155001
ENSG00000099749 1.705169e-54 56.640302
ENSG00000198692 3.507964e-46 48.692481
ENSG00000154620 2.197919e-35 41.492680
ENSG00000157828 5.597382e-24 30.056641
ENSG00000006757 7.718044e-12 20.728249
ENSG00000183878 2.813714e-12 15.860830
ENSG00000092377 8.974261e-02 1.957700
ENSG00000177606 8.974261e-02 1.735214
ENSG00000143921 8.974261e-02 1.623161
We see that the top genes with significant differences of transcription between males and females are consistent with those found using the DESeq2
.
Other plots implemented in limma
can be used to visually inspect the results.
volcanoplot(fit, coef=2, highlight = 5,
names = rownames(fit$coefficients))
6.5 Accounting for additional sources of variation
When doing a differential expression analysis in a case-control setting, the variable of interest, i.e. the variable that explains the separation of the case samples from the control, is usually the treatment, genotypic differences, a certain phenotype, etc. However, in reality, depending on how the experiment and the sequencing were designed, there may be additional factors that might contribute to the variation between the compared samples. Sometimes, such variables are known, for instance, the date of the sequencing for each sample (batch information), or the temperature under which samples were kept. Such variables are not necessarily biological but rather technical, however, they still impact the measurements obtained from an RNA-seq experiment. Such variables can introduce systematic shifts in the obtained measurements. Here, we will demonstrate: firstly how to account for such variables using DESeq2
, when the possible sources of variation are actually known; secondly, how to account for such variables when all we have is just a count table but we observe that the variable of interest only explains a small proportion of the differences between case and control samples.
6.5.1 Accounting for covariates using DESeq2
For demonstration purposes, we will use a subset of the count table obtained for a heart disease study, where there are RNA-seq samples from subjects with normal and failing hearts. We will use a subset of the samples, focusing on 6 case and 6 control samples and we only consider protein-coding genes (for speed concerns). Data are available at a package called compGenomRdata
that can be installed with:
::install_github("compgenomr/compGenomRData") devtools
This package contains text, RDS and other genomic specific data. Let’s import count and colData for the experiment having data on the heart disease study.
<- system.file('extdata/rna-seq/SRP021193.raw_counts.tsv',
counts_file package = 'compGenomRData')
<- system.file('extdata/rna-seq/SRP021193.colData.tsv',
colData_file package = 'compGenomRData')
<- read.table(counts_file)
counts <- read.table(colData_file, header = T, sep = '\t',
colData stringsAsFactors = TRUE)
Let’s take a look at how the samples cluster by calculating the TPM counts as displayed as a heatmap in Figure 6.16.
library(pheatmap)
#find gene length normalized values
<- counts$width
geneLengths <- apply( subset(counts, select = c(-width)), 2,
rpk function(x) x/(geneLengths/1000))
#normalize by the sample size using rpk values
<- apply(rpk, 2, function(x) x / sum(as.numeric(x)) * 10^6)
tpm
<- names(sort(apply(tpm, 1, var),
selectedGenes decreasing = TRUE)[1:100])
pheatmap(tpm[selectedGenes,],
scale = 'row',
annotation_col = colData,
show_rownames = FALSE)
Here we can see from the clusters that the dominating variable is the ‘Library Selection’ variable rather than the ‘diagnosis’ variable, which determines the state of the organ from which the sample was taken. Case and control samples are all mixed in both two major clusters. However, ideally, we’d like to see a separation of the case and control samples regardless of the additional covariates. When testing for differential gene expression between conditions, such confounding variables can be accounted for using DESeq2
. Below is a demonstration of how we instruct DESeq2
to account for the ‘library selection’ variable:
library(DESeq2)
# remove the 'width' column from the counts matrix
<- as.matrix(subset(counts, select = c(-width)))
countData # set up a DESeqDataSet object
<- DESeqDataSetFromMatrix(countData = countData,
dds colData = colData,
design = ~ LibrarySelection + group)
Now, we can run the differential expression analysis as has been demonstrated previously.
# run DESeq
<- DESeq(dds)
dds # extract results that are adjusted for 'LibrarySelection'
<- results(dds, contrast = c('group', 'CASE', 'CTRL')) DEresults
6.5.2 Accounting for estimated covariates using sva
and RUVSeq
In cases when the sources of potential variation are not known, it is worthwhile to use tools such as sva
or RUVSeq
that can estimate potential sources of variation and clean up the counts table from those sources of variation. Later on, the estimated covariates can be integrated into DESeq2
’s design formula.
6.5.2.1 Surrogate variable analysis (SVA
)
Let us start by illustrating how to use sva
package to estimate surrogate variables that account for unwanted variability.
Here, for demonstration purposes, we’ll use a count table from a lung carcinoma study in which a transcription factor (Ets homologous factor - EHF) is overexpressed and compared to the control samples with baseline EHF expression. Again, we only consider protein coding genes and use only five case and five control samples. The original data can be found on the recount2 database with the accession ‘SRP049988.’
<- system.file('extdata/rna-seq/SRP049988.raw_counts.tsv',
counts_file package = 'compGenomRData')
<- system.file('extdata/rna-seq/SRP049988.colData.tsv',
colData_file package = 'compGenomRData')
<- read.table(counts_file)
counts <- read.table(colData_file, header = T,
colData sep = '\t', stringsAsFactors = TRUE)
# simplify condition descriptions
$source_name <- ifelse(colData$group == 'CASE',
colData'EHF_overexpression', 'Empty_Vector')
Let’s start by making heatmaps of the samples using TPM counts (see Figure 6.17).
#find gene length normalized values
<- counts$width
geneLengths <- apply( subset(counts, select = c(-width)), 2,
rpk function(x) x/(geneLengths/1000))
#normalize by the sample size using rpk values
<- apply(rpk, 2, function(x) x / sum(as.numeric(x)) * 10^6)
tpm <- names(sort(apply(tpm, 1, var),
selectedGenes decreasing = T)[1:100])
pheatmap(tpm[selectedGenes,],
scale = 'row',
annotation_col = colData,
cutree_cols = 2,
show_rownames = FALSE)
We can see that the overall clusters look fine, except that one of the case samples (CASE_5) clusters more closely with the control samples than the other case samples. This mis-clustering could be a result of some batch effect, or any other technical preparation steps. However, the colData object doesn’t contain any variables that we can use to pinpoint the exact cause of this. So, let’s use RUVSeq
to estimate potential covariates to see if the clustering results can be improved.
First, we set up the experiment:
library(EDASeq)
# remove 'width' column from counts
<- as.matrix(subset(counts, select = c(-width)))
countData # create a seqExpressionSet object using EDASeq package
<- newSeqExpressionSet(counts = countData,
set phenoData = colData)
Next, let’s make a diagnostic RLE plot on the raw count table.
par(mfrow = c(1,2))
plotRLE(set, outline=FALSE, ylim=c(-4, 4), col=as.numeric(colData$group))
plotPCA(set, col = as.numeric(colData$group), adj = 0.5,
ylim = c(-0.7, 0.5), xlim = c(-0.5, 0.5))
The plot shows that normalization is required. Let us see what happen with TPM normalized counts.
par(mfrow = c(1,2))
plotRLE(tpm, outline=FALSE, ylim=c(-4, 4), col=as.numeric(colData$group))
plotPCA(tpm, col=as.numeric(colData$group), adj = 0.5,
ylim = c(-0.3, 1), xlim = c(-0.5, 0.5))
Both RLE and PCA plots look better on normalized data compared to raw data, but still suggest the necessity of further improvement, because the CASE_5 sample still clusters with the control samples. We haven’t yet accounted for the source of unwanted variation.
This is also reflected when analyzing the data to detect differentially expressed genes:
library(DESeq2)
<- DESeqDataSetFromMatrix(countData = countData,
dds colData = colData,
design = ~ group)
<- dds[rowSums(DESeq2::counts(dds)) > 10]
dds <- DESeq(dds)
dds <- results(dds, contrast = c("group", 'CASE', 'CTRL'))
ans
ans
log2 fold change (MLE): group CASE vs CTRL
Wald test p-value: group CASE vs CTRL
DataFrame with 18508 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
TSPAN6 89382.1826 -0.545795 0.0418376 -13.04555 6.73674e-39 2.25551e-36
DPM1 60432.9217 0.166516 0.1194548 1.39396 1.63328e-01 2.68234e-01
SCYL3 26767.1487 0.097745 0.0750204 1.30291 1.92604e-01 3.05730e-01
C1ORF112 16738.9727 0.720767 0.2040321 3.53261 4.11471e-04 1.75710e-03
FGR 15.2858 -6.043118 3.0119833 -2.00636 4.48180e-02 9.34917e-02
... ... ... ... ... ... ...
ETDC 46.8742 0.2189282 1.851210 0.1182622 0.905860 0.942735
LMLN2 390.4337 -0.5897963 0.871550 -0.6767213 0.498583 0.634202
AC008687.8 154.7583 -0.0487054 0.511696 -0.0951842 0.924168 0.954562
MYOCOS 27.2624 1.2110715 2.668909 0.4537703 0.649994 0.762304
HSFX3 1457.9518 0.0714840 0.272921 0.2619223 0.793381 0.869712
Now, if we look at the QQplot we observe that there is huge inflation.
::qqPlot(ans$pvalue) GWASTools
This figure indicates that there exists unwanted variability that must be removed from the association analysis. sva
estimates some variables that can capture hidden or unobserved variables that define hidden groups of samples. The idea is similar to PCA analysis in GWAS studies to capture population stratification. The main difference here is that sva
iterativelty estimates the required number of surrogate variables that can then be used in the adjusted models. Let us illustrate how to do it.
The method removes all the existing variability but the one driven by our grouping/condition variable by defining two models
library(sva)
<- model.matrix( ~ group, data=colData)
mod1 <- model.matrix( ~ 1, data=colData) mod0
NOTE: If there are variables that you want to adjust for (e.g. age, sex, batch) the to models would be (not run)
<- model.matrix( ~ group + age + sex + batch, data=colData)
mod1 <- model.matrix( ~ age + sex + batch, data=colData) mod0
Then, the surrogate variables are estimated with
<- countData[ rowSums(countData) > 10, ]
countData <- svaseq(countData, mod1, mod0) sv
Number of significant surrogate variables is: 2
Iteration (out of 5 ):1 2 3 4 5
Note that we remove genes with low counts to avoid numerical problems sva
procedure.
The number of estimated surrogate variables are
$n.sv sv
[1] 2
And the surrogate variables are saved in sv$sv
head(sv$sv)
[,1] [,2]
[1,] 0.1518950 -0.60401773
[2,] 0.3107944 -0.01966796
[3,] 0.3185363 0.04966395
[4,] 0.3496187 -0.18502593
[5,] -0.5496443 0.28947418
[6,] -0.3927037 -0.26278458
Now we can re-run the DE analysis adjusting for those covariates as following:
# add surrogate variables to the colData
<- DataFrame(colData, sv$sv)
temp colData(dds) <- temp
head(colData(dds))
DataFrame with 6 rows and 4 columns
source_name group V1 V2
<character> <factor> <numeric> <numeric>
CASE_1 EHF_overexpression CASE 0.151895 -0.6040177
CASE_2 EHF_overexpression CASE 0.310794 -0.0196680
CASE_3 EHF_overexpression CASE 0.318536 0.0496639
CASE_4 EHF_overexpression CASE 0.349619 -0.1850259
CASE_5 EHF_overexpression CASE -0.549644 0.2894742
CTRL_1 Empty_Vector CTRL -0.392704 -0.2627846
# update the design
design(dds) <- as.formula( ~ group + V1 + V2)
# re-run the analysis
<- dds[rowSums(DESeq2::counts(dds)) > 10]
dds <- DESeq(dds)
dds <- results(dds, contrast = c("group", 'CASE', 'CTRL')) ans.sv
Now, let us verify whether the QQ-plot is showing inflation with regard to the p-values
::qqPlot(ans.sv$pvalue) GWASTools
In that case, we still see a huge inflation. This is explain by the fact that we are analyzing gene expression between cases diagnosed with cancer and controls where the transcriptome is highly altered. In practical terms, one can state in the manuscript that DE analysis was adjusted for surrogate variables to guarantee that the differences we observed are not due to technical artifacts.
6.5.2.2 Remove unwanted variability (RUV
)
Let’s see how to utilize the RUVseq
package to first diagnose the problem or detecting that unwanted variability is present and then how solve it. RUVSeq
has three main functions for removing unwanted variation: RUVg()
, RUVs()
, and RUVr()
. Here, we will demonstrate how to use RUVg
and RUVs
. RUVr will be left as an exercise.
Using RUVg
One way of removing unwanted variation depends on using a set of reference genes that are not expected to change by the sources of technical variation. One strategy along this line is to use spike-in genes, which are artificially introduced into the sequencing run (Jiang, et al. 2011). However, there are many sequencing datasets that don’t have this spike-in data available. In such cases, an empirical set of genes can be collected from the expression data by doing a differential expression analysis and discovering genes that are unchanged in the given conditions. These unchanged genes are used to clean up the data from systematic shifts in expression due to the unwanted sources of variation. Another strategy could be to use a set of house-keeping genes as negative controls, and use them as a reference to correct the systematic biases in the data. Let’s use a list of ~500 house-keeping genes compiled here: https://www.tau.ac.il/~elieis/HKG/HK_genes.txt.
library(RUVSeq)
#source for house-keeping genes collection:
#https://m.tau.ac.il/~elieis/HKG/HK_genes.txt
<- read.table(url("https://m.tau.ac.il/~elieis/HKG/HK_genes.txt"))
HK_genes
# let's take an intersection of the house-keeping genes with the genes available
# in the count table
<- intersect(rownames(set), HK_genes$V1) house_keeping_genes
We will now run RUVg()
with the different number of factors of unwanted variation. We will plot the PCA after removing the unwanted variation. We should be able to see which \(k\) values (e.g. number of factors) produce better separation between sample groups.
# now, we use these genes as the empirical set of genes as input to RUVg.
# we try different values of k and see how the PCA plots look
par(mfrow = c(2, 2))
for(k in 1:4) {
<- RUVg(x = set, cIdx = house_keeping_genes, k = k)
set_g plotPCA(set_g, col=as.numeric(colData$group), cex = 0.9, adj = 0.5,
main = paste0('with RUVg, k = ',k),
ylim = c(-1, 1), xlim = c(-1, 1), )
}
Based on the separation of case and control samples in the PCA plots in Figure 8.16, we choose \(k = 1\) and re-run the RUVg()
function with the house-keeping genes to do more diagnostic plots.
<- RUVg(x = set, cIdx = house_keeping_genes, k = 1) set_g
Now let’s do diagnostics: compare the count matrices with or without RUVg processing, comparing RLE plots (Figure @ref{fig:rleRUV}) and PCA plots (Figure @ref{fig:pcaRUV}) to see the effect of RUVg on the normalization and separation of case and control samples.
par(mfrow = c(1,2))
plotRLE(set, outline=FALSE, ylim=c(-4, 4),
col=as.numeric(colData$group), main = 'without RUVg')
plotRLE(set_g, outline=FALSE, ylim=c(-4, 4),
col=as.numeric(colData$group), main = 'with RUVg')
par(mfrow = c(1,2))
plotPCA(set, col=as.numeric(colData$group), adj = 0.5,
main = 'without RUVg',
ylim = c(-1, 0.5), xlim = c(-0.5, 0.5))
plotPCA(set_g, col=as.numeric(colData$group), adj = 0.5,
main = 'with RUVg',
ylim = c(-1, 0.5), xlim = c(-0.5, 0.5))
We can observe that using RUVg()
with house-keeping genes as reference has improved the clusters, however not yielded ideal separation. Probably the effect that is causing the ‘CASE_5’ to cluster with the control samples still hasn’t been completely eliminated.
Using RUVs
There is another strategy of RUVSeq
that works better in the presence of replicates in the absence of a confounded experimental design, which is the RUVs()
function. Let’s see how that performs with this data. This time we don’t use the house-keeping genes. We rather use all genes as input to RUVs ()
. This function estimates the correction factor by assuming that replicates should have constant biological variation, rather, the variation in the replicates are the unwanted variation.
# make a table of sample groups from colData
<- makeGroups(colData$group)
differences
## looking for two different sources of unwanted variation (k = 2)
## use information from all genes in the expression object
par(mfrow = c(2, 2))
for(k in 1:4) {
<- RUVs(set, unique(rownames(set)),
set_s k=k, differences) #all genes
plotPCA(set_s, col=as.numeric(colData$group),
cex = 0.9, adj = 0.5,
main = paste0('with RUVs, k = ',k),
ylim = c(-1, 1), xlim = c(-0.6, 0.6))
}
Based on the separation of case and control samples in the PCA plots in Figure 6.25, we can see that the samples are better separated even at k = 2
when using RUVs()
. Here, we re-run the RUVs() function using k = 2
, in order to do more diagnostic plots. We try to pick a value of k that is good enough to distinguish the samples by condition of interest. While setting the value of k to higher values could improve the percentage of explained variation by the first principle component to up to 61%, we try to avoid setting the value unnecessarily high to avoid removing factors that might also correlate with important biological differences between conditions.
# we set k=2
<- RUVs(set, unique(rownames(set)), k=2, differences) set_s
Now let’s do diagnostics again: compare the count matrices with or without RUVs processing, comparing RLE plots (Figure 6.26) and PCA plots (Figure 6.27) to see the effect of RUVg on the normalization and separation of case and control samples.
par(mfrow = c(1,2))
plotRLE(set, outline=FALSE, ylim=c(-4, 4),
col=as.numeric(colData$group),
main = 'without RUVs')
plotRLE(set_s, outline=FALSE, ylim=c(-4, 4),
col=as.numeric(colData$group),
main = 'with RUVs')
par(mfrow = c(1,2))
plotPCA(set, col=as.numeric(colData$group),
main = 'without RUVs', adj = 0.5,
ylim = c(-0.75, 0.75), xlim = c(-0.75, 0.75))
plotPCA(set_s, col=as.numeric(colData$group),
main = 'with RUVs', adj = 0.5,
ylim = c(-0.75, 0.75), xlim = c(-0.75, 0.75))
Let’s compare PCA results from RUVs and RUVg with the initial raw counts matrix. We will simply run the plotPCA()
function on different normalization schemes. The resulting plots are in Figure 6.28:
par(mfrow = c(1,3))
plotPCA(countData, col=as.numeric(colData$group),
main = 'without RUV - raw counts', adj = 0.5,
ylim = c(-0.75, 0.75), xlim = c(-0.75, 0.75))
plotPCA(set_g, col=as.numeric(colData$group),
main = 'with RUVg', adj = 0.5,
ylim = c(-0.75, 0.75), xlim = c(-0.75, 0.75))
plotPCA(set_s, col=as.numeric(colData$group),
main = 'with RUVs', adj = 0.5,
ylim = c(-0.75, 0.75), xlim = c(-0.75, 0.75))
It looks like RUVs ()
has performed better than RUVg()
in this case. So, let’s use count data that is processed by RUVs()
to re-do the initial heatmap.
library(EDASeq)
library(pheatmap)
# extract normalized counts that are cleared from unwanted variation using RUVs
<- normCounts(set_s)
normCountData <- names(sort(apply(normCountData, 1, var),
selectedGenes decreasing = TRUE))[1:500]
pheatmap(normCountData[selectedGenes,],
annotation_col = colData,
show_rownames = FALSE,
cutree_cols = 2,
scale = 'row')
Having computed the sources of variation using RUVs()
, we can actually integrate these variables with DESeq2 to re-do the differential expression analysis. Here the idea is that the comparisons among groups will be ajusted for unwanted variability.
library(DESeq2)
#set up DESeqDataSet object
<- DESeqDataSetFromMatrix(countData = countData,
dds colData = colData,
design = ~ group)
# filter for low count genes
<- dds[rowSums(DESeq2::counts(dds)) > 10]
dds
# insert the covariates W1 and W2 (the code shows how to add 1-9)
# computed using RUVs into DESeqDataSet object
colData(dds) <- cbind(colData(dds),
pData(set_s)[rownames(colData(dds)),
grep('W_[0-9]',
colnames(pData(set_s)))])
# update the design formula for the DESeq analysis (save the variable of
# interest to the last!)
design(dds) <- ~ W_1 + W_2 + group
# repeat the analysis
<- DESeq(dds)
dds
# extract deseq results
<- results(dds, contrast = c('group', 'CASE', 'CTRL'))
res <- res[order(res$padj),]
res head(res)
log2 fold change (MLE): group CASE vs CTRL
Wald test p-value: group CASE vs CTRL
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
USP9Y 29712.5 5.16933 0.1189552 43.4561 0 0
RPS4Y1 140830.2 5.25714 0.0724517 72.5606 0 0
NTS 251982.3 -5.05912 0.0947389 -53.4007 0 0
RSPO3 94636.1 -1.76698 0.0470231 -37.5769 0 0
KRT19 367911.7 2.08835 0.0439788 47.4855 0 0
CES1 127785.8 4.65580 0.0858893 54.2070 0 0