9 Exercises
Here we propose several exercises whose answers will have to be delivered through the Moodle site. The data are in a folder called data_for_exercises
. The answers will also be available in the Moodle site.
9.1 Bioconductor
EXERCISE 1: Read PLINK data (‘obesity.bed,’ ‘obesity.fam,’ ‘obesity.bim’) that are available here
- How many samples are genotyped?
- How many SNPs are in the chromosome 10?
- Get the SNP names available in the region chr2:134500-2300000.
- In which chromosome is located the SNP rs1527354?
- Get the number of individuals of each genoytpe corresponding to the SNP rs7859525. Which the homozygous normal genotype?
EXERCISE 2: Install the Bioconductor package called
airway
and load the dataairway
which is aSummarizedExperiment
by executing:library(airway) data(airway)
- Print the object
airway
and interpret the output.- Get the table of counts (use
asssays()
function) and the variables of the experiment (usecolData()
function).- Compare the number of counts of the gene ENSG00000000419 between treated and untreated cells (variable
dex
) usingt.test
function.
EXERCISE 3: Recount2 provides data for different RNA-seq experiments. These includes data from GTEx or TCGA projects. We have downloaded a subset of data corresponding to breast cancer and created a variable called
er
which encodes the estrogen receptor status (Negative and Positive). TheSummarizedExperiment
object is calledbreast
and is available in the file ‘breast_tcga.Rdata.’ Load the data into R and answer the next questions
- How many samples are in the
SummarizedExperiment
object?- And how many genes?
- Which is the number of samples with positive estrogen receptor status (variable
er
)?- Subset the individuals having Negative strogen receptor status and draw a boxplot of the first gene.
- Create a
SummarizedExperiment
object of the genomic region chr6:151.2-151.8Mb. How many genes are in that region? How many of them are annotated? That is, how many of them have a gene symbol name? (HINT: userowRanges()
function and remember thatmcols()
function is used to get acccess to columns in aGRanges
object)
EXERCISE 4: The file
snpsGWAS.Rdata
contains information about 28 SNPs that were called as a significantly associated with a given disease after performing a GWAS study.
- Load this data into R.
- Create a data.frame with the name of the SNP, its chromosome, its position and the allele name.
- Add another column to this data.frame with the minor allele frequency (MAF) estimated from 1000 genomes (HINT: use
listAttributes
to retrieve the name of the attribute we are interested in).- Annotate the variants and get the gene symbol
- How many SNPs are located in promoter regions?
9.2 Genetic association studies
EXERCISE 5: Researchers are interested in assessing possible association between candidate SNPs and the response to treatment in patients diagnosed with major depression (file ‘DM.txt’ - NOTE: check how alleles are separated). The file also includes clinical information about other covariates of interest such as:
- HDRS: Hamilton Depression Rating Scale (continuous variable)
- PSICOT: Was the patient psychotic? (No and Yes)
- MELANCOL: Was the patient melancholic? (No and Yes)
- EPD_PREV: Number of previous episodes of depression (continuous variable)
- RESP: Response to treatment (outcome)
Answer the following questions:
- Is it necessary to check HWE hypothesis for this example?
- Is there any SNP associated with the response to the treatment? If so, write a sentence interpreting the results of that association (only for one SNP)
- Does the result change after adjusting for other clinical covariates?
- Create a plot with the p-values only for dominant, recessive and additive models
- Compute the p-values using MAX-statistic for all SNPs
EXERCISE 6: Researchers are now interested in assessing the effect of haplotypes on the response to treatment. Answer the following questions:
- Which is the combination of SNPs (e.g. number of consecutive haplotypes) that is more associated with the response to treatment? (NOTE1: Try only haplotypes of length 2 up to 4)
- Which is the most frequent haplotype?
- Compute the OR of association and its 95% confidence intervals between the estimate haplotypes and the response to treatment
EXERCISE 7: Researchers are now interested in creating a genetic score to predict the response to the treatment
- Perform all the required steps to get such score and evaluate its performance using a ROC curve
9.3 GWAS
EXERCISE 8: Researchers are interested in detecting new SNPs associated with BMI (body mass index). To do so, they performed a GWAS using DNA information about 425 individuals. Genotype information is available in plink format (files ‘coronary.bed,’ ‘coronary.bim,’ ‘coronary.fam’) while phenotypic information can be found in the file ‘coronary.txt.’
- Read genotypes using
snpStats
library- Verify that both data sets are in the same order (NOTE: ‘id’ variable in the file ‘coronary.txt’ must be used since it corresponds to the unique patient number)
- Remove those SNPs that do not pass QC
- Perform QC filtering of individuals
- Assess association between BMI’ and the SNPs (NOTE: remember that you are analyzing a quantitative trait)
- Calculate ‘lambda’ and create a Q-Q plot to assess population stratification
- Assess association between BMI’ and the SNPs adjusting for population stratification (variables ‘ev3’ and ‘ev4’ in the file ‘coronary.txt’) (NOTE: remember that you are analyzing a quantitative trait). Are there differences with crude analysis?
- Create a Manhattan plot
9.4 Microarray data analysis
EXERCISE 9: To be supplied
9.5 RNAseq data analysis
EXERICSE 10: Download the data SRP029880 from recount2: http://duffel.rail.bio/recount/v2/SRP029880/rse_gene.Rdata. This file contains a
RangedSummarizedExperiment
object. Do the following tasks:
- Get the
counts
andcolData
tables. (NOTE:counts
can be retrieved byassay(rse_gene, 1)
orassays(rse_gene)$counts
)- Normalize the counts using CPM, RPKM, TPM and TMM methods and create MA-plots for two individuals.
- Using TPM normalized data:
- Plot a heatmap of the top 500 most variable genes. Compare with the heatmap obtained using the 100 most variable genes.
- Re-do the heatmaps setting the scale argument to
none
, andcolumn
. Compare the results withscale = 'row'
.
- Create some RLE and PCA plots for each method as well as for the raw data (do not forget to log-transform the data).
- Create a new variable called group by renaming the variable title as:
normal
,primary
andmetastasis
.- Perform a differential expression analysis using the variable
group
and get the list of differentially expressed genes comparingmetastasis vs normal
andprimary vs normal
.- Create two MA and volcano plots to describe the previous comparisons.
- Examine the distribution of counts for a single gene (most associated with
metastasis
). Hint: useDESeq2::plotCounts ()
- Do the same plot for the top-10 genes (this tries to mimic your supplementary material in your paper!)
- Investigate whether the analysis should be adjusted for other covariates.
- Repeat differential expression analysis accounting for unwanted variability using
sva
package.
9.6 Enrichment Analysis
EXERCISE 11: Library
tweeDEseqCountData
contains data corresponding to an RNA-seq experiment described in Pickrell et al. (2010). Data correspond to lymphoblastoid cell lines about 69 non-related Nigerian individuals. This information as well as phenotypic data is available as an object of classeSet
calledpickrell.eset
that can be loaded after typing:data(pickrell)
Use limma-voom method to detect those genes that are differentially expressed (DE) between males and females (variale
gender
).Select those genes that are DE at 5% FDR with a minimum fold-change of 1.5
Create a gene set of sex-expecific genes by
geneUniverse <- featureNames(pickrell.eset) geneSex <- unique(intersect(geneUniverse, c(msYgenes, XiEgenes)))
and test whether the list of DE genes is enriched in that gene set.
EXERCISE 12: Use the data in exercise 10 to obtain a list of genes that are differentially expressed between
metastasis vs normal
samples (you can run the R script that is available in the Moodle - ‘Answer exercises’)
Perform an enrichment analysis of GO, KEGG and DisGeNET databases using
clusterProfiler
.Investigate whether the DE genes are enriched in protein-coding genes.