9 Exercises

Here we propose several exercises whose answers will have to be delivered through the Moodle site. The data are in a folder called data_for_exercises. The answers will also be available in the Moodle site.

9.1 Bioconductor

EXERCISE 1: Read PLINK data (‘obesity.bed,’ ‘obesity.fam,’ ‘obesity.bim’) that are available here

How many samples are genotyped?

How many SNPs are in the chromosome 10?

Get the SNP names available in the region chr2:134500-2300000.

In which chromosome is located the SNP rs1527354?

Get the number of individuals of each genoytpe corresponding to the SNP rs7859525. Which the homozygous normal genotype?

EXERCISE 2: Install the Bioconductor package called airway and load the data airway which is a SummarizedExperiment by executing:
library(airway)
data(airway)
Print the object airway and interpret the output.

Get the table of counts (use asssays() function) and the variables of the experiment (use colData() function).

Compare the number of counts of the gene ENSG00000000419 between treated and untreated cells (variable dex) using t.test function.

EXERCISE 3: Recount2 provides data for different RNA-seq experiments. These includes data from GTEx or TCGA projects. We have downloaded a subset of data corresponding to breast cancer and created a variable called er which encodes the estrogen receptor status (Negative and Positive). The SummarizedExperiment object is called breast and is available in the file ‘breast_tcga.Rdata.’ Load the data into R and answer the next questions

How many samples are in the SummarizedExperiment object?

And how many genes?

Which is the number of samples with positive estrogen receptor status (variable er)?

Subset the individuals having Negative strogen receptor status and draw a boxplot of the first gene.

Create a SummarizedExperiment object of the genomic region chr6:151.2-151.8Mb. How many genes are in that region? How many of them are annotated? That is, how many of them have a gene symbol name? (HINT: use rowRanges() function and remember that mcols() function is used to get acccess to columns in a GRanges object)

EXERCISE 4: The file snpsGWAS.Rdata contains information about 28 SNPs that were called as a significantly associated with a given disease after performing a GWAS study.

Load this data into R.

Create a data.frame with the name of the SNP, its chromosome, its position and the allele name.

Add another column to this data.frame with the minor allele frequency (MAF) estimated from 1000 genomes (HINT: use listAttributes to retrieve the name of the attribute we are interested in).

Annotate the variants and get the gene symbol

How many SNPs are located in promoter regions?

9.2 Genetic association studies

EXERCISE 5: Researchers are interested in assessing possible association between candidate SNPs and the response to treatment in patients diagnosed with major depression (file ‘DM.txt’ - NOTE: check how alleles are separated). The file also includes clinical information about other covariates of interest such as:

HDRS: Hamilton Depression Rating Scale (continuous variable)

PSICOT: Was the patient psychotic? (No and Yes)

MELANCOL: Was the patient melancholic? (No and Yes)

EPD_PREV: Number of previous episodes of depression (continuous variable)

RESP: Response to treatment (outcome)

Answer the following questions:

Is it necessary to check HWE hypothesis for this example?

Is there any SNP associated with the response to the treatment? If so, write a sentence interpreting the results of that association (only for one SNP)

Does the result change after adjusting for other clinical covariates?

Create a plot with the p-values only for dominant, recessive and additive models

Compute the p-values using MAX-statistic for all SNPs

EXERCISE 6: Researchers are now interested in assessing the effect of haplotypes on the response to treatment. Answer the following questions:

Which is the combination of SNPs (e.g. number of consecutive haplotypes) that is more associated with the response to treatment? (NOTE1: Try only haplotypes of length 2 up to 4)

Which is the most frequent haplotype?

Compute the OR of association and its 95% confidence intervals between the estimate haplotypes and the response to treatment

EXERCISE 7: Researchers are now interested in creating a genetic score to predict the response to the treatment

Perform all the required steps to get such score and evaluate its performance using a ROC curve

9.3 GWAS

EXERCISE 8: Researchers are interested in detecting new SNPs associated with BMI (body mass index). To do so, they performed a GWAS using DNA information about 425 individuals. Genotype information is available in plink format (files ‘coronary.bed,’ ‘coronary.bim,’ ‘coronary.fam’) while phenotypic information can be found in the file ‘coronary.txt.’

Read genotypes using snpStats library

Verify that both data sets are in the same order (NOTE: ‘id’ variable in the file ‘coronary.txt’ must be used since it corresponds to the unique patient number)

Remove those SNPs that do not pass QC

Perform QC filtering of individuals

Assess association between BMI’ and the SNPs (NOTE: remember that you are analyzing a quantitative trait)

Calculate ‘lambda’ and create a Q-Q plot to assess population stratification

Assess association between BMI’ and the SNPs adjusting for population stratification (variables ‘ev3’ and ‘ev4’ in the file ‘coronary.txt’) (NOTE: remember that you are analyzing a quantitative trait). Are there differences with crude analysis?

Create a Manhattan plot

9.4 Microarray data analysis

EXERCISE 9: To be supplied

9.5 RNAseq data analysis

EXERICSE 10: Download the data SRP029880 from recount2: http://duffel.rail.bio/recount/v2/SRP029880/rse_gene.Rdata. This file contains a RangedSummarizedExperiment object. Do the following tasks:

Get the counts and colData tables. (NOTE: counts can be retrieved by assay(rse_gene, 1) or assays(rse_gene)$counts)

Normalize the counts using CPM, RPKM, TPM and TMM methods and create MA-plots for two individuals.

Using TPM normalized data:

Plot a heatmap of the top 500 most variable genes. Compare with the heatmap obtained using the 100 most variable genes.

Re-do the heatmaps setting the scale argument to none, and column. Compare the results with scale = 'row'.

Create some RLE and PCA plots for each method as well as for the raw data (do not forget to log-transform the data).

Create a new variable called group by renaming the variable title as: normal, primary and metastasis.

Perform a differential expression analysis using the variable group and get the list of differentially expressed genes comparing metastasis vs normal and primary vs normal.

Create two MA and volcano plots to describe the previous comparisons.

Examine the distribution of counts for a single gene (most associated with metastasis). Hint: use DESeq2::plotCounts ()

Do the same plot for the top-10 genes (this tries to mimic your supplementary material in your paper!)

Investigate whether the analysis should be adjusted for other covariates.

Repeat differential expression analysis accounting for unwanted variability using sva package.

9.6 Enrichment Analysis

EXERCISE 11: Library tweeDEseqCountData contains data corresponding to an RNA-seq experiment described in Pickrell et al. (2010). Data correspond to lymphoblastoid cell lines about 69 non-related Nigerian individuals. This information as well as phenotypic data is available as an object of class eSet called pickrell.eset that can be loaded after typing:
data(pickrell)
Use limma-voom method to detect those genes that are differentially expressed (DE) between males and females (variale gender).

Select those genes that are DE at 5% FDR with a minimum fold-change of 1.5

Create a gene set of sex-expecific genes by
geneUniverse <- featureNames(pickrell.eset)
geneSex <- unique(intersect(geneUniverse,
c(msYgenes, XiEgenes)))
and test whether the list of DE genes is enriched in that gene set.

EXERCISE 12: Use the data in exercise 10 to obtain a list of genes that are differentially expressed between metastasis vs normal samples (you can run the R script that is available in the Moodle - ‘Answer exercises’)

Perform an enrichment analysis of GO, KEGG and DisGeNET databases using clusterProfiler.

Investigate whether the DE genes are enriched in protein-coding genes.