16 Genome-wide association analyses (GWAS)

Genomic data can be stored in different formats. VCF and PLINK files are commonly used in genetic epidemiology studies. We have a GWAS example available at BRGE data repository that aims to find SNPs associated with asthma. We have data stored in VCF (brge.vcf) with several covariates and phenotypes available in the file brge.txt (gender, age, obesity, smoking, country and asthma status). The same data is also available in PLINK format (brge.bed, brge.bim, brge.fam) with covariates in the file brge.phe.

Here we illustrate how to perform GWAS using R and Bioconductor packages or PLINK shell command line.

16.1 GWAS with Bioconductor

We have created a resource having the VCF file of our study on asthma as previously described. The name of the resource is brge_vcf the phenotypes are available in another resource called brge that is a .txt file. The GWAS analysis is then perform as follows

We first start by preparing login data

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
               user = "dsuser", password = "password",
               resource = "RSRC.brge_vcf", driver = "OpalDriver")
logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE,
                          symbol = "res")

In this case we have to assign to different resources. One for the VCF (obesity_vcf) and another one for the phenotypic data (obesity). To this end, the datashield.assign.resource function is required before assigning any object to the specific resource. Notice that the VCF resource can be load into R as a GDS thanks to our extension of existing resources in the reourcer

datashield.assign.resource(conns, symbol = "vcf.res", 
                           resource = list(study1 = "RSRC.brge_vcf"))
datashield.assign.expr(conns, symbol = "gds", 
                       expr = quote(as.resource.object(vcf.res)))


datashield.assign.resource(conns, symbol = "covars.res", 
                           resource = list(study1 = "RSRC.brge"))
datashield.assign.expr(conns, symbol = "covars", 
                       expr = quote(as.resource.data.frame(covars.res)))

These are the objects available in the Opal server

ds.ls()
$study1
$study1$environment.searched
[1] "R_GlobalEnv"

$study1$objects.found
[1] "covars"     "covars.res" "gds"        "res"        "vcf.res"   

We can use dsBaseClient functions to inspect the variables that are in the covars data.frame. The variables are

ds.colnames("covars")
$study1
[1] "scanID"  "gender"  "obese"   "age"     "smoke"   "country" "asthma" 

The asthma variable has this number of individuals at each level (0: controls, 1: cases)

ds.table("covars$asthma")

 Data in all studies were valid 

Study 1 :  No errors reported from this study
$output.list
$output.list$TABLE_rvar.by.study_row.props
             study
covars$asthma 1
            0 1
            1 1

$output.list$TABLE_rvar.by.study_col.props
             study
covars$asthma         1
            0 0.6864187
            1 0.3135813

$output.list$TABLE_rvar.by.study_counts
             study
covars$asthma    1
            0 1587
            1  725

$output.list$TABLES.COMBINED_all.sources_proportions
covars$asthma
    0     1 
0.686 0.314 

$output.list$TABLES.COMBINED_all.sources_counts
covars$asthma
   0    1 
1587  725 


$validity.message
[1] "Data in all studies were valid"

There may be interest in only studying certain genes, for that matter, the loaded VCF resource can be subsetting as follows

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes)

The previous code will over-write the VCF with the SNPs corresponding to the selected genes, if the intention is to perform studies with both the complete VCF and a subsetted one, the argument name can be used to create a new object on the server with the subsetted VCF, preserving the complete one.

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes = genes, name = "subset.vcf")

Then, an object of class GenotypeData must be created at the server side to perform genetic data analyses. This is a container defined in the GWASTools package for storing genotype and phenotypic data from genetic association studies. By doing that we will also verify whether individuals in the GDS (e.g VCF) and covariates files have the same individuals and are in the same order. This can be performed by

ds.GenotypeData(x='gds', covars = 'covars', columnId = 1, newobj.name = 'gds.Data')

Before performing the association analyses, quality control (QC) can be performed to the loaded data. Three methodologies are available; 1) Principal Component Analysis (PCA) of the genomic data, 2) Hardy-Weinberg Equilibrium (HWE) testing and 3) Allelic frequency estimates. The QC methods 2 and 3 have as inputs a GenotypeData object, created with a covariates file that has a gender column; while method 1 has as input a VCF.

To perform the PCA, a pruning functionality is built inside so that redundant SNPs are discarted (there is an extra argument ld.threshold which controls the pruning, more information about it at the SNPRelate documentation), speeding up the execution time

ds.PCASNPS("gds", prune = TRUE)

To perform QC methodologies 2 and 3, the name of the gender column as well as the keys to describe male or female have to be provided. Remember that we can visualize the names of the variables from our data by executing ds.colnames("covars"). In our case, this variable is called “gender”, and the levels of this variable are 1 for male and 2 for female as we can see here (NOTE: we cannot use ds.levels since gender variable is not a factor):

ds.table1D("covars$gender")$counts
      covars$gender
1              1215
2              1097
Total          2312

The HWE test can be performed to selected chromosomes using the argument chromosomes, only the autosomes can be selected when performing a HWE test, the encoding of the autosomes can be fetched with

ds.getChromosomeNames("gds.Data")$autosomes
NULL

Therefore, HWE can be performed by:

ds.exactHWE("gds.Data", sexcol = "gender", male = "1", female = "2", chromosome = "22")
$study1
# A tibble: 1,581 x 9
   snpID chr   nAA   nAB   nBB   MAF                minor.allele f                   pval                
   <int> <chr> <chr> <chr> <chr> <chr>              <chr>        <chr>               <chr>               
 1 95140 22    531   1082  640   0.475810031069685  A            0.0372494534103404  0.0762378420092437  
 2 95141 22    580   1092  590   0.497789566755084  A            0.0344638881244839  0.101115839682148   
 3 95142 22    0     333   1951  0.0728984238178634 A            -0.0786304604486423 3.29321452000707e-06
 4 95143 22    0     225   2080  0.0488069414316703 A            -0.0513112884834663 0.00571353422173819 
 5 95144 22    176   831   1249  0.262189716312057  A            0.0479240933754878  0.025554465064758   
 6 95145 22    249   1003  1055  0.32531426094495   A            0.00958157673233362 0.635607524466151   
 7 95146 22    55    516   1738  0.135556517973149  A            0.0464603328061876  0.0325126363699498  
 8 95147 22    203   929   1164  0.290722996515679  A            0.0188880417746163  0.362631409282596   
 9 95148 22    291   1035  985   0.349848550411077  A            0.0154998317584483  0.464409674690009   
10 95149 22    11    276   2013  0.0647826086956522 A            0.00966929694008412 0.604847402951204   
# ... with 1,571 more rows

attr(,"class")
[1] "dsexactHWE" "list"      

Similarly, allele frequencies estimates can be estimated by:

ds.alleleFrequency("gds.Data", sexcol = "gender", male = "1", female = "2")
$study1
# A tibble: 99,289 x 7
          M       F      all   n.M   n.F     n      MAF
      <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl>    <dbl>
 1 0.015    0.0147  0.0149    1200  1086  2286 0.0149  
 2 0.000837 0.00276 0.00175   1195  1087  2282 0.00175 
 3 0.00330  0.00548 0.00433   1212  1095  2307 0.00433 
 4 0.0124   0.00868 0.0106    1213  1095  2308 0.0106  
 5 0.00165  0       0.000866  1214  1095  2309 0.000866
 6 0.0712   0.0754  0.0732    1208  1094  2302 0.0732  
 7 0.00660  0.00913 0.00780   1213  1095  2308 0.00780 
 8 0.00248  0.00274 0.00261   1209  1093  2302 0.00261 
 9 0.182    0.181   0.181     1188  1071  2259 0.181   
10 0.00165  0       0.000868  1210  1094  2304 0.000868
# ... with 99,279 more rows

attr(,"class")
[1] "dsalleleFrequency" "list"             

In the future, more functions will be created to perform quality control (QC) for both, SNPs and inviduals.

Association analysis for a given SNP is performed by simply

ds.glmSNP(snps.fit = "rs11247693", model = asthma ~ gender + age, genoData='gds.Data')
             Estimate Std. Error   p-value
rs11247693 -0.1543215  0.2309585 0.5040196
attr(,"class")
[1] "dsGlmSNP" "matrix"   "array"   

The analysis of all available SNPs is performed when the argument snps.fit is missing. The function performs the analysis of the selected SNPs in a single repository or in multiple repositories as performing pooled analyses (it uses ds.glm DataSHIELD function). As in the case of transcriptomic data, analyzing all the SNPs in the genome (e.g GWAS) will be high time-consuming. We can adopt a similar approach as the one adopted using the limma at each server. That is, we run GWAS at each repository using specific and scalable packages available in R/Bioc. In that case we use the GWASTools and GENESIS packages. The complete pipeline is implemented in this function

ans.bioC <- ds.GWAS('gds.Data', model=asthma~age+country)
ans.bioC
$study1
# A tibble: 99,288 x 14
   variant.id rs         chr         pos n.obs    freq   MAC Score Score.SE Score.Stat  Score.pval    Est Est.SE     PVE
        <int> <chr>      <chr>     <int> <dbl>   <dbl> <dbl> <dbl>    <dbl>      <dbl>       <dbl>  <dbl>  <dbl>   <dbl>
 1      97800 rs12008773 X      64955287  2227 0.0126     56  19.5     3.84       5.07 0.000000408  1.32  0.260  0.0111 
 2      91449 rs2267914  20      1858988  2311 0.104     481  47.5     9.62       4.93 0.000000809  0.513 0.104  0.0105 
 3      19742 rs7153     3     122865987  2308 0.256    1181  62.3    13.7        4.56 0.00000508   0.334 0.0732 0.00900
 4      93268 rs6097326  20     51324797  2311 0.125     579  46.4    10.3        4.51 0.00000642   0.439 0.0973 0.00881
 5      19744 rs3732410  3     122898410  2308 0.254    1173  60.3    13.6        4.43 0.00000940   0.325 0.0734 0.00849
 6      66984 rs11055608 12     13804693  2306 0.446    2057 -67.6    15.7       -4.31 0.0000161   -0.275 0.0638 0.00805
 7      74760 rs7995146  13    111804035  2304 0.0527    243  29.7     6.96       4.27 0.0000195    0.614 0.144  0.00789
 8      59678 rs7098143  10     83329037  2295 0.210     963 -54.5    12.8       -4.25 0.0000214   -0.331 0.0780 0.00781
 9      13835 rs13002717 2     186197782  2311 0.00649    30  10.6     2.52       4.22 0.0000244    1.67  0.397  0.00770
10      27337 rs1602679  4     167576714  2299 0.101     463  39.7     9.46       4.20 0.0000264    0.444 0.106  0.00764
# ... with 99,278 more rows

attr(,"class")
[1] "dsGWAS" "list"  

This close the DataSHIELD session

datashield.logout(conns)