16 Genome-wide association analyses (GWAS)

Genomic data can be stored in different formats. VCF and PLINK files are commonly used in genetic epidemiology studies. We have a GWAS example available at BRGE data repository that aims to find SNPs associated with asthma. We have data stored in VCF (brge.vcf) with several covariates and phenotypes available in the file brge.txt (gender, age, obesity, smoking, country and asthma status). The same data is also available in PLINK format (brge.bed, brge.bim, brge.fam) with covariates in the file brge.phe.

Here we illustrate how to perform GWAS using R and Bioconductor packages or PLINK shell command line.

16.1 GWAS with Bioconductor

We have created a resource having the VCF file of our study on asthma as previously described. The name of the resource is brge_vcf the phenotypes are available in another resource called brge that is a .txt file. The GWAS analysis is then perform as follows

We first start by preparing login data

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
               user = "dsuser", password = "password",
               resource = "RSRC.brge_vcf", driver = "OpalDriver")
logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE,
                          symbol = "res")

In this case we have to assign to different resources. One for the VCF (obesity_vcf) and another one for the phenotypic data (obesity). To this end, the datashield.assign.resource function is required before assigning any object to the specific resource. Notice that the VCF resource can be load into R as a GDS thanks to our extension of existing resources in the reourcer

datashield.assign.resource(conns, symbol = "vcf.res", 
                           resource = list(study1 = "RSRC.brge_vcf"))
datashield.assign.expr(conns, symbol = "gds", 
                       expr = quote(as.resource.object(vcf.res)))


datashield.assign.resource(conns, symbol = "covars.res", 
                           resource = list(study1 = "RSRC.brge"))
datashield.assign.expr(conns, symbol = "covars", 
                       expr = quote(as.resource.data.frame(covars.res)))

These are the objects available in the Opal server

ds.ls()

$study1
$study1$environment.searched
[1] "R_GlobalEnv"

$study1$objects.found
[1] "covars"     "covars.res" "gds"        "res"        "vcf.res"

We can use dsBaseClient functions to inspect the variables that are in the covars data.frame. The variables are

ds.colnames("covars")

$study1
[1] "scanID"  "gender"  "obese"   "age"     "smoke"   "country" "asthma"

The asthma variable has this number of individuals at each level (0: controls, 1: cases)

ds.table("covars$asthma")


 Data in all studies were valid 

Study 1 :  No errors reported from this study

$output.list
$output.list$TABLE_rvar.by.study_row.props
             study
covars$asthma 1
            0 1
            1 1

$output.list$TABLE_rvar.by.study_col.props
             study
covars$asthma         1
            0 0.6864187
            1 0.3135813

$output.list$TABLE_rvar.by.study_counts
             study
covars$asthma    1
            0 1587
            1  725

$output.list$TABLES.COMBINED_all.sources_proportions
covars$asthma
    0     1 
0.686 0.314 

$output.list$TABLES.COMBINED_all.sources_counts
covars$asthma
   0    1 
1587  725 


$validity.message
[1] "Data in all studies were valid"

There may be interest in only studying certain genes, for that matter, the loaded VCF resource can be subsetting as follows

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes)

The previous code will over-write the VCF with the SNPs corresponding to the selected genes, if the intention is to perform studies with both the complete VCF and a subsetted one, the argument name can be used to create a new object on the server with the subsetted VCF, preserving the complete one.

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes = genes, name = "subset.vcf")

Then, an object of class GenotypeData must be created at the server side to perform genetic data analyses. This is a container defined in the GWASTools package for storing genotype and phenotypic data from genetic association studies. By doing that we will also verify whether individuals in the GDS (e.g VCF) and covariates files have the same individuals and are in the same order. This can be performed by

ds.GenotypeData(x='gds', covars = 'covars', columnId = 1, newobj.name = 'gds.Data')

Before performing the association analyses, quality control (QC) can be performed to the loaded data. Three methodologies are available; 1) Principal Component Analysis (PCA) of the genomic data, 2) Hardy-Weinberg Equilibrium (HWE) testing and 3) Allelic frequency estimates. The QC methods 2 and 3 have as inputs a GenotypeData object, created with a covariates file that has a gender column; while method 1 has as input a VCF.

To perform the PCA, a pruning functionality is built inside so that redundant SNPs are discarted (there is an extra argument ld.threshold which controls the pruning, more information about it at the SNPRelate documentation), speeding up the execution time

ds.PCASNPS("gds", prune = TRUE)

To perform QC methodologies 2 and 3, the name of the gender column as well as the keys to describe male or female have to be provided. Remember that we can visualize the names of the variables from our data by executing ds.colnames("covars"). In our case, this variable is called “gender”, and the levels of this variable are 1 for male and 2 for female as we can see here (NOTE: we cannot use ds.levels since gender variable is not a factor):

ds.table1D("covars$gender")$counts

      covars$gender
1              1215
2              1097
Total          2312

The HWE test can be performed to selected chromosomes using the argument chromosomes, only the autosomes can be selected when performing a HWE test, the encoding of the autosomes can be fetched with

ds.getChromosomeNames("gds.Data")$autosomes

NULL

Therefore, HWE can be performed by:

ds.exactHWE("gds.Data", sexcol = "gender", male = "1", female = "2", chromosome = "22")

$study1
# A tibble: 1,581 x 9
   snpID chr   nAA   nAB   nBB   MAF                minor.allele f                   pval                
   <int> <chr> <chr> <chr> <chr> <chr>              <chr>        <chr>               <chr>               
 1 95140 22    531   1082  640   0.475810031069685  A            0.0372494534103404  0.0762378420092437  
 2 95141 22    580   1092  590   0.497789566755084  A            0.0344638881244839  0.101115839682148   
 3 95142 22    0     333   1951  0.0728984238178634 A            -0.0786304604486423 3.29321452000707e-06
 4 95143 22    0     225   2080  0.0488069414316703 A            -0.0513112884834663 0.00571353422173819 
 5 95144 22    176   831   1249  0.262189716312057  A            0.0479240933754878  0.025554465064758   
 6 95145 22    249   1003  1055  0.32531426094495   A            0.00958157673233362 0.635607524466151   
 7 95146 22    55    516   1738  0.135556517973149  A            0.0464603328061876  0.0325126363699498  
 8 95147 22    203   929   1164  0.290722996515679  A            0.0188880417746163  0.362631409282596   
 9 95148 22    291   1035  985   0.349848550411077  A            0.0154998317584483  0.464409674690009   
10 95149 22    11    276   2013  0.0647826086956522 A            0.00966929694008412 0.604847402951204   
# ... with 1,571 more rows

attr(,"class")
[1] "dsexactHWE" "list"

Similarly, allele frequencies estimates can be estimated by:

ds.alleleFrequency("gds.Data", sexcol = "gender", male = "1", female = "2")

$study1
# A tibble: 99,289 x 7
          M       F      all   n.M   n.F     n      MAF
      <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl>    <dbl>
 1 0.015    0.0147  0.0149    1200  1086  2286 0.0149  
 2 0.000837 0.00276 0.00175   1195  1087  2282 0.00175 
 3 0.00330  0.00548 0.00433   1212  1095  2307 0.00433 
 4 0.0124   0.00868 0.0106    1213  1095  2308 0.0106  
 5 0.00165  0       0.000866  1214  1095  2309 0.000866
 6 0.0712   0.0754  0.0732    1208  1094  2302 0.0732  
 7 0.00660  0.00913 0.00780   1213  1095  2308 0.00780 
 8 0.00248  0.00274 0.00261   1209  1093  2302 0.00261 
 9 0.182    0.181   0.181     1188  1071  2259 0.181   
10 0.00165  0       0.000868  1210  1094  2304 0.000868
# ... with 99,279 more rows

attr(,"class")
[1] "dsalleleFrequency" "list"

In the future, more functions will be created to perform quality control (QC) for both, SNPs and inviduals.

Association analysis for a given SNP is performed by simply

ds.glmSNP(snps.fit = "rs11247693", model = asthma ~ gender + age, genoData='gds.Data')

             Estimate Std. Error   p-value
rs11247693 -0.1543215  0.2309585 0.5040196
attr(,"class")
[1] "dsGlmSNP" "matrix"   "array"

The analysis of all available SNPs is performed when the argument snps.fit is missing. The function performs the analysis of the selected SNPs in a single repository or in multiple repositories as performing pooled analyses (it uses ds.glm DataSHIELD function). As in the case of transcriptomic data, analyzing all the SNPs in the genome (e.g GWAS) will be high time-consuming. We can adopt a similar approach as the one adopted using the limma at each server. That is, we run GWAS at each repository using specific and scalable packages available in R/Bioc. In that case we use the GWASTools and GENESIS packages. The complete pipeline is implemented in this function

ans.bioC <- ds.GWAS('gds.Data', model=asthma~age+country)

ans.bioC

$study1
# A tibble: 99,288 x 14
   variant.id rs         chr         pos n.obs    freq   MAC Score Score.SE Score.Stat  Score.pval    Est Est.SE     PVE
        <int> <chr>      <chr>     <int> <dbl>   <dbl> <dbl> <dbl>    <dbl>      <dbl>       <dbl>  <dbl>  <dbl>   <dbl>
 1      97800 rs12008773 X      64955287  2227 0.0126     56  19.5     3.84       5.07 0.000000408  1.32  0.260  0.0111 
 2      91449 rs2267914  20      1858988  2311 0.104     481  47.5     9.62       4.93 0.000000809  0.513 0.104  0.0105 
 3      19742 rs7153     3     122865987  2308 0.256    1181  62.3    13.7        4.56 0.00000508   0.334 0.0732 0.00900
 4      93268 rs6097326  20     51324797  2311 0.125     579  46.4    10.3        4.51 0.00000642   0.439 0.0973 0.00881
 5      19744 rs3732410  3     122898410  2308 0.254    1173  60.3    13.6        4.43 0.00000940   0.325 0.0734 0.00849
 6      66984 rs11055608 12     13804693  2306 0.446    2057 -67.6    15.7       -4.31 0.0000161   -0.275 0.0638 0.00805
 7      74760 rs7995146  13    111804035  2304 0.0527    243  29.7     6.96       4.27 0.0000195    0.614 0.144  0.00789
 8      59678 rs7098143  10     83329037  2295 0.210     963 -54.5    12.8       -4.25 0.0000214   -0.331 0.0780 0.00781
 9      13835 rs13002717 2     186197782  2311 0.00649    30  10.6     2.52       4.22 0.0000244    1.67  0.397  0.00770
10      27337 rs1602679  4     167576714  2299 0.101     463  39.7     9.46       4.20 0.0000264    0.444 0.106  0.00764
# ... with 99,278 more rows

attr(,"class")
[1] "dsGWAS" "list"

This close the DataSHIELD session

datashield.logout(conns)

16.2 GWAS with PLINK

Here we illustrate how to perform the same GWAS analyses on the asthma using PLINK secure shell commands. This can be performed thanks to the posibility of having ssh resources as described here.

It is worth to notice that this workflow and the new R functions implemented in dsOmicsClient could be used as a guideline to carry out similar analyses using existing analysis tools in genomics such as IMPUTE, SAMtools or BEDtools among many others.

We start by assigning login resources

  library(DSOpal)
  library(dsBaseClient)
  library(dsOmicsClient)
  builder <- newDSLoginBuilder()
  builder$append(server = "study1", url = "https://opal-demo.obiba.org",
                 user = "dsuser", password = "password",
                 resource = "RSRC.brge_plink", driver = "OpalDriver")
  logindata <- builder$build()

Then we assign the resource to a symbol (i.e. R object) called client which is a ssh resource

  conns <- datashield.login(logins = logindata, assign = TRUE,
                            symbol = "client")
  ds.class("client")

$study1
[1] "SshResourceClient"     "CommandResourceClient" "ResourceClient"        "R6"

Now, we are ready to run any PLINK command from the client site. Notice that in this case we want to assess association between the genotype data in bed format and use as phenotype the variable ‘asthma’ that is in the file ‘brge.phe’ in the 6th column. The sentence in a PLINK command would be (NOTE: we avoid –out to indicate the output file since the file will be available in R as a tibble).

  plink --bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age

The arguments must be encapsulated in a single character without the command ‘plink’

  plink.arguments <- "--bfile brge --logistic --pheno brge.phe --mpheno 6 --covar brge.phe --covar-name gender,age"

the analyses are then performed by

  ans.plink <- ds.PLINK("client", plink.arguments)

The object ans.plink contains the PLINK results at each server as well as the outuput provided by PLINK

lapply(ans.plink, names)

$study1
[1] "results"   "plink.out"

head(ans.plink$study1$results)

# A tibble: 6 x 9
    CHR SNP           BP A1    TEST   NMISS    OR   STAT        P
  <dbl> <chr>      <dbl> <chr> <chr>  <dbl> <dbl>  <dbl>    <dbl>
1     0 MitoC3993T  3993 T     ADD     2286 0.752 -1.33  0.182   
2     0 MitoC3993T  3993 T     gender  2286 0.742 -3.27  0.00107 
3     0 MitoC3993T  3993 T     age     2286 1.00   0.565 0.572   
4     0 MitoG4821A  4821 A     ADD     2282 2.68   1.71  0.0879  
5     0 MitoG4821A  4821 A     gender  2282 0.740 -3.31  0.000940
6     0 MitoG4821A  4821 A     age     2282 1.00   0.465 0.642

ans.plink$study$plink.out

$status
[1] 0

$output
 [1] ""                                                                                   
 [2] "@----------------------------------------------------------@"                       
 [3] "|        PLINK!       |     v1.07      |   10/Aug/2009     |"                       
 [4] "|----------------------------------------------------------|"                       
 [5] "|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |"                       
 [6] "|----------------------------------------------------------|"                       
 [7] "|  For documentation, citation & bug-report instructions:  |"                       
 [8] "|        http://pngu.mgh.harvard.edu/purcell/plink/        |"                       
 [9] "@----------------------------------------------------------@"                       
[10] ""                                                                                   
[11] "Skipping web check... [ --noweb ] "                                                 
[12] "Writing this text to log file [ /tmp/ssh-2566/out.log ]"                            
[13] "Analysis started: Tue Jan 12 14:50:23 2021"                                         
[14] ""                                                                                   
[15] "Options in effect:"                                                                 
[16] "\t--bfile brge"                                                                      
[17] "\t--logistic"                                                                        
[18] "\t--covar brge.phe"                                                                  
[19] "\t--covar-name gender,age"                                                           
[20] "\t--noweb"                                                                           
[21] "\t--out /tmp/ssh-2566/out"                                                           
[22] ""                                                                                   
[23] "Reading map (extended format) from [ brge.bim ] "                                   
[24] "100000 markers to be included from [ brge.bim ]"                                    
[25] "Reading pedigree information from [ brge.fam ] "                                    
[26] "2312 individuals read from [ brge.fam ] "                                           
[27] "2312 individuals with nonmissing phenotypes"                                        
[28] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"                              
[29] "Missing phenotype value is also -9"                                                 
[30] "725 cases, 1587 controls and 0 missing"                                             
[31] "1097 males, 1215 females, and 0 of unspecified sex"                                 
[32] "Reading genotype bitfile from [ brge.bed ] "                                        
[33] "Detected that binary PED file is v1.00 SNP-major mode"                              
[34] "Reading 6 covariates from [ brge.phe ] with nonmissing values for 2199 individuals" 
[35] "Selected subset of 2 from 6 covariates"                                             
[36] "For these, nonmissing covariate values for 2312 individuals"                        
[37] "Before frequency and genotyping pruning, there are 100000 SNPs"                     
[38] "2312 founders and 0 non-founders found"                                             
[39] "6009 heterozygous haploid genotypes; set to missing"                                
[40] "Writing list of heterozygous haploid genotypes to [ /tmp/ssh-2566/out.hh ]"         
[41] "7 SNPs with no founder genotypes observed"                                          
[42] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"                           
[43] "Writing list of these SNPs to [ /tmp/ssh-2566/out.nof ]"                            
[44] "Total genotyping rate in remaining individuals is 0.994408"                         
[45] "0 SNPs failed missingness test ( GENO > 1 )"                                        
[46] "0 SNPs failed frequency test ( MAF < 0 )"                                           
[47] "After frequency and genotyping pruning, there are 100000 SNPs"                      
[48] "After filtering, 725 cases, 1587 controls and 0 missing"                            
[49] "After filtering, 1097 males, 1215 females, and 0 of unspecified sex"                
[50] "Converting data to Individual-major format"                                         
[51] "Writing logistic model association results to [ /tmp/ssh-2566/out.assoc.logistic ] "
[52] ""                                                                                   
[53] "Analysis finished: Tue Jan 12 14:52:20 2021"                                        
[54] ""                                                                                   

$error
character(0)

$command
[1] "cd /home/master/brge && plink1 --bfile brge --logistic --covar brge.phe --covar-name gender,age --noweb --out /tmp/ssh-2566/out"

attr(,"class")
[1] "resource.exec"

We can compare the p-values obtained using PLINK with Bioconductor-based packages for the top-10 SNPs as follows:

library(tidyverse)
# get SNP p.values (additive model - ADD)
res.plink <- ans.plink$study1$results %>% filter(TEST=="ADD") %>%
arrange(P)
# compare top-10 with Biocoductor's results
snps <- res.plink$SNP[1:10]
plink <- res.plink %>% filter(SNP%in%snps) %>% dplyr::select(SNP, P)
bioC <- ans.bioC$study1 %>% filter(rs%in%snps) %>% dplyr::select(rs, Score.pval)
left_join(plink, bioC, by=c("SNP" = "rs"))

# A tibble: 10 x 3
   SNP                 P  Score.pval
   <chr>           <dbl>       <dbl>
 1 rs2267914  0.00000151 0.000000809
 2 rs6097326  0.00000424 0.00000642 
 3 rs7153     0.00000440 0.00000508 
 4 rs3732410  0.00000817 0.00000940 
 5 rs7995146  0.0000170  0.0000195  
 6 rs6495788  0.0000213  0.0000278  
 7 rs1602679  0.0000268  0.0000264  
 8 rs11055608 0.0000270  0.0000161  
 9 rs7098143  0.0000313  0.0000214  
10 rs7676164  0.0000543  0.0000537

As expected, the p-values are in the same order of magnitud having little variations due to the implemented methods of each software.

We can do the same comparions of minor allele frequency (MAF) estimation performed with Bioconductor and PLINK. To this end, we need first to estimate MAF using PLINK

  plink.arguments <- "--bfile brge --freq"
  ans.plink2 <- ds.PLINK("client", plink.arguments)
  maf.plink <- ans.plink2$study1$results
  
  plink <- maf.plink %>% filter(SNP%in%snps) %>% dplyr::select(SNP, MAF)
  bioC <- ans.bioC$study1 %>% filter(rs%in%snps) %>% dplyr::select(rs, freq)
  left_join(plink, bioC, by=c("SNP" = "rs"))

# A tibble: 10 x 3
   SNP           MAF   freq
   <chr>       <dbl>  <dbl>
 1 rs7153     0.256  0.256 
 2 rs3732410  0.254  0.254 
 3 rs7676164  0.304  0.304 
 4 rs1602679  0.101  0.101 
 5 rs7098143  0.210  0.210 
 6 rs11055608 0.446  0.446 
 7 rs7995146  0.0527 0.0527
 8 rs6495788  0.267  0.267 
 9 rs2267914  0.104  0.104 
10 rs6097326  0.125  0.125

This close the DataSHIELD session

  datashield.logout(conns)