9 Extension to secure shell programs: GWAS with PLINK

GWAS can also be performed using programs that are executed using shell commands. This is the case for PLINK, one of the state-of-the-art programs to run GWAS and other genomic data analyses such gene-enviroment interactions or polygenic risc score analyses that require efficient and scalable pipelines.

Resources also allow the use of secure SSH service to run programs on a remote server accessible through SSH containing data and analysis tools where R is just used for launching the analyses and aggregating results. This feature allows us to create functions to analyze data using specific shell programs.

Here we describe how the PLINK program can be used to perform GWAS. In this case, the resource describes that access is given via SSH, the credentials required to connect, and the commands that can be run (of which one is plink).

We use the following code to illustrate how analyses should be performed using the resourcer package. This code could be considered as the base code for creating a server-side DataSHIELD package as performed in the plinkDS() function implemented in the dsOmics package

We access the ssh resource called brge_plink (Figure 7.2) using the resourcer package as follows:

library(resourcer)
brge_plink <- resourcer::newResource(url="ssh://plink-demo.obiba.org:2222/home/master/brge?exec=ls,plink,plink1", 
                                     identity = "master", secret = "nSATfSy9Y8JAo5zK")
client <- resourcer::newResourceClient(brge_plink)

This creates an object of this class:

class(client)
[1] "SshResourceClient"     "CommandResourceClient" "ResourceClient"        "R6"                   

These are the actions we can do with an SshResourceClient object

names(client)
 [1] ".__enclos_env__"    "clone"              "close"              "exec"               "removeTempDir"
 [6] "tempDir"            "uploadFile"         "downloadFile"       "getConnection"      "getAllowedCommands"
[11] "initialize"         "asTbl"              "asDataFrame"        "getResource"       

For this specific resource (e.g. PLINK) we can execute these shell commands

client$getAllowedCommands()
[1] "ls"     "plink"  "plink1"

For instance

client$exec("ls", "-la")
$status
[1] 0

$output
[1] "total 92992"
[2] "dr-xr-xr-x 2 master master     4096 Apr 29  2020 ."
[3] "drwxr-xr-x 7 master master     4096 Jan  4 18:41 .."
[4] "-r--r--r-- 1 master master 57800003 Apr 29  2020 brge.bed"
[5] "-r--r--r-- 1 master master  2781294 Apr 29  2020 brge.bim"
[6] "-r--r--r-- 1 master master    45771 Apr 29  2020 brge.fam"
[7] "-r--r--r-- 1 master master 34442346 Apr 29  2020 brge.gds"
[8] "-r--r--r-- 1 master master    59802 Apr 29  2020 brge.phe"
[9] "-r--r--r-- 1 master master    72106 Apr 29  2020 brge.txt"

$error
character(0)

$command
[1] "cd /home/master/brge && ls -la"

attr(,"class")
[1] "resource.exec"

Then, to avoid multiple accesses to the resource, it is recommended to create a temporary directory to save our results

tempDir <- client$tempDir()
tempDir
[1] "/tmp/ssh-3451"
client$exec("ls", tempDir)
$status
[1] 0

$output
character(0)

$error
character(0)

$command
[1] "cd /home/master/brge && ls /tmp/ssh-3451"

attr(,"class")
[1] "resource.exec"

Then, we can use R to launch the shell commands

client$exec('plink1', c('--bfile', 'brge', '--freq', '--out', paste0(tempDir, '/out'), '--noweb'))
$status
[1] 0

$output
 [1] ""
 [2] "@----------------------------------------------------------@"
 [3] "|        PLINK!       |     v1.07      |   10/Aug/2009     |"
 [4] "|----------------------------------------------------------|"
 [5] "|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |"
 [6] "|----------------------------------------------------------|"
 [7] "|  For documentation, citation & bug-report instructions:  |"
 [8] "|        http://pngu.mgh.harvard.edu/purcell/plink/        |"
 [9] "@----------------------------------------------------------@"
[10] ""
[11] "Skipping web check... [ --noweb ] "
[12] "Writing this text to log file [ /tmp/ssh-3451/out.log ]"
[13] "Analysis started: Tue Jan 12 14:41:06 2021"
[14] ""
[15] "Options in effect:"
[16] "\t--bfile brge"
[17] "\t--freq"
[18] "\t--out /tmp/ssh-3451/out"
[19] "\t--noweb"
[20] ""
[21] "Reading map (extended format) from [ brge.bim ] "
[22] "100000 markers to be included from [ brge.bim ]"
[23] "Reading pedigree information from [ brge.fam ] "
[24] "2312 individuals read from [ brge.fam ] "
[25] "2312 individuals with nonmissing phenotypes"
[26] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[27] "Missing phenotype value is also -9"
[28] "725 cases, 1587 controls and 0 missing"
[29] "1097 males, 1215 females, and 0 of unspecified sex"
[30] "Reading genotype bitfile from [ brge.bed ] "
[31] "Detected that binary PED file is v1.00 SNP-major mode"
[32] "Before frequency and genotyping pruning, there are 100000 SNPs"
[33] "2312 founders and 0 non-founders found"
[34] "6009 heterozygous haploid genotypes; set to missing"
[35] "Writing list of heterozygous haploid genotypes to [ /tmp/ssh-3451/out.hh ]"
[36] "7 SNPs with no founder genotypes observed"
[37] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"
[38] "Writing list of these SNPs to [ /tmp/ssh-3451/out.nof ]"
[39] "Writing allele frequencies (founders-only) to [ /tmp/ssh-3451/out.frq ] "
[40] ""
[41] "Analysis finished: Tue Jan 12 14:41:10 2021"
[42] ""

$error
character(0)

$command
[1] "cd /home/master/brge && plink1 --bfile brge --freq --out /tmp/ssh-3451/out --noweb"

attr(,"class")
[1] "resource.exec"

The results can be retrieved as an R object

outs <- client$exec('ls', tempDir)$output
outs
[1] "out.frq" "out.hh"  "out.log" "out.nof"
client$downloadFile(paste0(tempDir, '/out.frq'))  
NULL
ans <- readr::read_table("out.frq")
ans
# A tibble: 100,000 x 6
     CHR SNP         A1    A2         MAF NCHROBS
   <dbl> <chr>       <chr> <chr>    <dbl>   <dbl>
 1     0 MitoC3993T  T     C     0.0149      4572
 2     0 MitoG4821A  A     G     0.00175     4564
 3     0 MitoG6027A  A     G     0.00434     4614
 4     0 MitoT6153C  C     T     0.0106      4616
 5     0 MitoC7275T  T     C     0.000866    4618
 6     0 MitoT9699C  C     T     0.0732      4604
 7     0 MitoA10045G G     A     0.00780     4616
 8     0 MitoG10311A A     G     0.00261     4604
 9     0 MitoA11252G G     A     0.182       4518
10     0 MitoT11900C C     T     0.000868    4608
# ... with 99,990 more rows

Finally temporary directories are removed and the session closed

client$removeTempDir()
client$close()