9 Extension to secure shell programs: GWAS with PLINK
GWAS can also be performed using programs that are executed using shell commands. This is the case for PLINK, one of the state-of-the-art programs to run GWAS and other genomic data analyses such gene-enviroment interactions or polygenic risc score analyses that require efficient and scalable pipelines.
Resources also allow the use of secure SSH service to run programs on a remote server accessible through SSH containing data and analysis tools where R is just used for launching the analyses and aggregating results. This feature allows us to create functions to analyze data using specific shell programs.
Here we describe how the PLINK program can be used to perform GWAS. In this case, the resource describes that access is given via SSH, the credentials required to connect, and the commands that can be run (of which one is plink).
We use the following code to illustrate how analyses should be performed using the resourcer package. This code could be considered as the base code for creating a server-side DataSHIELD package as performed in the plinkDS()
function implemented in the dsOmics package
We access the ssh resource called brge_plink
(Figure 7.2) using the resourcer package as follows:
library(resourcer)
brge_plink <- resourcer::newResource(url="ssh://plink-demo.obiba.org:2222/home/master/brge?exec=ls,plink,plink1",
identity = "master", secret = "nSATfSy9Y8JAo5zK")
client <- resourcer::newResourceClient(brge_plink)
This creates an object of this class:
[1] "SshResourceClient" "CommandResourceClient" "ResourceClient" "R6"
These are the actions we can do with an SshResourceClient
object
[1] ".__enclos_env__" "clone" "close" "exec" "removeTempDir"
[6] "tempDir" "uploadFile" "downloadFile" "getConnection" "getAllowedCommands"
[11] "initialize" "asTbl" "asDataFrame" "getResource"
For this specific resource (e.g. PLINK) we can execute these shell commands
[1] "ls" "plink" "plink1"
For instance
$status
[1] 0
$output
[1] "total 92992"
[2] "dr-xr-xr-x 2 master master 4096 Apr 29 2020 ."
[3] "drwxr-xr-x 7 master master 4096 Jan 4 18:41 .."
[4] "-r--r--r-- 1 master master 57800003 Apr 29 2020 brge.bed"
[5] "-r--r--r-- 1 master master 2781294 Apr 29 2020 brge.bim"
[6] "-r--r--r-- 1 master master 45771 Apr 29 2020 brge.fam"
[7] "-r--r--r-- 1 master master 34442346 Apr 29 2020 brge.gds"
[8] "-r--r--r-- 1 master master 59802 Apr 29 2020 brge.phe"
[9] "-r--r--r-- 1 master master 72106 Apr 29 2020 brge.txt"
$error
character(0)
$command
[1] "cd /home/master/brge && ls -la"
attr(,"class")
[1] "resource.exec"
Then, to avoid multiple accesses to the resource, it is recommended to create a temporary directory to save our results
[1] "/tmp/ssh-3451"
$status
[1] 0
$output
character(0)
$error
character(0)
$command
[1] "cd /home/master/brge && ls /tmp/ssh-3451"
attr(,"class")
[1] "resource.exec"
Then, we can use R to launch the shell commands
$status
[1] 0
$output
[1] ""
[2] "@----------------------------------------------------------@"
[3] "| PLINK! | v1.07 | 10/Aug/2009 |"
[4] "|----------------------------------------------------------|"
[5] "| (C) 2009 Shaun Purcell, GNU General Public License, v2 |"
[6] "|----------------------------------------------------------|"
[7] "| For documentation, citation & bug-report instructions: |"
[8] "| http://pngu.mgh.harvard.edu/purcell/plink/ |"
[9] "@----------------------------------------------------------@"
[10] ""
[11] "Skipping web check... [ --noweb ] "
[12] "Writing this text to log file [ /tmp/ssh-3451/out.log ]"
[13] "Analysis started: Tue Jan 12 14:41:06 2021"
[14] ""
[15] "Options in effect:"
[16] "\t--bfile brge"
[17] "\t--freq"
[18] "\t--out /tmp/ssh-3451/out"
[19] "\t--noweb"
[20] ""
[21] "Reading map (extended format) from [ brge.bim ] "
[22] "100000 markers to be included from [ brge.bim ]"
[23] "Reading pedigree information from [ brge.fam ] "
[24] "2312 individuals read from [ brge.fam ] "
[25] "2312 individuals with nonmissing phenotypes"
[26] "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)"
[27] "Missing phenotype value is also -9"
[28] "725 cases, 1587 controls and 0 missing"
[29] "1097 males, 1215 females, and 0 of unspecified sex"
[30] "Reading genotype bitfile from [ brge.bed ] "
[31] "Detected that binary PED file is v1.00 SNP-major mode"
[32] "Before frequency and genotyping pruning, there are 100000 SNPs"
[33] "2312 founders and 0 non-founders found"
[34] "6009 heterozygous haploid genotypes; set to missing"
[35] "Writing list of heterozygous haploid genotypes to [ /tmp/ssh-3451/out.hh ]"
[36] "7 SNPs with no founder genotypes observed"
[37] "Warning, MAF set to 0 for these SNPs (see --nonfounders)"
[38] "Writing list of these SNPs to [ /tmp/ssh-3451/out.nof ]"
[39] "Writing allele frequencies (founders-only) to [ /tmp/ssh-3451/out.frq ] "
[40] ""
[41] "Analysis finished: Tue Jan 12 14:41:10 2021"
[42] ""
$error
character(0)
$command
[1] "cd /home/master/brge && plink1 --bfile brge --freq --out /tmp/ssh-3451/out --noweb"
attr(,"class")
[1] "resource.exec"
The results can be retrieved as an R object
[1] "out.frq" "out.hh" "out.log" "out.nof"
NULL
# A tibble: 100,000 x 6
CHR SNP A1 A2 MAF NCHROBS
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 0 MitoC3993T T C 0.0149 4572
2 0 MitoG4821A A G 0.00175 4564
3 0 MitoG6027A A G 0.00434 4614
4 0 MitoT6153C C T 0.0106 4616
5 0 MitoC7275T T C 0.000866 4618
6 0 MitoT9699C C T 0.0732 4604
7 0 MitoA10045G G A 0.00780 4616
8 0 MitoG10311A A G 0.00261 4604
9 0 MitoA11252G G A 0.182 4518
10 0 MitoT11900C C T 0.000868 4608
# ... with 99,990 more rows
Finally temporary directories are removed and the session closed