13 Fast GWAS

The method implemented for the fast GWAS is based on the research by Sikorska et al. (2013) We adapted their algorithm to our infrastructure to perform a fast pooled GWAS. Our implementation is described in the following pseudo-code:

Step 1: Fit the objective model⁵

Iteratively each DP_i⁶ extracts the coefficients of objective model
Client sends to each DP_i the coefficient values
Each DP_i computes the fitted values and residuals [FIT_i, RES_i]

Step 2: Using the residuals and genotype information [RES_i, GEN_i]

Each DP_i computes RES_i - mean(RES_i) :> YC_i
Each DP_i computes colSums(YC_i * GEN_i); colSums(GEN_i); colSums(GEN_i ^ 2) and returns to the client [B_i, S1_i, S2_i]
The client merges [B_i, S1_i, S2_i] into [B, S1, S2]
The client computes the total number of individuals :> N_IND

Step 3: Using [B, S1, S2, N_IND, YC_i] compute betas and pvalues

The client computes S2 - (S1 ^ 2) / N_IND :> DEN1
The client computes B / DEN1 :> BETA to obtain the beta values
Each DP_i computes colSums(YC_i ^ 2) and returns to client :> YC2_i
The client merges YC2_i :> YC2
The client computes (YC2 - BETA ^ 2 * DEN1) / (N_IND - K - 2) :> SIGMA
The client computes sqrt(SIGMA * (1 / DEN1)) :> ERR
The client computes 2 * pnorm(-abs(BETA / ERR)) :> PVAL to obtain the pvalues

The data that is being shared to the client is always a result of an aggregation (i.e. column sums and different products). For that reason, we added some disclosure controls that guarantee that the aggregated data has a minimum of valid points to prevent data leaks (e.g. returning aggregates of a single individual will leak its data). It must be noted that the performance is limited by the slowest server, therefore increasing the amount of data providers to the study does not imply increasing the computational time, as all servers work in parallel.

References

Sikorska, Karolina, Emmanuel Lesaffre, Patrick FJ Groenen, and Paul HC Eilers. 2013. “GWAS on Your Notebook: Fast Semi-Parallel Linear and Logistic Regression for Genome-Wide Association Studies.” BMC Bioinformatics 14 (1): 1–11.

Model obtained using dsBaseClient::ds.glm↩︎
DP_i: Data processor (Opal node)↩︎