13 Fast GWAS

The method implemented for the fast GWAS is based on the research by Sikorska et al. (2013) We adapted their algorithm to our infrastructure to perform a fast pooled GWAS. Our implementation is described in the following pseudo-code:


Step 1: Fit the objective model5


  1. Iteratively each DP_i6 extracts the coefficients of objective model
  2. Client sends to each DP_i the coefficient values
  3. Each DP_i computes the fitted values and residuals [FIT_i, RES_i]

Step 2: Using the residuals and genotype information [RES_i, GEN_i]


  1. Each DP_i computes RES_i - mean(RES_i) :> YC_i
  2. Each DP_i computes colSums(YC_i * GEN_i); colSums(GEN_i); colSums(GEN_i ^ 2) and returns to the client [B_i, S1_i, S2_i]
  3. The client merges [B_i, S1_i, S2_i] into [B, S1, S2]
  4. The client computes the total number of individuals :> N_IND

Step 3: Using [B, S1, S2, N_IND, YC_i] compute betas and pvalues


  1. The client computes S2 - (S1 ^ 2) / N_IND :> DEN1
  2. The client computes B / DEN1 :> BETA to obtain the beta values
  3. Each DP_i computes colSums(YC_i ^ 2) and returns to client :> YC2_i
  4. The client merges YC2_i :> YC2
  5. The client computes (YC2 - BETA ^ 2 * DEN1) / (N_IND - K - 2) :> SIGMA
  6. The client computes sqrt(SIGMA * (1 / DEN1)) :> ERR
  7. The client computes 2 * pnorm(-abs(BETA / ERR)) :> PVAL to obtain the pvalues

The data that is being shared to the client is always a result of an aggregation (i.e. column sums and different products). For that reason, we added some disclosure controls that guarantee that the aggregated data has a minimum of valid points to prevent data leaks (e.g. returning aggregates of a single individual will leak its data). It must be noted that the performance is limited by the slowest server, therefore increasing the amount of data providers to the study does not imply increasing the computational time, as all servers work in parallel.

References

Sikorska, Karolina, Emmanuel Lesaffre, Patrick FJ Groenen, and Paul HC Eilers. 2013. “GWAS on Your Notebook: Fast Semi-Parallel Linear and Logistic Regression for Genome-Wide Association Studies.” BMC Bioinformatics 14 (1): 1–11.

  1. Model obtained using dsBaseClient::ds.glm↩︎

  2. DP_i: Data processor (Opal node)↩︎