13 Fast GWAS
The method implemented for the fast GWAS is based on the research by Sikorska et al. (2013) We adapted their algorithm to our infrastructure to perform a fast pooled GWAS. Our implementation is described in the following pseudo-code:
Step 1: Fit the objective model5
- Iteratively each
DP_i
6 extracts the coefficients of objective model - Client sends to each
DP_i
the coefficient values - Each
DP_i
computes the fitted values and residuals [FIT_i
,RES_i
]
Step 2: Using the residuals and genotype information [RES_i
, GEN_i
]
- Each DP_i computes
RES_i - mean(RES_i) :> YC_i
- Each DP_i computes
colSums(YC_i * GEN_i); colSums(GEN_i); colSums(GEN_i ^ 2)
and returns to the client [B_i
,S1_i
,S2_i
] - The client merges [
B_i
,S1_i
,S2_i
] into [B
,S1
,S2
] - The client computes the total number of individuals :>
N_IND
Step 3: Using [B
, S1
, S2
, N_IND
, YC_i
] compute betas and pvalues
- The client computes
S2 - (S1 ^ 2) / N_IND :> DEN1
- The client computes
B / DEN1 :>
BETA to obtain the beta values - Each
DP_i
computescolSums(YC_i ^ 2)
and returns to client :>YC2_i
- The client merges
YC2_i :> YC2
- The client computes
(YC2 - BETA ^ 2 * DEN1) / (N_IND - K - 2) :> SIGMA
- The client computes
sqrt(SIGMA * (1 / DEN1)) :> ERR
- The client computes
2 * pnorm(-abs(BETA / ERR)) :> PVAL
to obtain the pvalues
The data that is being shared to the client is always a result of an aggregation (i.e. column sums and different products). For that reason, we added some disclosure controls that guarantee that the aggregated data has a minimum of valid points to prevent data leaks (e.g. returning aggregates of a single individual will leak its data). It must be noted that the performance is limited by the slowest server, therefore increasing the amount of data providers to the study does not imply increasing the computational time, as all servers work in parallel.