14 PCA
To compute a pooled PCA using horizontally partitioned data, the block method singular value decomposition has been implemented Berry et al. (2005). The pseudo-code of the algorithm is as follows:
Step 1: Intermediate SVD7
- Each
DP_i
8 computes SVD(geno9) [u_i
,v_i
,d_i
] - Each
DP_i
computesu_i * d_i [res_i]
- Each
DP_i
returnsres_i
to the client
Step 2: Client aggregation
- The client receives res_i and merges it horizontally [aggregated]
- The client computes SVD(aggregated) [final_results]
- The client returns the pooled PCA results [final_results]
The algorithm used returns to the client the product of the left singular vectors (u_i
) and singular values (d_i
). This guarantees that the original data contained on each node is not reconstructible (i.e. we only share non-disclosive information among servers).
This is a general method to compute PCA. We have adapted this approach to allow performing PCA on genotype data. To this end, these two extra steps have been implemented.
- Genotype standardization: \(\frac{X_{ij} - 2 \mu_{j}}{\sigma_j}\), where the genotype is \(X_ij\) (\(i\) is the individual index, \(j\) is the SNP index), \(\mu_j=2p_j\) is the mean of the 0,1,2-coded genotype (\(p_j\) being the alternate allele frequency), and \(\sigma_j\) is the expected standard error under Hardy-Weinberg Equilibrium, that is \(\sqrt{2p_j (1-p_j)}\).
- Perform PCA using only selected SNPs that have been linked to differentiate ethnic groups Huang, Shu, and Cai (2015).
References
Berry, Michael W, Dani Mezher, Bernard Philippe, and Ahmed Sameh. 2005. “Parallel Algorithms for the Singular Value Decomposition.” In Handbook of Parallel Computing and Statistics, 133–80. Chapman; Hall/CRC.
Huang, Tao, Yang Shu, and Yu-Dong Cai. 2015. “Genetic Differences Among Ethnic Groups.” BMC Genomics 16 (1): 1–10.