14 PCA

To compute a pooled PCA using horizontally partitioned data, the block method singular value decomposition has been implemented Berry et al. (2005). The pseudo-code of the algorithm is as follows:


Step 1: Intermediate SVD7


  1. Each DP_i8 computes SVD(geno9) [u_i, v_i, d_i]
  2. Each DP_i computes u_i * d_i [res_i]
  3. Each DP_i returns res_i to the client

Step 2: Client aggregation


  1. The client receives res_i and merges it horizontally [aggregated]
  2. The client computes SVD(aggregated) [final_results]
  3. The client returns the pooled PCA results [final_results]

The algorithm used returns to the client the product of the left singular vectors (u_i) and singular values (d_i). This guarantees that the original data contained on each node is not reconstructible (i.e. we only share non-disclosive information among servers).

This is a general method to compute PCA. We have adapted this approach to allow performing PCA on genotype data. To this end, these two extra steps have been implemented.

  • Genotype standardization: \(\frac{X_{ij} - 2 \mu_{j}}{\sigma_j}\), where the genotype is \(X_ij\) (\(i\) is the individual index, \(j\) is the SNP index), \(\mu_j=2p_j\) is the mean of the 0,1,2-coded genotype (\(p_j\) being the alternate allele frequency), and \(\sigma_j\) is the expected standard error under Hardy-Weinberg Equilibrium, that is \(\sqrt{2p_j (1-p_j)}\).
  • Perform PCA using only selected SNPs that have been linked to differentiate ethnic groups Huang, Shu, and Cai (2015).

References

Berry, Michael W, Dani Mezher, Bernard Philippe, and Ahmed Sameh. 2005. “Parallel Algorithms for the Singular Value Decomposition.” In Handbook of Parallel Computing and Statistics, 133–80. Chapman; Hall/CRC.
Huang, Tao, Yang Shu, and Yu-Dong Cai. 2015. “Genetic Differences Among Ethnic Groups.” BMC Genomics 16 (1): 1–10.

  1. Singular value decomposition↩︎

  2. DP_i: Data processor (Opal node)↩︎

  3. Encoded genotype data↩︎