15 Differential privacy

Differential privacy can be briefly described as an algorithm that if the output is observed, we can’t tell if an individual information has been used. This concept was introduced by Dwork, Roth, et al. (2014). Formally, this concept is referred to as \(\epsilon\)-differential privacy and it is mathematically expressed as:

\[Pr[A(D_1)=S]≤exp(\epsilon)\cdot Pr[A(D_2)=S]\]

Where A is a randomized algorithm that takes a dataset D as input, and \(\epsilon\) is the privacy parameter that defines the level of privacy (as closer to 0 more privacy).

In order to implement differential privacy into our software, the Laplace mechanism has been used. This mechanism adds noise to the output of a function, the noise is drawn from a Laplace distribution with mean 0 and scale \(\Delta f/\epsilon\), where \(\Delta f\) if the \(l_1\) sensitivity defined by:

\[\Delta f = max_{D_1, D_2} ||f(D_1) - f(D_2)||\]

The assessment of \(\Delta f\) can be particularly complex for certain functions (e.g. limma + voom), for that reason a sampling method Rubinstein and Aldà (2017) has been used to assess \(\Delta f\). Since the number of resamples performed can potentially impact the differential privacy quality, we allow the data owner to configure this parameter into the Opal server.

For the specific case of fast GWAS, the differential privacy methods can severly affect random SNPs, for that reason we included a resample argument inside of the function to perform the analisis resample times and get the median values.

References

Dwork, Cynthia, Aaron Roth, et al. 2014. “The Algorithmic Foundations of Differential Privacy.” Found. Trends Theor. Comput. Sci. 9 (3-4): 211–407.

Rubinstein, Benjamin IP, and Francesco Aldà. 2017. “Pain-Free Random Differential Privacy with Sensitivity Sampling.” In International Conference on Machine Learning, 2950–59. PMLR.