15 Differential privacy
Differential privacy can be briefly described as an algorithm that if the output is observed, we can’t tell if an individual information has been used. This concept was introduced by Dwork, Roth, et al. (2014). Formally, this concept is referred to as \(\epsilon\)-differential privacy and it is mathematically expressed as:
\[Pr[A(D_1)=S]≤exp(\epsilon)\cdot Pr[A(D_2)=S]\]
Where A is a randomized algorithm that takes a dataset D as input, and \(\epsilon\) is the privacy parameter that defines the level of privacy (as closer to 0 more privacy).
In order to implement differential privacy into our software, the Laplace mechanism has been used. This mechanism adds noise to the output of a function, the noise is drawn from a Laplace distribution with mean 0 and scale \(\Delta f/\epsilon\), where \(\Delta f\) if the \(l_1\) sensitivity defined by:
\[\Delta f = max_{D_1, D_2} ||f(D_1) - f(D_2)||\]
The assessment of \(\Delta f\) can be particularly complex for certain functions (e.g. limma + voom), for that reason a sampling method Rubinstein and Aldà (2017) has been used to assess \(\Delta f\). Since the number of resamples performed can potentially impact the differential privacy quality, we allow the data owner to configure this parameter into the Opal server.
For the specific case of fast GWAS, the differential privacy methods can severly affect random SNPs, for that reason we included a resample
argument inside of the function to perform the analisis resample
times and get the median values.