Client-side function that fits a Generalized Linear Model across vertically partitioned data using Block Coordinate Descent with encrypted labels. The response variable y only needs to exist on ONE server (the "label server"). Non-label servers compute gradient updates using y encrypted under the MHE collective public key, and only the aggregated p_k-length gradient is revealed via threshold decryption.
Usage
ds.vertGLM(
data_name,
y_var,
x_vars,
y_server = NULL,
family = "gaussian",
max_iter = 100,
tol = 1e-04,
lambda = 1e-04,
log_n = 12,
log_scale = 40,
verbose = TRUE,
datasources = NULL
)Arguments
- data_name
Character string. Name of the (aligned) data frame on each server.
- y_var
Character string. Name of the response variable (must exist on the label server specified by
y_server).- x_vars
A named list where each name corresponds to a server name and each element is a character vector of predictor variable names from that server.
- y_server
Character string. Name of the server holding the response variable. This server uses plaintext IRLS; all other servers use the encrypted gradient protocol.
- family
Character string. GLM family: "gaussian", "binomial", or "poisson". Default is "gaussian".
- max_iter
Integer. Maximum number of BCD iterations. Default is 100.
- tol
Numeric. Convergence tolerance on coefficient change. Default is 1e-4 (accounts for CKKS approximation noise).
- lambda
Numeric. L2 regularization parameter. Default is 1e-4.
- log_n
Integer. CKKS ring dimension parameter (12, 13, or 14). Default is 12 (2048 slots, supports up to 2048 observations).
- log_scale
Integer. CKKS scale parameter. Default is 40.
- verbose
Logical. Print progress messages. Default is TRUE.
- datasources
DataSHIELD connection object or list of connections. If NULL, uses all available connections.
Value
A list with class "ds.glm" containing:
coefficients: Named vector of coefficient estimates (on original scale, including intercept)iterations: Number of iterations until convergenceconverged: Logical indicating convergencefamily: Family usedn_obs: Number of observationsn_vars: Number of predictor variables (including intercept)lambda: Regularization parameter useddeviance: Residual deviance of the fitted modelnull_deviance: Null deviance (intercept-only model)pseudo_r2: McFadden's pseudo R-squaredaic: Akaike Information Criteriony_server: Name of the label servercall: The matched call
Details
Feature Standardization
Features are automatically standardized (centered and scaled) on each server before BCD to ensure fast convergence. For Gaussian family, the response is also standardized. Coefficients are transformed back to the original scale after convergence, and an intercept is computed.
Encrypted-Label BCD-IRLS Protocol
The response variable y resides on a single "label server". Non-label servers never see y in plaintext. The protocol proceeds as:
MHE Key Setup: All servers generate key shares and combine them into a Collective Public Key (CPK) with Galois keys.
Standardize: Each server standardizes its features.
Encrypt y: The label server encrypts (standardized) y under the CPK and distributes the ciphertext to non-label servers.
BCD Loop: For each iteration, each server updates its block of coefficients on the standardized scale.
Unstandardize: Coefficients are transformed back to the original scale and an intercept is computed.
Deviance: Computed on the label server using plaintext y and the final linear predictor (original scale).
References
van Kesteren, E.J. et al. (2019). Privacy-preserving generalized linear models using distributed block coordinate descent. arXiv:1911.03183.
Mouchet, C. et al. (2021). "Multiparty Homomorphic Encryption from Ring-Learning-With-Errors". Proceedings on Privacy Enhancing Technologies.
See also
ds.vertCor for correlation analysis,
ds.vertPCA for PCA analysis