dsVertClient • dsVertClient

Overview

dsVertClient is a client-side DataSHIELD package that enables privacy-preserving statistical analysis on vertically partitioned federated data. In vertical partitioning, different data sources hold different variables (columns) for the same set of observations (rows).

This package provides user-friendly functions for:

ECDH-PSI Record Alignment: Privacy-preserving record matching using P-256 elliptic curves (no dictionary attacks possible)
Correlation Analysis: Compute correlation matrices using Multiparty Homomorphic Encryption (MHE) with share-wrapped threshold decryption
Principal Component Analysis: Perform PCA from the MHE-based correlation matrix
Generalized Linear Models: Fit GLMs using IRLS with encrypted-label BCD and GLM secure routing (3 families)

Security Guarantees

Property	What it prevents
Dictionary attack resistance	Unlike SHA-256 hashing, the client CANNOT reverse-engineer patient IDs from the masked curve points (requires server’s secret scalar)
Scalar confidentiality	Each server’s P-256 random scalar stays on-server
Unlinkability (DDH)	The client CANNOT link single-masked points across servers
Blind relay	Client sees ONLY opaque encrypted blobs (X25519 + AES-256-GCM). All EC point exchanges are transport-encrypted server-to-server.
PSI Firewall	Server-side FSM enforces phase ordering and one-shot semantics per target, preventing OPRF oracle attacks
MITM-resistant mode	Optional pre-shared key pinning validates transport PKs against administrator-configured peers, detecting key substitution

Statistical Analysis (MHE-CKKS)

Property	What it prevents
Client privacy	The researcher CANNOT decrypt any individual data. Only aggregate statistics (correlations, coefficients) are revealed after all servers cooperate.
Server privacy	Each server’s raw data never leaves the server. Other servers only see encrypted ciphertexts (opaque).
Collusion resistance	Even K-1 colluding servers cannot decrypt without the K-th server’s key share. Full decryption requires ALL K servers.

Transport & Routing Security

Property	What it prevents
Share-wrapping	Client CANNOT see or manipulate partial decryption shares. Each share is transport-encrypted under the fusion server’s X25519 key before relay.
GLM secure routing	Client CANNOT see individual-level vectors (eta, mu, w, v). Only sees p_k-length coefficient vectors and opaque encrypted blobs.
Protocol Firewall	Only ciphertexts produced by authorized operations can be decrypted. Prevents decryption oracle attacks.

Installation

# Install from GitHub (install dsVert first on servers)
devtools::install_github("isglobal-brge/dsVertClient")

Quick Start

library(dsVertClient)
library(DSI)

# Connect to Opal/DataSHIELD servers
conns <- datashield.login(logindata)

# 1. Align records across servers using ECDH-PSI
ds.psiAlign("D", "patient_id", "D_aligned", datasources = conns)

# 2. Define which variables are on which server
variables <- list(
  hospital_A = c("age", "bmi"),
  hospital_B = c("glucose", "systolic_bp"),
  hospital_C = c("cholesterol", "hdl")
)

# 3. Compute correlation matrix (MHE with share-wrapped threshold decryption)
cor_result <- ds.vertCor("D_aligned", variables, datasources = conns)
print(cor_result)

# 4. Perform PCA
pca <- ds.vertPCA("D_aligned", variables, n_components = 3, datasources = conns)
print(pca)

# 5. Fit a GLM
model <- ds.vertGLM("D_aligned", "outcome", variables,
                    y_server = "hospital_B", family = "gaussian",
                    datasources = conns)
summary(model)

# Disconnect
datashield.logout(conns)

Client-Side Functions

Function	Description
`ds.psiAlign`	Align records across servers using ECDH-PSI
`ds.vertCor`	Compute cross-server correlation matrix via MHE share-wrapped threshold decryption
`ds.vertPCA`	Perform PCA from the MHE-based correlation matrix
`ds.vertGLM`	Fit Generalized Linear Models via encrypted-label BCD with GLM secure routing

Workflow: Align -> Analyze

 ┌──────────────────┐   ┌───────────────────────────┐
 │  ds.psiAlign     │──▶│        Analyze            │
 │  (ECDH-PSI)      │   │  ds.vertCor (MHE)         │
 │                  │   │  ds.vertPCA               │
 │                  │   │  ds.vertGLM (BCD + MHE)   │
 └──────────────────┘   └───────────────────────────┘

MHE Correlation: How It Works

The 6-phase threshold MHE protocol:

Key Generation: Each server generates its own secret key share and public key share. Party 0 creates the Common Reference Polynomial (CRP).
Key Combination: Public key shares are combined into a Collective Public Key (CPK). Encryption under the CPK requires ALL servers for decryption.
Encryption: Each server standardizes its data (Z-scores) and encrypts columns under the CPK.
Local Correlation: Within-server correlations are computed in plaintext.
Cross-Server Correlation (share-wrapped threshold decryption): For each server pair (A, B): server A computes Z_A * Enc(Z_B) homomorphically. Each server produces a partial decryption share, wrapped (transport-encrypted) under the fusion server’s X25519 public key. The client relays these opaque wrapped shares to the fusion server (party 0). The fusion server unwraps all shares, computes its own share, and returns only the final aggregate scalar. The client never sees raw partial shares.
Assembly: The full p x p correlation matrix is assembled from local and cross-server blocks.

GLM Secure Routing (Coordinator Model)

The GLM protocol uses a coordinator model so that individual-level vectors never pass through the client:

Role	Responsibilities
Label server (coordinator)	Runs the IRLS loop. Encrypts individual-level vectors (mu, w, v) under each non-label server’s public key and sends the ciphertexts through the client as opaque blobs.
Non-label servers	Decrypt locally, compute their gradient contribution (X_k’ W v), encrypt eta_k for the coordinator.
Client (blind relay)	Routes opaque encrypted blobs between servers. Sees only p_k-length coefficient vectors (public model output) and encrypted payloads it cannot decrypt.

This ensures the client acts as a transport layer with no access to observation-level information.

Supported GLM Families

Family	Link	Use Case
`gaussian`	Identity	Continuous outcomes (linear regression)
`binomial`	Logit	Binary outcomes (logistic regression)
`poisson`	Log	Count data

Chunked Transfer Protocol

DataSHIELD’s R expression parser limits the size of arguments that can be passed inline in call() expressions. Cryptographic objects (CKKS ciphertexts, EC points, key shares, transport-encrypted blobs) routinely exceed this limit. dsVertClient automatically splits all large payloads into 10 KB chunks via mheStoreBlobDS, which the server reassembles transparently. This chunking is applied uniformly across all protocols (PSI, MHE correlation, GLM) for any data that scales with the number of observations, variables, or servers — ensuring that no DataSHIELD call can overflow regardless of dataset size or number of parties.

Requirements

R >= 4.0.0
DSI package
jsonlite package
dsVert package installed on DataSHIELD servers (Opal/Rock)

Documentation

Validation Study: Full pipeline validation against centralized R

Authors

David Sarrat González (david.sarrat@isglobal.org)
Miron Banjac (miron.banjac@isglobal.org)
Juan R González (juanr.gonzalez@isglobal.org)

dsVertClient: DataSHIELD Client Functions for Vertically Partitioned Data