library(BigDataStatMeth)
library(rhdf5)
library(data.table)
getQRbyBlocks <- function(strdataset, file, m, center, scale, bcols, overwrt) {
# Parse dataset path into group and dataset components
# Example: "data/X" becomes group="data", dataset="X"
strgroup <- gsub("/.?$", "", strdataset)
strdataset <- gsub("^.*/", "", strdataset)
# Step 1: Normalize the dataset
# Centering removes column means, scaling divides by standard deviations
# This ensures numerical stability in QR computation
bdNormalize_hdf5(
filename = file,
group = strgroup,
dataset = strdataset,
bcenter = center,
bscale = scale,
overwrite = overwrt
)
# Step 2: Split matrix into m blocks by rows
# Each block gets approximately n/m rows
# Normalized data stored under NORMALIZED/, output goes to Step1/
bdSplit_matrix_hdf5(
filename = file,
group = paste0("NORMALIZED/", strgroup),
dataset = strdataset,
outgroup = paste0("Step1/", strdataset, "rows"),
nblocks = m,
bycols = bcols, # FALSE = split by rows (samples)
overwrite = overwrt
)
# Step 3: Apply QR to each block independently
# Get list of block datasets created by split
blocks <- bdgetDatasetsList_hdf5(file, paste0("Step1/", strdataset, "rows"))
# Apply QR decomposition to each block
# Creates both Q and R for each block
bdapply_Function_hdf5(
filename = file,
group = paste0("Step1/", strdataset, "rows"),
datasets = blocks,
outgroup = paste0("Step2/", strdataset, "rows"),
func = "QR",
overwrite = overwrt
)
# Step 4: Merge R matrices from all blocks
# QR gives Q (orthogonal) and R (upper triangular) for each block
# We need to stack all R matrices vertically
blocks_qr <- bdgetDatasetsList_hdf5(file, paste0("Step2/", strdataset, "rows"))
# Filter to get only R matrices (dataset names contain ".R")
r_matrices <- blocks_qr[which(blocks_qr %like% ".R")]
# Stack R matrices vertically (bind by rows)
bdBind_hdf5_datasets(
filename = file,
group = paste0("Step2/", strdataset, "rows"),
datasets = r_matrices,
outgroup = "Step3/merged",
outdataset = paste0(strdataset, "Rt"),
func = "bindRows",
overwrite = overwrt
)
# Step 5: Final QR on merged R matrices
# Taking QR of the stacked R matrices gives us components for final Q
bdapply_Function_hdf5(
filename = file,
group = "Step3/merged",
datasets = paste0(strdataset, "Rt"),
outgroup = "Step3/Final_QR",
func = "QR",
overwrite = overwrt
)
# Step 6: Split final Q matrix for block-wise multiplication
# We need to split this Q to multiply it with original Q blocks
bdSplit_matrix_hdf5(
filename = file,
group = "Step3/Final_QR",
dataset = paste0(strdataset, "Rt.Q"),
outgroup = "Step4/splitted",
nblocks = m,
bycols = bcols,
overwrite = overwrt
)
# Step 7: Multiply original Q blocks by final Q blocks
# This reconstructs the full Q matrix in block form
tmp <- bdgetDatasetsList_hdf5(file, "Step4/splitted")
Rt_Q_divide <- tmp[which(tmp %like% paste0(strdataset, "Rt.Q"))]
# Get original Q matrices from Step 2
q_matrices <- blocks_qr[which(blocks_qr %like% ".Q")]
# Multiply each original Q block by corresponding final Q block
bdapply_Function_hdf5(
filename = file,
group = paste0("Step2/", strdataset, "rows"),
datasets = q_matrices,
outgroup = "Step5",
func = "blockmult",
b_group = "Step4/splitted",
b_datasets = Rt_Q_divide,
overwrite = TRUE
)
# Step 8: Combine all Q blocks into final Q matrix
# Stack the multiplied Q blocks vertically
blocks_Q <- bdgetDatasetsList_hdf5(file, "Step5")
bdBind_hdf5_datasets(
filename = file,
group = "Step5",
datasets = blocks_Q[which(blocks_Q %like% paste0(strdataset, "."))],
outgroup = "Step6",
outdataset = paste0(strdataset, "Q"),
func = "bindRows",
overwrite = overwrt
)
cat("✓ QR decomposition complete for", strdataset, "\n")
}CCA Implementation in R
Building Canonical Correlation Analysis Using BigDataStatMeth R API
This page uses functions that have been deprecated following the major restructuring introduced in BigDataStatMeth v2.0.1. The new version replaces the procedural bd* API with a unified S3 interface based on HDF5Matrix objects. All code blocks are shown for reference but are not executed.
An updated version of this page will be published shortly
1 Overview
This document shows how to implement Canonical Correlation Analysis (CCA) from scratch using BigDataStatMeth’s R API. Rather than using a pre-built CCA function, you’ll see how to combine BigDataStatMeth’s building blocks—QR decomposition, matrix operations, SVD—to create a complete statistical method. This is the foundation for understanding how to develop your own statistical methods for big data.
The implementation you’ll see here is actual working code from the BDStatMethExamples package. It’s not pseudocode or simplified for teaching—it’s production code that handles real TCGA multi-omic data with thousands of samples and features. By studying this implementation, you learn both the mathematical algorithm and the practical engineering required to make complex statistical methods work on big data.
Think of this as a blueprint. CCA serves as the example, but the patterns you’ll learn—partitioning data, applying operations iteratively, combining results hierarchically, managing intermediate outputs—apply to any algorithm you might develop. Whether you’re implementing sparse PCA, regularized regression, factor analysis, or novel methods from research papers, this workflow shows you how to structure the implementation.
1.1 What You’ll Learn
By the end of this document, you will:
- Understand CCA’s mathematical algorithm and its implementation challenges
- Implement QR decomposition by blocks for matrices exceeding memory
- Combine intermediate QR results using hierarchical merging
- Compute canonical correlations from Q matrix cross-products
- Extract canonical coefficients by solving triangular systems
- Structure complex multi-step algorithms with clear function separation
- Manage intermediate results strategically in HDF5 files
- Create reusable patterns applicable to other statistical methods
2 CCA Algorithm Overview
2.1 Mathematical Foundation
Canonical Correlation Analysis finds linear combinations of features in two datasets that maximize correlation. Given matrices X (n × p) and Y (n × q), CCA seeks weight vectors α and β such that:
\max_{\alpha, \beta} \text{cor}(X\alpha, Y\beta)
subject to the constraint that the canonical variates have unit variance. This optimization problem has a classical solution through matrix decompositions.
The standard algorithm proceeds through these mathematical steps:
Step 1: QR decompositions
X = Q_X R_X, \quad Y = Q_Y R_Y
where Q matrices are orthonormal (n × p and n × q) and R matrices are upper triangular.
Step 2: Cross-product of Q matrices
M = Q_X^T Q_Y
This (p × q) matrix captures the relationship between the orthonormalized datasets.
Step 3: SVD of cross-product
M = U D V^T
where D contains singular values (the canonical correlations), and U, V are orthogonal matrices containing canonical directions.
Step 4: Canonical coefficients
\alpha = R_X^{-1} U, \quad \beta = R_Y^{-1} V
These transform original features to canonical space.
Step 5: Canonical variates
s_X = X\alpha, \quad s_Y = Y\beta
These are the actual canonical scores for each sample.
2.2 Why Block-Wise Processing?
The algorithm above assumes you can load X and Y into memory, compute their QR decompositions directly, and proceed. For small datasets (thousands of samples, hundreds of features), this works fine. For genomic or multi-omic data—50,000 samples × 500,000 SNPs, or 10,000 samples × 20,000 genes—loading the full matrices into RAM is impossible.
Block-wise processing solves this by:
- Partitioning each matrix into m blocks by rows (samples)
- Computing QR decomposition on each block independently
- Combining block QR results hierarchically to reconstruct the full QR
- Achieving identical mathematical results with manageable memory
The key insight is that QR decomposition of stacked matrices can be computed by taking QR of each block, stacking the R matrices, taking QR of that stack, then multiplying the original Q matrices by the new Q. This hierarchical approach is mathematically exact—you get the same final Q and R as if you’d computed QR on the full matrix, but you never load the full matrix into memory.
For a matrix X with 50,000 rows, partitioning into m=10 blocks means each block has 5,000 rows. If X has 500,000 columns, each block is 5,000 × 500,000—still large, but processable. The R matrix from each block QR is only 500,000 × 500,000 (or smaller if you compute economy QR). Stacking 10 such R matrices gives 5,000,000 × 500,000, and taking QR of that is manageable because it’s column-heavy, not row-heavy.
This is the pattern we’ll implement: partition → process blocks → merge hierarchically → reconstruct full result.
3 Step 1: QR Decomposition by Blocks
3.1 The getQRbyBlocks Function
This function is the algorithmic core of block-wise CCA. It takes a dataset stored in HDF5 and computes its QR decomposition by processing the data in m blocks. The function doesn’t return matrices to R—instead, it orchestrates a sequence of HDF5 operations, with intermediate results stored in the file under “Step1”, “Step2”, etc.
3.2 Understanding the Nine Steps
This function implements a carefully orchestrated sequence. Each step builds on previous results, and the order is mathematically necessary.
Step 1 (Normalization): Centering and scaling improve numerical stability. Without centering, QR decomposition can be numerically unstable if features have very different means. Without scaling, features with large variance dominate. For multi-omic CCA where X might be methylation (0-1 scale) and Y might be expression (0-10,000 scale), scaling is essential.
Step 2 (Partitioning): Splitting into m blocks creates manageable chunks. If your original matrix is 50,000 × 500,000 and won’t fit in RAM, m=10 gives you blocks of 5,000 × 500,000 that might fit. The choice of m balances memory usage (larger m = smaller blocks) against computational overhead (more blocks = more operations).
Step 3 (Block QR): Computing QR on each block is independent—blocks can be processed in parallel if desired. Each block’s QR gives Q_i (5,000 × 500,000 in economy form) and R_i (500,000 × 500,000). The Q_i matrices stay in their original locations; we’ll use them later. The R_i matrices move to the next step.
Step 4 (Merging R): Stacking all R_i matrices vertically creates a tall-skinny matrix. With m=10 blocks, stacking gives (10 × 500,000) × 500,000 = 5,000,000 × 500,000. This is column-heavy—many rows, but relatively few columns—which is much easier to handle than the original row-heavy matrix.
Step 5 (Final QR): Taking QR of the stacked R matrices gives Q_final and R_final. The R_final is the R component of the full matrix’s QR decomposition—this is exact, not approximate. The Q_final will be used to combine the original Q blocks.
Step 6 (Splitting Q_final): We need to multiply each original Q_i by the corresponding rows of Q_final. Splitting Q_final into m blocks aligns them with the original m blocks of Q.
Step 7 (Block Multiplication): For each block i, compute Q_i × Q_final_i. This gives the final Q matrix in block form. The mathematics: if X_i = Q_i R_i, and stacking R_i gives [R_1; R_2; …] = Q_final R_final, then X = [Q_1; Q_2; …] Q_final R_final. So the final Q is [Q_1 Q_final; Q_2 Q_final; …].
Step 8 (Combining Final Q): Stack the multiplied Q blocks vertically to reconstruct the full Q matrix. This final Q, combined with R_final from Step 5, gives you the complete QR decomposition.
Step 9 (Implicit - Storage): The function doesn’t return anything to R. Instead, results are stored in HDF5 at /Step6/[dataset]Q for Q and /Step3/Final_QR/[dataset]Rt.R for R. Subsequent functions know to look in these locations.
This multi-step process seems complex, but each step is simple. The complexity is in the orchestration, not the individual operations. And the payoff is that you can compute QR decomposition of matrices that would never fit in memory using a standard algorithm.
The mathematical justification for this approach:
If X_i = Q_i R_i for each block, then stacking gives:
\begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_m \end{bmatrix} = \begin{bmatrix} Q_1 R_1 \\ Q_2 R_2 \\ \vdots \\ Q_m R_m \end{bmatrix}
Taking QR of the stacked R matrices:
\begin{bmatrix} R_1 \\ R_2 \\ \vdots \\ R_m \end{bmatrix} = \tilde{Q} \tilde{R}
Then the full QR decomposition is:
X = \begin{bmatrix} Q_1 \\ Q_2 \\ \vdots \\ Q_m \end{bmatrix} \tilde{Q} \tilde{R}
The final Q matrix is the block-diagonal matrix of Q_i’s times the dense matrix \tilde{Q}. The final R is \tilde{R}. This is mathematically identical to QR on the full matrix X but computed in memory-friendly steps.
4 Step 2: Main CCA Function
4.1 Orchestrating the Complete Algorithm
Now that we have block-wise QR decomposition, we can implement complete CCA. The bdCCA_hdf5 function orchestrates all steps: QR for both X and Y, computing their cross-product, SVD to extract canonical correlations, solving for coefficients, and computing variates.
bdCCA_hdf5 <- function(filename, X, Y, m = 10,
bcenter = TRUE, bscale = FALSE, bycols = FALSE,
overwriteResults = FALSE, keepInteResults = FALSE) {
matrices <- c(X, Y)
# Step 1-8: Compute QR decomposition for both X and Y matrices
# This is the computationally intensive part - processes data in blocks
# Results stored in HDF5 under Step1/ through Step6/
sapply(
matrices,
getQRbyBlocks,
file = filename,
m = m,
center = bcenter,
scale = bscale,
bcols = bycols,
overwrt = overwriteResults
)
cat("✓ QR decompositions complete for both matrices\n")
# Step 9: Compute cross-product of Q matrices
# This matrix captures the canonical relationship between X and Y
# Q_X is (n × p), Q_Y is (n × q), so Q_X^T Q_Y is (p × q)
res <- bdCrossprod_hdf5(
filename = filename,
group = "Step6",
A = "XQ",
groupB = "Step6",
B = "YQ",
outdataset = "CrossProd_XQ_YQ",
outgroup = "Step7",
overwrite = TRUE
)
cat("✓ Cross-product computed\n")
# Step 10: SVD of cross-product
# Singular values are the canonical correlations
# U and V matrices contain canonical directions
res <- bdSVD_hdf5(
file = filename,
group = "Step7",
dataset = "CrossProd_XQ_YQ",
bcenter = FALSE, # Already centered in Step 1
bscale = FALSE,
k = 16, # Number of blocks for SVD computation
q = 2, # Two-level hierarchy for large-scale SVD
threads = 3,
overwrite = TRUE
)
cat("✓ SVD computed - canonical correlations extracted\n")
# Get dimensions needed for coefficient computation
# Need number of columns in X and Y to properly dimension coefficients
res <- sapply(matrices, bdgetDim_hdf5, filename = filename)
# Step 11: Compute canonical coefficients and variates
# Coefficients show which features contribute to canonical variates
# Variates are the actual canonical scores for each sample
writeCCAComponents_hdf5(filename, res[2, X], res[2, Y])
cat("✓ Canonical coefficients and variates saved\n")
# Optional: Remove intermediate results to save disk space
# Step1-Step7 contain temporary matrices used during computation
# Results/ contains final canonical components
if (keepInteResults == FALSE) {
sapply(paste0("Step", 1:7), function(x) {
invisible(bdRemove_hdf5_element(filename, element = x))
cat(" ", x, "removed\n")
})
cat("✓ Intermediate results cleaned up\n")
}
return(list(
filename = filename,
results_group = "Results"
))
}4.2 Function Flow and Design Decisions
This function is deceptively simple—only about 50 lines—but it orchestrates a complex multi-stage computation. Several design decisions make it work effectively.
Separation of concerns: The function doesn’t implement QR decomposition itself; it calls getQRbyBlocks for both matrices. This separation means if you want to optimize QR (try different block sizes, use different algorithms), you modify getQRbyBlocks without touching this orchestration function. Similarly, SVD computation is delegated to bdSVD_hdf5. The main function’s job is orchestration, not implementation.
Sequential dependency management: Steps must execute in order—you can’t compute cross-product before QR, can’t do SVD before cross-product, can’t solve for coefficients before SVD. The function structure makes this dependency explicit. Each step validates that its inputs exist before proceeding. If Step 3 fails, Step 4 never runs—preventing cascading errors from invalid intermediate results.
Memory efficiency through streaming: At no point does this function load the full X or Y matrices into R. All operations happen in HDF5, with only metadata (filenames, dimensions, group names) in R memory. Even when computing with datasets containing millions of elements, R’s memory usage stays constant. The HDF5 file grows as results accumulate, but that’s disk space, not RAM.
Parameter propagation: The m parameter (number of blocks) propagates to getQRbyBlocks, which propagates it to splitting and merging operations. This single parameter controls memory usage throughout. Want to use less memory? Increase m. Have more RAM available? Decrease m for speed. One parameter tunes the entire computation’s memory footprint.
Result management flexibility: The keepInteResults parameter lets users choose between disk space (delete intermediates) and debugging capability (keep them). During development, keep intermediate results to inspect where computations fail. In production, delete them to save space. This flexibility costs nothing—it’s just a flag controlling cleanup.
Error propagation: Each BigDataStatMeth function returns status. If bdNormalize_hdf5 fails, getQRbyBlocks doesn’t proceed to splitting. If getQRbyBlocks fails for X, Y’s QR never starts. Errors stop propagation immediately, preventing corrupted results from appearing as valid output.
The function’s apparent simplicity hides careful engineering. It’s not just “call these functions in order”—it’s “orchestrate a complex computation with clear dependencies, graceful error handling, flexible resource management, and predictable behavior.”
5 Step 3: Computing Canonical Coefficients
5.1 From SVD Results to Interpretable Components
The SVD gave us singular values (canonical correlations) and singular vectors (U and V). But these aren’t directly the canonical coefficients we want—they’re in the orthonormalized Q space. To get coefficients in terms of original features, we need to “undo” the QR transformation by solving triangular systems.
writeCCAComponents_hdf5 <- function(filename, ncolsX, ncolsY) {
if (!file.exists(filename)) {
stop("ERROR - File does not exist")
}
# Read all necessary components from HDF5
h5f <- H5Fopen(filename)
# Extract Q matrices (first ncolsX × ncolsX and ncolsY × ncolsY portions)
# These are the orthonormal bases from QR decomposition
XQ <- h5f$Step6$XQ[1:ncolsX, 1:ncolsX]
YQ <- h5f$Step6$YQ[1:ncolsY, 1:ncolsY]
# Extract R matrices from final QR
# These are upper triangular matrices from the decomposition
XR <- h5f$Step3$Final_QR$XRt.R
YR <- h5f$Step3$Final_QR$YRt.R
# Extract SVD results
# d: singular values (canonical correlations)
# u: left singular vectors (directions in X space)
# v: right singular vectors (directions in Y space)
d <- h5f$SVD$CrossProd_XQ_YQ$d
u <- h5f$SVD$CrossProd_XQ_YQ$u
v <- h5f$SVD$CrossProd_XQ_YQ$v
# Extract centering parameters (means subtracted during normalization)
xcenter <- h5f$NORMALIZED$data$mean.X
ycenter <- h5f$NORMALIZED$data$mean.Y
# Extract feature names
x.names <- h5f$data$.X_dimnames$`2`
y.names <- h5f$data$.Y_dimnames$`2`
h5closeAll()
# Reconstruct QR in compact form
# Q and R are stored separately; combine into single triangular matrix
# Lower triangle = Q (without diagonal), Upper triangle = R (with diagonal)
XR[lower.tri(XR, diag = FALSE)] <- 0 # Zero below diagonal
XQ[upper.tri(XQ, diag = TRUE)] <- 0 # Zero on and above diagonal
XQR <- XR + XQ # Combined representation
YR[lower.tri(YR, diag = FALSE)] <- 0
YQ[upper.tri(YQ, diag = TRUE)] <- 0
YQR <- YR + YQ
# Solve for canonical coefficients
# Since X = Q_X R_X and canonical directions are U in Q space,
# coefficients in original space are: α = R_X^(-1) U
# bdSolve efficiently solves the triangular system R_X α = U
xcoef <- bdSolve(XQR, u)
ycoef <- bdSolve(YQR, v)
# Assign feature names to coefficients
rownames(xcoef) <- as.matrix(x.names)
rownames(ycoef) <- as.matrix(y.names)
# Store coefficients in HDF5
bdCreate_hdf5_matrix(
filename,
object = xcoef,
group = "Results",
dataset = "xcoef",
overwriteDataset = TRUE
)
bdCreate_hdf5_matrix(
filename = filename,
object = ycoef,
group = "Results",
dataset = "ycoef",
overwriteDataset = TRUE
)
# Store canonical correlations (diagonal of D from SVD)
bdCreate_hdf5_matrix(
filename,
object = as.matrix(diag(d)),
group = "Results",
dataset = "cor",
overwriteDataset = TRUE
)
# Store centering parameters for reconstructing original scale
bdCreate_hdf5_matrix(
filename,
object = xcenter,
group = "Results",
dataset = "xcenter",
overwriteDataset = TRUE
)
bdCreate_hdf5_matrix(
filename,
object = ycenter,
group = "Results",
dataset = "ycenter",
overwriteDataset = TRUE
)
# Compute canonical variates (sample scores)
# Multiply original data by coefficients: s_X = X α
bdblockmult_hdf5(
filename = filename,
group = "data",
A = "X",
B = "xcoef",
groupB = "Results",
outgroup = "Results",
outdataset = "xscores",
overwrite = TRUE
)
bdblockmult_hdf5(
filename,
group = "data",
A = "Y",
B = "ycoef",
groupB = "Results",
outgroup = "Results",
outdataset = "yscores",
overwrite = TRUE
)
cat("✓ All CCA components saved to Results/ group\n")
}5.2 Why We Need This Step
The SVD gave us U and V in the Q space—the orthonormalized coordinate system created by QR decomposition. But users want coefficients in terms of original features: “which genes contribute to this canonical variate?” not “which Q-space dimensions contribute?”
The transformation back requires solving a triangular system. We have X = Q_X R_X, and in Q space the canonical direction is U. To find the direction α in original X space such that Xα corresponds to Q_X U, we solve:
R_X \alpha = U
Since R_X is upper triangular (from QR decomposition), this system solves efficiently via back-substitution. The bdSolve function implements this triangular solve, which is O(n²) rather than O(n³) for general matrix inversion.
5.3 Understanding the Results Structure
After this function completes, the HDF5 file contains a /Results/ group with these datasets:
- xcoef: (p × k) matrix of X feature loadings for k canonical components
- ycoef: (q × k) matrix of Y feature loadings
- cor: (k × 1) vector of canonical correlations
- xcenter, ycenter: Mean vectors used for centering
- xscores: (n × k) matrix of X canonical variates (sample scores)
- yscores: (n × k) matrix of Y canonical variates
These components support different downstream analyses:
Feature interpretation: Sort xcoef and ycoef by absolute value to identify which original features (genes, CpG sites) contribute most to each canonical component. High-loading features drive the canonical relationship.
Sample visualization: Plot xscores vs yscores, colored by sample metadata (cancer type, treatment group). This shows how samples separate in canonical space.
Validation: Compute correlation between xscores and yscores columns—should match the canonical correlations in cor.
Downstream modeling: Use xscores and yscores as features in classification, regression, or survival models. They’re lower-dimensional representations capturing coordinated variation.
The centering parameters let you reconstruct original-scale predictions if needed, though typically canonical variates are used directly.
6 Example Usage
Here’s how to use the complete CCA implementation on real data:
library(BigDataStatMeth)
library(rhdf5)
# Assume we have multi-omic data already in HDF5
# X: methylation (n × p)
# Y: expression (n × q)
cca_file <- "multi_omic_cca.hdf5"
# Perform CCA with 4 blocks, centering but not scaling
result <- bdCCA_hdf5(
filename = cca_file,
X = "data/X",
Y = "data/Y",
m = 4,
bcenter = TRUE,
bscale = FALSE,
overwriteResults = TRUE,
keepInteResults = FALSE # Clean up intermediate steps
)
# Examine results
h5ls(cca_file, recursive = TRUE)
# Load canonical correlations
h5f <- H5Fopen(cca_file)
canonical_cors <- diag(h5f$Results$cor)
cat("Canonical correlations:\n")
print(canonical_cors[1:5])
# Load feature coefficients
xcoef <- h5f$Results$xcoef
ycoef <- h5f$Results$ycoef
# Identify top features for first canonical component
x_top <- order(abs(xcoef[, 1]), decreasing = TRUE)[1:10]
y_top <- order(abs(ycoef[, 1]), decreasing = TRUE)[1:10]
cat("\nTop 10 methylation sites:\n")
print(xcoef[x_top, 1])
cat("\nTop 10 genes:\n")
print(ycoef[y_top, 1])
# Load sample scores for visualization
xscores <- h5f$Results$xscores
yscores <- h5f$Results$yscores
h5closeAll()
# Verify: correlation of scores should match canonical correlation
cor_check <- cor(xscores[, 1], yscores[, 1])
cat("\nVerification - CC1:", canonical_cors[1],
"| Computed:", cor_check, "\n")This example shows the typical workflow: run CCA, examine canonical correlations to see how strong the relationships are, inspect feature loadings to understand what drives the relationships, and extract sample scores for visualization or downstream modeling.
7 Interactive Exercise
7.1 Practice: Extending the CCA Implementation
Understanding comes from experimentation. Try these modifications:
# Exercise 1: Vary the number of blocks
# Run CCA with m = 2, 4, 8, 16
# Time each execution
# How does block count affect speed and memory?
system.time({
bdCCA_hdf5(cca_file, "data/X", "data/Y", m = 2,
overwriteResults = TRUE, keepInteResults = FALSE)
})
system.time({
bdCCA_hdf5(cca_file, "data/X", "data/Y", m = 8,
overwriteResults = TRUE, keepInteResults = FALSE)
})
# Exercise 2: Inspect intermediate results
# Run with keepInteResults = TRUE
# Examine Step1 through Step7 in HDF5 file
# What's the size and structure of each step?
bdCCA_hdf5(cca_file, "data/X", "data/Y", m = 4,
keepInteResults = TRUE)
h5ls(cca_file, recursive = TRUE)
# Exercise 3: Implement sparse CCA
# Modify writeCCAComponents_hdf5 to:
# - Apply L1 penalty to coefficients
# - Threshold small coefficients to zero
# - Re-normalize remaining coefficients
# How does sparsity affect interpretation?
# Exercise 4: Add cross-validation
# Split samples into train/test
# Compute CCA on train set
# Project test samples onto canonical variates
# Evaluate stability of canonical correlations1. Algorithm Design: - Why 9 steps in getQRbyBlocks, not fewer? - Could you compute CCA without QR decomposition? - What if you swapped SVD for eigendecomposition?
2. Performance Tuning: - With 100GB data, what m value would you try first? - When does increasing m stop helping? - Would parallel processing help? Which steps?
3. Numerical Stability: - Why does normalization improve QR stability? - What happens if R matrices become singular? - How would you detect and handle ill-conditioning?
4. Extension Possibilities: - How would you implement regularized CCA? - Could you handle >2 datasets (multi-block CCA)? - How would you add permutation testing?
5. Production Deployment: - What validation checks would you add? - How would you handle computation failures? - What logging would be essential?
These questions push beyond the implementation to understand design trade-offs and extension possibilities.
8 Key Takeaways
Let’s consolidate what you’ve learned about implementing statistical methods using BigDataStatMeth’s R API.
8.1 Essential Concepts
Block-wise algorithms decompose computation into manageable steps. The QR decomposition by blocks shows the fundamental pattern: partition data, process each partition independently, combine results hierarchically. This isn’t an approximation or heuristic—it’s mathematically exact. The block-wise QR produces identical results to computing QR on the full matrix, but with dramatically reduced memory requirements. This pattern applies beyond QR: eigendecomposition, SVD, matrix factorizations, even iterative algorithms like EM can be adapted to block-wise processing. The key insight is identifying which operations can be parallelized (processing blocks) and which must be sequential (combining results).
HDF5 acts as persistent working memory. Traditional R functions load data, compute, return results—everything happens in RAM. BigDataStatMeth functions read from HDF5, compute, write results back to HDF5, return metadata. The HDF5 file becomes working memory that persists between function calls and exceeds RAM capacity. Step1 through Step7 in getQRbyBlocks aren’t temporary variables—they’re persistent intermediate results that subsequent functions access directly. This changes how you think about algorithms: instead of “what fits in memory,” you ask “what’s the optimal sequence of disk operations.” The HDF5 file documents the computation’s progression—if it fails at Step 5, Steps 1-4 remain intact for debugging.
Hierarchical combination preserves mathematical correctness. Combining block QR results through “stack R matrices, take QR, multiply Q matrices” seems complicated. Why not just concatenate the final Q blocks? Because that wouldn’t produce the correct overall QR decomposition. The hierarchical approach is derived from the QR decomposition’s mathematical properties—specifically, that QR of a vertically stacked matrix can be computed from the QR decompositions of the blocks. This isn’t algorithmic cleverness; it’s mathematical necessity. The pattern generalizes: whenever you break a matrix operation into blocks, you need to understand the mathematical relationship between block results and the full result. Sometimes it’s simple (addition, subtraction), sometimes it requires hierarchical combining (QR, SVD), sometimes it requires iteration (eigenvalue methods).
Normalization isn’t preprocessing—it’s algorithmic necessity. The bcenter and bscale parameters aren’t optional niceties for cleaner data. They’re essential for numerical stability in QR decomposition. Without centering, features with large means can cause catastrophic cancellation in floating-point arithmetic. Without scaling, features with large variance dominate the decomposition, and features with small variance contribute numerical noise rather than signal. For CCA across multi-omic datasets—methylation (0-1 scale), expression (0-10,000 scale)—scaling isn’t just recommended, it’s mandatory. The algorithm will run without it, produce results, and those results will be numerically garbage. This is true for most decomposition methods: normalization isn’t data cleaning, it’s numerical stability.
Intermediate result management balances disk space and debugging. The keepInteResults parameter exemplifies a design decision in all complex algorithms: clean up as you go (save disk space, hide complexity) or keep intermediates (enable debugging, allow inspection). During development, keep everything—if Step 7 fails, you need Step 6 results to diagnose why. In production, delete intermediates—a CCA pipeline that processes 1000 datasets doesn’t need to keep Step1 through Step7 for all of them. The implementation makes this choice explicit and user-controllable. Good algorithm design anticipates both use cases. Consider adding timestamps to intermediate results, logging which steps completed, even storing parameter values that generated each result. These cost little disk space but dramatically improve debugging.
Function structure mirrors mathematical algorithm structure. The separation between getQRbyBlocks, bdCCA_hdf5, and writeCCAComponents_hdf5 isn’t arbitrary. Each function corresponds to a conceptual step in the CCA algorithm: compute QR, compute SVD of cross-product, solve for coefficients. This mapping between mathematical steps and code functions makes the implementation easier to understand, test, and modify. If you want to try a different SVD algorithm, you modify the bdSVD_hdf5 call without touching QR computation. If you want to add regularization, you modify writeCCAComponents_hdf5 to apply penalties when solving for coefficients. Clear separation of concerns—each function does one mathematical thing well—enables compositional algorithm development.
8.2 When to Implement in R vs C++
Understanding when to use R versus when to move to C++ guides efficient development.
✅ Implement in R when:
Prototyping new methods - R’s interactive development, immediate feedback, and high-level syntax make it ideal for trying new ideas. Get the algorithm working in R, validate correctness, optimize the mathematical approach. Only then consider C++ for performance.
Orchestrating existing functions - If your method combines BigDataStatMeth functions like bdSVD_hdf5, bdCrossprod_hdf5, etc., R orchestration is fine. These functions already use compiled C++ code—R just sequences the calls. The overhead of R function calls is negligible compared to the actual computations.
Methods used occasionally - If you run this analysis once per month, R performance is adequate. A method that takes 10 minutes in R versus 3 minutes in C++ doesn’t justify C++ development time for infrequent use.
Leveraging R’s statistical ecosystem - If your method needs permutation testing, cross-validation, statistical tests, or model fitting, R’s packages provide these capabilities. Reimplementing in C++ would be massive effort for marginal performance gain.
✅ Consider C++ when:
Performance bottlenecks identified - Profile your R code. If one specific operation consumes most of the runtime, rewrite that operation in C++, not the entire method. Surgical optimization is more efficient than wholesale rewriting.
Production deployment at scale - If this method will run thousands of times in production pipelines, C++ performance gains accumulate. A 50% speedup saves hours daily in high-throughput settings.
Custom algorithms not in BigDataStatMeth - If you’re implementing novel decomposition methods, custom optimization algorithms, or specialized linear algebra, C++ gives you control. You’re not constrained to BigDataStatMeth’s provided operations.
Real-time or interactive requirements - If users interact with the method (parameter changes trigger recomputation), sub-second response requires C++. R’s several-second latency is acceptable for batch processing but not interactive exploration.
The key principle: start in R, optimize selectively in C++ only when profiling shows clear bottlenecks and usage patterns justify development effort. Most statistical method implementations belong entirely in R, with BigDataStatMeth functions handling the computationally intensive matrix operations.
9 Next Steps
Explore C++ implementation:
- CCA Implementation in C++ - Same algorithm, performance optimization
Apply the patterns:
- Implement sparse CCA with L1 regularization
- Build regularized CCA for high-dimensional settings
- Develop partial least squares (PLS) using similar structure
- Create custom decomposition methods
Understand the underlying functions:
- Study BigDataStatMeth’s block-wise SVD implementation
- Examine how bdCrossprod_hdf5 handles large matrices
- Learn HDF5 dataset management strategies
Extend for production:
- Add comprehensive error handling
- Implement parameter validation
- Create progress reporting
- Build result verification functions