CCA Implementation in R

Building Canonical Correlation Analysis Using BigDataStatMeth R API

1 Overview

This document shows how to implement Canonical Correlation Analysis (CCA) from scratch using BigDataStatMeth’s R API. Rather than using a pre-built CCA function, you’ll see how to combine BigDataStatMeth’s building blocks—QR decomposition, matrix operations, SVD—to create a complete statistical method.

Think of this as a blueprint for developing your own statistical methods. CCA serves as the example, but the patterns you’ll learn apply to any algorithm: partition data into blocks, apply operations iteratively, combine results hierarchically. This is how you extend BigDataStatMeth beyond its built-in functions to implement novel statistical methods for big data.

1.1 What You’ll Learn

By the end of this document, you will:

Understand CCA’s mathematical algorithm and why block-wise processing helps
Implement QR decomposition by blocks for large matrices
Combine intermediate QR results hierarchically
Compute canonical correlations from Q matrices
Extract canonical coefficients and variates
Structure complex multi-step algorithms in R
Manage intermediate results in HDF5 files
Create reusable functions for new statistical methods

2 CCA Algorithm Overview

2.1 Mathematical Foundation

Canonical Correlation Analysis finds linear combinations of features in X and Y that maximize correlation:

\max_{\alpha, \beta} \text{cor}(X\alpha, Y\beta)

The standard algorithm:

Compute QR decompositions: X = Q_X R_X, Y = Q_Y R_Y
Compute cross-product: M = Q_X^T Q_Y
SVD of cross-product: M = U D V^T
Solve for canonical coefficients: \alpha = R_X^{-1} U, \beta = R_Y^{-1} V
Compute canonical variates: s_X = X\alpha, s_Y = Y\beta

2.2 Why Block-Wise Processing?

For large matrices (e.g., 50,000 samples × 500,000 genomic features), computing QR decomposition directly requires massive RAM. Block-wise processing:

Partitions X into m blocks by rows
Computes QR on each block separately (manageable size)
Combines block QR results hierarchically
Final result identical to full QR, but memory-friendly

3 Step 1: QR Decomposition by Blocks

3.1 The getQRbyBlocks Function

This is the core building block. It computes QR decomposition of a matrix stored in HDF5 by processing it in blocks:

library(BigDataStatMeth)
library(rhdf5)

getQRbyBlocks <- function(strdataset, file, m, center, scale, bcols, overwrt) {
  
  # Parse dataset path (group/dataset)
  strgroup <- gsub("/.?$", "", strdataset)
  strdataset <- gsub("^.*/", "", strdataset)
  
  # Step 1: Normalize the dataset
  # Centering and scaling ensure numerical stability
  bdNormalize_hdf5(
    filename = file,
    group = strgroup,
    dataset = strdataset,
    bcenter = center,
    bscale = scale,
    overwrite = overwrt
  )
  
  # Step 2: Split matrix into m blocks
  # Process by rows (samples) - each block gets ~n/m rows
  bdSplit_matrix_hdf5(
    filename = file,
    group = paste0("NORMALIZED/", strgroup),
    dataset = strdataset,
    outgroup = paste0("Step1/", strdataset, "rows"),
    nblocks = m,
    bycols = bcols,  # FALSE = split by rows
    overwrite = overwrt
  )
  
  # Step 3: Apply QR to each block independently
  blocks <- bdgetDatasetsList_hdf5(file, paste0("Step1/", strdataset, "rows"))
  
  bdapply_Function_hdf5(
    filename = file,
    group = paste0("Step1/", strdataset, "rows"),
    datasets = blocks,
    outgroup = paste0("Step2/", strdataset, "rows"),
    func = "QR",  # Apply QR decomposition
    overwrite = overwrt
  )
  
  # Step 4: Merge R matrices from all blocks
  # QR gives Q (orthogonal) and R (upper triangular)
  # We need to combine all R matrices
  blocks_qr <- bdgetDatasetsList_hdf5(file, paste0("Step2/", strdataset, "rows"))
  
  # Filter to get only R matrices (not Q)
  r_matrices <- blocks_qr[which(blocks_qr %like% ".R")]
  
  bdBind_hdf5_datasets(
    filename = file,
    group = paste0("Step2/", strdataset, "rows"),
    datasets = r_matrices,
    outgroup = "Step3/merged",
    outdataset = paste0(strdataset, "Rt"),
    func = "bindRows",  # Stack R matrices vertically
    overwrite = overwrt
  )
  
  # Step 5: Final QR on merged R matrices
  # This gives us the overall R matrix
  bdapply_Function_hdf5(
    filename = file,
    group = "Step3/merged",
    datasets = paste0(strdataset, "Rt"),
    outgroup = "Step3/Final_QR",
    func = "QR",
    overwrite = overwrt
  )
  
  # Step 6: Split final Q for block-wise multiplication
  bdSplit_matrix_hdf5(
    filename = file,
    group = "Step3/Final_QR",
    dataset = paste0(strdataset, "Rt.Q"),
    outgroup = "Step4/splitted",
    nblocks = m,
    bycols = bcols,
    overwrite = overwrt
  )
  
  # Step 7: Multiply original Q blocks by final Q blocks
  # This reconstructs the full Q matrix in blocks
  tmp <- bdgetDatasetsList_hdf5(file, "Step4/splitted")
  Rt_Q_divide <- tmp[which(tmp %like% paste0(strdataset, "Rt.Q"))]
  
  q_matrices <- blocks_qr[which(blocks_qr %like% ".Q")]
  
  bdapply_Function_hdf5(
    filename = file,
    group = paste0("Step2/", strdataset, "rows"),
    datasets = q_matrices,
    outgroup = "Step5",
    func = "blockmult",
    b_group = "Step4/splitted",
    b_datasets = Rt_Q_divide,
    overwrite = TRUE
  )
  
  # Step 8: Combine all Q blocks into final Q matrix
  blocks_Q <- bdgetDatasetsList_hdf5(file, "Step5")
  
  bdBind_hdf5_datasets(
    filename = file,
    group = "Step5",
    datasets = blocks_Q[which(blocks_Q %like% paste0(strdataset, "."))],
    outgroup = "Step6",
    outdataset = paste0(strdataset, "Q"),
    func = "bindRows",
    overwrite = overwrt
  )
  
  cat("✓ QR decomposition complete for", strdataset, "\n")
}

Why This Multi-Step Process?

Each block gets QR: X_i = Q_i R_i

Stacking all R_i and taking QR again gives: [R_1; R_2; ...; R_m] = \tilde{Q} \tilde{R}

Multiplying: Q = [Q_1; Q_2; ...; Q_m] \times \tilde{Q} gives the final Q

This hierarchical approach is mathematically equivalent to QR on the full matrix but processes data in memory-friendly chunks.

4 Step 2: Main CCA Function

Now we combine everything into the complete CCA implementation:

bdCCA_hdf5 <- function(filename, X, Y, m = 10,
                       bcenter = TRUE, bscale = FALSE, bycols = FALSE,
                       overwriteResults = FALSE, keepInteResults = FALSE) {
  
  matrices <- c(X, Y)
  
  # Step 1-6: Compute QR for both X and Y
  # This is the computationally intensive part
  sapply(
    matrices,
    getQRbyBlocks,
    file = filename,
    m = m,
    center = bcenter,
    scale = bscale,
    bcols = bycols,
    overwrt = overwriteResults
  )
  
  cat("✓ QR decompositions complete for both matrices\n")
  
  # Step 7: Compute cross-product of Q matrices
  # This finds the canonical space
  res <- bdCrossprod_hdf5(
    filename = filename,
    group = "Step6",
    A = "XQ",
    groupB = "Step6",
    B = "YQ",
    outdataset = "CrossProd_XQ_YQ",
    outgroup = "Step7",
    overwrite = TRUE
  )
  
  cat("✓ Cross-product computed\n")
  
  # Step 8: SVD of cross-product
  # Singular values = canonical correlations
  # U, V = canonical directions
  res <- bdSVD_hdf5(
    file = filename,
    group = "Step7",
    dataset = "CrossProd_XQ_YQ",
    bcenter = FALSE,  # Already centered
    bscale = FALSE,
    k = 16,  # Process in blocks
    q = 2,   # Two-level hierarchy
    threads = 3,
    overwrite = TRUE
  )
  
  cat("✓ SVD computed - canonical correlations extracted\n")
  
  # Get dimensions for coefficient computation
  res <- sapply(matrices, bdgetDim_hdf5, filename = filename)
  
  # Step 9: Compute and save canonical coefficients
  writeCCAComponents_hdf5(filename, res[2, X], res[2, Y])
  
  cat("✓ Canonical coefficients and variates saved\n")
  
  # Cleanup intermediate results if requested
  if (keepInteResults == FALSE) {
    sapply(paste0("Step", 1:7), function(x) {
      invisible(bdRemove_hdf5_element(filename, element = x))
      cat("  ", x, "removed\n")
    })
    cat("✓ Intermediate results cleaned up\n")
  }
  
  return(list(
    filename = filename,
    results_group = "Results"
  ))
}

5 Step 3: Computing Canonical Coefficients

The final step converts SVD results into interpretable canonical components:

writeCCAComponents_hdf5 <- function(filename, ncolsX, ncolsY) {
  
  if (!file.exists(filename)) {
    stop("ERROR - File does not exist")
  }
  
  # Read all necessary components from HDF5
  h5f <- H5Fopen(filename)
  
  # Get first ncolsX × ncolsX and ncolsY × ncolsY from Q matrices
  XQ <- h5f$Step6$XQ[1:ncolsX, 1:ncolsX]
  YQ <- h5f$Step6$YQ[1:ncolsY, 1:ncolsY]
  
  # Get R matrices from final QR
  XR <- h5f$Step3$Final_QR$XRt.R
  YR <- h5f$Step3$Final_QR$YRt.R
  
  # Get SVD results (canonical correlations and directions)
  d <- h5f$SVD$CrossProd_XQ_YQ$d
  u <- h5f$SVD$CrossProd_XQ_YQ$u
  v <- h5f$SVD$CrossProd_XQ_YQ$v
  
  # Get centering values
  xcenter <- h5f$NORMALIZED$data$mean.X
  ycenter <- h5f$NORMALIZED$data$mean.Y
  
  # Get feature names
  x_names <- h5f$data$.X_dimnames$`2`
  y_names <- h5f$data$.Y_dimnames$`2`
  
  h5closeAll()
  
  # Reconstruct compact QR representation
  # Q and R are stored separately, combine them
  XR[lower.tri(XR, diag = FALSE)] <- 0  # R is upper triangular
  XQ[upper.tri(XQ, diag = TRUE)] <- 0   # Q lower part
  XQR <- XR + XQ
  
  YR[lower.tri(YR, diag = FALSE)] <- 0
  YQ[upper.tri(YQ, diag = TRUE)] <- 0
  YQR <- YR + YQ
  
  # Solve for canonical coefficients
  # X canonical coefficients: solve XQR * xcoef = u
  xcoef <- bdSolve(XQR, u)
  
  # Y canonical coefficients: solve YQR * ycoef = v
  ycoef <- bdSolve(YQR, v)
  
  # Add feature names
  rownames(xcoef) <- as.matrix(x_names)
  rownames(ycoef) <- as.matrix(y_names)
  
  # Save canonical coefficients to HDF5
  bdCreate_hdf5_matrix(
    filename,
    object = xcoef,
    group = "Results",
    dataset = "xcoef",
    overwriteDataset = TRUE
  )
  
  bdCreate_hdf5_matrix(
    filename = filename,
    object = ycoef,
    group = "Results",
    dataset = "ycoef",
    overwriteDataset = TRUE
  )
  
  # Save canonical correlations
  bdCreate_hdf5_matrix(
    filename,
    object = as.matrix(diag(d)),
    group = "Results",
    dataset = "cor",
    overwriteDataset = TRUE
  )
  
  # Save centering values (needed for new data projection)
  bdCreate_hdf5_matrix(
    filename,
    object = xcenter,
    group = "Results",
    dataset = "xcenter",
    overwriteDataset = TRUE
  )
  
  bdCreate_hdf5_matrix(
    filename,
    object = ycenter,
    group = "Results",
    dataset = "ycenter",
    overwriteDataset = TRUE
  )
  
  # Compute canonical variates (scores) for samples
  # X scores = X * xcoef
  bdblockmult_hdf5(
    filename = filename,
    group = "data",
    A = "X",
    B = "xcoef",
    groupB = "Results",
    outgroup = "Results",
    outdataset = "xscores",
    overwrite = TRUE
  )
  
  # Y scores = Y * ycoef
  bdblockmult_hdf5(
    filename,
    group = "data",
    A = "Y",
    B = "ycoef",
    groupB = "Results",
    outgroup = "Results",
    outdataset = "yscores",
    overwrite = TRUE
  )
  
  cat("✓ All CCA components saved to /Results/\n")
}

Understanding the Coefficients

xcoef, ycoef: Feature loadings - which features contribute to each canonical variate
cor: Canonical correlations - strength of relationship for each component
xscores, yscores: Sample scores on canonical variates
xcenter, ycenter: Means used for centering (needed for projecting new data)

These components together provide complete CCA results interpretable just like standard CCA output.

6 Example Usage

Here’s how to use the implemented CCA function:

library(BigDataStatMeth)
library(rhdf5)

# Create example multi-omic data
set.seed(123)
n_samples <- 500
n_genes_x <- 3000
n_genes_y <- 2000

# Genomic data (X)
X <- matrix(rnorm(n_samples * n_genes_x), n_samples, n_genes_x)
rownames(X) <- paste0("Sample_", 1:n_samples)
colnames(X) <- paste0("Gene_X_", 1:n_genes_x)

# Expression data (Y)
Y <- matrix(rnorm(n_samples * n_genes_y), n_samples, n_genes_y)
rownames(Y) <- paste0("Sample_", 1:n_samples)
colnames(Y) <- paste0("Gene_Y_", 1:n_genes_y)

# Create HDF5 file
cca_file <- "example_cca.hdf5"

bdCreate_hdf5_matrix(cca_file, X, "data", "X", overwriteFile = TRUE)
bdCreate_hdf5_matrix(cca_file, Y, "data", "Y", overwriteFile = FALSE)

# Run CCA
result <- bdCCA_hdf5(
  filename = cca_file,
  X = "data/X",
  Y = "data/Y",
  m = 4,              # 4 blocks for QR
  bcenter = TRUE,     # Center data
  bscale = TRUE,      # Scale data
  overwriteResults = TRUE,
  keepInteResults = FALSE  # Remove intermediate steps
)

# Examine results
h5ls(cca_file)

# Extract canonical correlations
h5f <- H5Fopen(cca_file)
canonical_cors <- diag(h5f$Results$cor)
xcoef <- h5f$Results$xcoef
ycoef <- h5f$Results$ycoef
H5Fclose(h5f)

cat("\nCanonical correlations:\n")
print(canonical_cors[1:5])

cat("\nTop features for first canonical variate:\n")
cat("X features:\n")
print(head(sort(abs(xcoef[, 1]), decreasing = TRUE)))
cat("Y features:\n")
print(head(sort(abs(ycoef[, 1]), decreasing = TRUE)))

7 Interactive Exercise

7.1 Practice: Modifying the CCA Implementation

Understanding how the algorithm works helps you adapt it for your needs or create entirely new methods.

# Exercise 1: Change block size
# Try m = 2, 4, 8, 16
# How does memory usage change?
# How does computation time change?
# Do results differ?

# Exercise 2: Skip normalization
# Set bcenter = FALSE, bscale = FALSE
# How do canonical correlations change?
# Why does this matter?

# Exercise 3: Keep intermediate results
# Set keepInteResults = TRUE
# Explore Step1 through Step7 in HDF5 file
# What does each step contain?
# How does data transform through the pipeline?

# Exercise 4: Add progress reporting
# Modify getQRbyBlocks to print progress
# Track time for each major step
# Identify bottlenecks

Reflection Questions

1. Algorithm Understanding: - Why QR decomposition instead of direct SVD on X and Y? - What would break if you skipped the hierarchical QR merge? - How does block size affect accuracy vs. efficiency?

2. Design Decisions: - Why store intermediate results in HDF5 instead of memory? - keepInteResults = TRUE vs. FALSE - when use each? - Could you implement this with fewer steps? Trade-offs?

3. Extending to New Methods: - What if you wanted sparse CCA? Which step changes? - How would you implement regularized CCA? - Could you use this pattern for PLS regression?

4. Practical Considerations: - Your data has missing values. Where in the pipeline do you handle them? - X has 100,000 features, Y has 50,000. What breaks? - How do you validate results match standard CCA?

5. Performance Optimization: - Which step is slowest? QR blocks? SVD? Coefficient solving? - Could you parallelize any steps? - When does the C++ version become worth the effort?

Think about how you’d adapt this pattern for your own statistical method. The structure—partition, process blocks, combine hierarchically—applies broadly beyond CCA.

8 Key Takeaways

Let’s consolidate what you’ve learned about implementing CCA using BigDataStatMeth’s R API.

8.1 Essential Concepts

Block-wise algorithms decompose complex operations into manageable steps. Rather than computing QR decomposition on a 50,000 × 500,000 matrix (requiring hundreds of GB RAM), we partition into blocks, compute QR on each (requiring only GB), then combine results hierarchically. This pattern—partition, process, combine—applies to virtually any matrix algorithm. Understanding this pattern lets you implement statistical methods that would otherwise be impossible on standard hardware.

HDF5 files serve as persistent working memory. Each step saves results to HDF5: Step1 has split matrices, Step2 has block QRs, Step3 has merged results. This isn’t just storage—it enables debugging (inspect intermediate results), resuming (restart from any step), and memory management (only current block in RAM). Without HDF5’s structure, you’d need everything in memory simultaneously or manage dozens of temporary files manually.

Hierarchical combination preserves mathematical correctness. Computing QR on blocks, stacking R matrices, taking QR of the stack, and multiplying back through Q matrices gives identical results to single-shot QR on the full matrix. The mathematics works because QR decomposition composes correctly. Not all algorithms have this property—you can’t naively block-wise any operation and expect correct results. Understanding which operations compose correctly is essential for designing block-wise algorithms.

Normalization affects numerical stability more than correctness. Centering and scaling aren’t just preprocessing niceties for CCA—they prevent numerical precision loss when working with disparate scales. Genomic copy number (±2) and expression (0-100,000) in the same analysis without scaling produces correlations driven by scale, not biology. For block-wise processing on HDF5, normalization happens once upfront and stores normalized data for all subsequent operations, avoiding repeated normalization overhead.

Intermediate result management is a design decision. keepInteResults = FALSE removes Step1-Step7 after completion, saving disk space. keepInteResults = TRUE preserves them for inspection, debugging, or reusing with different final steps. For production pipelines, clean up. For method development, keep everything until you’re certain it works. The choice affects both disk usage and your ability to troubleshoot problems.

Code structure mirrors algorithm structure. getQRbyBlocks handles one matrix’s QR decomposition. bdCCA_hdf5 calls it twice (for X and Y), computes cross-product, performs SVD, and extracts coefficients. writeCCAComponents_hdf5 handles result extraction and formatting. This modular structure makes code understandable and reusable—getQRbyBlocks could be used for methods beyond CCA. Clear function boundaries with specific responsibilities enable building complex methods from simple components.

8.2 When to Implement Methods in R

Understanding when R implementation suffices versus when C++ becomes necessary guides efficient development.

✅ Implement in R when:

Prototyping new methods - R’s interactive nature enables rapid testing. Write the algorithm, run it on small data, iterate until correct. Once working, optimize or port to C++ if needed. Starting in R saves development time even if you eventually rewrite in C++.
Combining existing BigDataStatMeth functions - If your method uses bdSVD_hdf5, bdCrossprod_hdf5, bdSolve, etc., R glue code suffices. The heavy lifting happens in C++ (BigDataStatMeth internals), R just orchestrates. CCA is this case—all computational work happens in compiled code, R just sequences operations.
Method will be used occasionally - If you run CCA once per project, R performance is fine. Minutes vs. seconds doesn’t matter when you run it monthly. Save development effort for methods you’ll run thousands of times.
Your team works primarily in R - Keeping implementation in R enables others to understand, modify, and extend your code without C++ expertise. Collaboration and maintainability sometimes outweigh raw performance.
Intermediate result inspection matters - R makes it trivial to peek at any step: load data from HDF5, examine dimensions, plot distributions. This transparency helps verify correctness and debug problems. C++ implementations bury these details in compiled code.

✅ Consider C++ implementation when:

Performance is critical - If your method runs hundreds of times daily, saving 50% computation time through C++ pays for development effort. Profile first—identify if performance matters before investing in optimization.
Custom algorithms, not BigDataStatMeth combinations - If you’re implementing novel linear algebra (not using existing bdXXX functions), C++ gives you control over algorithms, memory layout, and parallelization. R becomes a bottleneck for truly custom computation.
Production deployment required - Software going into production pipelines benefits from C++’s speed and static typing. Development takes longer but runtime is faster and more predictable.
You need fine-grained parallelization - R’s parallelization works at function level. C++ with OpenMP or threading libraries enables parallelizing within matrix operations, achieving better speedups for large-scale data.

❌ Don’t reimplement in C++ when:

R version works fine - “Working” beats “optimal” when deadlines loom. If R implementation meets your needs, moving to C++ wastes time that could go toward science.
The bottleneck isn’t computation - If your analysis spends 90% of time reading files or waiting for database queries, optimizing computation achieves nothing. Profile to find true bottlenecks before optimizing.
You’re still figuring out the algorithm - Algorithm design in C++ is painful—compile, run, crash, debug, repeat. R’s interactivity (run line-by-line, inspect variables, iterate quickly) is invaluable during method development. Get it working in R, then optimize if needed.

The key principle: R for development and prototyping, C++ for production and performance-critical code. Many methods never need C++ because R orchestration of C++ building blocks (BigDataStatMeth functions) provides adequate performance.

9 Next Steps

Explore the C++ implementation:

CCA Implementation in C++ - Same algorithm, different language
Compare R and C++ approaches
Understand when each makes sense

Apply this pattern to new methods:

Implement PLS (Partial Least Squares) using similar structure
Create sparse CCA with L1 penalties
Develop custom dimensionality reduction methods

Extend CCA functionality:

Add cross-validation for component selection
Implement permutation testing for significance
Create visualization functions for results
Build prediction functions for new samples

10 Cleanup

# Close any open HDF5 connections
h5closeAll()

# Note: Keep example_cca.hdf5 for testing and learning
# It contains complete CCA results you can explore

--- title: "CCA Implementation in R" subtitle: "Building Canonical Correlation Analysis Using BigDataStatMeth R API" --- ## Overview This document shows how to implement Canonical Correlation Analysis (CCA) from scratch using BigDataStatMeth's R API. Rather than using a pre-built CCA function, you'll see how to combine BigDataStatMeth's building blocks—QR decomposition, matrix operations, SVD—to create a complete statistical method. This is the foundation for understanding how to develop your own statistical methods for big data. The implementation you'll see here is actual working code from the BDStatMethExamples package. It's not pseudocode or simplified for teaching—it's production code that handles real TCGA multi-omic data with thousands of samples and features. By studying this implementation, you learn both the mathematical algorithm and the practical engineering required to make complex statistical methods work on big data. Think of this as a blueprint. CCA serves as the example, but the patterns you'll learn—partitioning data, applying operations iteratively, combining results hierarchically, managing intermediate outputs—apply to any algorithm you might develop. Whether you're implementing sparse PCA, regularized regression, factor analysis, or novel methods from research papers, this workflow shows you how to structure the implementation. ::: {.learning-objectives} ### What You'll Learn By the end of this document, you will: - Understand CCA's mathematical algorithm and its implementation challenges - Implement QR decomposition by blocks for matrices exceeding memory - Combine intermediate QR results using hierarchical merging - Compute canonical correlations from Q matrix cross-products - Extract canonical coefficients by solving triangular systems - Structure complex multi-step algorithms with clear function separation - Manage intermediate results strategically in HDF5 files - Create reusable patterns applicable to other statistical methods ::: --- ## CCA Algorithm Overview ### Mathematical Foundation Canonical Correlation Analysis finds linear combinations of features in two datasets that maximize correlation. Given matrices X (n × p) and Y (n × q), CCA seeks weight vectors α and β such that: $$ \max_{\alpha, \beta} \text{cor}(X\alpha, Y\beta) $$ subject to the constraint that the canonical variates have unit variance. This optimization problem has a classical solution through matrix decompositions. The standard algorithm proceeds through these mathematical steps: **Step 1: QR decompositions** $$ X = Q_X R_X, \quad Y = Q_Y R_Y $$ where Q matrices are orthonormal (n × p and n × q) and R matrices are upper triangular. **Step 2: Cross-product of Q matrices** $$ M = Q_X^T Q_Y $$ This (p × q) matrix captures the relationship between the orthonormalized datasets. **Step 3: SVD of cross-product** $$ M = U D V^T $$ where D contains singular values (the canonical correlations), and U, V are orthogonal matrices containing canonical directions. **Step 4: Canonical coefficients** $$ \alpha = R_X^{-1} U, \quad \beta = R_Y^{-1} V $$ These transform original features to canonical space. **Step 5: Canonical variates** $$ s_X = X\alpha, \quad s_Y = Y\beta $$ These are the actual canonical scores for each sample. ### Why Block-Wise Processing? The algorithm above assumes you can load X and Y into memory, compute their QR decompositions directly, and proceed. For small datasets (thousands of samples, hundreds of features), this works fine. For genomic or multi-omic data—50,000 samples × 500,000 SNPs, or 10,000 samples × 20,000 genes—loading the full matrices into RAM is impossible. Block-wise processing solves this by: 1. **Partitioning** each matrix into m blocks by rows (samples) 2. **Computing** QR decomposition on each block independently 3. **Combining** block QR results hierarchically to reconstruct the full QR 4. **Achieving** identical mathematical results with manageable memory The key insight is that QR decomposition of stacked matrices can be computed by taking QR of each block, stacking the R matrices, taking QR of that stack, then multiplying the original Q matrices by the new Q. This hierarchical approach is mathematically exact—you get the same final Q and R as if you'd computed QR on the full matrix, but you never load the full matrix into memory. For a matrix X with 50,000 rows, partitioning into m=10 blocks means each block has 5,000 rows. If X has 500,000 columns, each block is 5,000 × 500,000—still large, but processable. The R matrix from each block QR is only 500,000 × 500,000 (or smaller if you compute economy QR). Stacking 10 such R matrices gives 5,000,000 × 500,000, and taking QR of that is manageable because it's column-heavy, not row-heavy. This is the pattern we'll implement: partition → process blocks → merge hierarchically → reconstruct full result. --- ## Step 1: QR Decomposition by Blocks ### The getQRbyBlocks Function This function is the algorithmic core of block-wise CCA. It takes a dataset stored in HDF5 and computes its QR decomposition by processing the data in m blocks. The function doesn't return matrices to R—instead, it orchestrates a sequence of HDF5 operations, with intermediate results stored in the file under "Step1", "Step2", etc. ```{r qr-by-blocks, eval=FALSE} library(BigDataStatMeth) library(rhdf5) library(data.table) getQRbyBlocks <- function(strdataset, file, m, center, scale, bcols, overwrt) { # Parse dataset path into group and dataset components # Example: "data/X" becomes group="data", dataset="X" strgroup <- gsub("/.?$", "", strdataset) strdataset <- gsub("^.*/", "", strdataset) # Step 1: Normalize the dataset # Centering removes column means, scaling divides by standard deviations # This ensures numerical stability in QR computation bdNormalize_hdf5( filename = file, group = strgroup, dataset = strdataset, bcenter = center, bscale = scale, overwrite = overwrt ) # Step 2: Split matrix into m blocks by rows # Each block gets approximately n/m rows # Normalized data stored under NORMALIZED/, output goes to Step1/ bdSplit_matrix_hdf5( filename = file, group = paste0("NORMALIZED/", strgroup), dataset = strdataset, outgroup = paste0("Step1/", strdataset, "rows"), nblocks = m, bycols = bcols, # FALSE = split by rows (samples) overwrite = overwrt ) # Step 3: Apply QR to each block independently # Get list of block datasets created by split blocks <- bdgetDatasetsList_hdf5(file, paste0("Step1/", strdataset, "rows")) # Apply QR decomposition to each block # Creates both Q and R for each block bdapply_Function_hdf5( filename = file, group = paste0("Step1/", strdataset, "rows"), datasets = blocks, outgroup = paste0("Step2/", strdataset, "rows"), func = "QR", overwrite = overwrt ) # Step 4: Merge R matrices from all blocks # QR gives Q (orthogonal) and R (upper triangular) for each block # We need to stack all R matrices vertically blocks_qr <- bdgetDatasetsList_hdf5(file, paste0("Step2/", strdataset, "rows")) # Filter to get only R matrices (dataset names contain ".R") r_matrices <- blocks_qr[which(blocks_qr %like% ".R")] # Stack R matrices vertically (bind by rows) bdBind_hdf5_datasets( filename = file, group = paste0("Step2/", strdataset, "rows"), datasets = r_matrices, outgroup = "Step3/merged", outdataset = paste0(strdataset, "Rt"), func = "bindRows", overwrite = overwrt ) # Step 5: Final QR on merged R matrices # Taking QR of the stacked R matrices gives us components for final Q bdapply_Function_hdf5( filename = file, group = "Step3/merged", datasets = paste0(strdataset, "Rt"), outgroup = "Step3/Final_QR", func = "QR", overwrite = overwrt ) # Step 6: Split final Q matrix for block-wise multiplication # We need to split this Q to multiply it with original Q blocks bdSplit_matrix_hdf5( filename = file, group = "Step3/Final_QR", dataset = paste0(strdataset, "Rt.Q"), outgroup = "Step4/splitted", nblocks = m, bycols = bcols, overwrite = overwrt ) # Step 7: Multiply original Q blocks by final Q blocks # This reconstructs the full Q matrix in block form tmp <- bdgetDatasetsList_hdf5(file, "Step4/splitted") Rt_Q_divide <- tmp[which(tmp %like% paste0(strdataset, "Rt.Q"))] # Get original Q matrices from Step 2 q_matrices <- blocks_qr[which(blocks_qr %like% ".Q")] # Multiply each original Q block by corresponding final Q block bdapply_Function_hdf5( filename = file, group = paste0("Step2/", strdataset, "rows"), datasets = q_matrices, outgroup = "Step5", func = "blockmult", b_group = "Step4/splitted", b_datasets = Rt_Q_divide, overwrite = TRUE ) # Step 8: Combine all Q blocks into final Q matrix # Stack the multiplied Q blocks vertically blocks_Q <- bdgetDatasetsList_hdf5(file, "Step5") bdBind_hdf5_datasets( filename = file, group = "Step5", datasets = blocks_Q[which(blocks_Q %like% paste0(strdataset, "."))], outgroup = "Step6", outdataset = paste0(strdataset, "Q"), func = "bindRows", overwrite = overwrt ) cat("✓ QR decomposition complete for", strdataset, "\n") } ``` ### Understanding the Nine Steps This function implements a carefully orchestrated sequence. Each step builds on previous results, and the order is mathematically necessary. **Step 1 (Normalization):** Centering and scaling improve numerical stability. Without centering, QR decomposition can be numerically unstable if features have very different means. Without scaling, features with large variance dominate. For multi-omic CCA where X might be methylation (0-1 scale) and Y might be expression (0-10,000 scale), scaling is essential. **Step 2 (Partitioning):** Splitting into m blocks creates manageable chunks. If your original matrix is 50,000 × 500,000 and won't fit in RAM, m=10 gives you blocks of 5,000 × 500,000 that might fit. The choice of m balances memory usage (larger m = smaller blocks) against computational overhead (more blocks = more operations). **Step 3 (Block QR):** Computing QR on each block is independent—blocks can be processed in parallel if desired. Each block's QR gives Q_i (5,000 × 500,000 in economy form) and R_i (500,000 × 500,000). The Q_i matrices stay in their original locations; we'll use them later. The R_i matrices move to the next step. **Step 4 (Merging R):** Stacking all R_i matrices vertically creates a tall-skinny matrix. With m=10 blocks, stacking gives (10 × 500,000) × 500,000 = 5,000,000 × 500,000. This is column-heavy—many rows, but relatively few columns—which is much easier to handle than the original row-heavy matrix. **Step 5 (Final QR):** Taking QR of the stacked R matrices gives Q_final and R_final. The R_final is the R component of the full matrix's QR decomposition—this is exact, not approximate. The Q_final will be used to combine the original Q blocks. **Step 6 (Splitting Q_final):** We need to multiply each original Q_i by the corresponding rows of Q_final. Splitting Q_final into m blocks aligns them with the original m blocks of Q. **Step 7 (Block Multiplication):** For each block i, compute Q_i × Q_final_i. This gives the final Q matrix in block form. The mathematics: if X_i = Q_i R_i, and stacking R_i gives [R_1; R_2; ...] = Q_final R_final, then X = [Q_1; Q_2; ...] Q_final R_final. So the final Q is [Q_1 Q_final; Q_2 Q_final; ...]. **Step 8 (Combining Final Q):** Stack the multiplied Q blocks vertically to reconstruct the full Q matrix. This final Q, combined with R_final from Step 5, gives you the complete QR decomposition. **Step 9 (Implicit - Storage):** The function doesn't return anything to R. Instead, results are stored in HDF5 at `/Step6/[dataset]Q` for Q and `/Step3/Final_QR/[dataset]Rt.R` for R. Subsequent functions know to look in these locations. This multi-step process seems complex, but each step is simple. The complexity is in the orchestration, not the individual operations. And the payoff is that you can compute QR decomposition of matrices that would never fit in memory using a standard algorithm. ::: {.callout-note} ## Why Hierarchical Combination Works The mathematical justification for this approach: If $X_i = Q_i R_i$ for each block, then stacking gives: $$ \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_m \end{bmatrix} = \begin{bmatrix} Q_1 R_1 \\ Q_2 R_2 \\ \vdots \\ Q_m R_m \end{bmatrix} $$ Taking QR of the stacked R matrices: $$ \begin{bmatrix} R_1 \\ R_2 \\ \vdots \\ R_m \end{bmatrix} = \tilde{Q} \tilde{R} $$ Then the full QR decomposition is: $$ X = \begin{bmatrix} Q_1 \\ Q_2 \\ \vdots \\ Q_m \end{bmatrix} \tilde{Q} \tilde{R} $$ The final Q matrix is the block-diagonal matrix of Q_i's times the dense matrix $\tilde{Q}$. The final R is $\tilde{R}$. This is mathematically identical to QR on the full matrix X but computed in memory-friendly steps. ::: --- ## Step 2: Main CCA Function ### Orchestrating the Complete Algorithm Now that we have block-wise QR decomposition, we can implement complete CCA. The `bdCCA_hdf5` function orchestrates all steps: QR for both X and Y, computing their cross-product, SVD to extract canonical correlations, solving for coefficients, and computing variates. ```{r bdcca-function, eval=FALSE} bdCCA_hdf5 <- function(filename, X, Y, m = 10, bcenter = TRUE, bscale = FALSE, bycols = FALSE, overwriteResults = FALSE, keepInteResults = FALSE) { matrices <- c(X, Y) # Step 1-8: Compute QR decomposition for both X and Y matrices # This is the computationally intensive part - processes data in blocks # Results stored in HDF5 under Step1/ through Step6/ sapply( matrices, getQRbyBlocks, file = filename, m = m, center = bcenter, scale = bscale, bcols = bycols, overwrt = overwriteResults ) cat("✓ QR decompositions complete for both matrices\n") # Step 9: Compute cross-product of Q matrices # This matrix captures the canonical relationship between X and Y # Q_X is (n × p), Q_Y is (n × q), so Q_X^T Q_Y is (p × q) res <- bdCrossprod_hdf5( filename = filename, group = "Step6", A = "XQ", groupB = "Step6", B = "YQ", outdataset = "CrossProd_XQ_YQ", outgroup = "Step7", overwrite = TRUE ) cat("✓ Cross-product computed\n") # Step 10: SVD of cross-product # Singular values are the canonical correlations # U and V matrices contain canonical directions res <- bdSVD_hdf5( file = filename, group = "Step7", dataset = "CrossProd_XQ_YQ", bcenter = FALSE, # Already centered in Step 1 bscale = FALSE, k = 16, # Number of blocks for SVD computation q = 2, # Two-level hierarchy for large-scale SVD threads = 3, overwrite = TRUE ) cat("✓ SVD computed - canonical correlations extracted\n") # Get dimensions needed for coefficient computation # Need number of columns in X and Y to properly dimension coefficients res <- sapply(matrices, bdgetDim_hdf5, filename = filename) # Step 11: Compute canonical coefficients and variates # Coefficients show which features contribute to canonical variates # Variates are the actual canonical scores for each sample writeCCAComponents_hdf5(filename, res[2, X], res[2, Y]) cat("✓ Canonical coefficients and variates saved\n") # Optional: Remove intermediate results to save disk space # Step1-Step7 contain temporary matrices used during computation # Results/ contains final canonical components if (keepInteResults == FALSE) { sapply(paste0("Step", 1:7), function(x) { invisible(bdRemove_hdf5_element(filename, element = x)) cat(" ", x, "removed\n") }) cat("✓ Intermediate results cleaned up\n") } return(list( filename = filename, results_group = "Results" )) } ``` ### Function Flow and Design Decisions This function is deceptively simple—only about 50 lines—but it orchestrates a complex multi-stage computation. Several design decisions make it work effectively. **Separation of concerns:** The function doesn't implement QR decomposition itself; it calls `getQRbyBlocks` for both matrices. This separation means if you want to optimize QR (try different block sizes, use different algorithms), you modify `getQRbyBlocks` without touching this orchestration function. Similarly, SVD computation is delegated to `bdSVD_hdf5`. The main function's job is orchestration, not implementation. **Sequential dependency management:** Steps must execute in order—you can't compute cross-product before QR, can't do SVD before cross-product, can't solve for coefficients before SVD. The function structure makes this dependency explicit. Each step validates that its inputs exist before proceeding. If Step 3 fails, Step 4 never runs—preventing cascading errors from invalid intermediate results. **Memory efficiency through streaming:** At no point does this function load the full X or Y matrices into R. All operations happen in HDF5, with only metadata (filenames, dimensions, group names) in R memory. Even when computing with datasets containing millions of elements, R's memory usage stays constant. The HDF5 file grows as results accumulate, but that's disk space, not RAM. **Parameter propagation:** The `m` parameter (number of blocks) propagates to `getQRbyBlocks`, which propagates it to splitting and merging operations. This single parameter controls memory usage throughout. Want to use less memory? Increase m. Have more RAM available? Decrease m for speed. One parameter tunes the entire computation's memory footprint. **Result management flexibility:** The `keepInteResults` parameter lets users choose between disk space (delete intermediates) and debugging capability (keep them). During development, keep intermediate results to inspect where computations fail. In production, delete them to save space. This flexibility costs nothing—it's just a flag controlling cleanup. **Error propagation:** Each BigDataStatMeth function returns status. If `bdNormalize_hdf5` fails, `getQRbyBlocks` doesn't proceed to splitting. If `getQRbyBlocks` fails for X, Y's QR never starts. Errors stop propagation immediately, preventing corrupted results from appearing as valid output. The function's apparent simplicity hides careful engineering. It's not just "call these functions in order"—it's "orchestrate a complex computation with clear dependencies, graceful error handling, flexible resource management, and predictable behavior." --- ## Step 3: Computing Canonical Coefficients ### From SVD Results to Interpretable Components The SVD gave us singular values (canonical correlations) and singular vectors (U and V). But these aren't directly the canonical coefficients we want—they're in the orthonormalized Q space. To get coefficients in terms of original features, we need to "undo" the QR transformation by solving triangular systems. ```{r write-components, eval=FALSE} writeCCAComponents_hdf5 <- function(filename, ncolsX, ncolsY) { if (!file.exists(filename)) { stop("ERROR - File does not exist") } # Read all necessary components from HDF5 h5f <- H5Fopen(filename) # Extract Q matrices (first ncolsX × ncolsX and ncolsY × ncolsY portions) # These are the orthonormal bases from QR decomposition XQ <- h5f$Step6$XQ[1:ncolsX, 1:ncolsX] YQ <- h5f$Step6$YQ[1:ncolsY, 1:ncolsY] # Extract R matrices from final QR # These are upper triangular matrices from the decomposition XR <- h5f$Step3$Final_QR$XRt.R YR <- h5f$Step3$Final_QR$YRt.R # Extract SVD results # d: singular values (canonical correlations) # u: left singular vectors (directions in X space) # v: right singular vectors (directions in Y space) d <- h5f$SVD$CrossProd_XQ_YQ$d u <- h5f$SVD$CrossProd_XQ_YQ$u v <- h5f$SVD$CrossProd_XQ_YQ$v # Extract centering parameters (means subtracted during normalization) xcenter <- h5f$NORMALIZED$data$mean.X ycenter <- h5f$NORMALIZED$data$mean.Y # Extract feature names x.names <- h5f$data$.X_dimnames$`2` y.names <- h5f$data$.Y_dimnames$`2` h5closeAll() # Reconstruct QR in compact form # Q and R are stored separately; combine into single triangular matrix # Lower triangle = Q (without diagonal), Upper triangle = R (with diagonal) XR[lower.tri(XR, diag = FALSE)] <- 0 # Zero below diagonal XQ[upper.tri(XQ, diag = TRUE)] <- 0 # Zero on and above diagonal XQR <- XR + XQ # Combined representation YR[lower.tri(YR, diag = FALSE)] <- 0 YQ[upper.tri(YQ, diag = TRUE)] <- 0 YQR <- YR + YQ # Solve for canonical coefficients # Since X = Q_X R_X and canonical directions are U in Q space, # coefficients in original space are: α = R_X^(-1) U # bdSolve efficiently solves the triangular system R_X α = U xcoef <- bdSolve(XQR, u) ycoef <- bdSolve(YQR, v) # Assign feature names to coefficients rownames(xcoef) <- as.matrix(x.names) rownames(ycoef) <- as.matrix(y.names) # Store coefficients in HDF5 bdCreate_hdf5_matrix( filename, object = xcoef, group = "Results", dataset = "xcoef", overwriteDataset = TRUE ) bdCreate_hdf5_matrix( filename = filename, object = ycoef, group = "Results", dataset = "ycoef", overwriteDataset = TRUE ) # Store canonical correlations (diagonal of D from SVD) bdCreate_hdf5_matrix( filename, object = as.matrix(diag(d)), group = "Results", dataset = "cor", overwriteDataset = TRUE ) # Store centering parameters for reconstructing original scale bdCreate_hdf5_matrix( filename, object = xcenter, group = "Results", dataset = "xcenter", overwriteDataset = TRUE ) bdCreate_hdf5_matrix( filename, object = ycenter, group = "Results", dataset = "ycenter", overwriteDataset = TRUE ) # Compute canonical variates (sample scores) # Multiply original data by coefficients: s_X = X α bdblockmult_hdf5( filename = filename, group = "data", A = "X", B = "xcoef", groupB = "Results", outgroup = "Results", outdataset = "xscores", overwrite = TRUE ) bdblockmult_hdf5( filename, group = "data", A = "Y", B = "ycoef", groupB = "Results", outgroup = "Results", outdataset = "yscores", overwrite = TRUE ) cat("✓ All CCA components saved to Results/ group\n") } ``` ### Why We Need This Step The SVD gave us U and V in the Q space—the orthonormalized coordinate system created by QR decomposition. But users want coefficients in terms of original features: "which genes contribute to this canonical variate?" not "which Q-space dimensions contribute?" The transformation back requires solving a triangular system. We have X = Q_X R_X, and in Q space the canonical direction is U. To find the direction α in original X space such that Xα corresponds to Q_X U, we solve: $$ R_X \alpha = U $$ Since R_X is upper triangular (from QR decomposition), this system solves efficiently via back-substitution. The `bdSolve` function implements this triangular solve, which is O(n²) rather than O(n³) for general matrix inversion. ### Understanding the Results Structure After this function completes, the HDF5 file contains a `/Results/` group with these datasets: - **xcoef:** (p × k) matrix of X feature loadings for k canonical components - **ycoef:** (q × k) matrix of Y feature loadings - **cor:** (k × 1) vector of canonical correlations - **xcenter, ycenter:** Mean vectors used for centering - **xscores:** (n × k) matrix of X canonical variates (sample scores) - **yscores:** (n × k) matrix of Y canonical variates These components support different downstream analyses: **Feature interpretation:** Sort xcoef and ycoef by absolute value to identify which original features (genes, CpG sites) contribute most to each canonical component. High-loading features drive the canonical relationship. **Sample visualization:** Plot xscores vs yscores, colored by sample metadata (cancer type, treatment group). This shows how samples separate in canonical space. **Validation:** Compute correlation between xscores and yscores columns—should match the canonical correlations in `cor`. **Downstream modeling:** Use xscores and yscores as features in classification, regression, or survival models. They're lower-dimensional representations capturing coordinated variation. The centering parameters let you reconstruct original-scale predictions if needed, though typically canonical variates are used directly. --- ## Example Usage Here's how to use the complete CCA implementation on real data: ```{r example-usage, eval=FALSE} library(BigDataStatMeth) library(rhdf5) # Assume we have multi-omic data already in HDF5 # X: methylation (n × p) # Y: expression (n × q) cca_file <- "multi_omic_cca.hdf5" # Perform CCA with 4 blocks, centering but not scaling result <- bdCCA_hdf5( filename = cca_file, X = "data/X", Y = "data/Y", m = 4, bcenter = TRUE, bscale = FALSE, overwriteResults = TRUE, keepInteResults = FALSE # Clean up intermediate steps ) # Examine results h5ls(cca_file, recursive = TRUE) # Load canonical correlations h5f <- H5Fopen(cca_file) canonical_cors <- diag(h5f$Results$cor) cat("Canonical correlations:\n") print(canonical_cors[1:5]) # Load feature coefficients xcoef <- h5f$Results$xcoef ycoef <- h5f$Results$ycoef # Identify top features for first canonical component x_top <- order(abs(xcoef[, 1]), decreasing = TRUE)[1:10] y_top <- order(abs(ycoef[, 1]), decreasing = TRUE)[1:10] cat("\nTop 10 methylation sites:\n") print(xcoef[x_top, 1]) cat("\nTop 10 genes:\n") print(ycoef[y_top, 1]) # Load sample scores for visualization xscores <- h5f$Results$xscores yscores <- h5f$Results$yscores h5closeAll() # Verify: correlation of scores should match canonical correlation cor_check <- cor(xscores[, 1], yscores[, 1]) cat("\nVerification - CC1:", canonical_cors[1], "| Computed:", cor_check, "\n") ``` This example shows the typical workflow: run CCA, examine canonical correlations to see how strong the relationships are, inspect feature loadings to understand what drives the relationships, and extract sample scores for visualization or downstream modeling. --- ## Interactive Exercise {.exercise} ### Practice: Extending the CCA Implementation Understanding comes from experimentation. Try these modifications: ```{r} #| eval: false # Exercise 1: Vary the number of blocks # Run CCA with m = 2, 4, 8, 16 # Time each execution # How does block count affect speed and memory? system.time({ bdCCA_hdf5(cca_file, "data/X", "data/Y", m = 2, overwriteResults = TRUE, keepInteResults = FALSE) }) system.time({ bdCCA_hdf5(cca_file, "data/X", "data/Y", m = 8, overwriteResults = TRUE, keepInteResults = FALSE) }) # Exercise 2: Inspect intermediate results # Run with keepInteResults = TRUE # Examine Step1 through Step7 in HDF5 file # What's the size and structure of each step? bdCCA_hdf5(cca_file, "data/X", "data/Y", m = 4, keepInteResults = TRUE) h5ls(cca_file, recursive = TRUE) # Exercise 3: Implement sparse CCA # Modify writeCCAComponents_hdf5 to: # - Apply L1 penalty to coefficients # - Threshold small coefficients to zero # - Re-normalize remaining coefficients # How does sparsity affect interpretation? # Exercise 4: Add cross-validation # Split samples into train/test # Compute CCA on train set # Project test samples onto canonical variates # Evaluate stability of canonical correlations ``` ::: {.callout-tip} ### Reflection Questions **1. Algorithm Design:** - Why 9 steps in getQRbyBlocks, not fewer? - Could you compute CCA without QR decomposition? - What if you swapped SVD for eigendecomposition? **2. Performance Tuning:** - With 100GB data, what m value would you try first? - When does increasing m stop helping? - Would parallel processing help? Which steps? **3. Numerical Stability:** - Why does normalization improve QR stability? - What happens if R matrices become singular? - How would you detect and handle ill-conditioning? **4. Extension Possibilities:** - How would you implement regularized CCA? - Could you handle >2 datasets (multi-block CCA)? - How would you add permutation testing? **5. Production Deployment:** - What validation checks would you add? - How would you handle computation failures? - What logging would be essential? These questions push beyond the implementation to understand design trade-offs and extension possibilities. ::: --- ## Key Takeaways {.key-concept} Let's consolidate what you've learned about implementing statistical methods using BigDataStatMeth's R API. ### Essential Concepts **Block-wise algorithms decompose computation into manageable steps.** The QR decomposition by blocks shows the fundamental pattern: partition data, process each partition independently, combine results hierarchically. This isn't an approximation or heuristic—it's mathematically exact. The block-wise QR produces identical results to computing QR on the full matrix, but with dramatically reduced memory requirements. This pattern applies beyond QR: eigendecomposition, SVD, matrix factorizations, even iterative algorithms like EM can be adapted to block-wise processing. The key insight is identifying which operations can be parallelized (processing blocks) and which must be sequential (combining results). **HDF5 acts as persistent working memory.** Traditional R functions load data, compute, return results—everything happens in RAM. BigDataStatMeth functions read from HDF5, compute, write results back to HDF5, return metadata. The HDF5 file becomes working memory that persists between function calls and exceeds RAM capacity. Step1 through Step7 in getQRbyBlocks aren't temporary variables—they're persistent intermediate results that subsequent functions access directly. This changes how you think about algorithms: instead of "what fits in memory," you ask "what's the optimal sequence of disk operations." The HDF5 file documents the computation's progression—if it fails at Step 5, Steps 1-4 remain intact for debugging. **Hierarchical combination preserves mathematical correctness.** Combining block QR results through "stack R matrices, take QR, multiply Q matrices" seems complicated. Why not just concatenate the final Q blocks? Because that wouldn't produce the correct overall QR decomposition. The hierarchical approach is derived from the QR decomposition's mathematical properties—specifically, that QR of a vertically stacked matrix can be computed from the QR decompositions of the blocks. This isn't algorithmic cleverness; it's mathematical necessity. The pattern generalizes: whenever you break a matrix operation into blocks, you need to understand the mathematical relationship between block results and the full result. Sometimes it's simple (addition, subtraction), sometimes it requires hierarchical combining (QR, SVD), sometimes it requires iteration (eigenvalue methods). **Normalization isn't preprocessing—it's algorithmic necessity.** The bcenter and bscale parameters aren't optional niceties for cleaner data. They're essential for numerical stability in QR decomposition. Without centering, features with large means can cause catastrophic cancellation in floating-point arithmetic. Without scaling, features with large variance dominate the decomposition, and features with small variance contribute numerical noise rather than signal. For CCA across multi-omic datasets—methylation (0-1 scale), expression (0-10,000 scale)—scaling isn't just recommended, it's mandatory. The algorithm will run without it, produce results, and those results will be numerically garbage. This is true for most decomposition methods: normalization isn't data cleaning, it's numerical stability. **Intermediate result management balances disk space and debugging.** The keepInteResults parameter exemplifies a design decision in all complex algorithms: clean up as you go (save disk space, hide complexity) or keep intermediates (enable debugging, allow inspection). During development, keep everything—if Step 7 fails, you need Step 6 results to diagnose why. In production, delete intermediates—a CCA pipeline that processes 1000 datasets doesn't need to keep Step1 through Step7 for all of them. The implementation makes this choice explicit and user-controllable. Good algorithm design anticipates both use cases. Consider adding timestamps to intermediate results, logging which steps completed, even storing parameter values that generated each result. These cost little disk space but dramatically improve debugging. **Function structure mirrors mathematical algorithm structure.** The separation between getQRbyBlocks, bdCCA_hdf5, and writeCCAComponents_hdf5 isn't arbitrary. Each function corresponds to a conceptual step in the CCA algorithm: compute QR, compute SVD of cross-product, solve for coefficients. This mapping between mathematical steps and code functions makes the implementation easier to understand, test, and modify. If you want to try a different SVD algorithm, you modify the bdSVD_hdf5 call without touching QR computation. If you want to add regularization, you modify writeCCAComponents_hdf5 to apply penalties when solving for coefficients. Clear separation of concerns—each function does one mathematical thing well—enables compositional algorithm development. ### When to Implement in R vs C++ Understanding when to use R versus when to move to C++ guides efficient development. ✅ **Implement in R when:** - **Prototyping new methods** - R's interactive development, immediate feedback, and high-level syntax make it ideal for trying new ideas. Get the algorithm working in R, validate correctness, optimize the mathematical approach. Only then consider C++ for performance. - **Orchestrating existing functions** - If your method combines BigDataStatMeth functions like bdSVD_hdf5, bdCrossprod_hdf5, etc., R orchestration is fine. These functions already use compiled C++ code—R just sequences the calls. The overhead of R function calls is negligible compared to the actual computations. - **Methods used occasionally** - If you run this analysis once per month, R performance is adequate. A method that takes 10 minutes in R versus 3 minutes in C++ doesn't justify C++ development time for infrequent use. - **Leveraging R's statistical ecosystem** - If your method needs permutation testing, cross-validation, statistical tests, or model fitting, R's packages provide these capabilities. Reimplementing in C++ would be massive effort for marginal performance gain. ✅ **Consider C++ when:** - **Performance bottlenecks identified** - Profile your R code. If one specific operation consumes most of the runtime, rewrite that operation in C++, not the entire method. Surgical optimization is more efficient than wholesale rewriting. - **Production deployment at scale** - If this method will run thousands of times in production pipelines, C++ performance gains accumulate. A 50% speedup saves hours daily in high-throughput settings. - **Custom algorithms not in BigDataStatMeth** - If you're implementing novel decomposition methods, custom optimization algorithms, or specialized linear algebra, C++ gives you control. You're not constrained to BigDataStatMeth's provided operations. - **Real-time or interactive requirements** - If users interact with the method (parameter changes trigger recomputation), sub-second response requires C++. R's several-second latency is acceptable for batch processing but not interactive exploration. The key principle: start in R, optimize selectively in C++ only when profiling shows clear bottlenecks and usage patterns justify development effort. Most statistical method implementations belong entirely in R, with BigDataStatMeth functions handling the computationally intensive matrix operations. --- ## Next Steps **Explore C++ implementation:** - [CCA Implementation in C++](cca-cpp-implementation.qmd) - Same algorithm, performance optimization **Apply the patterns:** - Implement sparse CCA with L1 regularization - Build regularized CCA for high-dimensional settings - Develop partial least squares (PLS) using similar structure - Create custom decomposition methods **Understand the underlying functions:** - Study BigDataStatMeth's block-wise SVD implementation - Examine how bdCrossprod_hdf5 handles large matrices - Learn HDF5 dataset management strategies **Extend for production:** - Add comprehensive error handling - Implement parameter validation - Create progress reporting - Build result verification functions