CCA Implementation in R

Building Canonical Correlation Analysis Using BigDataStatMeth R API

1 Overview

This document shows how to implement Canonical Correlation Analysis (CCA) from scratch using BigDataStatMeth’s R API. Rather than using a pre-built CCA function, you’ll see how to combine BigDataStatMeth’s building blocks—QR decomposition, matrix operations, SVD—to create a complete statistical method.

Think of this as a blueprint for developing your own statistical methods. CCA serves as the example, but the patterns you’ll learn apply to any algorithm: partition data into blocks, apply operations iteratively, combine results hierarchically. This is how you extend BigDataStatMeth beyond its built-in functions to implement novel statistical methods for big data.

1.1 What You’ll Learn

By the end of this document, you will:

  • Understand CCA’s mathematical algorithm and why block-wise processing helps
  • Implement QR decomposition by blocks for large matrices
  • Combine intermediate QR results hierarchically
  • Compute canonical correlations from Q matrices
  • Extract canonical coefficients and variates
  • Structure complex multi-step algorithms in R
  • Manage intermediate results in HDF5 files
  • Create reusable functions for new statistical methods

2 CCA Algorithm Overview

2.1 Mathematical Foundation

Canonical Correlation Analysis finds linear combinations of features in X and Y that maximize correlation:

\max_{\alpha, \beta} \text{cor}(X\alpha, Y\beta)

The standard algorithm:

  1. Compute QR decompositions: X = Q_X R_X, Y = Q_Y R_Y
  2. Compute cross-product: M = Q_X^T Q_Y
  3. SVD of cross-product: M = U D V^T
  4. Solve for canonical coefficients: \alpha = R_X^{-1} U, \beta = R_Y^{-1} V
  5. Compute canonical variates: s_X = X\alpha, s_Y = Y\beta

2.2 Why Block-Wise Processing?

For large matrices (e.g., 50,000 samples × 500,000 genomic features), computing QR decomposition directly requires massive RAM. Block-wise processing:

  • Partitions X into m blocks by rows
  • Computes QR on each block separately (manageable size)
  • Combines block QR results hierarchically
  • Final result identical to full QR, but memory-friendly

3 Step 1: QR Decomposition by Blocks

3.1 The getQRbyBlocks Function

This is the core building block. It computes QR decomposition of a matrix stored in HDF5 by processing it in blocks:

library(BigDataStatMeth)
library(rhdf5)

getQRbyBlocks <- function(strdataset, file, m, center, scale, bcols, overwrt) {
  
  # Parse dataset path (group/dataset)
  strgroup <- gsub("/.?$", "", strdataset)
  strdataset <- gsub("^.*/", "", strdataset)
  
  # Step 1: Normalize the dataset
  # Centering and scaling ensure numerical stability
  bdNormalize_hdf5(
    filename = file,
    group = strgroup,
    dataset = strdataset,
    bcenter = center,
    bscale = scale,
    overwrite = overwrt
  )
  
  # Step 2: Split matrix into m blocks
  # Process by rows (samples) - each block gets ~n/m rows
  bdSplit_matrix_hdf5(
    filename = file,
    group = paste0("NORMALIZED/", strgroup),
    dataset = strdataset,
    outgroup = paste0("Step1/", strdataset, "rows"),
    nblocks = m,
    bycols = bcols,  # FALSE = split by rows
    overwrite = overwrt
  )
  
  # Step 3: Apply QR to each block independently
  blocks <- bdgetDatasetsList_hdf5(file, paste0("Step1/", strdataset, "rows"))
  
  bdapply_Function_hdf5(
    filename = file,
    group = paste0("Step1/", strdataset, "rows"),
    datasets = blocks,
    outgroup = paste0("Step2/", strdataset, "rows"),
    func = "QR",  # Apply QR decomposition
    overwrite = overwrt
  )
  
  # Step 4: Merge R matrices from all blocks
  # QR gives Q (orthogonal) and R (upper triangular)
  # We need to combine all R matrices
  blocks_qr <- bdgetDatasetsList_hdf5(file, paste0("Step2/", strdataset, "rows"))
  
  # Filter to get only R matrices (not Q)
  r_matrices <- blocks_qr[which(blocks_qr %like% ".R")]
  
  bdBind_hdf5_datasets(
    filename = file,
    group = paste0("Step2/", strdataset, "rows"),
    datasets = r_matrices,
    outgroup = "Step3/merged",
    outdataset = paste0(strdataset, "Rt"),
    func = "bindRows",  # Stack R matrices vertically
    overwrite = overwrt
  )
  
  # Step 5: Final QR on merged R matrices
  # This gives us the overall R matrix
  bdapply_Function_hdf5(
    filename = file,
    group = "Step3/merged",
    datasets = paste0(strdataset, "Rt"),
    outgroup = "Step3/Final_QR",
    func = "QR",
    overwrite = overwrt
  )
  
  # Step 6: Split final Q for block-wise multiplication
  bdSplit_matrix_hdf5(
    filename = file,
    group = "Step3/Final_QR",
    dataset = paste0(strdataset, "Rt.Q"),
    outgroup = "Step4/splitted",
    nblocks = m,
    bycols = bcols,
    overwrite = overwrt
  )
  
  # Step 7: Multiply original Q blocks by final Q blocks
  # This reconstructs the full Q matrix in blocks
  tmp <- bdgetDatasetsList_hdf5(file, "Step4/splitted")
  Rt_Q_divide <- tmp[which(tmp %like% paste0(strdataset, "Rt.Q"))]
  
  q_matrices <- blocks_qr[which(blocks_qr %like% ".Q")]
  
  bdapply_Function_hdf5(
    filename = file,
    group = paste0("Step2/", strdataset, "rows"),
    datasets = q_matrices,
    outgroup = "Step5",
    func = "blockmult",
    b_group = "Step4/splitted",
    b_datasets = Rt_Q_divide,
    overwrite = TRUE
  )
  
  # Step 8: Combine all Q blocks into final Q matrix
  blocks_Q <- bdgetDatasetsList_hdf5(file, "Step5")
  
  bdBind_hdf5_datasets(
    filename = file,
    group = "Step5",
    datasets = blocks_Q[which(blocks_Q %like% paste0(strdataset, "."))],
    outgroup = "Step6",
    outdataset = paste0(strdataset, "Q"),
    func = "bindRows",
    overwrite = overwrt
  )
  
  cat("✓ QR decomposition complete for", strdataset, "\n")
}
NoteWhy This Multi-Step Process?

Each block gets QR: X_i = Q_i R_i

Stacking all R_i and taking QR again gives: [R_1; R_2; ...; R_m] = \tilde{Q} \tilde{R}

Multiplying: Q = [Q_1; Q_2; ...; Q_m] \times \tilde{Q} gives the final Q

This hierarchical approach is mathematically equivalent to QR on the full matrix but processes data in memory-friendly chunks.


4 Step 2: Main CCA Function

Now we combine everything into the complete CCA implementation:

bdCCA_hdf5 <- function(filename, X, Y, m = 10,
                       bcenter = TRUE, bscale = FALSE, bycols = FALSE,
                       overwriteResults = FALSE, keepInteResults = FALSE) {
  
  matrices <- c(X, Y)
  
  # Step 1-6: Compute QR for both X and Y
  # This is the computationally intensive part
  sapply(
    matrices,
    getQRbyBlocks,
    file = filename,
    m = m,
    center = bcenter,
    scale = bscale,
    bcols = bycols,
    overwrt = overwriteResults
  )
  
  cat("✓ QR decompositions complete for both matrices\n")
  
  # Step 7: Compute cross-product of Q matrices
  # This finds the canonical space
  res <- bdCrossprod_hdf5(
    filename = filename,
    group = "Step6",
    A = "XQ",
    groupB = "Step6",
    B = "YQ",
    outdataset = "CrossProd_XQ_YQ",
    outgroup = "Step7",
    overwrite = TRUE
  )
  
  cat("✓ Cross-product computed\n")
  
  # Step 8: SVD of cross-product
  # Singular values = canonical correlations
  # U, V = canonical directions
  res <- bdSVD_hdf5(
    file = filename,
    group = "Step7",
    dataset = "CrossProd_XQ_YQ",
    bcenter = FALSE,  # Already centered
    bscale = FALSE,
    k = 16,  # Process in blocks
    q = 2,   # Two-level hierarchy
    threads = 3,
    overwrite = TRUE
  )
  
  cat("✓ SVD computed - canonical correlations extracted\n")
  
  # Get dimensions for coefficient computation
  res <- sapply(matrices, bdgetDim_hdf5, filename = filename)
  
  # Step 9: Compute and save canonical coefficients
  writeCCAComponents_hdf5(filename, res[2, X], res[2, Y])
  
  cat("✓ Canonical coefficients and variates saved\n")
  
  # Cleanup intermediate results if requested
  if (keepInteResults == FALSE) {
    sapply(paste0("Step", 1:7), function(x) {
      invisible(bdRemove_hdf5_element(filename, element = x))
      cat("  ", x, "removed\n")
    })
    cat("✓ Intermediate results cleaned up\n")
  }
  
  return(list(
    filename = filename,
    results_group = "Results"
  ))
}

5 Step 3: Computing Canonical Coefficients

The final step converts SVD results into interpretable canonical components:

writeCCAComponents_hdf5 <- function(filename, ncolsX, ncolsY) {
  
  if (!file.exists(filename)) {
    stop("ERROR - File does not exist")
  }
  
  # Read all necessary components from HDF5
  h5f <- H5Fopen(filename)
  
  # Get first ncolsX × ncolsX and ncolsY × ncolsY from Q matrices
  XQ <- h5f$Step6$XQ[1:ncolsX, 1:ncolsX]
  YQ <- h5f$Step6$YQ[1:ncolsY, 1:ncolsY]
  
  # Get R matrices from final QR
  XR <- h5f$Step3$Final_QR$XRt.R
  YR <- h5f$Step3$Final_QR$YRt.R
  
  # Get SVD results (canonical correlations and directions)
  d <- h5f$SVD$CrossProd_XQ_YQ$d
  u <- h5f$SVD$CrossProd_XQ_YQ$u
  v <- h5f$SVD$CrossProd_XQ_YQ$v
  
  # Get centering values
  xcenter <- h5f$NORMALIZED$data$mean.X
  ycenter <- h5f$NORMALIZED$data$mean.Y
  
  # Get feature names
  x_names <- h5f$data$.X_dimnames$`2`
  y_names <- h5f$data$.Y_dimnames$`2`
  
  h5closeAll()
  
  # Reconstruct compact QR representation
  # Q and R are stored separately, combine them
  XR[lower.tri(XR, diag = FALSE)] <- 0  # R is upper triangular
  XQ[upper.tri(XQ, diag = TRUE)] <- 0   # Q lower part
  XQR <- XR + XQ
  
  YR[lower.tri(YR, diag = FALSE)] <- 0
  YQ[upper.tri(YQ, diag = TRUE)] <- 0
  YQR <- YR + YQ
  
  # Solve for canonical coefficients
  # X canonical coefficients: solve XQR * xcoef = u
  xcoef <- bdSolve(XQR, u)
  
  # Y canonical coefficients: solve YQR * ycoef = v
  ycoef <- bdSolve(YQR, v)
  
  # Add feature names
  rownames(xcoef) <- as.matrix(x_names)
  rownames(ycoef) <- as.matrix(y_names)
  
  # Save canonical coefficients to HDF5
  bdCreate_hdf5_matrix(
    filename,
    object = xcoef,
    group = "Results",
    dataset = "xcoef",
    overwriteDataset = TRUE
  )
  
  bdCreate_hdf5_matrix(
    filename = filename,
    object = ycoef,
    group = "Results",
    dataset = "ycoef",
    overwriteDataset = TRUE
  )
  
  # Save canonical correlations
  bdCreate_hdf5_matrix(
    filename,
    object = as.matrix(diag(d)),
    group = "Results",
    dataset = "cor",
    overwriteDataset = TRUE
  )
  
  # Save centering values (needed for new data projection)
  bdCreate_hdf5_matrix(
    filename,
    object = xcenter,
    group = "Results",
    dataset = "xcenter",
    overwriteDataset = TRUE
  )
  
  bdCreate_hdf5_matrix(
    filename,
    object = ycenter,
    group = "Results",
    dataset = "ycenter",
    overwriteDataset = TRUE
  )
  
  # Compute canonical variates (scores) for samples
  # X scores = X * xcoef
  bdblockmult_hdf5(
    filename = filename,
    group = "data",
    A = "X",
    B = "xcoef",
    groupB = "Results",
    outgroup = "Results",
    outdataset = "xscores",
    overwrite = TRUE
  )
  
  # Y scores = Y * ycoef
  bdblockmult_hdf5(
    filename,
    group = "data",
    A = "Y",
    B = "ycoef",
    groupB = "Results",
    outgroup = "Results",
    outdataset = "yscores",
    overwrite = TRUE
  )
  
  cat("✓ All CCA components saved to /Results/\n")
}
ImportantUnderstanding the Coefficients
  • xcoef, ycoef: Feature loadings - which features contribute to each canonical variate
  • cor: Canonical correlations - strength of relationship for each component
  • xscores, yscores: Sample scores on canonical variates
  • xcenter, ycenter: Means used for centering (needed for projecting new data)

These components together provide complete CCA results interpretable just like standard CCA output.


6 Example Usage

Here’s how to use the implemented CCA function:

library(BigDataStatMeth)
library(rhdf5)

# Create example multi-omic data
set.seed(123)
n_samples <- 500
n_genes_x <- 3000
n_genes_y <- 2000

# Genomic data (X)
X <- matrix(rnorm(n_samples * n_genes_x), n_samples, n_genes_x)
rownames(X) <- paste0("Sample_", 1:n_samples)
colnames(X) <- paste0("Gene_X_", 1:n_genes_x)

# Expression data (Y)
Y <- matrix(rnorm(n_samples * n_genes_y), n_samples, n_genes_y)
rownames(Y) <- paste0("Sample_", 1:n_samples)
colnames(Y) <- paste0("Gene_Y_", 1:n_genes_y)

# Create HDF5 file
cca_file <- "example_cca.hdf5"

bdCreate_hdf5_matrix(cca_file, X, "data", "X", overwriteFile = TRUE)
bdCreate_hdf5_matrix(cca_file, Y, "data", "Y", overwriteFile = FALSE)

# Run CCA
result <- bdCCA_hdf5(
  filename = cca_file,
  X = "data/X",
  Y = "data/Y",
  m = 4,              # 4 blocks for QR
  bcenter = TRUE,     # Center data
  bscale = TRUE,      # Scale data
  overwriteResults = TRUE,
  keepInteResults = FALSE  # Remove intermediate steps
)

# Examine results
h5ls(cca_file)

# Extract canonical correlations
h5f <- H5Fopen(cca_file)
canonical_cors <- diag(h5f$Results$cor)
xcoef <- h5f$Results$xcoef
ycoef <- h5f$Results$ycoef
H5Fclose(h5f)

cat("\nCanonical correlations:\n")
print(canonical_cors[1:5])

cat("\nTop features for first canonical variate:\n")
cat("X features:\n")
print(head(sort(abs(xcoef[, 1]), decreasing = TRUE)))
cat("Y features:\n")
print(head(sort(abs(ycoef[, 1]), decreasing = TRUE)))

7 Interactive Exercise

7.1 Practice: Modifying the CCA Implementation

Understanding how the algorithm works helps you adapt it for your needs or create entirely new methods.

# Exercise 1: Change block size
# Try m = 2, 4, 8, 16
# How does memory usage change?
# How does computation time change?
# Do results differ?

# Exercise 2: Skip normalization
# Set bcenter = FALSE, bscale = FALSE
# How do canonical correlations change?
# Why does this matter?

# Exercise 3: Keep intermediate results
# Set keepInteResults = TRUE
# Explore Step1 through Step7 in HDF5 file
# What does each step contain?
# How does data transform through the pipeline?

# Exercise 4: Add progress reporting
# Modify getQRbyBlocks to print progress
# Track time for each major step
# Identify bottlenecks
TipReflection Questions

1. Algorithm Understanding: - Why QR decomposition instead of direct SVD on X and Y? - What would break if you skipped the hierarchical QR merge? - How does block size affect accuracy vs. efficiency?

2. Design Decisions: - Why store intermediate results in HDF5 instead of memory? - keepInteResults = TRUE vs. FALSE - when use each? - Could you implement this with fewer steps? Trade-offs?

3. Extending to New Methods: - What if you wanted sparse CCA? Which step changes? - How would you implement regularized CCA? - Could you use this pattern for PLS regression?

4. Practical Considerations: - Your data has missing values. Where in the pipeline do you handle them? - X has 100,000 features, Y has 50,000. What breaks? - How do you validate results match standard CCA?

5. Performance Optimization: - Which step is slowest? QR blocks? SVD? Coefficient solving? - Could you parallelize any steps? - When does the C++ version become worth the effort?

Think about how you’d adapt this pattern for your own statistical method. The structure—partition, process blocks, combine hierarchically—applies broadly beyond CCA.


8 Key Takeaways

Let’s consolidate what you’ve learned about implementing CCA using BigDataStatMeth’s R API.

8.1 Essential Concepts

Block-wise algorithms decompose complex operations into manageable steps. Rather than computing QR decomposition on a 50,000 × 500,000 matrix (requiring hundreds of GB RAM), we partition into blocks, compute QR on each (requiring only GB), then combine results hierarchically. This pattern—partition, process, combine—applies to virtually any matrix algorithm. Understanding this pattern lets you implement statistical methods that would otherwise be impossible on standard hardware.

HDF5 files serve as persistent working memory. Each step saves results to HDF5: Step1 has split matrices, Step2 has block QRs, Step3 has merged results. This isn’t just storage—it enables debugging (inspect intermediate results), resuming (restart from any step), and memory management (only current block in RAM). Without HDF5’s structure, you’d need everything in memory simultaneously or manage dozens of temporary files manually.

Hierarchical combination preserves mathematical correctness. Computing QR on blocks, stacking R matrices, taking QR of the stack, and multiplying back through Q matrices gives identical results to single-shot QR on the full matrix. The mathematics works because QR decomposition composes correctly. Not all algorithms have this property—you can’t naively block-wise any operation and expect correct results. Understanding which operations compose correctly is essential for designing block-wise algorithms.

Normalization affects numerical stability more than correctness. Centering and scaling aren’t just preprocessing niceties for CCA—they prevent numerical precision loss when working with disparate scales. Genomic copy number (±2) and expression (0-100,000) in the same analysis without scaling produces correlations driven by scale, not biology. For block-wise processing on HDF5, normalization happens once upfront and stores normalized data for all subsequent operations, avoiding repeated normalization overhead.

Intermediate result management is a design decision. keepInteResults = FALSE removes Step1-Step7 after completion, saving disk space. keepInteResults = TRUE preserves them for inspection, debugging, or reusing with different final steps. For production pipelines, clean up. For method development, keep everything until you’re certain it works. The choice affects both disk usage and your ability to troubleshoot problems.

Code structure mirrors algorithm structure. getQRbyBlocks handles one matrix’s QR decomposition. bdCCA_hdf5 calls it twice (for X and Y), computes cross-product, performs SVD, and extracts coefficients. writeCCAComponents_hdf5 handles result extraction and formatting. This modular structure makes code understandable and reusable—getQRbyBlocks could be used for methods beyond CCA. Clear function boundaries with specific responsibilities enable building complex methods from simple components.

8.2 When to Implement Methods in R

Understanding when R implementation suffices versus when C++ becomes necessary guides efficient development.

Implement in R when:

  • Prototyping new methods - R’s interactive nature enables rapid testing. Write the algorithm, run it on small data, iterate until correct. Once working, optimize or port to C++ if needed. Starting in R saves development time even if you eventually rewrite in C++.

  • Combining existing BigDataStatMeth functions - If your method uses bdSVD_hdf5, bdCrossprod_hdf5, bdSolve, etc., R glue code suffices. The heavy lifting happens in C++ (BigDataStatMeth internals), R just orchestrates. CCA is this case—all computational work happens in compiled code, R just sequences operations.

  • Method will be used occasionally - If you run CCA once per project, R performance is fine. Minutes vs. seconds doesn’t matter when you run it monthly. Save development effort for methods you’ll run thousands of times.

  • Your team works primarily in R - Keeping implementation in R enables others to understand, modify, and extend your code without C++ expertise. Collaboration and maintainability sometimes outweigh raw performance.

  • Intermediate result inspection matters - R makes it trivial to peek at any step: load data from HDF5, examine dimensions, plot distributions. This transparency helps verify correctness and debug problems. C++ implementations bury these details in compiled code.

Consider C++ implementation when:

  • Performance is critical - If your method runs hundreds of times daily, saving 50% computation time through C++ pays for development effort. Profile first—identify if performance matters before investing in optimization.

  • Custom algorithms, not BigDataStatMeth combinations - If you’re implementing novel linear algebra (not using existing bdXXX functions), C++ gives you control over algorithms, memory layout, and parallelization. R becomes a bottleneck for truly custom computation.

  • Production deployment required - Software going into production pipelines benefits from C++’s speed and static typing. Development takes longer but runtime is faster and more predictable.

  • You need fine-grained parallelization - R’s parallelization works at function level. C++ with OpenMP or threading libraries enables parallelizing within matrix operations, achieving better speedups for large-scale data.

Don’t reimplement in C++ when:

  • R version works fine - “Working” beats “optimal” when deadlines loom. If R implementation meets your needs, moving to C++ wastes time that could go toward science.

  • The bottleneck isn’t computation - If your analysis spends 90% of time reading files or waiting for database queries, optimizing computation achieves nothing. Profile to find true bottlenecks before optimizing.

  • You’re still figuring out the algorithm - Algorithm design in C++ is painful—compile, run, crash, debug, repeat. R’s interactivity (run line-by-line, inspect variables, iterate quickly) is invaluable during method development. Get it working in R, then optimize if needed.

The key principle: R for development and prototyping, C++ for production and performance-critical code. Many methods never need C++ because R orchestration of C++ building blocks (BigDataStatMeth functions) provides adequate performance.


9 Next Steps

Explore the C++ implementation:

  • CCA Implementation in C++ - Same algorithm, different language
  • Compare R and C++ approaches
  • Understand when each makes sense

Apply this pattern to new methods:

  • Implement PLS (Partial Least Squares) using similar structure
  • Create sparse CCA with L1 penalties
  • Develop custom dimensionality reduction methods

Extend CCA functionality:

  • Add cross-validation for component selection
  • Implement permutation testing for significance
  • Create visualization functions for results
  • Build prediction functions for new samples

10 Cleanup

# Close any open HDF5 connections
h5closeAll()

# Note: Keep example_cca.hdf5 for testing and learning
# It contains complete CCA results you can explore