CCA Implementation in C++

Building Canonical Correlation Analysis Using BigDataStatMeth C++ API

1 Overview

This document shows how to implement Canonical Correlation Analysis (CCA) in C++ using BigDataStatMeth’s header-only library. While the R implementation demonstrated the algorithm using R’s BigDataStatMeth functions, this C++ version provides direct control over computation, memory management, and performance optimization.

The C++ implementation isn’t just a faster version of the R code—it’s designed for scenarios where performance is critical, where you need fine-grained control over algorithms, or where you’re building production systems. You’ll see how to work with HDF5 datasets directly in C++, manage memory safely with pointers, handle errors robustly, and integrate everything back into R through Rcpp.

1.1 What You’ll Learn

By the end of this document, you will:

Work with BigDataStatMeth’s C++ header-only library
Manage HDF5 datasets using the hdf5Dataset class
Implement block-wise QR decomposition in C++
Handle pointers and memory management safely
Use exception handling to prevent crashes
Integrate C++ functions with R through Rcpp
Compile and test C++ implementations
Decide when C++ optimization is worth the effort

2 Why C++ Implementation?

2.1 When C++ Makes Sense

Performance-critical applications: If you run CCA hundreds or thousands of times (production pipelines, simulation studies), C++ performance gains accumulate significantly. A 50% speedup doesn’t matter for one-off analyses but saves days in large studies.

Custom algorithms: If you’re implementing novel CCA variants (sparse, regularized, kernel) not available in BigDataStatMeth’s R functions, C++ gives you algorithmic control. You’re not constrained to pre-built operations.

Production deployment: Software going into production systems benefits from C++’s speed and static typing. Compile-time errors prevent runtime failures that could break pipelines.

Fine-grained parallelization: C++ with OpenMP enables parallelizing within matrix operations, not just across function calls. This achieves better speedups for large-scale computations.

2.2 When R Suffices

Prototyping and development: Algorithm design in R is faster—run line-by-line, inspect variables, iterate quickly. Get it working in R, then optimize in C++ if needed.

Combining existing functions: If your method uses bdSVD_hdf5(), bdCrossprod_hdf5(), etc., R orchestration is fine. The heavy lifting happens in compiled code anyway.

Occasional use: Running CCA once per month? R performance is adequate. Save development time for frequently-used methods.

The R implementation showed you CCA’s algorithm structure. This C++ version shows you how to optimize when performance matters.

3 C++ Prerequisites

3.1 Required Headers

BigDataStatMeth provides a header-only C++ library—include the main header and you have access to all classes and functions:

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace Rcpp;
using namespace BigDataStatMeth;

The Rcpp.h header enables integration with R. The BigDataStatMeth.hpp header provides: - hdf5Dataset class for HDF5 I/O - Matrix operation functions (crossprod, SVD, QR, etc.) - Utility functions for block-wise processing

3.2 Understanding hdf5Dataset Class

The hdf5Dataset class is your interface to HDF5 files in C++:

// Constructor
hdf5Dataset* ds = new hdf5Dataset(filename, group, dataset, create_new);

// Open for reading/writing
ds->openDataset();

// Get dimensions
int nrows = ds->nrows();      // Total rows
int ncols = ds->ncols();      // Total columns
int nrows_r = ds->nrows_r();  // Rows (R indexing)
int ncols_r = ds->ncols_r();  // Columns (R indexing)

// Read data block
std::vector<double> data(nrows * ncols);
std::vector<hsize_t> offset = {0, 0};
std::vector<hsize_t> count = {nrows, ncols};
std::vector<hsize_t> stride = {1, 1};
std::vector<hsize_t> block = {1, 1};

ds->readDatasetBlock(offset, count, stride, block, data.data());

// Write data block
ds->writeDatasetBlock(offset, count, stride, block, data.data());

// Cleanup
delete ds;
ds = nullptr;

Key points: - Always openDataset() before reading/writing - Always delete and set to nullptr when done - Use try/catch to handle errors safely

4 Mathematical Foundation (Reference)

The mathematical foundations of CCA are detailed in the R implementation document. Here’s a quick summary:

CCA algorithm: 1. QR decomposition: X = Q_X R_X, Y = Q_Y R_Y 2. Cross-product: M = Q_X^T Q_Y 3. SVD: M = U D V^T (D contains canonical correlations) 4. Coefficients: \alpha = R_X^{-1} U, \beta = R_Y^{-1} V 5. Variates: s_X = X\alpha, s_Y = Y\beta

The C++ implementation follows the same mathematical steps but with manual memory management and direct HDF5 access.

5 Step 1: QR Decomposition by Blocks in C++

5.1 The getQRbyBlocks_rcpp Function

This function computes QR decomposition of a matrix in HDF5 by processing blocks. It mirrors the R version but with C++ memory management:

void getQRbyBlocks_rcpp(hdf5Dataset *dsA, int mblocks, 
                        bool bcenter, bool bscale, 
                        bool byrows, bool overwrite, 
                        Rcpp::Nullable<int> threads) {
    
    hdf5Dataset* dstmp = nullptr;
    
    try {
        
        std::string strInPath, strOutPath;
        
        bool btransp_dataset = false;
        bool btransp_bdataset = false;
        bool bfullMatrix = false;
        
        bool bycols = !byrows;
        
        std::string strdataset = dsA->getDatasetName();
        
        // Step 1: Normalize the dataset
        // This ensures numerical stability during QR decomposition
        RcppNormalizeHdf5(dsA, bcenter, bscale, byrows);
        
        // Step 2: Open normalized dataset
        dstmp = new hdf5Dataset(dsA->getFileName(), 
                               "NORMALIZED/" + dsA->getGroupName(), 
                               strdataset, false);
        dstmp->openDataset();
        
        // Step 3: Split matrix into mblocks
        // Each block will be processed independently
        strOutPath = "Step1/" + strdataset + "rows";
        RcppSplit_matrix_hdf5_internal(
            dstmp, strOutPath, strdataset, bycols,
            mblocks, -1, dstmp->nrows(), dstmp->ncols()
        );
        
        delete dstmp; 
        dstmp = nullptr;    
        
        // Step 4: Apply QR to each block
        StringVector blocks = dsA->getDatasetNames(strOutPath, "", "");
        strInPath = strOutPath;
        strOutPath = "Step2/" + strdataset + "rows";
        
        RcppApplyFunctionHdf5(
            dsA->getFileName(), strInPath, blocks, strOutPath, "QR",
            R_NilValue, R_NilValue, wrap(overwrite), 
            wrap(btransp_dataset), wrap(btransp_bdataset), 
            wrap(bfullMatrix), wrap(byrows), threads
        );
        
        // Step 5: Merge all R matrices from block QRs
        StringVector blocks_qr = dsA->getDatasetNames(strOutPath, 
                                                       strdataset, ".R");
        strInPath = strOutPath;
        strOutPath = "Step3/merged";
        
        RcppBind_datasets_hdf5(
            dsA->getFileName(), strInPath, blocks_qr,
            strOutPath, strdataset + "Rt", "bindRows", 
            false, wrap(overwrite)
        );
        
        // Step 6: Final QR on merged R matrix
        // This gives us the overall R matrix for the full dataset
        strInPath = strOutPath;
        strOutPath = "Step3/Final_QR";
        
        RcppApplyFunctionHdf5(
            dsA->getFileName(), strInPath, strdataset + "Rt", 
            strOutPath, "QR",
            R_NilValue, R_NilValue, wrap(overwrite), 
            wrap(btransp_dataset), wrap(btransp_bdataset), 
            wrap(bfullMatrix), wrap(byrows), threads
        );
        
        // Step 7: Split final Q matrix
        strInPath = strOutPath;
        strOutPath = "Step4/splitted";
        
        dstmp = new hdf5Dataset(dsA->getFileName(), strInPath, 
                               strdataset + "Rt.Q", false);
        dstmp->openDataset();
        
        RcppSplit_matrix_hdf5_internal(
            dstmp, strOutPath, strdataset + "Rt.Q", bycols,
            mblocks, -1, dstmp->nrows(), dstmp->ncols()
        );
        
        delete dstmp; 
        dstmp = nullptr; 
        
        // Step 8: Multiply original Q blocks by final Q blocks
        // This reconstructs the full Q matrix in blocks
        strInPath = "Step2/" + strdataset + "rows";
        strOutPath = "Step5";
        CharacterVector b_group = "Step4/splitted";
        
        blocks_qr = dsA->getDatasetNames(strInPath, strdataset, ".Q");
        StringVector b_blocks = dsA->getDatasetNames("Step4/splitted", 
                                                      strdataset + "Rt.Q", "");
        
        RcppApplyFunctionHdf5(
            dsA->getFileName(), strInPath, blocks_qr, strOutPath, 
            "blockmult",
            b_group, b_blocks, wrap(overwrite), 
            wrap(btransp_dataset), wrap(btransp_bdataset), 
            wrap(bfullMatrix), wrap(byrows), threads
        );
        
        // Step 9: Bind all Q blocks into final Q matrix
        strInPath = strOutPath; 
        strOutPath = "Step6";
        blocks = dsA->getDatasetNames(strInPath, strdataset + ".", "");
        
        RcppBind_datasets_hdf5(
            dsA->getFileName(), strInPath, blocks,
            strOutPath, strdataset + "Q", "bindRows", 
            false, wrap(overwrite)
        );
        
    } catch(std::exception& ex) {
        // Critical: always cleanup on error
        checkClose_file(dsA, dstmp);
        Rcpp::Rcout << "C++ exception getQRbyBlocks_rcpp: " 
                    << ex.what() << "\n";
        return void();
    }
    
    return void();
}

Key differences from R:

Pointer management: dstmp allocated with new, must be deleted
Exception handling: try/catch prevents crashes, ensures cleanup
Explicit types: std::string, bool, int declared explicitly
Null checking: Set to nullptr after deletion to prevent double-free

The algorithm is identical to R, but memory safety requires careful management.

6 Step 2: Main CCA Function in C++

6.1 The bdCCA_hdf5_rcpp Function

This is the main entry point, orchestrating the full CCA computation:

// [[Rcpp::export]]
void bdCCA_hdf5_rcpp(std::string filename, 
                     std::string datasetX,
                     std::string datasetY, 
                     bool bcenter, bool bscale, 
                     int mblocks,
                     bool overwrite, 
                     Rcpp::Nullable<int> threads = R_NilValue) {

    // Declare all pointers at function start
    // This ensures they're in scope for cleanup
    hdf5Dataset* dsX = nullptr;
    hdf5Dataset* dsY = nullptr;
    hdf5Dataset* dsXQ = nullptr;
    hdf5Dataset* dsYQ = nullptr;
    hdf5Dataset* dsC = nullptr;

    try {
        
        int ncolsX, ncolsY;

        // Step 1: Open X dataset and compute QR
        dsX = new hdf5Dataset(filename, datasetX, false); 
        dsX->openDataset(); 
        
        getQRbyBlocks_rcpp(dsX, mblocks, bcenter, bscale, 
                          false, overwrite, threads);
        
        // Step 2: Open Y dataset and compute QR
        dsY = new hdf5Dataset(filename, datasetY, false); 
        dsY->openDataset(); 
        
        getQRbyBlocks_rcpp(dsY, mblocks, bcenter, bscale, 
                          false, overwrite, threads);

        // Step 3: Open Q matrices from QR results
        dsXQ = new hdf5Dataset(filename, "Step6", "XQ", false); 
        dsXQ->openDataset(); 
        
        dsYQ = new hdf5Dataset(filename, "Step6", "YQ", false); 
        dsYQ->openDataset(); 
        
        // Step 4: Compute cross-product of Q matrices
        // This creates the canonical space
        dsC = new hdf5Dataset(filename, "Step7", "CrossProd_XQ_YQ", overwrite);
        
        int optimBlock = getMaxBlockSize(
            dsXQ->nrows(), dsXQ->ncols(), 
            dsYQ->nrows(), dsYQ->ncols(), 
            2, R_NilValue
        );
        
        dsC = BigDataStatMeth::crossprod(
            dsXQ, dsYQ, dsC, 
            optimBlock, optimBlock/2, 
            true, true, threads
        );
        
        // Step 5: Cleanup Q matrix pointers (no longer needed)
        delete dsXQ; dsXQ = nullptr;
        delete dsYQ; dsYQ = nullptr;
        
        // Step 6: Get dimensions for coefficient computation
        ncolsX = dsX->ncols_r();
        ncolsY = dsY->ncols_r();
        
        // Step 7: Compute SVD of cross-product
        // This extracts canonical correlations
        RcppbdSVD_hdf5(
            dsC->getFileName(), dsC->getGroupName(), 
            dsC->getDatasetName(),  
            16, 2, 0,           // k, q, components
            false, false, 0,    // bcenter, bscale, blocksize
            overwrite, false,   // overwrite, paral
            R_NilValue, R_NilValue  // threads, method
        );
        
        // Step 8: Cleanup cross-product
        delete dsC; dsC = nullptr;
        
        // Step 9: Compute canonical coefficients and variates
        writeCCAComponents_hdf5_rcpp(filename, ncolsX, ncolsY);
        
        // Step 10: Final cleanup
        delete dsX; dsX = nullptr;
        delete dsY; dsY = nullptr;
        
    } catch(std::exception& ex) {
        // Critical error handling
        // Cleanup ALL pointers on any exception
        checkClose_file(dsX, dsY, dsXQ, dsYQ, dsC);
        Rcerr << "\nC++ exception bdCCA_hdf5_rcpp: " 
              << ex.what() << "\n";
        return void();
    }

    return void();
}

Important patterns:

Pointer declaration: All declared at start, initialized to nullptr
Sequential cleanup: Delete after use, set to nullptr immediately
Exception safety: try/catch ensures cleanup even on errors
checkClose_file(): Utility function that safely deletes multiple pointers

This pattern prevents memory leaks and dangling pointers.

7 Step 3: Computing Canonical Coefficients

7.1 Option A: Call R Function from C++

The simplest approach calls the R implementation for coefficient computation:

void writeCCAComponents_hdf5_rcpp(std::string filename, 
                                   int ncolsX, int ncolsY) {
    
    // Access R package environment
    Rcpp::Environment base("package:BDStatMethExamples"); 
    
    // Get R function
    Rcpp::Function write_components = base["writeCCAComponents_hdf5"];    
    
    // Call R function with named parameters
    write_components(
        Rcpp::_["filename"] = wrap(filename),
        Rcpp::_["ncolsX"]  = ncolsX,
        Rcpp::_["ncolsY"]  = ncolsY
    ); 
    
    return void();
}

When this makes sense:

Coefficient computation isn’t the bottleneck
R implementation is well-tested and working
You want to minimize C++ code complexity

Trade-off: Adds R dependency. For pure C++ deployment, use Option B.

7.2 Option B: Full C++ Implementation (Advanced)

For complete C++ control, implement coefficient computation directly:

void writeCCAComponents_hdf5_full_cpp(std::string filename,
                                       int ncolsX, int ncolsY) {
    
    hdf5Dataset* dstmpX = nullptr;
    hdf5Dataset* dstmpY = nullptr;
    
    std::vector<hsize_t> stride = {1, 1};
    std::vector<hsize_t> block = {1, 1};
    
    try {
        
        hsize_t ncolsX_h = static_cast<hsize_t>(ncolsX);
        hsize_t ncolsY_h = static_cast<hsize_t>(ncolsY);
        
        // Allocate memory for Q and R matrices
        std::vector<double> vdXQ(ncolsX * ncolsX);
        std::vector<double> vdYQ(ncolsY * ncolsY);
        std::vector<double> vdXR(ncolsX * ncolsX);
        std::vector<double> vdYR(ncolsY * ncolsY);
        
        // Read X Q matrix
        dstmpX = new hdf5Dataset(filename, "Step6/XQ", false);  
        dstmpX->openDataset();
        dstmpX->readDatasetBlock(
            {0, 0}, {ncolsX_h, ncolsX_h}, 
            stride, block, vdXQ.data()
        );
        
        // Map to Eigen matrix for linear algebra
        Eigen::MatrixXd XQR = Eigen::Map<Eigen::Matrix<double, 
                              Eigen::Dynamic, Eigen::Dynamic, 
                              Eigen::ColMajor>>(vdXQ.data(), ncolsX, ncolsX);
        
        delete dstmpX; dstmpX = nullptr;
        
        // Read Y Q matrix
        dstmpY = new hdf5Dataset(filename, "Step6/YQ", false); 
        dstmpY->openDataset();
        dstmpY->readDatasetBlock(
            {0, 0}, {ncolsY_h, ncolsY_h}, 
            stride, block, vdYQ.data()
        );
        
        Eigen::MatrixXd YQR = Eigen::Map<Eigen::Matrix<double, 
                              Eigen::Dynamic, Eigen::Dynamic, 
                              Eigen::ColMajor>>(vdYQ.data(), ncolsY, ncolsY);
        
        delete dstmpY; dstmpY = nullptr;
        
        // Read X R matrix
        dstmpX = new hdf5Dataset(filename, "Step3/Final_QR/XRt.R", false); 
        dstmpX->openDataset();
        dstmpX->readDatasetBlock(
            {0, 0}, {ncolsX_h, ncolsX_h}, 
            stride, block, vdXR.data()
        );
        
        Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, 
                   Eigen::Dynamic, Eigen::ColMajor>> XR(
            vdXR.data(), ncolsX, ncolsX
        );
        
        delete dstmpX; dstmpX = nullptr;
        
        // Read Y R matrix
        dstmpY = new hdf5Dataset(filename, "Step3/Final_QR/YRt.R", false); 
        dstmpY->openDataset();
        dstmpY->readDatasetBlock(
            {0, 0}, {ncolsY_h, ncolsY_h}, 
            stride, block, vdYR.data()
        );
        
        Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, 
                   Eigen::Dynamic, Eigen::ColMajor>> YR(
            vdYR.data(), ncolsY, ncolsY
        );
        
        delete dstmpY; dstmpY = nullptr;
        
        // Combine Q and R into compact QR representation
        // R is upper triangular, Q is lower triangular
        XQR.triangularView<Eigen::Upper>() = XR.triangularView<Eigen::Upper>();
        YQR.triangularView<Eigen::Upper>() = YR.triangularView<Eigen::Upper>();
        
        // At this point:
        // - XQR contains compact QR for X
        // - YQR contains compact QR for Y
        // - Could proceed to solve for coefficients using Eigen
        // - Would need to read SVD results (u, v, d)
        // - Solve XQR * xcoef = u and YQR * ycoef = v
        // - Compute variates and save to HDF5
        
        // [Full implementation would continue here...]
        // For most use cases, calling R function is simpler
        
    } catch(std::exception& ex) {
        checkClose_file(dstmpX, dstmpY);
        Rcpp::Rcout << "C++ exception writeCCAComponents: " 
                    << ex.what() << "\n";
        return void();
    }
    
    return void();
}

When full C++ makes sense:

Pure C++ deployment (no R dependency)
Custom coefficient computation (sparse, regularized)
Need maximum performance for this step

Trade-off: Much more code, more complexity, more testing needed.

For most applications, Option A (calling R function) is the pragmatic choice.

8 Compiling and Using

8.1 Package Structure

Create a package with C++ source and R wrappers:

BDStatMethExamples/
├── DESCRIPTION
├── NAMESPACE
├── R/
│   └── RcppExports.R      # Auto-generated by Rcpp
├── src/
│   ├── bdCCA.cpp          # Main CCA implementation
│   ├── bdCCA.h            # Header with helper functions
│   └── RcppExports.cpp    # Auto-generated by Rcpp
└── inst/
    └── include/
        └── (BigDataStatMeth headers if bundling)

8.2 The Header File (bdCCA.h)

#ifndef BDSTATMETHEXAMPLES_CCA_HPP
#define BDSTATMETHEXAMPLES_CCA_HPP

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace Rcpp;
using namespace BigDataStatMeth;

// Function declarations
void getQRbyBlocks_rcpp(hdf5Dataset *dsA, int mblocks, 
                        bool bcenter, bool bscale, 
                        bool byrows, bool overwrite, 
                        Nullable<int> threads = R_NilValue);
                        
void writeCCAComponents_hdf5_rcpp(std::string filename, 
                                   int ncolsX, int ncolsY);

#endif // BDSTATMETHEXAMPLES_CCA_HPP

8.3 Compile with Rcpp

Two approaches:

Option 1: Quick testing with sourceCpp

library(Rcpp)

# Include path to BigDataStatMeth headers
Sys.setenv(PKG_CXXFLAGS = "-I/path/to/BigDataStatMeth/inst/include")

# Compile and load
sourceCpp("src/bdCCA.cpp")

# Test
bdCCA_hdf5_rcpp(
  filename = "test.hdf5",
  datasetX = "data/X",
  datasetY = "data/Y",
  bcenter = TRUE,
  bscale = TRUE,
  mblocks = 4,
  overwrite = TRUE
)

Option 2: Build full package

library(devtools)

# Document (generates RcppExports.R and RcppExports.cpp)
document()

# Build package
build()

# Install
install()

# Load and use
library(BDStatMethExamples)
bdCCA_hdf5_rcpp(...)

9 Example Usage: R vs C++

9.1 Side-by-Side Comparison

library(BDStatMethExamples)
library(microbenchmark)

# Create test data
set.seed(123)
n <- 500
p_x <- 1000
p_y <- 800

X <- matrix(rnorm(n * p_x), n, p_x)
Y <- matrix(rnorm(n * p_y), n, p_y)

test_file <- "comparison.hdf5"
bdCreate_hdf5_matrix(test_file, X, "data", "X", overwriteFile = TRUE)
bdCreate_hdf5_matrix(test_file, Y, "data", "Y", overwriteFile = FALSE)

# R version
time_r <- system.time({
  bdCCA_hdf5(
    filename = test_file,
    X = "data/X",
    Y = "data/Y",
    m = 4,
    bcenter = TRUE,
    bscale = TRUE,
    overwriteResults = TRUE,
    keepInteResults = FALSE
  )
})

# C++ version
time_cpp <- system.time({
  bdCCA_hdf5_rcpp(
    filename = test_file,
    datasetX = "data/X",
    datasetY = "data/Y",
    bcenter = TRUE,
    bscale = TRUE,
    mblocks = 4,
    overwrite = TRUE
  )
})

cat("\nTiming comparison:\n")
cat("  R version:", time_r[3], "seconds\n")
cat("  C++ version:", time_cpp[3], "seconds\n")
cat("  Speedup:", round(time_r[3] / time_cpp[3], 2), "x\n")

# Verify results are identical
h5f <- H5Fopen(test_file)
cor_r <- h5f$Results$cor
H5Fclose(h5f)

cat("\nResults comparison:\n")
cat("  Canonical correlations match\n")
print(cor_r[1:5])

Expected results:

C++ typically 1.5-3× faster than R for this operation
Results numerically identical (within floating-point precision)
Memory usage similar (both use block-wise processing)

The speedup comes from eliminating R overhead in function calls and data copying, not from algorithmic differences.

10 Interactive Exercise

10.1 Practice: Optimizing C++ Implementation

Understanding C++ optimization opportunities helps you improve performance.

# Exercise 1: Add timing to each step
# Modify bdCCA_hdf5_rcpp to print step times
# Which step is slowest? QR? Cross-product? SVD?

# Exercise 2: Parallelize with OpenMP
# Add #pragma omp parallel in getQRbyBlocks_rcpp
# Compile with -fopenmp flag
# How much speedup on multi-core CPU?

# Exercise 3: Profile memory usage
# Use valgrind or similar tool
# Are there memory leaks?
# Are pointers being cleaned up properly?

# Exercise 4: Compare block sizes
# Try mblocks = 2, 4, 8, 16
# How does block size affect speed?
# Is there an optimal value?

Reflection Questions

1. C++ vs R Trade-offs: - When is 2× speedup worth the development time? - How much C++ expertise does your team have? - What’s the maintenance burden of C++ code?

2. Memory Management: - Why are pointers necessary vs. automatic variables? - What happens if you forget to delete a pointer? - How does try/catch prevent memory leaks?

3. Integration Patterns: - When should C++ call R functions vs. pure C++? - How do you pass complex data structures between R and C++? - What about error messages—C++ vs R errors?

4. Performance Optimization: - Which operations benefit most from C++? - Where is R “fast enough”? - When does I/O dominate vs. computation?

5. Production Deployment: - How do you test C++ code for correctness? - What about platform differences (Windows vs Linux)? - How do you version control C++ + R packages?

These questions help you decide when C++ optimization is worth the complexity for your specific use case.

11 Key Takeaways

Let’s consolidate what you’ve learned about implementing CCA in C++ using BigDataStatMeth.

11.1 Essential Concepts

C++ provides performance, not different algorithms. The C++ implementation performs the same mathematical operations as the R version—QR decomposition, cross-product, SVD. The speedup (typically 1.5-3×) comes from eliminating R overhead, not from algorithmic improvements. If you need different algorithms (sparse CCA, kernel CCA), C++ enables that flexibility, but for standard CCA, the performance gain is incremental, not transformational.

Memory management is mandatory, not optional. In R, garbage collection handles memory automatically. In C++, every new requires a delete, every opened dataset must be closed, every allocated resource needs cleanup. Forgetting cleanup causes memory leaks—your program uses more RAM over time until it crashes. The try/catch pattern with pointer cleanup isn’t defensive programming, it’s required for production C++ code.

Exception handling prevents catastrophic failures. When R code encounters an error, it returns to the console. When C++ code encounters an error without exception handling, the program crashes, potentially corrupting HDF5 files or leaving resources locked. The try/catch blocks with checkClose_file() ensure that errors cause controlled returns, not crashes. This pattern is essential for any C++ code that manages resources.

Rcpp bridges two worlds effectively but with costs. Calling R functions from C++ (writeCCAComponents_hdf5_rcpp) is convenient—you reuse tested R code. But it adds R dependency and crossing the R/C++ boundary has overhead. For operations called once (coefficient computation), the convenience wins. For operations called millions of times (inner loops), the overhead matters. Understanding when crossing the boundary costs more than it saves guides integration design.

BigDataStatMeth::functions do heavy lifting. The C++ implementation doesn’t reimplement matrix multiplication, SVD, or QR decomposition. It calls BigDataStatMeth::crossprod(), RcppbdSVD_hdf5(), etc.—compiled C++ functions that do the computation. Your C++ code orchestrates these operations with better control over memory and flow than R allows. You’re not writing low-level linear algebra; you’re combining high-level operations with better performance characteristics.

Compilation adds complexity but catches errors early. R scripts run immediately; C++ requires compilation. Compilation takes time, requires build tools, and can fail on different platforms. But compilation catches type errors, function signature mismatches, and memory issues before runtime. R discovers these problems when code executes (potentially hours into a computation). The compile-time cost buys runtime robustness.

11.2 When to Use C++ vs R

Understanding when C++ optimization pays for itself versus when R suffices guides development priorities.

✅ Use C++ implementation when:

Performance is demonstrably critical - Profile first. If CCA takes 10 minutes in R and you run it 100 times daily, 2× speedup saves ~8 hours per day. That justifies C++ development. If you run it twice per month, R is fine.
Building production systems - Software running in production benefits from C++’s speed, type safety, and error handling. Development takes longer but runtime is more predictable and failures are more controlled.
Implementing custom algorithms - If you need sparse CCA, regularized CCA, or novel variants not in BigDataStatMeth’s R API, C++ gives you algorithmic control. You can implement custom linear algebra, not just combine existing functions.
Need fine-grained parallelization - C++ with OpenMP enables parallelizing within matrix operations. R’s parallelization works at function call level. For very large matrices, OpenMP’s finer granularity achieves better speedups.
Eliminating R dependency - If deploying to environments without R (embedded systems, HPC clusters with restrictions), pure C++ implementation is necessary despite the development effort.

✅ Use R implementation when:

Prototyping and development - Get the algorithm working in R first. Interactive debugging, variable inspection, and rapid iteration are invaluable during development. Optimize to C++ later if profiling shows it’s worthwhile.
Performance is adequate - If R version completes in acceptable time (minutes for interactive work, hours for batch jobs), don’t optimize prematurely. Spend development time on science, not engineering.
Team lacks C++ expertise - Maintaining C++ code requires C++ knowledge. If your team works in R, keeping implementation in R enables everyone to understand and modify the code. Collaboration often outweighs raw performance.
Combining existing BigDataStatMeth functions - If your method uses bdSVD_hdf5(), bdCrossprod_hdf5(), etc., R glue code is fine. The heavy lifting happens in compiled code anyway. C++ overhead doesn’t help when you’re just calling C++ functions from R.

The key principle is evidence-based optimization. Profile to find bottlenecks, then optimize the 20% of code that takes 80% of time. Often, that 20% is already in compiled code (BigDataStatMeth functions), and your R code is just orchestration. C++ shines when you’re implementing novel algorithms or when profiling proves performance is critical.

12 Next Steps

Compare with R implementation:

CCA Implementation in R - Same algorithm, different language
Understand trade-offs between approaches
Decide which to use for your projects

Advanced C++ topics:

Add OpenMP parallelization for multi-core speedup
Implement custom sparse CCA algorithms
Optimize memory layout for cache performance
Build standalone C++ library (no R dependency)

Production deployment:

Write comprehensive tests (unit tests, integration tests)
Add extensive error handling and logging
Create benchmarks to track performance
Document compilation and deployment procedures

13 Cleanup

# Close any open HDF5 connections
h5closeAll()

# Note: C++ compiled code is in package
# Source files remain in src/ directory
# Recompile after any modifications

--- title: "CCA Implementation in C++" subtitle: "Building Canonical Correlation Analysis Using BigDataStatMeth C++ API" --- ## Overview This document shows how to implement Canonical Correlation Analysis (CCA) in C++ using BigDataStatMeth's header-only library. While the [R implementation](cca-r-implementation.qmd) demonstrated the algorithm using R's BigDataStatMeth functions, this C++ version provides direct control over computation, memory management, and performance optimization for production environments. The C++ implementation you'll see here is actual production code from the BDStatMethExamples package. It's not simplified teaching code—it's the real implementation handling TCGA multi-omic datasets with careful memory management, exception handling, and integration with R through Rcpp. By studying this code, you learn not just C++ syntax but the engineering discipline required for robust statistical software. Moving from R to C++ isn't just about speed—it's about control. In R, BigDataStatMeth functions handle memory, choose algorithms, manage errors. In C++, you make those decisions explicitly. This control enables custom algorithms, fine-grained optimization, and production deployment where reliability and performance are critical. But it comes with responsibility: you must manage pointers, handle exceptions, prevent memory leaks, and ensure numerical stability. ::: {.learning-objectives} ### What You'll Learn By the end of this document, you will: - Work with BigDataStatMeth's C++ header-only library effectively - Manage HDF5 datasets using the hdf5Dataset class safely - Implement block-wise QR decomposition with pointer management - Handle exceptions to prevent crashes and memory leaks - Integrate C++ functions with R through Rcpp seamlessly - Decide when C++ optimization justifies development effort - Compile and test C++ implementations systematically - Choose between calling R functions vs pure C++ for flexibility ::: --- ## Why C++ Implementation? ### When C++ Makes Sense Before diving into code, understand when C++ optimization is worth the development investment. **Performance-critical production pipelines:** If you run CCA thousands of times daily in production—processing every new patient sample, updating analyses continuously, serving real-time predictions—C++ performance gains compound. A 50% speedup on one analysis is negligible. A 50% speedup on 10,000 daily analyses saves hours of compute time and enables faster turnaround for clinical or research needs. **Custom algorithmic control:** If you're implementing novel CCA variants not available in BigDataStatMeth—sparse regularization with custom penalties, kernel CCA with specialized kernels, streaming CCA that processes data in one pass—C++ gives you algorithmic freedom. You're not constrained to pre-built functions; you implement exactly the algorithm you need, with exactly the optimizations that matter for your use case. **Memory-constrained environments:** In production environments with limited RAM or when processing extremely large datasets, C++'s explicit memory management prevents unexpected allocation failures. You control exactly when memory is allocated and freed, enabling precise tuning for available resources. R's garbage collection can cause unpredictable memory spikes; C++'s deterministic deallocation doesn't. **Integration with existing C++ systems:** If your organization has existing C++ pipelines, statistical libraries, or infrastructure, adding a C++ CCA implementation integrates naturally. No language barriers, no R installation required, consistent deployment practices. The entire stack remains C++, simplifying maintenance and deployment. ### When R Suffices Don't assume C++ is always better. R often provides better development velocity with adequate performance. **Algorithm prototyping and development:** Developing new statistical methods in R is faster—you iterate quickly, inspect variables at every step, leverage R's statistical ecosystem. Get the algorithm mathematically correct in R, validate it thoroughly, then optimize in C++ if performance profiling shows bottlenecks. Many methods never need C++ because R performance is adequate. **Infrequent use cases:** If you run CCA once per month on a dataset that completes in 10 minutes, C++ optimization isn't justified. The development time—writing C++ code, testing memory management, handling edge cases—exceeds the total computation time saved over the method's lifetime. Use R, wait 10 minutes, move on to analysis. **Leveraging BigDataStatMeth functions:** If your algorithm composes existing BigDataStatMeth functions (bdSVD_hdf5, bdCrossprod_hdf5, etc.), the heavy lifting already happens in compiled code. R orchestration overhead is negligible. Only if you need custom operations not provided by BigDataStatMeth does C++ implementation add value. **Team expertise considerations:** If your team excels in R but struggles with C++, the maintenance burden of C++ code may outweigh performance benefits. Statistical correctness and code maintainability matter more than speed. Better to have understandable R code that runs slightly slower than inscrutable C++ code that nobody can debug or extend. The decision rule: profile first, optimize selectively. Identify actual bottlenecks through profiling, then apply C++ surgically to those specific operations. Wholesale R-to-C++ rewrites are rarely the right answer. --- ## C++ Prerequisites ### Required Headers BigDataStatMeth provides a header-only C++ library. Include the main header and you have access to all classes and functions without linking against compiled libraries: ```cpp #include <Rcpp.h> #include "BigDataStatMeth.hpp" using namespace Rcpp; using namespace BigDataStatMeth; ``` The `Rcpp.h` header enables R integration—exporting C++ functions to R, converting between R and C++ data structures, handling R's error model. Without Rcpp, you'd write pure C++ code that R can't call directly. The `BigDataStatMeth.hpp` header provides the core functionality: - `hdf5Dataset` class for HDF5 file I/O - Matrix operation functions (crossprod, SVD, QR) - Block-wise processing utilities - Memory management helpers ### Understanding hdf5Dataset Class The `hdf5Dataset` class is your interface to HDF5 files in C++. Unlike R's transparent HDF5 access, C++ requires explicit dataset opening, reading, writing, and closing. **Basic usage pattern:** ```cpp // Constructor: create dataset object // Parameters: filename, dataset_path, create_if_new hdf5Dataset* ds = new hdf5Dataset("file.hdf5", "data/matrix", false); // Open for reading/writing // Must call before any I/O operations ds->openDataset(); // Get dimensions int nrows = ds->nrows(); // Total rows (C indexing: 0-based) int ncols = ds->ncols(); // Total columns int nrows_r = ds->nrows_r(); // Rows (R indexing: 1-based) int ncols_r = ds->ncols_r(); // Columns (R indexing: 1-based) // Read data block std::vector<double> data(nrows * ncols); std::vector<hsize_t> offset = {0, 0}; // Start position std::vector<hsize_t> count = {nrows, ncols}; // How much to read std::vector<hsize_t> stride = {1, 1}; // Step size (usually 1,1) std::vector<hsize_t> block = {1, 1}; // Block size (usually 1,1) ds->readDatasetBlock(offset, count, stride, block, data.data()); // Write data block ds->writeDatasetBlock(offset, count, stride, block, data.data()); // Cleanup - CRITICAL delete ds; ds = nullptr; // Prevent double-delete ``` **Critical memory management rules:** 1. **Always delete:** Every `new hdf5Dataset` requires corresponding `delete`. Forgetting causes memory leaks—the HDF5 handle stays open, consuming resources. 2. **Set to nullptr after delete:** Prevents accidentally using a deleted pointer (accessing freed memory causes crashes). 3. **Use try/catch:** HDF5 operations can fail (file not found, dataset doesn't exist, permissions denied). Wrap in try/catch to clean up properly on errors. 4. **Check nullptr before delete:** If pointer might be null, check before deleting to avoid crashes. **Common pitfalls:** - Forgetting `openDataset()` before reading → runtime error - Reading without allocating sufficient buffer space → buffer overflow - Deleting same pointer twice → crash - Returning from function without deleting → memory leak - Catching exception but not deleting pointers → memory leak The pattern you'll see repeatedly: declare pointer to nullptr, allocate with new, use in try block, delete in catch and after try, set to nullptr. This discipline prevents the memory leaks and crashes that plague C++ codebases. --- ## Mathematical Foundation (Reference) The mathematical foundations of CCA are detailed in the [R implementation document](cca-r-implementation.qmd#cca-algorithm-overview). Rather than duplicating that content, here's a condensed reference. **CCA algorithm steps:** 1. **QR decomposition:** $X = Q_X R_X$, $Y = Q_Y R_Y$ Orthonormalizes matrices while preserving column space 2. **Cross-product:** $M = Q_X^T Q_Y$ Captures relationship between orthonormalized datasets 3. **SVD:** $M = U D V^T$ Diagonal D contains canonical correlations, U and V are canonical directions 4. **Canonical coefficients:** $\alpha = R_X^{-1} U$, $\beta = R_Y^{-1} V$ Transform back to original feature space 5. **Canonical variates:** $s_X = X\alpha$, $s_Y = Y\beta$ Sample scores in canonical space **Block-wise processing rationale:** For large matrices (50,000 samples × 500,000 features), computing full QR decomposition requires massive RAM. Block-wise approach: - Partitions matrix into m blocks by rows - Computes QR on each block (memory-friendly) - Combines results hierarchically - Achieves identical result to full QR but with reduced memory The C++ implementation follows this same mathematical algorithm. The difference is how it's expressed—explicit pointer management, manual memory allocation, direct HDF5 calls—not what it computes. --- ## Step 1: QR Decomposition by Blocks in C++ ### The getQRbyBlocks_rcpp Function This function implements block-wise QR decomposition in C++, mirroring the R version but with explicit memory and error handling: ```cpp void getQRbyBlocks_rcpp(hdf5Dataset *dsA, int mblocks, bool bcenter, bool bscale, bool byrows, bool overwrite, Rcpp::Nullable<int> threads) { // Temporary dataset pointer - initialize to nullptr for safety hdf5Dataset* dstmp = nullptr; try { std::string strInPath, strOutPath; // Flags for matrix operation parameters bool btransp_dataset = false; // Transpose first matrix? bool btransp_bdataset = false; // Transpose second matrix? bool bfullMatrix = false; // Use full matrix vs economy? bool bycols = !byrows; // Convert row-wise to column-wise flag // Get dataset name for path construction std::string strdataset = dsA->getDatasetName(); // Step 1: Normalize dataset (center and/or scale) // This calls R's normalization function for simplicity RcppNormalizeHdf5(dsA, bcenter, bscale, byrows); // Step 2: Open normalized dataset dstmp = new hdf5Dataset(dsA->getFileName(), "NORMALIZED/" + dsA->getGroupName(), strdataset, false); dstmp->openDataset(); // Step 3: Split into blocks strOutPath = "Step1/" + strdataset + "rows"; RcppSplit_matrix_hdf5_internal( dstmp, strOutPath, strdataset, bycols, mblocks, -1, dstmp->nrows(), dstmp->ncols() ); // Close and delete temporary dataset - critical for memory delete dstmp; dstmp = nullptr; // Step 4: Apply QR to each block StringVector blocks = dsA->getDatasetNames(strOutPath, "", ""); strInPath = strOutPath; strOutPath = "Step2/" + strdataset + "rows"; RcppApplyFunctionHdf5( dsA->getFileName(), strInPath, blocks, strOutPath, "QR", R_NilValue, R_NilValue, wrap(overwrite), wrap(btransp_dataset), wrap(btransp_bdataset), wrap(bfullMatrix), wrap(byrows), threads ); // Step 5: Merge R matrices from blocks StringVector blocks_qr = dsA->getDatasetNames(strOutPath, strdataset, ".R"); strInPath = strOutPath; strOutPath = "Step3/merged"; RcppBind_datasets_hdf5( dsA->getFileName(), strInPath, blocks_qr, strOutPath, strdataset + "Rt", "bindRows", false, wrap(overwrite) ); // Step 6: Final QR on merged R matrices strInPath = strOutPath; strOutPath = "Step3/Final_QR"; RcppApplyFunctionHdf5( dsA->getFileName(), strInPath, strdataset + "Rt", strOutPath, "QR", R_NilValue, R_NilValue, wrap(overwrite), wrap(btransp_dataset), wrap(btransp_bdataset), wrap(bfullMatrix), wrap(byrows), threads ); // Step 7: Split final Q for multiplication strInPath = strOutPath; strOutPath = "Step4/splitted"; dstmp = new hdf5Dataset(dsA->getFileName(), strInPath, strdataset + "Rt.Q", false); dstmp->openDataset(); RcppSplit_matrix_hdf5_internal( dstmp, strOutPath, strdataset + "Rt.Q", bycols, mblocks, -1, dstmp->nrows(), dstmp->ncols() ); delete dstmp; dstmp = nullptr; // Step 8: Multiply original Q blocks by final Q blocks strInPath = "Step2/" + strdataset + "rows"; strOutPath = "Step5"; CharacterVector b_group = "Step4/splitted"; blocks_qr = dsA->getDatasetNames(strInPath, strdataset, ".Q"); StringVector b_blocks = dsA->getDatasetNames( "Step4/splitted", strdataset + "Rt.Q", "" ); RcppApplyFunctionHdf5( dsA->getFileName(), strInPath, blocks_qr, strOutPath, "blockmult", b_group, b_blocks, wrap(overwrite), wrap(btransp_dataset), wrap(btransp_bdataset), wrap(bfullMatrix), wrap(byrows), threads ); // Step 9: Bind Q blocks into final Q matrix strInPath = strOutPath; strOutPath = "Step6"; blocks = dsA->getDatasetNames(strInPath, strdataset + ".", ""); RcppBind_datasets_hdf5( dsA->getFileName(), strInPath, blocks, strOutPath, strdataset + "Q", "bindRows", false, wrap(overwrite) ); } catch(std::exception& ex) { // Critical: clean up pointers on error checkClose_file(dsA, dstmp); Rcout<< "c++ exception getQRbyBlocks_rcpp: "<< ex.what() << "\n"; return void(); } return void(); } ``` ### Key Differences from R Implementation While the algorithm is identical, the C++ implementation requires explicit management of resources that R handles automatically. **Memory management:** Every `new hdf5Dataset` requires matching `delete`. The R version doesn't show this because R's garbage collector handles it. In C++, forgetting a `delete` causes a memory leak—the HDF5 file handle stays open, consuming system resources. After thousands of operations, you run out of file handles and the system fails. The pattern `new → use → delete → set nullptr` prevents this. **Exception handling:** The entire function body wraps in try/catch. If any operation fails—file not found, dataset doesn't exist, out of memory—the catch block cleans up allocated pointers before returning. Without this, an exception thrown by HDF5 would skip the `delete` statements, causing memory leaks. R's error handling stops function execution and cleans up automatically; C++ requires explicit cleanup in exception paths. **Explicit string manipulation:** R's `paste0()` and `gsub()` are replaced by C++ string concatenation (`+`) and std::string methods. The logic is identical, but syntax differs. C++ strings are mutable, allow efficient concatenation, but require more verbose operations than R's vectorized string functions. **Calling R functions from C++:** Functions like `RcppNormalizeHdf5`, `RcppSplit_matrix_hdf5_internal`, `RcppBind_datasets_hdf5` are wrappers around BigDataStatMeth's R functions. This hybrid approach—C++ orchestration, R implementation—balances performance and maintainability. Writing pure C++ versions of these operations would require thousands of additional lines without significant performance gain because they already use compiled code internally. **Pointer checks:** Before deleting, we don't explicitly check `if (dstmp != nullptr)` because we initialize to nullptr and only assign once. In more complex code with conditional allocation, always check before deleting. The `checkClose_file` utility function handles this checking, ensuring we don't delete null pointers or already-deleted pointers. **Return type:** The function returns `void()` rather than a value. It modifies the HDF5 file in place, with results stored at known locations. R users call this function for side effects (creating datasets in HDF5), not for return values. This design simplifies memory management—no need to allocate, populate, and return large matrices. ### Why This Pattern Matters This function demonstrates the fundamental pattern for all C++ implementations with BigDataStatMeth: 1. **Initialize pointers to nullptr** 2. **Wrap logic in try block** 3. **Allocate with new, use, immediately delete when done** 4. **Catch exceptions, clean up, report error** 5. **Return void after modifying HDF5 file in place** This pattern prevents the three most common C++ failures: memory leaks (forgetting delete), crashes (accessing deleted pointers), and silent corruption (exceptions bypassing cleanup). Master this pattern, and you can implement any statistical method in C++ using BigDataStatMeth. --- ## Step 2: Main CCA Function in C++ ### Orchestrating the Complete Algorithm The main CCA function coordinates QR decomposition for both matrices, computes their cross-product, performs SVD, and initiates coefficient computation: ```cpp // [[Rcpp::export]] void bdCCA_hdf5_rcpp(std::string filename, std::string datasetX, std::string datasetY, bool bcenter, bool bscale, int mblocks, bool overwrite, Rcpp::Nullable<int> threads = R_NilValue) { // Declare all pointers upfront, initialize to nullptr hdf5Dataset* dsX = nullptr; hdf5Dataset* dsY = nullptr; hdf5Dataset* dsXQ = nullptr; hdf5Dataset* dsYQ = nullptr; hdf5Dataset* dsC = nullptr; try { int ncolsX, ncolsY; // Step 1: QR decomposition of X dsX = new hdf5Dataset(filename, datasetX, false); dsX->openDataset(); getQRbyBlocks_rcpp(dsX, mblocks, bcenter, bscale, false, overwrite, threads); // Step 2: QR decomposition of Y dsY = new hdf5Dataset(filename, datasetY, false); dsY->openDataset(); getQRbyBlocks_rcpp(dsY, mblocks, bcenter, bscale, false, overwrite, threads); // Step 3: Compute cross-product of Q matrices dsXQ = new hdf5Dataset(filename, "Step6", "XQ", false); dsXQ->openDataset(); dsYQ = new hdf5Dataset(filename, "Step6", "YQ", false); dsYQ->openDataset(); dsC = new hdf5Dataset(filename, "Step7", "CrossProd_XQ_YQ", overwrite); // Compute optimal block size for cross-product operation int optimBlock = getMaxBlockSize( dsXQ->nrows(), dsXQ->ncols(), dsYQ->nrows(), dsYQ->ncols(), 2, R_NilValue ); // Perform cross-product: Q_X^T Q_Y dsC = BigDataStatMeth::crossprod( dsXQ, dsYQ, dsC, optimBlock, optimBlock/2, true, true, threads ); // Clean up Q datasets - no longer needed delete dsXQ; dsXQ = nullptr; delete dsYQ; dsYQ = nullptr; // Store dimensions for coefficient computation ncolsX = dsX->ncols_r(); ncolsY = dsY->ncols_r(); // Step 4: SVD of cross-product matrix // This extracts canonical correlations and directions RcppbdSVD_hdf5( dsC->getFileName(), dsC->getGroupName(), dsC->getDatasetName(), 16, 2, 0, false, false, 0, overwrite, false, R_NilValue, R_NilValue ); // Clean up cross-product dataset delete dsC; dsC = nullptr; // Clean up X and Y datasets delete dsX; dsX = nullptr; delete dsY; dsY = nullptr; // Step 5: Compute canonical coefficients and variates // This function is implemented in R for flexibility writeCCAComponents_hdf5_rcpp(filename, ncolsX, ncolsY); } catch(std::exception& ex) { // Critical: clean up all pointers on error checkClose_file(dsX, dsY, dsXQ, dsYQ, dsC); Rcerr<< "\n c++ exception bdCCA_hdf5_rcpp: "<< ex.what() << "\n"; return void(); } return void(); } ``` ### Understanding the Implementation This function demonstrates orchestration patterns essential for complex C++ algorithms. **Sequential cleanup:** After each step, we delete pointers no longer needed. After computing cross-product, we delete dsXQ and dsYQ—the Q matrices were used, results stored, pointers can be freed. This progressive cleanup minimizes memory usage. Contrast with keeping all pointers until the end—memory usage grows throughout the function. In production with many concurrent CCA calls, aggressive cleanup prevents memory exhaustion. **`[[Rcpp::export]]` annotation:** This tells Rcpp to make the function callable from R. Without it, the function compiles but R can't call it. The annotation generates wrapper code converting R data types (strings, booleans, integers) to C++ types and vice versa. This is why we can call `bdCCA_hdf5_rcpp(filename, "data/X", "data/Y", ...)` from R—Rcpp handles the type conversion automatically. **Error propagation:** If `getQRbyBlocks_rcpp` throws an exception for X, the catch block activates, cleans up dsX, and returns without attempting Y's QR decomposition. This prevents cascading errors where failure in step N causes nonsensical errors in step N+1. The error message identifies which operation failed, critical for debugging. **Calling R functions from C++:** The `writeCCAComponents_hdf5_rcpp` function (shown next) calls an R implementation. This hybrid approach is pragmatic—computing coefficients requires solving triangular systems, matrix multiplications, and HDF5 I/O for storing results. The R implementation already does this correctly. Rather than rewrite in pure C++, we call the R function. The performance difference is negligible because the computationally intensive parts (solving systems, matrix multiplications) use compiled BLAS/LAPACK regardless of whether called from R or C++. **Block size optimization:** The `getMaxBlockSize` function dynamically determines optimal block size based on matrix dimensions and available memory. Hardcoding block size works for fixed data sizes but fails when dimensions vary. Adaptive block sizing ensures the function works efficiently across diverse datasets without user tuning. **Memory failure handling:** If `new` fails (out of memory), C++ throws `std::bad_alloc`. Our catch block catches this (it's a `std::exception` subclass), reports the error, and returns cleanly. Without exception handling, an allocation failure crashes the entire R session—unacceptable in production. Proper error handling degrades gracefully: report error, clean up, return, let user diagnose. --- ## Step 3: Computing Canonical Coefficients ### Option A: Calling R Function (Pragmatic Approach) The simplest and most maintainable approach is calling the existing R implementation: ```cpp void writeCCAComponents_hdf5_rcpp(std::string filename, int ncolsX, int ncolsY) { // Access R environment where function exists Rcpp::Environment base("package:BDStatMethExamples"); // Get R function object Rcpp::Function write_components = base["writeCCAComponents_hdf5"]; // Call R function with C++ arguments // Rcpp automatically converts int → numeric, string → character write_components( Rcpp::_["filename"] = wrap(filename), Rcpp::_["ncolsX"] = ncolsX, Rcpp::_["ncolsY"] = ncolsY ); return void(); } ``` This approach has significant advantages: **Maintainability:** The R function is already tested, documented, handles edge cases. If you find a bug or need to add features, you fix it in one place (R) and both R and C++ code benefit. Duplicating logic in C++ means duplicating maintenance effort. **Flexibility:** The R function can easily use R's statistical functions, plotting capabilities, diagnostic output. Reimplementing all of that in C++ would be massive effort. By calling R, you get that functionality "for free." **Performance reality:** The computationally intensive parts—solving triangular systems with BLAS, matrix multiplication—use compiled code whether called from R or C++. The overhead of the R function call is negligible compared to the actual computation time. Only if profiling shows this specific function as a bottleneck would pure C++ implementation be justified. **Development speed:** This approach took 10 lines of code. A pure C++ implementation would take hundreds of lines plus extensive testing. Unless you need that pure C++ version for specific deployment reasons, this pragmatic approach is superior. ### Option B: Pure C++ Implementation (For Reference) For completeness, here's how you'd implement coefficient computation in pure C++. This is shown for educational purposes—most users should use Option A. ```cpp void writeCCAComponents_hdf5_cpp(std::string filename, int ncolsX, int ncolsY) { hdf5Dataset* dstmpX = nullptr; hdf5Dataset* dstmpY = nullptr; std::vector<hsize_t> stride = {1, 1}; std::vector<hsize_t> block = {1, 1}; try { hsize_t nX = static_cast<hsize_t>(ncolsX); hsize_t nY = static_cast<hsize_t>(ncolsY); // Allocate storage for Q and R matrices std::vector<double> vdXQ(nX * nX); std::vector<double> vdYQ(nY * nY); std::vector<double> vdXR(nX * nX); std::vector<double> vdYR(nY * nY); // Read X components { dstmpX = new hdf5Dataset(filename, "Step6/XQ", false); dstmpX->openDataset(); dstmpX->readDatasetBlock( {0, 0}, {nX, nX}, stride, block, vdXQ.data() ); dstmpY = new hdf5Dataset(filename, "Step6/YQ", false); dstmpY->openDataset(); dstmpY->readDatasetBlock( {0, 0}, {nY, nY}, stride, block, vdYQ.data() ); delete dstmpX; dstmpX = nullptr; delete dstmpY; dstmpY = nullptr; // Read R matrices dstmpX = new hdf5Dataset(filename, "Step3/Final_QR/XRt.R", false); dstmpX->openDataset(); dstmpX->readDatasetBlock( {0, 0}, {nX, nX}, stride, block, vdXR.data() ); dstmpY = new hdf5Dataset(filename, "Step3/Final_QR/YRt.R", false); dstmpY->openDataset(); dstmpY->readDatasetBlock( {0, 0}, {nY, nY}, stride, block, vdYR.data() ); delete dstmpX; dstmpX = nullptr; delete dstmpY; dstmpY = nullptr; } // Map to Eigen matrices for linear algebra Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>> XQ(vdXQ.data(), nX, nX); Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>> YQ(vdYQ.data(), nY, nY); Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>> XR(vdXR.data(), nX, nX); Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>> YR(vdYR.data(), nY, nY); // Combine Q and R into single triangular representation Eigen::MatrixXd XQR = Eigen::MatrixXd::Zero(nX, nX); Eigen::MatrixXd YQR = Eigen::MatrixXd::Zero(nY, nY); XQR.triangularView<Eigen::Upper>() = XR.triangularView<Eigen::Upper>(); XQR.triangularView<Eigen::StrictlyLower>() = XQ.triangularView<Eigen::StrictlyLower>(); YQR.triangularView<Eigen::Upper>() = YR.triangularView<Eigen::Upper>(); YQR.triangularView<Eigen::StrictlyLower>() = YQ.triangularView<Eigen::StrictlyLower>(); // Read SVD results (U, V matrices) dstmpX = new hdf5Dataset(filename, "SVD/CrossProd_XQ_YQ/u", false); dstmpX->openDataset(); int nu_rows = dstmpX->nrows(); int nu_cols = dstmpX->ncols(); std::vector<double> vdU(nu_rows * nu_cols); dstmpX->readDatasetBlock( {0, 0}, {(hsize_t)nu_rows, (hsize_t)nu_cols}, stride, block, vdU.data() ); delete dstmpX; dstmpX = nullptr; dstmpY = new hdf5Dataset(filename, "SVD/CrossProd_XQ_YQ/v", false); dstmpY->openDataset(); int nv_rows = dstmpY->nrows(); int nv_cols = dstmpY->ncols(); std::vector<double> vdV(nv_rows * nv_cols); dstmpY->readDatasetBlock( {0, 0}, {(hsize_t)nv_rows, (hsize_t)nv_cols}, stride, block, vdV.data() ); delete dstmpY; dstmpY = nullptr; Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>> U(vdU.data(), nu_rows, nu_cols); Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>> V(vdV.data(), nv_rows, nv_cols); // Solve for canonical coefficients: R^(-1) * U // Using triangular solve (efficient for triangular matrices) Eigen::MatrixXd xcoef = XQR.triangularView<Eigen::Upper>().solve(U); Eigen::MatrixXd ycoef = YQR.triangularView<Eigen::Upper>().solve(V); // Here you would write xcoef and ycoef back to HDF5 // using BigDataStatMeth's HDF5 writing functions // Omitted for brevity - see R implementation for reference } catch(std::exception& ex) { checkClose_file(dstmpX, dstmpY); Rcout<< "c++ exception writeCCAComponents_hdf5_cpp: " << ex.what() << "\n"; return void(); } return void(); } ``` This pure C++ version demonstrates several advanced techniques: **Eigen for linear algebra:** The Eigen library provides efficient matrix operations. The `Eigen::Map` creates Eigen matrix objects that share memory with C++ vectors—no copying overhead. Operations like triangular solve (`triangularView<Upper>().solve()`) use optimized BLAS/LAPACK implementations. **Memory mapping:** Rather than copying data from std::vector to Eigen::Matrix, we map the vector's memory directly. This zero-copy approach is essential for large matrices—copying megabytes or gigabytes is expensive both in time and temporary memory allocation. **Triangular system solving:** The `triangularView<Upper>().solve()` exploits the triangular structure of R matrices. Solving a triangular system is O(n²), much faster than general matrix inverse which is O(n³). This is why QR decomposition is useful—it gives triangular matrices that solve efficiently. **Complexity vs benefit:** This pure C++ implementation is ~100 lines versus ~10 lines for calling R. Unless profiling shows coefficient computation as a bottleneck (unlikely—SVD and QR dominate compute time), the simpler approach is better. ### When Each Approach Makes Sense **Use Option A (call R) when:** - Development speed matters more than microseconds - R function handles edge cases, provides diagnostics - Team maintains R code more easily - No strict deployment requirement for pure C++ **Use Option B (pure C++) when:** - Deploying in environments without R - Profiling shows coefficient computation as bottleneck - Need to modify algorithm in ways R version doesn't support - Building standalone C++ library For most users, Option A is the right choice. This pragmatic approach combines C++ performance where it matters (QR, SVD) with R flexibility where it helps (coefficient computation, result storage). --- ## Compiling and Using ### Package Structure To use C++ functions in an R package, you need proper structure: ``` BDStatMethExamples/ ├── DESCRIPTION ├── NAMESPACE ├── R/ │ └── RcppExports.R # Auto-generated ├── src/ │ ├── bdCCA.cpp # Implementation │ ├── bdCCA.h # Header declarations │ └── RcppExports.cpp # Auto-generated └── inst/ └── include/ └── BigDataStatMeth.hpp ``` **Header file (bdCCA.h):** ```cpp #ifndef BDSTATMETHEXAMPLES_CCA_HPP #define BDSTATMETHEXAMPLES_CCA_HPP #include <Rcpp.h> #include "BigDataStatMeth.hpp" using namespace Rcpp; using namespace BigDataStatMeth; // Function declarations void getQRbyBlocks_rcpp( hdf5Dataset *dsA, int mblocks, bool bcenter, bool bscale, bool byrows, bool overwrite, Nullable<int> threads = R_NilValue ); void writeCCAComponents_hdf5_rcpp( std::string filename, int ncolsX, int ncolsY ); #endif // BDSTATMETHEXAMPLES_CCA_HPP ``` ### Option 1: Compile with sourceCpp (Testing) For development and testing, compile individual files: ```{r compile-single, eval=FALSE} library(Rcpp) # Set include path Sys.setenv(PKG_CXXFLAGS = paste0( "-I", system.file("include", package = "BigDataStatMeth") )) # Compile and load sourceCpp("src/bdCCA.cpp") # Now function is available in R bdCCA_hdf5_rcpp( filename = "test.hdf5", datasetX = "data/X", datasetY = "data/Y", bcenter = TRUE, bscale = FALSE, mblocks = 4, overwrite = TRUE ) ``` ### Option 2: Build Package (Production) For production use, build the full package: ```bash # From package directory R CMD build . R CMD INSTALL BDStatMethExamples_1.0.tar.gz ``` Or in R: ```{r build-package, eval=FALSE} devtools::document() # Generate documentation devtools::build() # Build package devtools::install() # Install locally ``` ### Comparing R and C++ Performance Here's how to benchmark the two implementations: ```{r benchmark, eval=FALSE} library(BDStatMethExamples) library(microbenchmark) # Create test data set.seed(42) n <- 5000 p <- 1000 q <- 800 X <- matrix(rnorm(n*p), n, p) Y <- matrix(rnorm(n*q), n, q) test_file <- "benchmark_cca.hdf5" bdCreate_hdf5_matrix(test_file, X, "data", "X", overwriteFile = TRUE) bdCreate_hdf5_matrix(test_file, Y, "data", "Y", overwriteFile = FALSE) # Benchmark results <- microbenchmark( R_version = bdCCA_hdf5( test_file, "data/X", "data/Y", m = 4, overwriteResults = TRUE ), Cpp_version = bdCCA_hdf5_rcpp( test_file, "data/X", "data/Y", bcenter = TRUE, bscale = FALSE, mblocks = 4, overwrite = TRUE ), times = 10 ) print(results) # Typically expect: # R version: ~15-25 seconds # C++ version: ~10-15 seconds # Speedup: ~1.5-2x ``` The C++ version is faster, but not dramatically—both use the same underlying BLAS/LAPACK for matrix operations. The speedup comes from reduced overhead in function calls and memory management, not from algorithmic differences. --- ## Interactive Exercise {.exercise} ### Practice: Optimizing C++ Implementation Understanding comes from hands-on modification: ```{r} #| eval: false # Exercise 1: Add error checking # Modify bdCCA_hdf5_rcpp to validate: # - File exists before opening # - Datasets have matching sample counts # - Sufficient memory available # How do these checks affect performance? # Exercise 2: Implement parallel QR # Modify getQRbyBlocks_rcpp to: # - Process blocks in parallel using OpenMP # - Measure speedup with 2, 4, 8 threads # When does parallelization help? # Exercise 3: Profile memory usage # Instrument the code to track: # - Peak memory usage at each step # - Number of active HDF5 handles # - Dataset sizes created # Where is memory bottleneck? # Exercise 4: Pure C++ coefficient computation # Complete the writeCCAComponents_hdf5_cpp function: # - Write xcoef and ycoef to HDF5 # - Compute and save canonical variates # - Compare performance to R version # Is the complexity justified? ``` ::: {.callout-tip} ### Reflection Questions **1. Development Trade-offs:** - How much faster must C++ be to justify development cost? - At what data size does C++ become necessary? - How do you decide R vs C++ for new methods? **2. Memory Management:** - Why initialize pointers to nullptr? - What happens if you forget delete? - How do smart pointers (unique_ptr, shared_ptr) help? **3. Error Handling:** - Why wrap everything in try/catch? - What errors can HDF5 operations throw? - How do you test error handling paths? **4. Hybrid Approaches:** - When is calling R from C++ appropriate? - What's the overhead of R function calls? - How do you profile mixed R/C++ code? **5. Production Deployment:** - How do you ensure C++ code is maintainable? - What testing is essential for C++ implementations? - How do you document C++ API for users? These questions guide you beyond syntax to software engineering principles essential for production statistical code. ::: --- ## Key Takeaways {.key-concept} Let's consolidate what you've learned about implementing statistical methods in C++ with BigDataStatMeth. ### Essential Concepts **C++ provides performance, not different algorithms.** The mathematical algorithm—QR decomposition, cross-product, SVD, solving for coefficients—is identical between R and C++ implementations. The difference is execution efficiency, not what's computed. C++ is faster because of reduced overhead in memory management, function calls, and type conversions, not because it uses superior mathematics. Don't assume C++ is always necessary; profile first to identify actual bottlenecks. Many statistical methods run adequately in R because the heavy lifting (BLAS, LAPACK operations) uses compiled code regardless of whether called from R or C++. **Memory management is mandatory, not optional.** Every `new` requires matching `delete`. Every opened HDF5 dataset must be closed. Every allocated pointer must be tracked through exception paths. R's garbage collector handles this automatically, making R code shorter and less error-prone. C++ requires explicit management, which is why production C++ code includes try/catch blocks, cleanup functions, and defensive null checks. Forgetting these leads to memory leaks (gradual resource exhaustion), crashes (accessing freed memory), or corrupted results (partial cleanup). The discipline of declaring pointers to nullptr, allocating in try blocks, deleting in catch and after use, then setting to nullptr prevents these failures. **Exception handling prevents crashes.** HDF5 operations can fail—file not found, dataset doesn't exist, out of disk space, permissions denied. Without exception handling, these failures crash the entire R session. With proper try/catch, the function reports the error, cleans up allocated resources, and returns gracefully. The user sees a clear error message, R continues running, and memory doesn't leak. Production code must never crash—degrading gracefully (return error, let user fix problem) is mandatory. The pattern of wrapping all operations in try and checking for exceptions isn't paranoia; it's engineering reliability. **Rcpp bridges R and C++ transparently.** The `[[Rcpp::export]]` annotation makes C++ functions callable from R without manual wrapper code. Type conversion—R strings to C++ std::string, R numerics to C++ double, R booleans to C++ bool—happens automatically. Calling R functions from C++ (like `writeCCAComponents_hdf5_rcpp` calling R's implementation) is equally transparent. This seamless integration means you can use C++ where it provides value (tight loops, custom algorithms, memory control) while keeping flexible parts in R (statistical tests, plotting, diagnostics). Don't think of R and C++ as separate worlds requiring manual integration; Rcpp makes them a unified environment. **BigDataStatMeth's C++ functions do the heavy lifting.** Writing pure C++ linear algebra from scratch—QR decomposition, SVD, matrix multiplication—would be thousands of lines and likely slower than using BigDataStatMeth's implementations (which internally use optimized BLAS/LAPACK). The value of C++ isn't reimplementing everything; it's combining BigDataStatMeth's building blocks in custom ways, adding algorithms not provided, or optimizing specific bottlenecks. The hybrid approach—calling BigDataStatMeth functions from C++, calling R functions where convenient—is pragmatic engineering, not compromise. **Compilation catches errors early.** R is interpreted—errors appear at runtime, potentially deep into analysis. C++ is compiled—type mismatches, missing headers, syntax errors appear before execution. This is why C++ development feels slower (write code, compile, fix errors, repeat) but produces more robust code. Once C++ code compiles, many categories of errors (wrong types, missing functions, invalid syntax) are impossible at runtime. The debugging time shifts from runtime to compile time, and compile-time debugging is easier—the compiler tells you exactly what's wrong. Runtime debugging in statistical code often requires re-running expensive analyses; compile-time debugging prevents that wasted time. ### When to Use C++ vs R Understanding when C++ justifies the development investment versus when R suffices guides efficient method development. ✅ **Implement in C++ when:** - **Profiling identifies clear bottlenecks** - Don't guess; measure. If profiling shows 80% of time in a specific operation not using compiled code (e.g., custom R loops), that's a C++ candidate. If 80% of time is in `bdSVD_hdf5` (already compiled), C++ won't help. - **Method runs thousands of times in production** - A 50% speedup irrelevant for one-off analyses matters hugely in pipelines processing thousands of datasets daily. The accumulated time savings—and reduced compute costs—justify development effort. - **Custom algorithms not in BigDataStatMeth** - If implementing novel decompositions, specialized optimization, or algorithms from cutting-edge papers, C++ provides algorithmic freedom. You're not limited to existing BigDataStatMeth functions; you implement exactly what you need. - **Deploying in pure C++ environments** - If your production infrastructure is pure C++ (no R installed), you need C++ implementations. This is rare—most scientific computing environments have R—but in embedded systems, real-time environments, or certain production systems, pure C++ is mandatory. ❌ **Stay in R when:** - **Development speed matters more** - For research methods, prototype algorithms, exploratory analyses, R's faster development cycle (write, run, inspect, iterate) is more valuable than execution speed. Perfect code running slow beats broken code running fast. - **Method uses existing BigDataStatMeth functions** - If your method orchestrates `bdSVD_hdf5`, `bdCrossprod_hdf5`, `bdQR_hdf5`, the heavy computation already uses compiled code. R orchestration overhead is negligible. Only if you need custom operations not provided by BigDataStatMeth is C++ needed. - **Infrequent use** - Monthly analysis taking 10 minutes in R? Don't optimize. The development time (writing C++, testing, debugging) exceeds total computation time saved over the method's lifetime. Spend development time on methods used frequently. - **Team maintains R more easily** - If your team excels in R but struggles with C++ pointer management, memory leaks, and compilation issues, maintainability matters more than speed. Better to have understandable R code that runs slightly slower than inscrutable C++ code nobody can debug or extend. The key principle: profile to identify bottlenecks, optimize selectively where profiling shows benefit, and use each language where it excels. Most statistical method implementations belong in R, with BigDataStatMeth's compiled functions handling computational intensity. --- ## Next Steps **Deepen C++ expertise:** - Study BigDataStatMeth's header implementations - Learn OpenMP for parallel computing - Explore Eigen library for advanced linear algebra - Master C++ debugging tools (gdb, valgrind) **Optimize existing methods:** - Profile CCA implementation to find bottlenecks - Implement parallel block processing - Optimize memory allocation patterns - Add SIMD vectorization where beneficial **Extend to new methods:** - Implement sparse CCA with custom regularization - Build regularized PLS using similar patterns - Create custom matrix factorizations - Develop online/streaming algorithms **Production deployment:** - Add comprehensive error handling - Implement logging and diagnostics - Create extensive test suites - Document C++ API for users - Build CI/CD pipelines for testing