CCA Implementation in C++

Building Canonical Correlation Analysis Using BigDataStatMeth C++ API

WarningDeprecated Functions — Pending Migration to BigDataStatMeth v2.0.1

This page uses functions that have been deprecated following the major restructuring introduced in BigDataStatMeth v2.0.1. The new version replaces the procedural bd* API with a unified S3 interface based on HDF5Matrix objects. All code blocks are shown for reference but are not executed.

An updated version of this page will be published shortly

1 Overview

This document shows how to implement Canonical Correlation Analysis (CCA) in C++ using BigDataStatMeth’s header-only library. While the R implementation demonstrated the algorithm using R’s BigDataStatMeth functions, this C++ version provides direct control over computation, memory management, and performance optimization for production environments.

The C++ implementation you’ll see here is actual production code from the BDStatMethExamples package. It’s not simplified teaching code—it’s the real implementation handling TCGA multi-omic datasets with careful memory management, exception handling, and integration with R through Rcpp. By studying this code, you learn not just C++ syntax but the engineering discipline required for robust statistical software.

Moving from R to C++ isn’t just about speed—it’s about control. In R, BigDataStatMeth functions handle memory, choose algorithms, manage errors. In C++, you make those decisions explicitly. This control enables custom algorithms, fine-grained optimization, and production deployment where reliability and performance are critical. But it comes with responsibility: you must manage pointers, handle exceptions, prevent memory leaks, and ensure numerical stability.

1.1 What You’ll Learn

By the end of this document, you will:

  • Work with BigDataStatMeth’s C++ header-only library effectively
  • Manage HDF5 datasets using the hdf5Dataset class safely
  • Implement block-wise QR decomposition with pointer management
  • Handle exceptions to prevent crashes and memory leaks
  • Integrate C++ functions with R through Rcpp seamlessly
  • Decide when C++ optimization justifies development effort
  • Compile and test C++ implementations systematically
  • Choose between calling R functions vs pure C++ for flexibility

2 Why C++ Implementation?

2.1 When C++ Makes Sense

Before diving into code, understand when C++ optimization is worth the development investment.

Performance-critical production pipelines: If you run CCA thousands of times daily in production—processing every new patient sample, updating analyses continuously, serving real-time predictions—C++ performance gains compound. A 50% speedup on one analysis is negligible. A 50% speedup on 10,000 daily analyses saves hours of compute time and enables faster turnaround for clinical or research needs.

Custom algorithmic control: If you’re implementing novel CCA variants not available in BigDataStatMeth—sparse regularization with custom penalties, kernel CCA with specialized kernels, streaming CCA that processes data in one pass—C++ gives you algorithmic freedom. You’re not constrained to pre-built functions; you implement exactly the algorithm you need, with exactly the optimizations that matter for your use case.

Memory-constrained environments: In production environments with limited RAM or when processing extremely large datasets, C++’s explicit memory management prevents unexpected allocation failures. You control exactly when memory is allocated and freed, enabling precise tuning for available resources. R’s garbage collection can cause unpredictable memory spikes; C++’s deterministic deallocation doesn’t.

Integration with existing C++ systems: If your organization has existing C++ pipelines, statistical libraries, or infrastructure, adding a C++ CCA implementation integrates naturally. No language barriers, no R installation required, consistent deployment practices. The entire stack remains C++, simplifying maintenance and deployment.

2.2 When R Suffices

Don’t assume C++ is always better. R often provides better development velocity with adequate performance.

Algorithm prototyping and development: Developing new statistical methods in R is faster—you iterate quickly, inspect variables at every step, leverage R’s statistical ecosystem. Get the algorithm mathematically correct in R, validate it thoroughly, then optimize in C++ if performance profiling shows bottlenecks. Many methods never need C++ because R performance is adequate.

Infrequent use cases: If you run CCA once per month on a dataset that completes in 10 minutes, C++ optimization isn’t justified. The development time—writing C++ code, testing memory management, handling edge cases—exceeds the total computation time saved over the method’s lifetime. Use R, wait 10 minutes, move on to analysis.

Leveraging BigDataStatMeth functions: If your algorithm composes existing BigDataStatMeth functions (bdSVD_hdf5, bdCrossprod_hdf5, etc.), the heavy lifting already happens in compiled code. R orchestration overhead is negligible. Only if you need custom operations not provided by BigDataStatMeth does C++ implementation add value.

Team expertise considerations: If your team excels in R but struggles with C++, the maintenance burden of C++ code may outweigh performance benefits. Statistical correctness and code maintainability matter more than speed. Better to have understandable R code that runs slightly slower than inscrutable C++ code that nobody can debug or extend.

The decision rule: profile first, optimize selectively. Identify actual bottlenecks through profiling, then apply C++ surgically to those specific operations. Wholesale R-to-C++ rewrites are rarely the right answer.


3 C++ Prerequisites

3.1 Required Headers

BigDataStatMeth provides a header-only C++ library. Include the main header and you have access to all classes and functions without linking against compiled libraries:

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace Rcpp;
using namespace BigDataStatMeth;

The Rcpp.h header enables R integration—exporting C++ functions to R, converting between R and C++ data structures, handling R’s error model. Without Rcpp, you’d write pure C++ code that R can’t call directly.

The BigDataStatMeth.hpp header provides the core functionality: - hdf5Dataset class for HDF5 file I/O - Matrix operation functions (crossprod, SVD, QR) - Block-wise processing utilities - Memory management helpers

3.2 Understanding hdf5Dataset Class

The hdf5Dataset class is your interface to HDF5 files in C++. Unlike R’s transparent HDF5 access, C++ requires explicit dataset opening, reading, writing, and closing.

Basic usage pattern:

// Constructor: create dataset object
// Parameters: filename, dataset_path, create_if_new
hdf5Dataset* ds = new hdf5Dataset("file.hdf5", "data/matrix", false);

// Open for reading/writing
// Must call before any I/O operations
ds->openDataset();

// Get dimensions
int nrows = ds->nrows();      // Total rows (C indexing: 0-based)
int ncols = ds->ncols();      // Total columns
int nrows_r = ds->nrows_r();  // Rows (R indexing: 1-based)
int ncols_r = ds->ncols_r();  // Columns (R indexing: 1-based)

// Read data block
std::vector<double> data(nrows * ncols);
std::vector<hsize_t> offset = {0, 0};        // Start position
std::vector<hsize_t> count = {nrows, ncols}; // How much to read
std::vector<hsize_t> stride = {1, 1};        // Step size (usually 1,1)
std::vector<hsize_t> block = {1, 1};         // Block size (usually 1,1)

ds->readDatasetBlock(offset, count, stride, block, data.data());

// Write data block
ds->writeDatasetBlock(offset, count, stride, block, data.data());

// Cleanup - CRITICAL
delete ds;
ds = nullptr;  // Prevent double-delete

Critical memory management rules:

  1. Always delete: Every new hdf5Dataset requires corresponding delete. Forgetting causes memory leaks—the HDF5 handle stays open, consuming resources.

  2. Set to nullptr after delete: Prevents accidentally using a deleted pointer (accessing freed memory causes crashes).

  3. Use try/catch: HDF5 operations can fail (file not found, dataset doesn’t exist, permissions denied). Wrap in try/catch to clean up properly on errors.

  4. Check nullptr before delete: If pointer might be null, check before deleting to avoid crashes.

Common pitfalls:

  • Forgetting openDataset() before reading → runtime error
  • Reading without allocating sufficient buffer space → buffer overflow
  • Deleting same pointer twice → crash
  • Returning from function without deleting → memory leak
  • Catching exception but not deleting pointers → memory leak

The pattern you’ll see repeatedly: declare pointer to nullptr, allocate with new, use in try block, delete in catch and after try, set to nullptr. This discipline prevents the memory leaks and crashes that plague C++ codebases.


4 Mathematical Foundation (Reference)

The mathematical foundations of CCA are detailed in the R implementation document. Rather than duplicating that content, here’s a condensed reference.

CCA algorithm steps:

  1. QR decomposition: X = Q_X R_X, Y = Q_Y R_Y
    Orthonormalizes matrices while preserving column space

  2. Cross-product: M = Q_X^T Q_Y
    Captures relationship between orthonormalized datasets

  3. SVD: M = U D V^T
    Diagonal D contains canonical correlations, U and V are canonical directions

  4. Canonical coefficients: \alpha = R_X^{-1} U, \beta = R_Y^{-1} V
    Transform back to original feature space

  5. Canonical variates: s_X = X\alpha, s_Y = Y\beta
    Sample scores in canonical space

Block-wise processing rationale:

For large matrices (50,000 samples × 500,000 features), computing full QR decomposition requires massive RAM. Block-wise approach:

  • Partitions matrix into m blocks by rows
  • Computes QR on each block (memory-friendly)
  • Combines results hierarchically
  • Achieves identical result to full QR but with reduced memory

The C++ implementation follows this same mathematical algorithm. The difference is how it’s expressed—explicit pointer management, manual memory allocation, direct HDF5 calls—not what it computes.


5 Step 1: QR Decomposition by Blocks in C++

5.1 The getQRbyBlocks_rcpp Function

This function implements block-wise QR decomposition in C++, mirroring the R version but with explicit memory and error handling:

void getQRbyBlocks_rcpp(hdf5Dataset *dsA, int mblocks, 
                        bool bcenter, bool bscale, 
                        bool byrows, bool overwrite, 
                        Rcpp::Nullable<int> threads) {
    
    // Temporary dataset pointer - initialize to nullptr for safety
    hdf5Dataset* dstmp = nullptr;
    
    try {
        
        std::string strInPath, strOutPath;
        
        // Flags for matrix operation parameters
        bool btransp_dataset = false;   // Transpose first matrix?
        bool btransp_bdataset = false;  // Transpose second matrix?
        bool bfullMatrix = false;       // Use full matrix vs economy?
        
        bool bycols = !byrows;  // Convert row-wise to column-wise flag
        
        // Get dataset name for path construction
        std::string strdataset = dsA->getDatasetName();
        
        // Step 1: Normalize dataset (center and/or scale)
        // This calls R's normalization function for simplicity
        RcppNormalizeHdf5(dsA, bcenter, bscale, byrows);
        
        // Step 2: Open normalized dataset
        dstmp = new hdf5Dataset(dsA->getFileName(), 
                                "NORMALIZED/" + dsA->getGroupName(), 
                                strdataset, false);
        dstmp->openDataset();
        
        // Step 3: Split into blocks
        strOutPath = "Step1/" + strdataset + "rows";
        RcppSplit_matrix_hdf5_internal(
            dstmp, strOutPath, strdataset, bycols,
            mblocks, -1, dstmp->nrows(), dstmp->ncols()
        );
        
        // Close and delete temporary dataset - critical for memory
        delete dstmp; 
        dstmp = nullptr;    
        
        // Step 4: Apply QR to each block
        StringVector blocks = dsA->getDatasetNames(strOutPath, "", "");
        strInPath = strOutPath;
        strOutPath = "Step2/" + strdataset + "rows";
        
        RcppApplyFunctionHdf5(
            dsA->getFileName(), strInPath, blocks, strOutPath, "QR",
            R_NilValue, R_NilValue, wrap(overwrite), 
            wrap(btransp_dataset), wrap(btransp_bdataset), 
            wrap(bfullMatrix), wrap(byrows), threads
        );
        
        // Step 5: Merge R matrices from blocks
        StringVector blocks_qr = dsA->getDatasetNames(strOutPath, 
                                                       strdataset, ".R");
        strInPath = strOutPath;
        strOutPath = "Step3/merged";
        
        RcppBind_datasets_hdf5(
            dsA->getFileName(), strInPath, blocks_qr,
            strOutPath, strdataset + "Rt", "bindRows", 
            false, wrap(overwrite)
        );
        
        // Step 6: Final QR on merged R matrices
        strInPath = strOutPath;
        strOutPath = "Step3/Final_QR";
        
        RcppApplyFunctionHdf5(
            dsA->getFileName(), strInPath, strdataset + "Rt", 
            strOutPath, "QR",
            R_NilValue, R_NilValue, wrap(overwrite), 
            wrap(btransp_dataset), wrap(btransp_bdataset), 
            wrap(bfullMatrix), wrap(byrows), threads
        );
        
        // Step 7: Split final Q for multiplication
        strInPath = strOutPath;
        strOutPath = "Step4/splitted";
        
        dstmp = new hdf5Dataset(dsA->getFileName(), strInPath, 
                                strdataset + "Rt.Q", false);
        dstmp->openDataset();
        
        RcppSplit_matrix_hdf5_internal(
            dstmp, strOutPath, strdataset + "Rt.Q", bycols,
            mblocks, -1, dstmp->nrows(), dstmp->ncols()
        );
        
        delete dstmp; 
        dstmp = nullptr; 
        
        // Step 8: Multiply original Q blocks by final Q blocks
        strInPath = "Step2/" + strdataset + "rows";
        strOutPath = "Step5";
        CharacterVector b_group = "Step4/splitted";
        
        blocks_qr = dsA->getDatasetNames(strInPath, strdataset, ".Q");
        StringVector b_blocks = dsA->getDatasetNames(
            "Step4/splitted", strdataset + "Rt.Q", ""
        );
        
        RcppApplyFunctionHdf5(
            dsA->getFileName(), strInPath, blocks_qr, strOutPath, 
            "blockmult", b_group, b_blocks, wrap(overwrite), 
            wrap(btransp_dataset), wrap(btransp_bdataset), 
            wrap(bfullMatrix), wrap(byrows), threads
        );
        
        // Step 9: Bind Q blocks into final Q matrix
        strInPath = strOutPath; 
        strOutPath = "Step6";
        blocks = dsA->getDatasetNames(strInPath, strdataset + ".", "");
        
        RcppBind_datasets_hdf5(
            dsA->getFileName(), strInPath, blocks,
            strOutPath, strdataset + "Q", "bindRows", 
            false, wrap(overwrite)
        );
        
    } catch(std::exception& ex) {
        // Critical: clean up pointers on error
        checkClose_file(dsA, dstmp);
        Rcout<< "c++ exception getQRbyBlocks_rcpp: "<< ex.what() << "\n";
        return void();
    }
    
    return void();
}

5.2 Key Differences from R Implementation

While the algorithm is identical, the C++ implementation requires explicit management of resources that R handles automatically.

Memory management: Every new hdf5Dataset requires matching delete. The R version doesn’t show this because R’s garbage collector handles it. In C++, forgetting a delete causes a memory leak—the HDF5 file handle stays open, consuming system resources. After thousands of operations, you run out of file handles and the system fails. The pattern new → use → delete → set nullptr prevents this.

Exception handling: The entire function body wraps in try/catch. If any operation fails—file not found, dataset doesn’t exist, out of memory—the catch block cleans up allocated pointers before returning. Without this, an exception thrown by HDF5 would skip the delete statements, causing memory leaks. R’s error handling stops function execution and cleans up automatically; C++ requires explicit cleanup in exception paths.

Explicit string manipulation: R’s paste0() and gsub() are replaced by C++ string concatenation (+) and std::string methods. The logic is identical, but syntax differs. C++ strings are mutable, allow efficient concatenation, but require more verbose operations than R’s vectorized string functions.

Calling R functions from C++: Functions like RcppNormalizeHdf5, RcppSplit_matrix_hdf5_internal, RcppBind_datasets_hdf5 are wrappers around BigDataStatMeth’s R functions. This hybrid approach—C++ orchestration, R implementation—balances performance and maintainability. Writing pure C++ versions of these operations would require thousands of additional lines without significant performance gain because they already use compiled code internally.

Pointer checks: Before deleting, we don’t explicitly check if (dstmp != nullptr) because we initialize to nullptr and only assign once. In more complex code with conditional allocation, always check before deleting. The checkClose_file utility function handles this checking, ensuring we don’t delete null pointers or already-deleted pointers.

Return type: The function returns void() rather than a value. It modifies the HDF5 file in place, with results stored at known locations. R users call this function for side effects (creating datasets in HDF5), not for return values. This design simplifies memory management—no need to allocate, populate, and return large matrices.

5.3 Why This Pattern Matters

This function demonstrates the fundamental pattern for all C++ implementations with BigDataStatMeth:

  1. Initialize pointers to nullptr
  2. Wrap logic in try block
  3. Allocate with new, use, immediately delete when done
  4. Catch exceptions, clean up, report error
  5. Return void after modifying HDF5 file in place

This pattern prevents the three most common C++ failures: memory leaks (forgetting delete), crashes (accessing deleted pointers), and silent corruption (exceptions bypassing cleanup). Master this pattern, and you can implement any statistical method in C++ using BigDataStatMeth.


6 Step 2: Main CCA Function in C++

6.1 Orchestrating the Complete Algorithm

The main CCA function coordinates QR decomposition for both matrices, computes their cross-product, performs SVD, and initiates coefficient computation:

// [[Rcpp::export]]
void bdCCA_hdf5_rcpp(std::string filename, std::string datasetX,
                     std::string datasetY, bool bcenter, bool bscale, 
                     int mblocks, bool overwrite, 
                     Rcpp::Nullable<int> threads = R_NilValue) {

    // Declare all pointers upfront, initialize to nullptr
    hdf5Dataset* dsX = nullptr;
    hdf5Dataset* dsY = nullptr;
    hdf5Dataset* dsXQ = nullptr;
    hdf5Dataset* dsYQ = nullptr;
    hdf5Dataset* dsC = nullptr;

    try {
        
        int ncolsX, ncolsY;

        // Step 1: QR decomposition of X
        dsX = new hdf5Dataset(filename, datasetX, false); 
        dsX->openDataset(); 
        getQRbyBlocks_rcpp(dsX, mblocks, bcenter, bscale, 
                          false, overwrite, threads);
        
        // Step 2: QR decomposition of Y
        dsY = new hdf5Dataset(filename, datasetY, false); 
        dsY->openDataset(); 
        getQRbyBlocks_rcpp(dsY, mblocks, bcenter, bscale, 
                          false, overwrite, threads);
        
        // Step 3: Compute cross-product of Q matrices
        dsXQ = new hdf5Dataset(filename, "Step6", "XQ", false); 
        dsXQ->openDataset(); 
        
        dsYQ = new hdf5Dataset(filename, "Step6", "YQ", false); 
        dsYQ->openDataset(); 
        
        dsC = new hdf5Dataset(filename, "Step7", 
                             "CrossProd_XQ_YQ", overwrite);
        
        // Compute optimal block size for cross-product operation
        int optimBlock = getMaxBlockSize(
            dsXQ->nrows(), dsXQ->ncols(), 
            dsYQ->nrows(), dsYQ->ncols(), 
            2, R_NilValue
        );
        
        // Perform cross-product: Q_X^T Q_Y
        dsC = BigDataStatMeth::crossprod(
            dsXQ, dsYQ, dsC, 
            optimBlock, optimBlock/2, 
            true, true, threads
        );
        
        // Clean up Q datasets - no longer needed
        delete dsXQ; dsXQ = nullptr;
        delete dsYQ; dsYQ = nullptr;
        
        // Store dimensions for coefficient computation
        ncolsX = dsX->ncols_r();
        ncolsY = dsY->ncols_r();
        
        // Step 4: SVD of cross-product matrix
        // This extracts canonical correlations and directions
        RcppbdSVD_hdf5(
            dsC->getFileName(), dsC->getGroupName(), 
            dsC->getDatasetName(),  
            16, 2, 0, false, false, 0, 
            overwrite, false, R_NilValue, R_NilValue
        );
        
        // Clean up cross-product dataset
        delete dsC; dsC = nullptr;
        
        // Clean up X and Y datasets
        delete dsX; dsX = nullptr;
        delete dsY; dsY = nullptr;
        
        // Step 5: Compute canonical coefficients and variates
        // This function is implemented in R for flexibility
        writeCCAComponents_hdf5_rcpp(filename, ncolsX, ncolsY);
        
    } catch(std::exception& ex) {
        // Critical: clean up all pointers on error
        checkClose_file(dsX, dsY, dsXQ, dsYQ, dsC);
        Rcerr<< "\n c++ exception bdCCA_hdf5_rcpp: "<< ex.what() << "\n";
        return void();
    }

    return void();
}

6.2 Understanding the Implementation

This function demonstrates orchestration patterns essential for complex C++ algorithms.

Sequential cleanup: After each step, we delete pointers no longer needed. After computing cross-product, we delete dsXQ and dsYQ—the Q matrices were used, results stored, pointers can be freed. This progressive cleanup minimizes memory usage. Contrast with keeping all pointers until the end—memory usage grows throughout the function. In production with many concurrent CCA calls, aggressive cleanup prevents memory exhaustion.

[[Rcpp::export]] annotation: This tells Rcpp to make the function callable from R. Without it, the function compiles but R can’t call it. The annotation generates wrapper code converting R data types (strings, booleans, integers) to C++ types and vice versa. This is why we can call bdCCA_hdf5_rcpp(filename, "data/X", "data/Y", ...) from R—Rcpp handles the type conversion automatically.

Error propagation: If getQRbyBlocks_rcpp throws an exception for X, the catch block activates, cleans up dsX, and returns without attempting Y’s QR decomposition. This prevents cascading errors where failure in step N causes nonsensical errors in step N+1. The error message identifies which operation failed, critical for debugging.

Calling R functions from C++: The writeCCAComponents_hdf5_rcpp function (shown next) calls an R implementation. This hybrid approach is pragmatic—computing coefficients requires solving triangular systems, matrix multiplications, and HDF5 I/O for storing results. The R implementation already does this correctly. Rather than rewrite in pure C++, we call the R function. The performance difference is negligible because the computationally intensive parts (solving systems, matrix multiplications) use compiled BLAS/LAPACK regardless of whether called from R or C++.

Block size optimization: The getMaxBlockSize function dynamically determines optimal block size based on matrix dimensions and available memory. Hardcoding block size works for fixed data sizes but fails when dimensions vary. Adaptive block sizing ensures the function works efficiently across diverse datasets without user tuning.

Memory failure handling: If new fails (out of memory), C++ throws std::bad_alloc. Our catch block catches this (it’s a std::exception subclass), reports the error, and returns cleanly. Without exception handling, an allocation failure crashes the entire R session—unacceptable in production. Proper error handling degrades gracefully: report error, clean up, return, let user diagnose.


7 Step 3: Computing Canonical Coefficients

7.1 Option A: Calling R Function (Pragmatic Approach)

The simplest and most maintainable approach is calling the existing R implementation:

void writeCCAComponents_hdf5_rcpp(std::string filename, 
                                   int ncolsX, int ncolsY) {
    
    // Access R environment where function exists
    Rcpp::Environment base("package:BDStatMethExamples"); 
    
    // Get R function object
    Rcpp::Function write_components = base["writeCCAComponents_hdf5"];    
    
    // Call R function with C++ arguments
    // Rcpp automatically converts int → numeric, string → character
    write_components(
        Rcpp::_["filename"] = wrap(filename),
        Rcpp::_["ncolsX"]  = ncolsX,
        Rcpp::_["ncolsY"]  = ncolsY
    ); 
    
    return void();
}

This approach has significant advantages:

Maintainability: The R function is already tested, documented, handles edge cases. If you find a bug or need to add features, you fix it in one place (R) and both R and C++ code benefit. Duplicating logic in C++ means duplicating maintenance effort.

Flexibility: The R function can easily use R’s statistical functions, plotting capabilities, diagnostic output. Reimplementing all of that in C++ would be massive effort. By calling R, you get that functionality “for free.”

Performance reality: The computationally intensive parts—solving triangular systems with BLAS, matrix multiplication—use compiled code whether called from R or C++. The overhead of the R function call is negligible compared to the actual computation time. Only if profiling shows this specific function as a bottleneck would pure C++ implementation be justified.

Development speed: This approach took 10 lines of code. A pure C++ implementation would take hundreds of lines plus extensive testing. Unless you need that pure C++ version for specific deployment reasons, this pragmatic approach is superior.

7.2 Option B: Pure C++ Implementation (For Reference)

For completeness, here’s how you’d implement coefficient computation in pure C++. This is shown for educational purposes—most users should use Option A.

void writeCCAComponents_hdf5_cpp(std::string filename, 
                                  int ncolsX, int ncolsY) {
    
    hdf5Dataset* dstmpX = nullptr;
    hdf5Dataset* dstmpY = nullptr;
    
    std::vector<hsize_t> stride = {1, 1};
    std::vector<hsize_t> block = {1, 1};
    
    try {
        
        hsize_t nX = static_cast<hsize_t>(ncolsX);
        hsize_t nY = static_cast<hsize_t>(ncolsY);
        
        // Allocate storage for Q and R matrices
        std::vector<double> vdXQ(nX * nX);
        std::vector<double> vdYQ(nY * nY);
        std::vector<double> vdXR(nX * nX);
        std::vector<double> vdYR(nY * nY);
        
        // Read X components
        {
            dstmpX = new hdf5Dataset(filename, "Step6/XQ", false);  
            dstmpX->openDataset();
            dstmpX->readDatasetBlock(
                {0, 0}, {nX, nX}, stride, block, vdXQ.data()
            );
            
            dstmpY = new hdf5Dataset(filename, "Step6/YQ", false); 
            dstmpY->openDataset();
            dstmpY->readDatasetBlock(
                {0, 0}, {nY, nY}, stride, block, vdYQ.data()
            );
            
            delete dstmpX; dstmpX = nullptr;
            delete dstmpY; dstmpY = nullptr;
            
            // Read R matrices
            dstmpX = new hdf5Dataset(filename, 
                                    "Step3/Final_QR/XRt.R", false); 
            dstmpX->openDataset();
            dstmpX->readDatasetBlock(
                {0, 0}, {nX, nX}, stride, block, vdXR.data()
            );
            
            dstmpY = new hdf5Dataset(filename, 
                                    "Step3/Final_QR/YRt.R", false); 
            dstmpY->openDataset();
            dstmpY->readDatasetBlock(
                {0, 0}, {nY, nY}, stride, block, vdYR.data()
            );
            
            delete dstmpX; dstmpX = nullptr;
            delete dstmpY; dstmpY = nullptr;
        }
        
        // Map to Eigen matrices for linear algebra
        Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, 
                                 Eigen::ColMajor>> 
            XQ(vdXQ.data(), nX, nX);
        Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, 
                                 Eigen::ColMajor>> 
            YQ(vdYQ.data(), nY, nY);
        Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, 
                                 Eigen::ColMajor>> 
            XR(vdXR.data(), nX, nX);
        Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, 
                                 Eigen::ColMajor>> 
            YR(vdYR.data(), nY, nY);
        
        // Combine Q and R into single triangular representation
        Eigen::MatrixXd XQR = Eigen::MatrixXd::Zero(nX, nX);
        Eigen::MatrixXd YQR = Eigen::MatrixXd::Zero(nY, nY);
        
        XQR.triangularView<Eigen::Upper>() = XR.triangularView<Eigen::Upper>();
        XQR.triangularView<Eigen::StrictlyLower>() = XQ.triangularView<Eigen::StrictlyLower>();
        
        YQR.triangularView<Eigen::Upper>() = YR.triangularView<Eigen::Upper>();
        YQR.triangularView<Eigen::StrictlyLower>() = YQ.triangularView<Eigen::StrictlyLower>();
        
        // Read SVD results (U, V matrices)
        dstmpX = new hdf5Dataset(filename, 
                                "SVD/CrossProd_XQ_YQ/u", false); 
        dstmpX->openDataset();
        int nu_rows = dstmpX->nrows();
        int nu_cols = dstmpX->ncols();
        std::vector<double> vdU(nu_rows * nu_cols);
        dstmpX->readDatasetBlock(
            {0, 0}, {(hsize_t)nu_rows, (hsize_t)nu_cols}, 
            stride, block, vdU.data()
        );
        delete dstmpX; dstmpX = nullptr;
        
        dstmpY = new hdf5Dataset(filename, 
                                "SVD/CrossProd_XQ_YQ/v", false); 
        dstmpY->openDataset();
        int nv_rows = dstmpY->nrows();
        int nv_cols = dstmpY->ncols();
        std::vector<double> vdV(nv_rows * nv_cols);
        dstmpY->readDatasetBlock(
            {0, 0}, {(hsize_t)nv_rows, (hsize_t)nv_cols}, 
            stride, block, vdV.data()
        );
        delete dstmpY; dstmpY = nullptr;
        
        Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, 
                                 Eigen::ColMajor>> 
            U(vdU.data(), nu_rows, nu_cols);
        Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, 
                                 Eigen::ColMajor>> 
            V(vdV.data(), nv_rows, nv_cols);
        
        // Solve for canonical coefficients: R^(-1) * U
        // Using triangular solve (efficient for triangular matrices)
        Eigen::MatrixXd xcoef = XQR.triangularView<Eigen::Upper>().solve(U);
        Eigen::MatrixXd ycoef = YQR.triangularView<Eigen::Upper>().solve(V);
        
        // Here you would write xcoef and ycoef back to HDF5
        // using BigDataStatMeth's HDF5 writing functions
        // Omitted for brevity - see R implementation for reference
        
    } catch(std::exception& ex) {
        checkClose_file(dstmpX, dstmpY);
        Rcout<< "c++ exception writeCCAComponents_hdf5_cpp: "
             << ex.what() << "\n";
        return void();
    }
    
    return void();
}

This pure C++ version demonstrates several advanced techniques:

Eigen for linear algebra: The Eigen library provides efficient matrix operations. The Eigen::Map creates Eigen matrix objects that share memory with C++ vectors—no copying overhead. Operations like triangular solve (triangularView<Upper>().solve()) use optimized BLAS/LAPACK implementations.

Memory mapping: Rather than copying data from std::vector to Eigen::Matrix, we map the vector’s memory directly. This zero-copy approach is essential for large matrices—copying megabytes or gigabytes is expensive both in time and temporary memory allocation.

Triangular system solving: The triangularView<Upper>().solve() exploits the triangular structure of R matrices. Solving a triangular system is O(n²), much faster than general matrix inverse which is O(n³). This is why QR decomposition is useful—it gives triangular matrices that solve efficiently.

Complexity vs benefit: This pure C++ implementation is ~100 lines versus ~10 lines for calling R. Unless profiling shows coefficient computation as a bottleneck (unlikely—SVD and QR dominate compute time), the simpler approach is better.

7.3 When Each Approach Makes Sense

Use Option A (call R) when: - Development speed matters more than microseconds - R function handles edge cases, provides diagnostics - Team maintains R code more easily - No strict deployment requirement for pure C++

Use Option B (pure C++) when: - Deploying in environments without R - Profiling shows coefficient computation as bottleneck - Need to modify algorithm in ways R version doesn’t support - Building standalone C++ library

For most users, Option A is the right choice. This pragmatic approach combines C++ performance where it matters (QR, SVD) with R flexibility where it helps (coefficient computation, result storage).


8 Compiling and Using

8.1 Package Structure

To use C++ functions in an R package, you need proper structure:

BDStatMethExamples/
├── DESCRIPTION
├── NAMESPACE
├── R/
│   └── RcppExports.R          # Auto-generated
├── src/
│   ├── bdCCA.cpp              # Implementation
│   ├── bdCCA.h                # Header declarations
│   └── RcppExports.cpp        # Auto-generated
└── inst/
    └── include/
        └── BigDataStatMeth.hpp

Header file (bdCCA.h):

#ifndef BDSTATMETHEXAMPLES_CCA_HPP
#define BDSTATMETHEXAMPLES_CCA_HPP

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace Rcpp;
using namespace BigDataStatMeth;

// Function declarations
void getQRbyBlocks_rcpp(
    hdf5Dataset *dsA, int mblocks, bool bcenter, bool bscale, 
    bool byrows, bool overwrite, Nullable<int> threads = R_NilValue
);

void writeCCAComponents_hdf5_rcpp(
    std::string filename, int ncolsX, int ncolsY
);

#endif // BDSTATMETHEXAMPLES_CCA_HPP

8.2 Option 1: Compile with sourceCpp (Testing)

For development and testing, compile individual files:

library(Rcpp)

# Set include path
Sys.setenv(PKG_CXXFLAGS = paste0(
  "-I", system.file("include", package = "BigDataStatMeth")
))

# Compile and load
sourceCpp("src/bdCCA.cpp")

# Now function is available in R
bdCCA_hdf5_rcpp(
  filename = "test.hdf5",
  datasetX = "data/X",
  datasetY = "data/Y",
  bcenter = TRUE,
  bscale = FALSE,
  mblocks = 4,
  overwrite = TRUE
)

8.3 Option 2: Build Package (Production)

For production use, build the full package:

# From package directory
R CMD build .
R CMD INSTALL BDStatMethExamples_1.0.tar.gz

Or in R:

devtools::document()    # Generate documentation
devtools::build()       # Build package
devtools::install()     # Install locally

8.4 Comparing R and C++ Performance

Here’s how to benchmark the two implementations:

library(BDStatMethExamples)
library(microbenchmark)

# Create test data
set.seed(42)
n <- 5000
p <- 1000
q <- 800

X <- matrix(rnorm(n*p), n, p)
Y <- matrix(rnorm(n*q), n, q)

test_file <- "benchmark_cca.hdf5"

bdCreate_hdf5_matrix(test_file, X, "data", "X", overwriteFile = TRUE)
bdCreate_hdf5_matrix(test_file, Y, "data", "Y", overwriteFile = FALSE)

# Benchmark
results <- microbenchmark(
  R_version = bdCCA_hdf5(
    test_file, "data/X", "data/Y", m = 4,
    overwriteResults = TRUE
  ),
  
  Cpp_version = bdCCA_hdf5_rcpp(
    test_file, "data/X", "data/Y", 
    bcenter = TRUE, bscale = FALSE, mblocks = 4, 
    overwrite = TRUE
  ),
  
  times = 10
)

print(results)

# Typically expect:
# R version:   ~15-25 seconds
# C++ version: ~10-15 seconds
# Speedup:     ~1.5-2x

The C++ version is faster, but not dramatically—both use the same underlying BLAS/LAPACK for matrix operations. The speedup comes from reduced overhead in function calls and memory management, not from algorithmic differences.


9 Interactive Exercise

9.1 Practice: Optimizing C++ Implementation

Understanding comes from hands-on modification:

# Exercise 1: Add error checking
# Modify bdCCA_hdf5_rcpp to validate:
# - File exists before opening
# - Datasets have matching sample counts
# - Sufficient memory available
# How do these checks affect performance?

# Exercise 2: Implement parallel QR
# Modify getQRbyBlocks_rcpp to:
# - Process blocks in parallel using OpenMP
# - Measure speedup with 2, 4, 8 threads
# When does parallelization help?

# Exercise 3: Profile memory usage
# Instrument the code to track:
# - Peak memory usage at each step
# - Number of active HDF5 handles
# - Dataset sizes created
# Where is memory bottleneck?

# Exercise 4: Pure C++ coefficient computation
# Complete the writeCCAComponents_hdf5_cpp function:
# - Write xcoef and ycoef to HDF5
# - Compute and save canonical variates
# - Compare performance to R version
# Is the complexity justified?
TipReflection Questions

1. Development Trade-offs: - How much faster must C++ be to justify development cost? - At what data size does C++ become necessary? - How do you decide R vs C++ for new methods?

2. Memory Management: - Why initialize pointers to nullptr? - What happens if you forget delete? - How do smart pointers (unique_ptr, shared_ptr) help?

3. Error Handling: - Why wrap everything in try/catch? - What errors can HDF5 operations throw? - How do you test error handling paths?

4. Hybrid Approaches: - When is calling R from C++ appropriate? - What’s the overhead of R function calls? - How do you profile mixed R/C++ code?

5. Production Deployment: - How do you ensure C++ code is maintainable? - What testing is essential for C++ implementations? - How do you document C++ API for users?

These questions guide you beyond syntax to software engineering principles essential for production statistical code.


10 Key Takeaways

Let’s consolidate what you’ve learned about implementing statistical methods in C++ with BigDataStatMeth.

10.1 Essential Concepts

C++ provides performance, not different algorithms. The mathematical algorithm—QR decomposition, cross-product, SVD, solving for coefficients—is identical between R and C++ implementations. The difference is execution efficiency, not what’s computed. C++ is faster because of reduced overhead in memory management, function calls, and type conversions, not because it uses superior mathematics. Don’t assume C++ is always necessary; profile first to identify actual bottlenecks. Many statistical methods run adequately in R because the heavy lifting (BLAS, LAPACK operations) uses compiled code regardless of whether called from R or C++.

Memory management is mandatory, not optional. Every new requires matching delete. Every opened HDF5 dataset must be closed. Every allocated pointer must be tracked through exception paths. R’s garbage collector handles this automatically, making R code shorter and less error-prone. C++ requires explicit management, which is why production C++ code includes try/catch blocks, cleanup functions, and defensive null checks. Forgetting these leads to memory leaks (gradual resource exhaustion), crashes (accessing freed memory), or corrupted results (partial cleanup). The discipline of declaring pointers to nullptr, allocating in try blocks, deleting in catch and after use, then setting to nullptr prevents these failures.

Exception handling prevents crashes. HDF5 operations can fail—file not found, dataset doesn’t exist, out of disk space, permissions denied. Without exception handling, these failures crash the entire R session. With proper try/catch, the function reports the error, cleans up allocated resources, and returns gracefully. The user sees a clear error message, R continues running, and memory doesn’t leak. Production code must never crash—degrading gracefully (return error, let user fix problem) is mandatory. The pattern of wrapping all operations in try and checking for exceptions isn’t paranoia; it’s engineering reliability.

Rcpp bridges R and C++ transparently. The [[Rcpp::export]] annotation makes C++ functions callable from R without manual wrapper code. Type conversion—R strings to C++ std::string, R numerics to C++ double, R booleans to C++ bool—happens automatically. Calling R functions from C++ (like writeCCAComponents_hdf5_rcpp calling R’s implementation) is equally transparent. This seamless integration means you can use C++ where it provides value (tight loops, custom algorithms, memory control) while keeping flexible parts in R (statistical tests, plotting, diagnostics). Don’t think of R and C++ as separate worlds requiring manual integration; Rcpp makes them a unified environment.

BigDataStatMeth’s C++ functions do the heavy lifting. Writing pure C++ linear algebra from scratch—QR decomposition, SVD, matrix multiplication—would be thousands of lines and likely slower than using BigDataStatMeth’s implementations (which internally use optimized BLAS/LAPACK). The value of C++ isn’t reimplementing everything; it’s combining BigDataStatMeth’s building blocks in custom ways, adding algorithms not provided, or optimizing specific bottlenecks. The hybrid approach—calling BigDataStatMeth functions from C++, calling R functions where convenient—is pragmatic engineering, not compromise.

Compilation catches errors early. R is interpreted—errors appear at runtime, potentially deep into analysis. C++ is compiled—type mismatches, missing headers, syntax errors appear before execution. This is why C++ development feels slower (write code, compile, fix errors, repeat) but produces more robust code. Once C++ code compiles, many categories of errors (wrong types, missing functions, invalid syntax) are impossible at runtime. The debugging time shifts from runtime to compile time, and compile-time debugging is easier—the compiler tells you exactly what’s wrong. Runtime debugging in statistical code often requires re-running expensive analyses; compile-time debugging prevents that wasted time.

10.2 When to Use C++ vs R

Understanding when C++ justifies the development investment versus when R suffices guides efficient method development.

Implement in C++ when:

  • Profiling identifies clear bottlenecks - Don’t guess; measure. If profiling shows 80% of time in a specific operation not using compiled code (e.g., custom R loops), that’s a C++ candidate. If 80% of time is in bdSVD_hdf5 (already compiled), C++ won’t help.

  • Method runs thousands of times in production - A 50% speedup irrelevant for one-off analyses matters hugely in pipelines processing thousands of datasets daily. The accumulated time savings—and reduced compute costs—justify development effort.

  • Custom algorithms not in BigDataStatMeth - If implementing novel decompositions, specialized optimization, or algorithms from cutting-edge papers, C++ provides algorithmic freedom. You’re not limited to existing BigDataStatMeth functions; you implement exactly what you need.

  • Deploying in pure C++ environments - If your production infrastructure is pure C++ (no R installed), you need C++ implementations. This is rare—most scientific computing environments have R—but in embedded systems, real-time environments, or certain production systems, pure C++ is mandatory.

Stay in R when:

  • Development speed matters more - For research methods, prototype algorithms, exploratory analyses, R’s faster development cycle (write, run, inspect, iterate) is more valuable than execution speed. Perfect code running slow beats broken code running fast.

  • Method uses existing BigDataStatMeth functions - If your method orchestrates bdSVD_hdf5, bdCrossprod_hdf5, bdQR_hdf5, the heavy computation already uses compiled code. R orchestration overhead is negligible. Only if you need custom operations not provided by BigDataStatMeth is C++ needed.

  • Infrequent use - Monthly analysis taking 10 minutes in R? Don’t optimize. The development time (writing C++, testing, debugging) exceeds total computation time saved over the method’s lifetime. Spend development time on methods used frequently.

  • Team maintains R more easily - If your team excels in R but struggles with C++ pointer management, memory leaks, and compilation issues, maintainability matters more than speed. Better to have understandable R code that runs slightly slower than inscrutable C++ code nobody can debug or extend.

The key principle: profile to identify bottlenecks, optimize selectively where profiling shows benefit, and use each language where it excels. Most statistical method implementations belong in R, with BigDataStatMeth’s compiled functions handling computational intensity.


11 Next Steps

Deepen C++ expertise:

  • Study BigDataStatMeth’s header implementations
  • Learn OpenMP for parallel computing
  • Explore Eigen library for advanced linear algebra
  • Master C++ debugging tools (gdb, valgrind)

Optimize existing methods:

  • Profile CCA implementation to find bottlenecks
  • Implement parallel block processing
  • Optimize memory allocation patterns
  • Add SIMD vectorization where beneficial

Extend to new methods:

  • Implement sparse CCA with custom regularization
  • Build regularized PLS using similar patterns
  • Create custom matrix factorizations
  • Develop online/streaming algorithms

Production deployment:

  • Add comprehensive error handling
  • Implement logging and diagnostics
  • Create extensive test suites
  • Document C++ API for users
  • Build CI/CD pipelines for testing