library(Rcpp)
# Set include path
Sys.setenv(PKG_CXXFLAGS = paste0(
"-I", system.file("include", package = "BigDataStatMeth")
))
# Compile and load
sourceCpp("src/bdCCA.cpp")
# Now function is available in R
bdCCA_hdf5_rcpp(
filename = "test.hdf5",
datasetX = "data/X",
datasetY = "data/Y",
bcenter = TRUE,
bscale = FALSE,
mblocks = 4,
overwrite = TRUE
)CCA Implementation in C++
Building Canonical Correlation Analysis Using BigDataStatMeth C++ API
This page uses functions that have been deprecated following the major restructuring introduced in BigDataStatMeth v2.0.1. The new version replaces the procedural bd* API with a unified S3 interface based on HDF5Matrix objects. All code blocks are shown for reference but are not executed.
An updated version of this page will be published shortly
1 Overview
This document shows how to implement Canonical Correlation Analysis (CCA) in C++ using BigDataStatMeth’s header-only library. While the R implementation demonstrated the algorithm using R’s BigDataStatMeth functions, this C++ version provides direct control over computation, memory management, and performance optimization for production environments.
The C++ implementation you’ll see here is actual production code from the BDStatMethExamples package. It’s not simplified teaching code—it’s the real implementation handling TCGA multi-omic datasets with careful memory management, exception handling, and integration with R through Rcpp. By studying this code, you learn not just C++ syntax but the engineering discipline required for robust statistical software.
Moving from R to C++ isn’t just about speed—it’s about control. In R, BigDataStatMeth functions handle memory, choose algorithms, manage errors. In C++, you make those decisions explicitly. This control enables custom algorithms, fine-grained optimization, and production deployment where reliability and performance are critical. But it comes with responsibility: you must manage pointers, handle exceptions, prevent memory leaks, and ensure numerical stability.
1.1 What You’ll Learn
By the end of this document, you will:
- Work with BigDataStatMeth’s C++ header-only library effectively
- Manage HDF5 datasets using the hdf5Dataset class safely
- Implement block-wise QR decomposition with pointer management
- Handle exceptions to prevent crashes and memory leaks
- Integrate C++ functions with R through Rcpp seamlessly
- Decide when C++ optimization justifies development effort
- Compile and test C++ implementations systematically
- Choose between calling R functions vs pure C++ for flexibility
2 Why C++ Implementation?
2.1 When C++ Makes Sense
Before diving into code, understand when C++ optimization is worth the development investment.
Performance-critical production pipelines: If you run CCA thousands of times daily in production—processing every new patient sample, updating analyses continuously, serving real-time predictions—C++ performance gains compound. A 50% speedup on one analysis is negligible. A 50% speedup on 10,000 daily analyses saves hours of compute time and enables faster turnaround for clinical or research needs.
Custom algorithmic control: If you’re implementing novel CCA variants not available in BigDataStatMeth—sparse regularization with custom penalties, kernel CCA with specialized kernels, streaming CCA that processes data in one pass—C++ gives you algorithmic freedom. You’re not constrained to pre-built functions; you implement exactly the algorithm you need, with exactly the optimizations that matter for your use case.
Memory-constrained environments: In production environments with limited RAM or when processing extremely large datasets, C++’s explicit memory management prevents unexpected allocation failures. You control exactly when memory is allocated and freed, enabling precise tuning for available resources. R’s garbage collection can cause unpredictable memory spikes; C++’s deterministic deallocation doesn’t.
Integration with existing C++ systems: If your organization has existing C++ pipelines, statistical libraries, or infrastructure, adding a C++ CCA implementation integrates naturally. No language barriers, no R installation required, consistent deployment practices. The entire stack remains C++, simplifying maintenance and deployment.
2.2 When R Suffices
Don’t assume C++ is always better. R often provides better development velocity with adequate performance.
Algorithm prototyping and development: Developing new statistical methods in R is faster—you iterate quickly, inspect variables at every step, leverage R’s statistical ecosystem. Get the algorithm mathematically correct in R, validate it thoroughly, then optimize in C++ if performance profiling shows bottlenecks. Many methods never need C++ because R performance is adequate.
Infrequent use cases: If you run CCA once per month on a dataset that completes in 10 minutes, C++ optimization isn’t justified. The development time—writing C++ code, testing memory management, handling edge cases—exceeds the total computation time saved over the method’s lifetime. Use R, wait 10 minutes, move on to analysis.
Leveraging BigDataStatMeth functions: If your algorithm composes existing BigDataStatMeth functions (bdSVD_hdf5, bdCrossprod_hdf5, etc.), the heavy lifting already happens in compiled code. R orchestration overhead is negligible. Only if you need custom operations not provided by BigDataStatMeth does C++ implementation add value.
Team expertise considerations: If your team excels in R but struggles with C++, the maintenance burden of C++ code may outweigh performance benefits. Statistical correctness and code maintainability matter more than speed. Better to have understandable R code that runs slightly slower than inscrutable C++ code that nobody can debug or extend.
The decision rule: profile first, optimize selectively. Identify actual bottlenecks through profiling, then apply C++ surgically to those specific operations. Wholesale R-to-C++ rewrites are rarely the right answer.
3 C++ Prerequisites
3.1 Required Headers
BigDataStatMeth provides a header-only C++ library. Include the main header and you have access to all classes and functions without linking against compiled libraries:
#include <Rcpp.h>
#include "BigDataStatMeth.hpp"
using namespace Rcpp;
using namespace BigDataStatMeth;The Rcpp.h header enables R integration—exporting C++ functions to R, converting between R and C++ data structures, handling R’s error model. Without Rcpp, you’d write pure C++ code that R can’t call directly.
The BigDataStatMeth.hpp header provides the core functionality: - hdf5Dataset class for HDF5 file I/O - Matrix operation functions (crossprod, SVD, QR) - Block-wise processing utilities - Memory management helpers
3.2 Understanding hdf5Dataset Class
The hdf5Dataset class is your interface to HDF5 files in C++. Unlike R’s transparent HDF5 access, C++ requires explicit dataset opening, reading, writing, and closing.
Basic usage pattern:
// Constructor: create dataset object
// Parameters: filename, dataset_path, create_if_new
hdf5Dataset* ds = new hdf5Dataset("file.hdf5", "data/matrix", false);
// Open for reading/writing
// Must call before any I/O operations
ds->openDataset();
// Get dimensions
int nrows = ds->nrows(); // Total rows (C indexing: 0-based)
int ncols = ds->ncols(); // Total columns
int nrows_r = ds->nrows_r(); // Rows (R indexing: 1-based)
int ncols_r = ds->ncols_r(); // Columns (R indexing: 1-based)
// Read data block
std::vector<double> data(nrows * ncols);
std::vector<hsize_t> offset = {0, 0}; // Start position
std::vector<hsize_t> count = {nrows, ncols}; // How much to read
std::vector<hsize_t> stride = {1, 1}; // Step size (usually 1,1)
std::vector<hsize_t> block = {1, 1}; // Block size (usually 1,1)
ds->readDatasetBlock(offset, count, stride, block, data.data());
// Write data block
ds->writeDatasetBlock(offset, count, stride, block, data.data());
// Cleanup - CRITICAL
delete ds;
ds = nullptr; // Prevent double-deleteCritical memory management rules:
Always delete: Every
new hdf5Datasetrequires correspondingdelete. Forgetting causes memory leaks—the HDF5 handle stays open, consuming resources.Set to nullptr after delete: Prevents accidentally using a deleted pointer (accessing freed memory causes crashes).
Use try/catch: HDF5 operations can fail (file not found, dataset doesn’t exist, permissions denied). Wrap in try/catch to clean up properly on errors.
Check nullptr before delete: If pointer might be null, check before deleting to avoid crashes.
Common pitfalls:
- Forgetting
openDataset()before reading → runtime error - Reading without allocating sufficient buffer space → buffer overflow
- Deleting same pointer twice → crash
- Returning from function without deleting → memory leak
- Catching exception but not deleting pointers → memory leak
The pattern you’ll see repeatedly: declare pointer to nullptr, allocate with new, use in try block, delete in catch and after try, set to nullptr. This discipline prevents the memory leaks and crashes that plague C++ codebases.
4 Mathematical Foundation (Reference)
The mathematical foundations of CCA are detailed in the R implementation document. Rather than duplicating that content, here’s a condensed reference.
CCA algorithm steps:
QR decomposition: X = Q_X R_X, Y = Q_Y R_Y
Orthonormalizes matrices while preserving column spaceCross-product: M = Q_X^T Q_Y
Captures relationship between orthonormalized datasetsSVD: M = U D V^T
Diagonal D contains canonical correlations, U and V are canonical directionsCanonical coefficients: \alpha = R_X^{-1} U, \beta = R_Y^{-1} V
Transform back to original feature spaceCanonical variates: s_X = X\alpha, s_Y = Y\beta
Sample scores in canonical space
Block-wise processing rationale:
For large matrices (50,000 samples × 500,000 features), computing full QR decomposition requires massive RAM. Block-wise approach:
- Partitions matrix into m blocks by rows
- Computes QR on each block (memory-friendly)
- Combines results hierarchically
- Achieves identical result to full QR but with reduced memory
The C++ implementation follows this same mathematical algorithm. The difference is how it’s expressed—explicit pointer management, manual memory allocation, direct HDF5 calls—not what it computes.
5 Step 1: QR Decomposition by Blocks in C++
5.1 The getQRbyBlocks_rcpp Function
This function implements block-wise QR decomposition in C++, mirroring the R version but with explicit memory and error handling:
void getQRbyBlocks_rcpp(hdf5Dataset *dsA, int mblocks,
bool bcenter, bool bscale,
bool byrows, bool overwrite,
Rcpp::Nullable<int> threads) {
// Temporary dataset pointer - initialize to nullptr for safety
hdf5Dataset* dstmp = nullptr;
try {
std::string strInPath, strOutPath;
// Flags for matrix operation parameters
bool btransp_dataset = false; // Transpose first matrix?
bool btransp_bdataset = false; // Transpose second matrix?
bool bfullMatrix = false; // Use full matrix vs economy?
bool bycols = !byrows; // Convert row-wise to column-wise flag
// Get dataset name for path construction
std::string strdataset = dsA->getDatasetName();
// Step 1: Normalize dataset (center and/or scale)
// This calls R's normalization function for simplicity
RcppNormalizeHdf5(dsA, bcenter, bscale, byrows);
// Step 2: Open normalized dataset
dstmp = new hdf5Dataset(dsA->getFileName(),
"NORMALIZED/" + dsA->getGroupName(),
strdataset, false);
dstmp->openDataset();
// Step 3: Split into blocks
strOutPath = "Step1/" + strdataset + "rows";
RcppSplit_matrix_hdf5_internal(
dstmp, strOutPath, strdataset, bycols,
mblocks, -1, dstmp->nrows(), dstmp->ncols()
);
// Close and delete temporary dataset - critical for memory
delete dstmp;
dstmp = nullptr;
// Step 4: Apply QR to each block
StringVector blocks = dsA->getDatasetNames(strOutPath, "", "");
strInPath = strOutPath;
strOutPath = "Step2/" + strdataset + "rows";
RcppApplyFunctionHdf5(
dsA->getFileName(), strInPath, blocks, strOutPath, "QR",
R_NilValue, R_NilValue, wrap(overwrite),
wrap(btransp_dataset), wrap(btransp_bdataset),
wrap(bfullMatrix), wrap(byrows), threads
);
// Step 5: Merge R matrices from blocks
StringVector blocks_qr = dsA->getDatasetNames(strOutPath,
strdataset, ".R");
strInPath = strOutPath;
strOutPath = "Step3/merged";
RcppBind_datasets_hdf5(
dsA->getFileName(), strInPath, blocks_qr,
strOutPath, strdataset + "Rt", "bindRows",
false, wrap(overwrite)
);
// Step 6: Final QR on merged R matrices
strInPath = strOutPath;
strOutPath = "Step3/Final_QR";
RcppApplyFunctionHdf5(
dsA->getFileName(), strInPath, strdataset + "Rt",
strOutPath, "QR",
R_NilValue, R_NilValue, wrap(overwrite),
wrap(btransp_dataset), wrap(btransp_bdataset),
wrap(bfullMatrix), wrap(byrows), threads
);
// Step 7: Split final Q for multiplication
strInPath = strOutPath;
strOutPath = "Step4/splitted";
dstmp = new hdf5Dataset(dsA->getFileName(), strInPath,
strdataset + "Rt.Q", false);
dstmp->openDataset();
RcppSplit_matrix_hdf5_internal(
dstmp, strOutPath, strdataset + "Rt.Q", bycols,
mblocks, -1, dstmp->nrows(), dstmp->ncols()
);
delete dstmp;
dstmp = nullptr;
// Step 8: Multiply original Q blocks by final Q blocks
strInPath = "Step2/" + strdataset + "rows";
strOutPath = "Step5";
CharacterVector b_group = "Step4/splitted";
blocks_qr = dsA->getDatasetNames(strInPath, strdataset, ".Q");
StringVector b_blocks = dsA->getDatasetNames(
"Step4/splitted", strdataset + "Rt.Q", ""
);
RcppApplyFunctionHdf5(
dsA->getFileName(), strInPath, blocks_qr, strOutPath,
"blockmult", b_group, b_blocks, wrap(overwrite),
wrap(btransp_dataset), wrap(btransp_bdataset),
wrap(bfullMatrix), wrap(byrows), threads
);
// Step 9: Bind Q blocks into final Q matrix
strInPath = strOutPath;
strOutPath = "Step6";
blocks = dsA->getDatasetNames(strInPath, strdataset + ".", "");
RcppBind_datasets_hdf5(
dsA->getFileName(), strInPath, blocks,
strOutPath, strdataset + "Q", "bindRows",
false, wrap(overwrite)
);
} catch(std::exception& ex) {
// Critical: clean up pointers on error
checkClose_file(dsA, dstmp);
Rcout<< "c++ exception getQRbyBlocks_rcpp: "<< ex.what() << "\n";
return void();
}
return void();
}5.2 Key Differences from R Implementation
While the algorithm is identical, the C++ implementation requires explicit management of resources that R handles automatically.
Memory management: Every new hdf5Dataset requires matching delete. The R version doesn’t show this because R’s garbage collector handles it. In C++, forgetting a delete causes a memory leak—the HDF5 file handle stays open, consuming system resources. After thousands of operations, you run out of file handles and the system fails. The pattern new → use → delete → set nullptr prevents this.
Exception handling: The entire function body wraps in try/catch. If any operation fails—file not found, dataset doesn’t exist, out of memory—the catch block cleans up allocated pointers before returning. Without this, an exception thrown by HDF5 would skip the delete statements, causing memory leaks. R’s error handling stops function execution and cleans up automatically; C++ requires explicit cleanup in exception paths.
Explicit string manipulation: R’s paste0() and gsub() are replaced by C++ string concatenation (+) and std::string methods. The logic is identical, but syntax differs. C++ strings are mutable, allow efficient concatenation, but require more verbose operations than R’s vectorized string functions.
Calling R functions from C++: Functions like RcppNormalizeHdf5, RcppSplit_matrix_hdf5_internal, RcppBind_datasets_hdf5 are wrappers around BigDataStatMeth’s R functions. This hybrid approach—C++ orchestration, R implementation—balances performance and maintainability. Writing pure C++ versions of these operations would require thousands of additional lines without significant performance gain because they already use compiled code internally.
Pointer checks: Before deleting, we don’t explicitly check if (dstmp != nullptr) because we initialize to nullptr and only assign once. In more complex code with conditional allocation, always check before deleting. The checkClose_file utility function handles this checking, ensuring we don’t delete null pointers or already-deleted pointers.
Return type: The function returns void() rather than a value. It modifies the HDF5 file in place, with results stored at known locations. R users call this function for side effects (creating datasets in HDF5), not for return values. This design simplifies memory management—no need to allocate, populate, and return large matrices.
5.3 Why This Pattern Matters
This function demonstrates the fundamental pattern for all C++ implementations with BigDataStatMeth:
- Initialize pointers to nullptr
- Wrap logic in try block
- Allocate with new, use, immediately delete when done
- Catch exceptions, clean up, report error
- Return void after modifying HDF5 file in place
This pattern prevents the three most common C++ failures: memory leaks (forgetting delete), crashes (accessing deleted pointers), and silent corruption (exceptions bypassing cleanup). Master this pattern, and you can implement any statistical method in C++ using BigDataStatMeth.
6 Step 2: Main CCA Function in C++
6.1 Orchestrating the Complete Algorithm
The main CCA function coordinates QR decomposition for both matrices, computes their cross-product, performs SVD, and initiates coefficient computation:
// [[Rcpp::export]]
void bdCCA_hdf5_rcpp(std::string filename, std::string datasetX,
std::string datasetY, bool bcenter, bool bscale,
int mblocks, bool overwrite,
Rcpp::Nullable<int> threads = R_NilValue) {
// Declare all pointers upfront, initialize to nullptr
hdf5Dataset* dsX = nullptr;
hdf5Dataset* dsY = nullptr;
hdf5Dataset* dsXQ = nullptr;
hdf5Dataset* dsYQ = nullptr;
hdf5Dataset* dsC = nullptr;
try {
int ncolsX, ncolsY;
// Step 1: QR decomposition of X
dsX = new hdf5Dataset(filename, datasetX, false);
dsX->openDataset();
getQRbyBlocks_rcpp(dsX, mblocks, bcenter, bscale,
false, overwrite, threads);
// Step 2: QR decomposition of Y
dsY = new hdf5Dataset(filename, datasetY, false);
dsY->openDataset();
getQRbyBlocks_rcpp(dsY, mblocks, bcenter, bscale,
false, overwrite, threads);
// Step 3: Compute cross-product of Q matrices
dsXQ = new hdf5Dataset(filename, "Step6", "XQ", false);
dsXQ->openDataset();
dsYQ = new hdf5Dataset(filename, "Step6", "YQ", false);
dsYQ->openDataset();
dsC = new hdf5Dataset(filename, "Step7",
"CrossProd_XQ_YQ", overwrite);
// Compute optimal block size for cross-product operation
int optimBlock = getMaxBlockSize(
dsXQ->nrows(), dsXQ->ncols(),
dsYQ->nrows(), dsYQ->ncols(),
2, R_NilValue
);
// Perform cross-product: Q_X^T Q_Y
dsC = BigDataStatMeth::crossprod(
dsXQ, dsYQ, dsC,
optimBlock, optimBlock/2,
true, true, threads
);
// Clean up Q datasets - no longer needed
delete dsXQ; dsXQ = nullptr;
delete dsYQ; dsYQ = nullptr;
// Store dimensions for coefficient computation
ncolsX = dsX->ncols_r();
ncolsY = dsY->ncols_r();
// Step 4: SVD of cross-product matrix
// This extracts canonical correlations and directions
RcppbdSVD_hdf5(
dsC->getFileName(), dsC->getGroupName(),
dsC->getDatasetName(),
16, 2, 0, false, false, 0,
overwrite, false, R_NilValue, R_NilValue
);
// Clean up cross-product dataset
delete dsC; dsC = nullptr;
// Clean up X and Y datasets
delete dsX; dsX = nullptr;
delete dsY; dsY = nullptr;
// Step 5: Compute canonical coefficients and variates
// This function is implemented in R for flexibility
writeCCAComponents_hdf5_rcpp(filename, ncolsX, ncolsY);
} catch(std::exception& ex) {
// Critical: clean up all pointers on error
checkClose_file(dsX, dsY, dsXQ, dsYQ, dsC);
Rcerr<< "\n c++ exception bdCCA_hdf5_rcpp: "<< ex.what() << "\n";
return void();
}
return void();
}6.2 Understanding the Implementation
This function demonstrates orchestration patterns essential for complex C++ algorithms.
Sequential cleanup: After each step, we delete pointers no longer needed. After computing cross-product, we delete dsXQ and dsYQ—the Q matrices were used, results stored, pointers can be freed. This progressive cleanup minimizes memory usage. Contrast with keeping all pointers until the end—memory usage grows throughout the function. In production with many concurrent CCA calls, aggressive cleanup prevents memory exhaustion.
[[Rcpp::export]] annotation: This tells Rcpp to make the function callable from R. Without it, the function compiles but R can’t call it. The annotation generates wrapper code converting R data types (strings, booleans, integers) to C++ types and vice versa. This is why we can call bdCCA_hdf5_rcpp(filename, "data/X", "data/Y", ...) from R—Rcpp handles the type conversion automatically.
Error propagation: If getQRbyBlocks_rcpp throws an exception for X, the catch block activates, cleans up dsX, and returns without attempting Y’s QR decomposition. This prevents cascading errors where failure in step N causes nonsensical errors in step N+1. The error message identifies which operation failed, critical for debugging.
Calling R functions from C++: The writeCCAComponents_hdf5_rcpp function (shown next) calls an R implementation. This hybrid approach is pragmatic—computing coefficients requires solving triangular systems, matrix multiplications, and HDF5 I/O for storing results. The R implementation already does this correctly. Rather than rewrite in pure C++, we call the R function. The performance difference is negligible because the computationally intensive parts (solving systems, matrix multiplications) use compiled BLAS/LAPACK regardless of whether called from R or C++.
Block size optimization: The getMaxBlockSize function dynamically determines optimal block size based on matrix dimensions and available memory. Hardcoding block size works for fixed data sizes but fails when dimensions vary. Adaptive block sizing ensures the function works efficiently across diverse datasets without user tuning.
Memory failure handling: If new fails (out of memory), C++ throws std::bad_alloc. Our catch block catches this (it’s a std::exception subclass), reports the error, and returns cleanly. Without exception handling, an allocation failure crashes the entire R session—unacceptable in production. Proper error handling degrades gracefully: report error, clean up, return, let user diagnose.
7 Step 3: Computing Canonical Coefficients
7.1 Option A: Calling R Function (Pragmatic Approach)
The simplest and most maintainable approach is calling the existing R implementation:
void writeCCAComponents_hdf5_rcpp(std::string filename,
int ncolsX, int ncolsY) {
// Access R environment where function exists
Rcpp::Environment base("package:BDStatMethExamples");
// Get R function object
Rcpp::Function write_components = base["writeCCAComponents_hdf5"];
// Call R function with C++ arguments
// Rcpp automatically converts int → numeric, string → character
write_components(
Rcpp::_["filename"] = wrap(filename),
Rcpp::_["ncolsX"] = ncolsX,
Rcpp::_["ncolsY"] = ncolsY
);
return void();
}This approach has significant advantages:
Maintainability: The R function is already tested, documented, handles edge cases. If you find a bug or need to add features, you fix it in one place (R) and both R and C++ code benefit. Duplicating logic in C++ means duplicating maintenance effort.
Flexibility: The R function can easily use R’s statistical functions, plotting capabilities, diagnostic output. Reimplementing all of that in C++ would be massive effort. By calling R, you get that functionality “for free.”
Performance reality: The computationally intensive parts—solving triangular systems with BLAS, matrix multiplication—use compiled code whether called from R or C++. The overhead of the R function call is negligible compared to the actual computation time. Only if profiling shows this specific function as a bottleneck would pure C++ implementation be justified.
Development speed: This approach took 10 lines of code. A pure C++ implementation would take hundreds of lines plus extensive testing. Unless you need that pure C++ version for specific deployment reasons, this pragmatic approach is superior.
7.2 Option B: Pure C++ Implementation (For Reference)
For completeness, here’s how you’d implement coefficient computation in pure C++. This is shown for educational purposes—most users should use Option A.
void writeCCAComponents_hdf5_cpp(std::string filename,
int ncolsX, int ncolsY) {
hdf5Dataset* dstmpX = nullptr;
hdf5Dataset* dstmpY = nullptr;
std::vector<hsize_t> stride = {1, 1};
std::vector<hsize_t> block = {1, 1};
try {
hsize_t nX = static_cast<hsize_t>(ncolsX);
hsize_t nY = static_cast<hsize_t>(ncolsY);
// Allocate storage for Q and R matrices
std::vector<double> vdXQ(nX * nX);
std::vector<double> vdYQ(nY * nY);
std::vector<double> vdXR(nX * nX);
std::vector<double> vdYR(nY * nY);
// Read X components
{
dstmpX = new hdf5Dataset(filename, "Step6/XQ", false);
dstmpX->openDataset();
dstmpX->readDatasetBlock(
{0, 0}, {nX, nX}, stride, block, vdXQ.data()
);
dstmpY = new hdf5Dataset(filename, "Step6/YQ", false);
dstmpY->openDataset();
dstmpY->readDatasetBlock(
{0, 0}, {nY, nY}, stride, block, vdYQ.data()
);
delete dstmpX; dstmpX = nullptr;
delete dstmpY; dstmpY = nullptr;
// Read R matrices
dstmpX = new hdf5Dataset(filename,
"Step3/Final_QR/XRt.R", false);
dstmpX->openDataset();
dstmpX->readDatasetBlock(
{0, 0}, {nX, nX}, stride, block, vdXR.data()
);
dstmpY = new hdf5Dataset(filename,
"Step3/Final_QR/YRt.R", false);
dstmpY->openDataset();
dstmpY->readDatasetBlock(
{0, 0}, {nY, nY}, stride, block, vdYR.data()
);
delete dstmpX; dstmpX = nullptr;
delete dstmpY; dstmpY = nullptr;
}
// Map to Eigen matrices for linear algebra
Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic,
Eigen::ColMajor>>
XQ(vdXQ.data(), nX, nX);
Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic,
Eigen::ColMajor>>
YQ(vdYQ.data(), nY, nY);
Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic,
Eigen::ColMajor>>
XR(vdXR.data(), nX, nX);
Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic,
Eigen::ColMajor>>
YR(vdYR.data(), nY, nY);
// Combine Q and R into single triangular representation
Eigen::MatrixXd XQR = Eigen::MatrixXd::Zero(nX, nX);
Eigen::MatrixXd YQR = Eigen::MatrixXd::Zero(nY, nY);
XQR.triangularView<Eigen::Upper>() = XR.triangularView<Eigen::Upper>();
XQR.triangularView<Eigen::StrictlyLower>() = XQ.triangularView<Eigen::StrictlyLower>();
YQR.triangularView<Eigen::Upper>() = YR.triangularView<Eigen::Upper>();
YQR.triangularView<Eigen::StrictlyLower>() = YQ.triangularView<Eigen::StrictlyLower>();
// Read SVD results (U, V matrices)
dstmpX = new hdf5Dataset(filename,
"SVD/CrossProd_XQ_YQ/u", false);
dstmpX->openDataset();
int nu_rows = dstmpX->nrows();
int nu_cols = dstmpX->ncols();
std::vector<double> vdU(nu_rows * nu_cols);
dstmpX->readDatasetBlock(
{0, 0}, {(hsize_t)nu_rows, (hsize_t)nu_cols},
stride, block, vdU.data()
);
delete dstmpX; dstmpX = nullptr;
dstmpY = new hdf5Dataset(filename,
"SVD/CrossProd_XQ_YQ/v", false);
dstmpY->openDataset();
int nv_rows = dstmpY->nrows();
int nv_cols = dstmpY->ncols();
std::vector<double> vdV(nv_rows * nv_cols);
dstmpY->readDatasetBlock(
{0, 0}, {(hsize_t)nv_rows, (hsize_t)nv_cols},
stride, block, vdV.data()
);
delete dstmpY; dstmpY = nullptr;
Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic,
Eigen::ColMajor>>
U(vdU.data(), nu_rows, nu_cols);
Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic,
Eigen::ColMajor>>
V(vdV.data(), nv_rows, nv_cols);
// Solve for canonical coefficients: R^(-1) * U
// Using triangular solve (efficient for triangular matrices)
Eigen::MatrixXd xcoef = XQR.triangularView<Eigen::Upper>().solve(U);
Eigen::MatrixXd ycoef = YQR.triangularView<Eigen::Upper>().solve(V);
// Here you would write xcoef and ycoef back to HDF5
// using BigDataStatMeth's HDF5 writing functions
// Omitted for brevity - see R implementation for reference
} catch(std::exception& ex) {
checkClose_file(dstmpX, dstmpY);
Rcout<< "c++ exception writeCCAComponents_hdf5_cpp: "
<< ex.what() << "\n";
return void();
}
return void();
}This pure C++ version demonstrates several advanced techniques:
Eigen for linear algebra: The Eigen library provides efficient matrix operations. The Eigen::Map creates Eigen matrix objects that share memory with C++ vectors—no copying overhead. Operations like triangular solve (triangularView<Upper>().solve()) use optimized BLAS/LAPACK implementations.
Memory mapping: Rather than copying data from std::vector to Eigen::Matrix, we map the vector’s memory directly. This zero-copy approach is essential for large matrices—copying megabytes or gigabytes is expensive both in time and temporary memory allocation.
Triangular system solving: The triangularView<Upper>().solve() exploits the triangular structure of R matrices. Solving a triangular system is O(n²), much faster than general matrix inverse which is O(n³). This is why QR decomposition is useful—it gives triangular matrices that solve efficiently.
Complexity vs benefit: This pure C++ implementation is ~100 lines versus ~10 lines for calling R. Unless profiling shows coefficient computation as a bottleneck (unlikely—SVD and QR dominate compute time), the simpler approach is better.
7.3 When Each Approach Makes Sense
Use Option A (call R) when: - Development speed matters more than microseconds - R function handles edge cases, provides diagnostics - Team maintains R code more easily - No strict deployment requirement for pure C++
Use Option B (pure C++) when: - Deploying in environments without R - Profiling shows coefficient computation as bottleneck - Need to modify algorithm in ways R version doesn’t support - Building standalone C++ library
For most users, Option A is the right choice. This pragmatic approach combines C++ performance where it matters (QR, SVD) with R flexibility where it helps (coefficient computation, result storage).
8 Compiling and Using
8.1 Package Structure
To use C++ functions in an R package, you need proper structure:
BDStatMethExamples/
├── DESCRIPTION
├── NAMESPACE
├── R/
│ └── RcppExports.R # Auto-generated
├── src/
│ ├── bdCCA.cpp # Implementation
│ ├── bdCCA.h # Header declarations
│ └── RcppExports.cpp # Auto-generated
└── inst/
└── include/
└── BigDataStatMeth.hpp
Header file (bdCCA.h):
#ifndef BDSTATMETHEXAMPLES_CCA_HPP
#define BDSTATMETHEXAMPLES_CCA_HPP
#include <Rcpp.h>
#include "BigDataStatMeth.hpp"
using namespace Rcpp;
using namespace BigDataStatMeth;
// Function declarations
void getQRbyBlocks_rcpp(
hdf5Dataset *dsA, int mblocks, bool bcenter, bool bscale,
bool byrows, bool overwrite, Nullable<int> threads = R_NilValue
);
void writeCCAComponents_hdf5_rcpp(
std::string filename, int ncolsX, int ncolsY
);
#endif // BDSTATMETHEXAMPLES_CCA_HPP8.2 Option 1: Compile with sourceCpp (Testing)
For development and testing, compile individual files:
8.3 Option 2: Build Package (Production)
For production use, build the full package:
# From package directory
R CMD build .
R CMD INSTALL BDStatMethExamples_1.0.tar.gzOr in R:
devtools::document() # Generate documentation
devtools::build() # Build package
devtools::install() # Install locally8.4 Comparing R and C++ Performance
Here’s how to benchmark the two implementations:
library(BDStatMethExamples)
library(microbenchmark)
# Create test data
set.seed(42)
n <- 5000
p <- 1000
q <- 800
X <- matrix(rnorm(n*p), n, p)
Y <- matrix(rnorm(n*q), n, q)
test_file <- "benchmark_cca.hdf5"
bdCreate_hdf5_matrix(test_file, X, "data", "X", overwriteFile = TRUE)
bdCreate_hdf5_matrix(test_file, Y, "data", "Y", overwriteFile = FALSE)
# Benchmark
results <- microbenchmark(
R_version = bdCCA_hdf5(
test_file, "data/X", "data/Y", m = 4,
overwriteResults = TRUE
),
Cpp_version = bdCCA_hdf5_rcpp(
test_file, "data/X", "data/Y",
bcenter = TRUE, bscale = FALSE, mblocks = 4,
overwrite = TRUE
),
times = 10
)
print(results)
# Typically expect:
# R version: ~15-25 seconds
# C++ version: ~10-15 seconds
# Speedup: ~1.5-2xThe C++ version is faster, but not dramatically—both use the same underlying BLAS/LAPACK for matrix operations. The speedup comes from reduced overhead in function calls and memory management, not from algorithmic differences.
9 Interactive Exercise
9.1 Practice: Optimizing C++ Implementation
Understanding comes from hands-on modification:
# Exercise 1: Add error checking
# Modify bdCCA_hdf5_rcpp to validate:
# - File exists before opening
# - Datasets have matching sample counts
# - Sufficient memory available
# How do these checks affect performance?
# Exercise 2: Implement parallel QR
# Modify getQRbyBlocks_rcpp to:
# - Process blocks in parallel using OpenMP
# - Measure speedup with 2, 4, 8 threads
# When does parallelization help?
# Exercise 3: Profile memory usage
# Instrument the code to track:
# - Peak memory usage at each step
# - Number of active HDF5 handles
# - Dataset sizes created
# Where is memory bottleneck?
# Exercise 4: Pure C++ coefficient computation
# Complete the writeCCAComponents_hdf5_cpp function:
# - Write xcoef and ycoef to HDF5
# - Compute and save canonical variates
# - Compare performance to R version
# Is the complexity justified?1. Development Trade-offs: - How much faster must C++ be to justify development cost? - At what data size does C++ become necessary? - How do you decide R vs C++ for new methods?
2. Memory Management: - Why initialize pointers to nullptr? - What happens if you forget delete? - How do smart pointers (unique_ptr, shared_ptr) help?
3. Error Handling: - Why wrap everything in try/catch? - What errors can HDF5 operations throw? - How do you test error handling paths?
4. Hybrid Approaches: - When is calling R from C++ appropriate? - What’s the overhead of R function calls? - How do you profile mixed R/C++ code?
5. Production Deployment: - How do you ensure C++ code is maintainable? - What testing is essential for C++ implementations? - How do you document C++ API for users?
These questions guide you beyond syntax to software engineering principles essential for production statistical code.
10 Key Takeaways
Let’s consolidate what you’ve learned about implementing statistical methods in C++ with BigDataStatMeth.
10.1 Essential Concepts
C++ provides performance, not different algorithms. The mathematical algorithm—QR decomposition, cross-product, SVD, solving for coefficients—is identical between R and C++ implementations. The difference is execution efficiency, not what’s computed. C++ is faster because of reduced overhead in memory management, function calls, and type conversions, not because it uses superior mathematics. Don’t assume C++ is always necessary; profile first to identify actual bottlenecks. Many statistical methods run adequately in R because the heavy lifting (BLAS, LAPACK operations) uses compiled code regardless of whether called from R or C++.
Memory management is mandatory, not optional. Every new requires matching delete. Every opened HDF5 dataset must be closed. Every allocated pointer must be tracked through exception paths. R’s garbage collector handles this automatically, making R code shorter and less error-prone. C++ requires explicit management, which is why production C++ code includes try/catch blocks, cleanup functions, and defensive null checks. Forgetting these leads to memory leaks (gradual resource exhaustion), crashes (accessing freed memory), or corrupted results (partial cleanup). The discipline of declaring pointers to nullptr, allocating in try blocks, deleting in catch and after use, then setting to nullptr prevents these failures.
Exception handling prevents crashes. HDF5 operations can fail—file not found, dataset doesn’t exist, out of disk space, permissions denied. Without exception handling, these failures crash the entire R session. With proper try/catch, the function reports the error, cleans up allocated resources, and returns gracefully. The user sees a clear error message, R continues running, and memory doesn’t leak. Production code must never crash—degrading gracefully (return error, let user fix problem) is mandatory. The pattern of wrapping all operations in try and checking for exceptions isn’t paranoia; it’s engineering reliability.
Rcpp bridges R and C++ transparently. The [[Rcpp::export]] annotation makes C++ functions callable from R without manual wrapper code. Type conversion—R strings to C++ std::string, R numerics to C++ double, R booleans to C++ bool—happens automatically. Calling R functions from C++ (like writeCCAComponents_hdf5_rcpp calling R’s implementation) is equally transparent. This seamless integration means you can use C++ where it provides value (tight loops, custom algorithms, memory control) while keeping flexible parts in R (statistical tests, plotting, diagnostics). Don’t think of R and C++ as separate worlds requiring manual integration; Rcpp makes them a unified environment.
BigDataStatMeth’s C++ functions do the heavy lifting. Writing pure C++ linear algebra from scratch—QR decomposition, SVD, matrix multiplication—would be thousands of lines and likely slower than using BigDataStatMeth’s implementations (which internally use optimized BLAS/LAPACK). The value of C++ isn’t reimplementing everything; it’s combining BigDataStatMeth’s building blocks in custom ways, adding algorithms not provided, or optimizing specific bottlenecks. The hybrid approach—calling BigDataStatMeth functions from C++, calling R functions where convenient—is pragmatic engineering, not compromise.
Compilation catches errors early. R is interpreted—errors appear at runtime, potentially deep into analysis. C++ is compiled—type mismatches, missing headers, syntax errors appear before execution. This is why C++ development feels slower (write code, compile, fix errors, repeat) but produces more robust code. Once C++ code compiles, many categories of errors (wrong types, missing functions, invalid syntax) are impossible at runtime. The debugging time shifts from runtime to compile time, and compile-time debugging is easier—the compiler tells you exactly what’s wrong. Runtime debugging in statistical code often requires re-running expensive analyses; compile-time debugging prevents that wasted time.
10.2 When to Use C++ vs R
Understanding when C++ justifies the development investment versus when R suffices guides efficient method development.
✅ Implement in C++ when:
Profiling identifies clear bottlenecks - Don’t guess; measure. If profiling shows 80% of time in a specific operation not using compiled code (e.g., custom R loops), that’s a C++ candidate. If 80% of time is in
bdSVD_hdf5(already compiled), C++ won’t help.Method runs thousands of times in production - A 50% speedup irrelevant for one-off analyses matters hugely in pipelines processing thousands of datasets daily. The accumulated time savings—and reduced compute costs—justify development effort.
Custom algorithms not in BigDataStatMeth - If implementing novel decompositions, specialized optimization, or algorithms from cutting-edge papers, C++ provides algorithmic freedom. You’re not limited to existing BigDataStatMeth functions; you implement exactly what you need.
Deploying in pure C++ environments - If your production infrastructure is pure C++ (no R installed), you need C++ implementations. This is rare—most scientific computing environments have R—but in embedded systems, real-time environments, or certain production systems, pure C++ is mandatory.
❌ Stay in R when:
Development speed matters more - For research methods, prototype algorithms, exploratory analyses, R’s faster development cycle (write, run, inspect, iterate) is more valuable than execution speed. Perfect code running slow beats broken code running fast.
Method uses existing BigDataStatMeth functions - If your method orchestrates
bdSVD_hdf5,bdCrossprod_hdf5,bdQR_hdf5, the heavy computation already uses compiled code. R orchestration overhead is negligible. Only if you need custom operations not provided by BigDataStatMeth is C++ needed.Infrequent use - Monthly analysis taking 10 minutes in R? Don’t optimize. The development time (writing C++, testing, debugging) exceeds total computation time saved over the method’s lifetime. Spend development time on methods used frequently.
Team maintains R more easily - If your team excels in R but struggles with C++ pointer management, memory leaks, and compilation issues, maintainability matters more than speed. Better to have understandable R code that runs slightly slower than inscrutable C++ code nobody can debug or extend.
The key principle: profile to identify bottlenecks, optimize selectively where profiling shows benefit, and use each language where it excels. Most statistical method implementations belong in R, with BigDataStatMeth’s compiled functions handling computational intensity.
11 Next Steps
Deepen C++ expertise:
- Study BigDataStatMeth’s header implementations
- Learn OpenMP for parallel computing
- Explore Eigen library for advanced linear algebra
- Master C++ debugging tools (gdb, valgrind)
Optimize existing methods:
- Profile CCA implementation to find bottlenecks
- Implement parallel block processing
- Optimize memory allocation patterns
- Add SIMD vectorization where beneficial
Extend to new methods:
- Implement sparse CCA with custom regularization
- Build regularized PLS using similar patterns
- Create custom matrix factorizations
- Develop online/streaming algorithms
Production deployment:
- Add comprehensive error handling
- Implement logging and diagnostics
- Create extensive test suites
- Document C++ API for users
- Build CI/CD pipelines for testing