flowchart TD
A[Matrix on Disk<br/>100,000 × 50,000] --> B[Initialize<br/>column_sums = 0]
B --> C{More blocks?}
C -->|Yes| D[Read Block i<br/>10,000 rows]
D --> E[Compute<br/>colSums for block]
E --> F[Accumulate:<br/>column_sums += block_sums]
F --> C
C -->|No| G[Finalize:<br/>means = column_sums / n_rows]
style A fill:#f0f8ff
style G fill:#e8f6e8
style D fill:#fff8e1
style E fill:#fff8e1
style F fill:#fff8e1
Block-Wise Computing
Adapting Statistical Algorithms for Disk-Based Data
0.1 What You’ll Learn
By the end of this section, you will:
- Understand the divide-process-combine paradigm of block-wise algorithms
- Know which operations decompose naturally to block-wise processing
- Recognize which algorithms are challenging to adapt for blocks
- Understand memory-computation trade-offs with block size
- See concrete examples of block-wise matrix operations
- Know how to compose new methods from existing BigDataStatMeth functions
1 From In-Memory to Block-Wise: A Paradigm Shift
You now understand why large datasets don’t fit in RAM and how HDF5 enables efficient access to disk-stored data. But knowing you can read arbitrary portions of a matrix from disk doesn’t automatically tell you how to perform statistical analyses on that data. The challenge is algorithmic: how do you adapt methods designed for in-memory matrices to work with data that must be processed in pieces?
This is not merely an implementation detail - it requires rethinking how algorithms work. A standard PCA implementation might call svd(X) where X is a complete matrix in memory. That single function call encapsulates a sophisticated numerical algorithm, but it fundamentally assumes it can access any element of X at any time. When X lives on disk, this assumption breaks. You must decompose the algorithm into steps that process manageable blocks of data, with intermediate results that fit in memory.
The mathematical operations remain the same - we’re still computing eigenvalues, performing matrix multiplications, or fitting regression models. But the computational strategy changes fundamentally. This section explains how standard statistical methods are adapted for block-wise processing, using examples from BigDataStatMeth’s implementation.
2 The Core Concept: Divide, Process, Combine
Block-wise computing follows a conceptual pattern that applies across different statistical methods:
- Divide: Partition the data matrix into blocks that fit comfortably in memory
- Process: Perform computations on each block independently (when possible)
- Combine: Merge the block-level results to obtain the final answer
The art lies in step 2 and 3 - figuring out what computations can be done on blocks independently, what information must be passed between blocks, and how to combine results validly. Not all algorithms decompose equally well, and some require sophisticated mathematical reformulations to work in a block-wise framework.
Let’s start with simple examples and build toward more complex methods.
3 Example 1: Computing Means and Standard Deviations
Computing column means illustrates the simplest case of block-wise processing. Suppose you have a matrix with 100,000 rows and 50,000 columns that’s too large for memory, but you can comfortably work with blocks of 10,000 rows at a time.
3.1 The In-Memory Approach
# Traditional approach - assumes X fits in memory
column_means <- colMeans(X)This single line hides significant computation: for each column, R reads all 100,000 values, sums them, and divides by 100,000.
3.2 The Block-Wise Approach
For block-wise computation, we leverage a mathematical property: the mean of a set of values equals the weighted average of the means of subsets, where weights are the subset sizes.
Step 1: Initialize accumulators
We start by creating a vector to accumulate the sum of each column across all blocks. We also query the HDF5 file metadata to know how many rows exist in total and calculate how many blocks we’ll need to process.
Step 2: Process blocks sequentially
For each block:
Calculate boundaries: Determine which rows belong to this block. Most blocks have
block_sizerows, but the last block might be smaller.Read from disk: Load only this block’s rows into RAM. This is the key to memory efficiency - we never hold the entire matrix.
Compute contribution: Sum the values in each column of this block. This gives us a vector of length
n_cols.Accumulate: Add this block’s column sums to our running total. This works because addition is associative and commutative.
Free memory: Explicitly release the block’s memory before reading the next one. This ensures peak memory usage is just one block, not all blocks.
Step 3: Calculate final means
After processing all blocks, we have the total sum for each column. Divide by the total number of rows to get the mean.
Why this works mathematically:
\text{mean}(X) = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{\sum_{j=1}^{k} \sum_{i \in \text{block}_j} x_i}{n}
The inner sums are what we compute for each block, and we accumulate them in the outer sum.
# Algorithm: BlockwiseMeans
# Input: HDF5 dataset with n_rows × n_cols, block_size
# Output: column_means vector of length n_cols
# 1. Initialize
column_sums <- vector of zeros (length n_cols)
n_rows_total <- get_total_rows_from_hdf5()
n_blocks <- ceiling(n_rows_total / block_size)
# 2. Process each block
for (i in 1:n_blocks) {
# Define block boundaries
start_row <- (i-1) * block_size + 1
end_row <- min(i * block_size, n_rows_total)
# Read block from disk
block <- read_hdf5_rows(start_row, end_row)
# Compute block contribution
block_sums <- compute_column_sums(block)
# Accumulate results
column_sums <- column_sums + block_sums
# Free memory
free(block)
}
# 3. Finalize computation
column_means <- column_sums / n_rows_total
# 4. Return result
return(column_means)Peak memory is one block (10,000 × 50,000 = ~40 MB for doubles) plus the accumulator vector (50,000 values = ~0.4 MB).
Total: ~40 MB instead of 40 GB for the full matrix - a 1000× reduction!
Key insight: For operations that can be expressed as combining independent contributions from blocks (sums, counts), block-wise processing is straightforward. The mathematical property that makes this work is additive decomposition: the sum over all data equals the sum of sums over blocks.
BigDataStatMeth implementation: This is handled internally by C++ functions like get_HDF5_mean_sd_by_column() (or by_row variant) which manage block reading, computation, and accumulation efficiently using OpenMP parallelization when multiple cores are available.
3.3 Why Standard Deviation Is Slightly More Complex
Standard deviation requires both means and squared deviations from means. You might think you could compute it in one pass through blocks, but there’s a subtle issue: you need the overall mean to compute deviations, but you won’t know the overall mean until you’ve seen all blocks.
Computing standard deviation requires two complete passes through the data:
- Pass 1: Compute overall means
- Pass 2: Compute squared deviations using those means
This doubles I/O but keeps memory constant.
Two passes through the data:
flowchart TD
A[Dataset on Disk] --> B[Pass 1:<br/>Compute Means]
B --> C[Store means<br/>in memory]
C --> D[Pass 2:<br/>Start over]
D --> E[Read Block i]
E --> F[Center:<br/>block - means]
F --> G[Square:<br/>centered²]
G --> H[Accumulate<br/>squared_devs]
H --> I{More blocks?}
I -->|Yes| E
I -->|No| J[Finalize:<br/>sqrt of sum over n-1]
style A fill:#f0f8ff
style B fill:#fff8e1
style C fill:#ffe8e8
style J fill:#e8f6e8
Why two passes are necessary:
To compute standard deviation, we need: \text{SD} = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}
The problem: we need \bar{x} (the mean) to compute each (x_i - \bar{x})^2 term. But we can’t know the overall mean until we’ve processed all blocks.
Pass 1: Compute means (same as before)
Use the block-wise mean algorithm to get the global mean for each column.
Pass 2: Compute squared deviations
Now that we know the means, we:
Read each block again: Yes, this means reading the entire dataset twice. For very large datasets on slow storage, this I/O cost matters.
Center the block: Subtract the global mean from each column. This “centering” operation broadcasts the mean vector across all rows of the block.
Square element-wise: Each centered value is squared: (x - \bar{x})^2
Accumulate squared deviations: Sum these squared values, column by column, across all blocks.
Finalize: After processing all blocks, divide by n-1 (Bessel’s correction for sample variance) and take the square root.
Alternative: One-pass algorithm
There exist numerically stable one-pass algorithms (e.g., Welford’s algorithm) that compute mean and variance simultaneously. However, they’re slightly more complex to implement and can be less numerically stable for data with large means and small variances. BigDataStatMeth opts for the two-pass approach for clarity and numerical reliability in typical statistical applications.
# Algorithm: BlockwiseStandardDeviation
# Input: HDF5 dataset with n_rows × n_cols, block_size
# Output: column_sds vector of length n_cols
# Pass 1: Compute means
column_means <- BlockwiseMeans(dataset, block_size)
# Pass 2: Compute squared deviations
squared_devs <- vector of zeros (length n_cols)
n_blocks <- ceiling(n_rows_total / block_size)
for (i in 1:n_blocks) {
# Read block from disk
block <- read_hdf5_rows(start_row, end_row)
# Center the block (subtract column means)
centered_block <- block - column_means # Broadcast subtraction
# Square element-wise
squared_block <- centered_block^2
# Sum squared deviations for this block
block_sq_devs <- compute_column_sums(squared_block)
# Accumulate
squared_devs <- squared_devs + block_sq_devs
# Free memory
free(block)
}
# Finalize: compute standard deviation
column_sds <- sqrt(squared_devs / (n_rows_total - 1))
return(column_sds)I/O cost: This two-pass approach doubles the I/O (reading data twice), but keeps memory requirements constant regardless of data size.
BigDataStatMeth implementation: The package provides two ways to compute these statistics:
bdNormalize_hdf5()- Performs centering and/or scaling in-place, automatically handling the multi-pass computation. Results are written back to the HDF5 file.bdgetSDandMean_hdf5()- Computes means and standard deviations and returns them either in memory (onmemory = TRUE) or stores them in the HDF5 file (onmemory = FALSE) for later use. This is useful when you need the statistics themselves, not just normalized data.
Both functions use the C++ internal implementation get_HDF5_mean_sd_by_column() which handles block iteration, parallel computation with OpenMP, and numerically stable accumulation.
4 Example 2: Matrix Multiplication by Blocks
Matrix multiplication is fundamental to many statistical methods - it appears in linear regression, principal component analysis, and countless other algorithms. It also illustrates a more complex block-wise pattern because the result depends on interactions between different parts of both input matrices.
4.1 The Mathematical Foundation
Recall that matrix multiplication C = A \times B where A is m \times k and B is k \times n, produces:
C_{ij} = \sum_{p=1}^{k} A_{ip} \times B_{pj}
Each element of C is a dot product of a row from A and a column from B. The key insight for block-wise processing is that this operation can be decomposed spatially - we can compute different regions of C independently.
4.2 Block-Wise Strategy
If we partition A into row blocks and B into column blocks:
A = \begin{bmatrix} A_1 \\ A_2 \\ \vdots \\ A_m \end{bmatrix}, \quad B = \begin{bmatrix} B_1 & B_2 & \cdots & B_n \end{bmatrix}
Then each block of the result C_{ij} = A_i \times B_j can be computed independently. More usefully, if we partition along the shared dimension (the k dimension):
A = \begin{bmatrix} A_1 & A_2 & \cdots & A_p \end{bmatrix}, \quad B = \begin{bmatrix} B_1 \\ B_2 \\ \vdots \\ B_p \end{bmatrix}
Then: C = \sum_{i=1}^{p} A_i \times B_i
This decomposition means we can: 1. Read one block pair (A_i, B_i) at a time from disk 2. Compute their contribution to C 3. Accumulate the result 4. Never hold more than two blocks and the accumulator in memory
4.3 Practical Implementation
flowchart TD
A["Matrix A m × k<br/>on disk"] --> B["Partition along<br/>k dimension"]
C["Matrix B k × n<br/>on disk"] --> B
B --> D["Block pairs:<br/>A₁ B₁, A₂ B₂, ..."]
D --> E{More blocks?}
E -->|Yes| F["Read block pair<br/>Aᵢ Bᵢ"]
F --> G["Compute<br/>Aᵢ × Bᵢ"]
G --> H["Accumulate:<br/>C += Aᵢ × Bᵢ"]
H --> E
E -->|No| I["Result matrix<br/>C m × n"]
style A fill:#f0f8ff
style C fill:#f0f8ff
style I fill:#e8f6e8
style F fill:#fff8e1
style G fill:#fff8e1
style H fill:#fff8e1
Key insight: Matrix multiplication can be decomposed along the shared dimension (the k in A_{m \times k} \times B_{k \times n}).
Step 1: Initialize
Create a zero matrix C of dimensions m \times n to accumulate results. Calculate how many blocks we’ll need along the k dimension.
Step 2: Process each block pair
For each block i:
Read from A: Extract columns
start:endof matrix A. This gives us a “tall thin” matrix: m rows (all of them) byblock_widthcolumns (just this block’s slice).Read from B: Extract rows
start:endof matrix B. This gives us a “short wide” matrix:block_widthrows (matching A’s columns) by n columns (all of them).Multiply: Compute the standard matrix product of these two small pieces. The result is m \times n - the same size as our final answer.
Accumulate: Add this contribution to C. This works because: C = A \times B = \sum_{i=1}^{blocks} A_i \times B_i where A_i and B_i are the block slices.
Step 3: Handle result
If C fits in memory, return it directly. If not (which can happen when m and n are both large), write it to HDF5 incrementally and return a reference.
Memory analysis:
Peak memory = A_{block} + B_{block} + C = (m \times b) + (b \times n) + (m \times n) values
For large k but moderate m and n, this is much less than holding full A and B which would require (m \times k) + (k \times n) values.
# Algorithm: BlockwiseMatrixMultiplication
# Input: Matrix A m × k in HDF5, Matrix B k × n in HDF5, block_size
# Output: Matrix C = A × B m × n
# Initialize result matrix
C <- zero matrix m × n
n_blocks <- ceiling(k / block_size)
# Process blocks along shared dimension
for (i in 1:n_blocks) {
# Determine block range along k dimension
start_idx <- (i-1) * block_size + 1
end_idx <- min(i * block_size, k)
block_width <- end_idx - start_idx + 1
# Read block from A (columns start_idx:end_idx)
A_block <- read_hdf5_columns(A, start_idx:end_idx) # m × block_width
# Read corresponding block from B (rows start_idx:end_idx)
B_block <- read_hdf5_rows(B, start_idx:end_idx) # block_width × n
# Compute this block's contribution to C
C_contribution <- A_block %*% B_block # Standard matrix mult
# Accumulate into result
C <- C + C_contribution
# Free memory
free(A_block, B_block)
}
# Write result (if too large for memory, write directly to HDF5)
if (C fits in memory) {
return(C)
} else {
write_to_hdf5(C)
return(HDF5_reference)
}Matrix multiplication is decomposed spatially along the shared dimension k. Each block pair contributes independently to the final result - this is different from the temporal decomposition used for means where blocks contribute sequentially.
BigDataStatMeth implementation: The bdblockmult_hdf5() function implements this strategy with additional optimizations:
- Automatic block size selection based on available memory
- Writing results directly to HDF5 to handle cases where even the result matrix is too large
- Parallel processing of independent block multiplications using OpenMP when multiple cores are available
- Careful memory management to avoid unnecessary copies
- All computation done in C++ for maximum efficiency
5 Example 3: The QR Decomposition Challenge
The QR decomposition is more challenging because it requires maintaining orthogonality across all columns, which creates dependencies between blocks. This illustrates how not all algorithms decompose trivially.
5.1 Why QR Matters
The QR decomposition factors a matrix A (dimensions m \times n) into: A = Q \times R
where Q is orthogonal (m \times n, with Q^T Q = I) and R is upper triangular (n \times n). This decomposition is fundamental to:
- Solving least squares problems (linear regression)
- Computing principal components
- Many iterative algorithms that need orthonormal bases
5.2 The Block-Wise Strategy: Hierarchical QR
The key insight is that QR decomposition can be computed hierarchically through a series of smaller QR decompositions.
flowchart TD
A["Matrix A<br/>m × n<br/>Too large for RAM"] --> B["Partition into<br/>k row blocks"]
B --> C1["Block A₁<br/>m₁ × n"]
B --> C2["Block A₂<br/>m₂ × n"]
B --> C3["Block Aₖ<br/>mₖ × n"]
C1 --> D1["Local QR:<br/>A₁ = Q₁R₁"]
C2 --> D2["Local QR:<br/>A₂ = Q₂R₂"]
C3 --> D3["Local QR:<br/>Aₖ = QₖRₖ"]
D1 --> E["Stack R matrices:<br/>[R₁; R₂; ...; Rₖ]<br/>kn × n<br/>Fits in RAM!"]
D2 --> E
D3 --> E
E --> F["Final QR:<br/>Rstack = QR × Rfinal"]
F --> G["Reconstruct Q:<br/>Multiply Q_i by QR blocks"]
G --> H["Result:<br/>A = Q × Rfinal"]
style A fill:#ffe8e8
style E fill:#fff8e1
style H fill:#e8f6e8
style D1 fill:#f0f8ff
style D2 fill:#f0f8ff
style D3 fill:#f0f8ff
Step 1: Partition
Divide the matrix A into blocks along rows: A = [A_1; A_2; ...; A_k]
Step 2: Local QR
Compute QR for each block independently: A_i = Q_i R_i
Each block’s QR decomposition can be computed separately because they operate on independent rows.
Step 3: Collect R matrices
Stack all the upper triangular R_i matrices vertically: R_{stack} = [R_1; R_2; ...; R_k]
Critical insight: R_{stack} is much smaller than A. It’s kn \times n instead of m \times n, and typically kn \ll m, so this stacked matrix fits in memory even when A doesn’t!
Step 4: Final QR
Compute QR of the stacked R matrices (this fits in RAM): R_{stack} = Q_R \times R_{final}
Step 5: Reconstruct Q
The overall Q is obtained by multiplying each local Q_i by the corresponding block of Q_R. This reconstruction can be done block by block, again staying within memory constraints.
Why this works mathematically:
A = \begin{bmatrix} Q_1 R_1 \\ Q_2 R_2 \\ \vdots \\ Q_k R_k \end{bmatrix} = \begin{bmatrix} Q_1 \\ Q_2 \\ \vdots \\ Q_k \end{bmatrix} \begin{bmatrix} R_1 \\ R_2 \\ \vdots \\ R_k \end{bmatrix}
Since the Q_i are orthogonal, applying another QR to the stacked R matrices gives us the final factorization.
# Algorithm: HierarchicalQR
# Input: Matrix A m × n in HDF5, block_size
# Output: Q matrix m × n, R matrix n × n
# Step 1: Partition and compute local QRs
n_blocks <- ceiling(m / block_size)
R_list <- list() # Store R matrices
for (i in 1:n_blocks) {
# Read block from HDF5
A_block <- read_hdf5_rows(start_row, end_row)
# Compute local QR
qr_result <- qr_decomposition(A_block)
Q_local <- qr_result$Q
R_local <- qr_result$R
# Store Q to HDF5 for later
write_hdf5(Q_local, group = "local_Q", dataset = paste0("Q_", i))
# Keep R in memory (small)
R_list[[i]] <- R_local
free(A_block, Q_local)
}
# Step 2: Stack R matrices (fits in RAM)
R_stacked <- bind_rows(R_list) # (k*n × n)
# Step 3: Final QR on stacked Rs
qr_final <- qr_decomposition(R_stacked)
Q_R <- qr_final$Q
R_final <- qr_final$R
# Step 4: Reconstruct global Q (block by block)
for (i in 1:n_blocks) {
# Read local Q_i
Q_local <- read_hdf5(group = "local_Q", dataset = paste0("Q_", i))
# Multiply by corresponding Q_R block
start_idx <- (i-1) * n + 1
end_idx <- i * n
Q_R_block <- Q_R[start_idx:end_idx, ]
Q_global_block <- Q_local %*% Q_R_block
# Write back to HDF5
write_hdf5(Q_global_block, group = "final_Q", dataset = paste0("Q_", i))
}
return(list(Q = "final_Q", R = R_final))The stacked R matrices are only (k \times n) \times n in size. Even with 10 blocks, if n=1000, that’s just a 10{,}000 \times 1000 matrix - about 80 MB. This fits comfortably in RAM even when the original matrix is hundreds of GB!
BigDataStatMeth implementation: The package implements hierarchical QR in bdQR_hdf5(), handling the block partitioning, intermediate storage, and final assembly automatically. This enables methods like PCA on out-of-memory data.
6 Example 4: Block-Wise SVD for Principal Component Analysis
Singular Value Decomposition (SVD) underlies principal component analysis and is one of the most important matrix decompositions in statistics. Computing SVD on matrices too large for memory is a significant challenge that demonstrates the full power of block-wise thinking.
6.1 The Mathematical Goal
For a matrix X (dimensions n \times p), we want: X = U \Sigma V^T
where: - U is n \times r with orthonormal columns (left singular vectors) - \Sigma is r \times r diagonal (singular values)
- V is p \times r with orthonormal columns (right singular vectors) - r = \min(n, p) or smaller if we want only the leading components
For PCA, we typically want only the first k components where k \ll r, which simplifies the problem significantly.
6.2 The Block-Wise Strategy: Hierarchical Decomposition
BigDataStatMeth implements a hierarchical SVD algorithm (Iwen & Ong, 2017) that works in levels.
flowchart TD
A["Matrix X n × p<br/>Too large for RAM"] --> B["Level 1:<br/>Partition into m blocks"]
B --> C1["Block X₁"]
B --> C2["Block X₂"]
B --> C3["Block Xₘ"]
C1 --> D1["Local SVD:<br/>X₁ = U₁Σ₁V₁ᵀ<br/>Keep k components"]
C2 --> D2["Local SVD:<br/>X₂ = U₂Σ₂V₂ᵀ<br/>Keep k components"]
C3 --> D3["Local SVD:<br/>Xₘ = UₘΣₘVₘᵀ<br/>Keep k components"]
D1 --> E1["Form Z₁ = V₁Σ₁<br/>p × k"]
D2 --> E2["Form Z₂ = V₂Σ₂<br/>p × k"]
D3 --> E3["Form Zₘ = VₘΣₘ<br/>p × k"]
E1 --> F["Level 2:<br/>Stack Z matrices<br/>Z = [Z₁; Z₂; ...; Zₘ]<br/>mk × k<br/>Fits in RAM!"]
E2 --> F
E3 --> F
F --> G["Global SVD:<br/>Z = UzΣzVzᵀ"]
G --> H["Extract Results:<br/>Singular values = Σz<br/>Right vectors = Vz<br/>Left vectors from U₁...Uₘ"]
H --> I["Final SVD:<br/>X ≈ UΣVᵀ"]
style A fill:#ffe8e8
style F fill:#fff8e1
style I fill:#e8f6e8
style D1 fill:#f0f8ff
style D2 fill:#f0f8ff
style D3 fill:#f0f8ff
style G fill:#e8f6e8
Level 1: Partition and Local SVD
Partition X into blocks along rows: X = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_m \end{bmatrix}
Compute truncated SVD for each block, keeping only k components: X_i = U_i \Sigma_i V_i^T
where U_i is n_i \times k, \Sigma_i is k \times k, and V_i is p \times k.
Form weighted matrices from the right singular vectors: Z_i = V_i \Sigma_i
Each Z_i is p \times k - much smaller than the original X_i which was n_i \times p!
Level 2: Merge and Global SVD
Stack the Z_i matrices: Z = \begin{bmatrix} Z_1 \\ Z_2 \\ \vdots \\ Z_m \end{bmatrix}
This Z matrix is mk \times k. Since m (blocks) and k (components) are both small, Z fits in memory.
Compute global SVD of Z: Z = U_Z \Sigma_Z V_Z^T
Extract results:
- Global singular values = \Sigma_Z
- Global right singular vectors = V_Z
- Global left singular vectors: multiply back through blocks
Why this works: The row space of X (spanned by right singular vectors) is well-approximated by combining the row spaces of blocks. Block SVDs capture main variation within each block; global SVD captures overall variation.
Approximation quality: Not exact, but excellent for large n relative to p with sufficient k. Error decreases as k increases.
Concrete example: X is 100,000 × 20,000 (16 GB)
- Partition into m = 10 blocks of 10,000 rows
- Keep k = 50 components
Memory per block during local SVD: - Block X_i: 10,000 × 20,000 = 1.6 GB - U_i: 10,000 × 50 = 4 MB
- \Sigma_i: 50 × 50 = 20 KB - V_i: 20,000 × 50 = 8 MB - Peak: ~1.6 GB (dominated by the block)
Memory for global SVD: - Z: (10 × 50) × 20,000 = 8 MB - This easily fits in memory!
Result: 16 GB problem → 1.6 GB peak memory (10× reduction) while computing accurate approximation.
The stacked Z matrix is tiny: (m \times k) \times p instead of n \times p. With 10 blocks and 50 components, even for p=20{,}000, that’s just 8 MB - fitting comfortably in RAM even when the original is 16 GB!
We’ve reduced a 16 GB problem to one requiring ~1.6 GB peak memory - a 10× reduction - while computing an accurate approximation to the full SVD.
BigDataStatMeth implementation: The bdSVD_hdf5() function implements this hierarchical strategy with additional optimizations:
- Automatic selection of block size and hierarchy depth based on data dimensions and available memory
- Parallel processing of independent blocks
- Optional centering and scaling before SVD
- Multiple hierarchy levels for very large datasets (you can set
qlevels withkblocks per level)
7 The Pattern: Recognizing Block-Wise Opportunities
Looking across these examples, several patterns emerge for when and how algorithms can be adapted for block-wise processing:
7.1 1. Operations with Additive Decomposition
If an operation can be expressed as summing contributions from different data portions, it’s straightforward to implement block-wise:
- Examples: Means, sums, counts, sufficient statistics for simple models
- Strategy: Process each block, accumulate results, finalize
- Memory: One block plus small accumulator
- I/O: Single pass through data
7.2 2. Operations with Hierarchical Decomposition
If an operation can be computed hierarchically (computing on blocks, then combining block results), block-wise processing is feasible:
- Examples: QR decomposition, SVD, certain optimization algorithms
- Strategy: Local computation on blocks, merge results, possibly iterate
- Memory: One block plus moderately-sized merge structure
- I/O: Multiple passes, but still practical
7.3 3. Operations Requiring Global Information
Some operations need information about the entire dataset to process any part of it:
- Examples: Sorting, ranking, quantiles, certain graph algorithms
- Challenge: May require multiple passes or sophisticated algorithms
- Strategy: Often use approximation or sampling strategies
7.4 4. Operations That Don’t Decompose Well
A few operations are fundamentally difficult to decompose:
- Examples: Some machine learning algorithms with complex dependencies, certain nonlinear optimizations
- Options: Use approximations, reformulate the problem, or accept that in-memory computation is necessary
BigDataStatMeth focuses on category 1 and 2 operations, which cover most standard statistical analyses. For category 3, the package often provides approximate but practical solutions.
8 Practical Considerations
8.1 Block Size Selection
Choosing appropriate block sizes involves trade-offs:
Too small blocks: - More overhead from reading many blocks - More function calls and loop iterations - Potentially less cache-friendly computations
Too large blocks: - Risk of memory exhaustion - Less opportunity for parallelization - Longer time to first result
BigDataStatMeth uses smart heuristics based on available RAM, data dimensions, and operation type. Users can override defaults when they have specific system knowledge, but the automatic selection works well for most cases.
Rule of thumb: Blocks should use ~10-20% of available RAM, leaving room for intermediate results and OS operations.
BigDataStatMeth uses heuristics based on: - Available RAM (conservative estimates) - Data dimensions (balanced partitioning) - Operation type (some need larger blocks for efficiency) - Computational complexity (CPU-bound vs. I/O-bound)
Users can override defaults when they have specific knowledge about their system.
8.2 Numerical Stability
Block-wise algorithms can affect numerical properties, though the impact varies by operation:
Potential issues: - Accumulation of rounding errors across many blocks - Loss of precision in hierarchical methods (e.g., hierarchical SVD is approximate) - Order-dependent results for some operations - Cancellation errors when subtracting nearly equal numbers
The package prioritizes practical numerical reliability over theoretical guarantees:
- ✅ Tested algorithms: Extensively verified against in-memory methods
- ✅ Established methods: Uses well-known algorithms from numerical linear algebra literature
- ✅ Double precision: All computations use 64-bit floating point
- ✅ Documented approximations: When algorithms are approximate (hierarchical SVD), this is clearly stated
What is NOT guaranteed: - ❌ Bit-exact reproducibility across different hardware - ❌ Formal error bounds for all operations - ❌ Specific compensated algorithms (e.g., Kahan summation)
Practical implications:
For most statistical applications, the numerical precision is more than adequate: - PCA/SVD components match in-memory results to many decimal places - Regression coefficients are statistically indistinguishable - Hypothesis test results are identical
If your application requires provable numerical bounds or arbitrary precision, BigDataStatMeth may not be suitable.
The package optimizes for the common case where statistical noise far exceeds numerical error - which describes most genomic and large-scale data analyses.
User control:
Some functions provide parameters to control accuracy: - SVD/PCA: k parameter controls how many components to compute (more = better approximation to full decomposition) - Iterative methods: tolerance parameters control convergence criteria
Check function documentation for available accuracy controls.
8.3 Parallelization
Many block-wise operations are embarrassingly parallel - blocks can be processed independently. BigDataStatMeth leverages this through OpenMP parallelization at the C++ level.
flowchart TD
A[Matrix in HDF5<br/>divided into blocks] --> B[OpenMP Scheduler<br/>threads=4]
B --> C1[Thread 1:<br/>Block 1]
B --> C2[Thread 2:<br/>Block 2]
B --> C3[Thread 3:<br/>Block 3]
B --> C4[Thread 4:<br/>Block 4]
C1 --> D1[Process<br/>independently]
C2 --> D2[Process<br/>independently]
C3 --> D3[Process<br/>independently]
C4 --> D4[Process<br/>independently]
D1 --> E[Combine results<br/>sequentially]
D2 --> E
D3 --> E
D4 --> E
E --> F[Final result]
style A fill:#f0f8ff
style B fill:#fff8e1
style C1 fill:#e8f6e8
style C2 fill:#e8f6e8
style C3 fill:#e8f6e8
style C4 fill:#e8f6e8
style F fill:#e8f6e8
OpenMP (Open Multi-Processing) is a parallel programming API for C++ that BigDataStatMeth uses internally. When you specify threads = 4 in a function like bdSVD_hdf5(), the C++ code distributes independent block operations across 4 CPU cores.
Key characteristics:
- Shared memory: All threads access the same HDF5 file, but read different blocks
- Thread-safe I/O: BigDataStatMeth ensures HDF5 reads don’t conflict between threads
- Load balancing: OpenMP automatically distributes blocks to available threads
- Overhead vs. benefit: Parallelism only helps when block processing time exceeds coordination overhead
When parallelization helps:
- Large blocks that take seconds to process each
- CPU-intensive operations (matrix multiplication, decompositions)
- Many blocks to process (good load distribution)
When it doesn’t help:
- Very small blocks (overhead dominates)
- I/O-bound operations (disk is the bottleneck, not CPU)
- Few blocks (can’t keep all cores busy)
BigDataStatMeth automatically uses parallelization where beneficial. Users control thread count via the threads parameter.
// Algorithm: ParallelBlockProcessing
// Input: HDF5 dataset, n_blocks, n_threads
// Output: Combined results from all blocks
// 1. Initialize
results <- array to store each block's output
// 2. Parallel processing (OpenMP)
#pragma omp parallel for num_threads(n_threads)
for (int i = 0; i < n_blocks; i++) {
// Each thread processes its assigned blocks
int thread_id = omp_get_thread_num();
// Thread-safe read
block_i = read_hdf5_block(i);
// Independent computation
result_i = process_block(block_i);
// Store result
results[i] = result_i;
// Free memory
free(block_i);
}
// 3. Combine results (sequential)
final_result = combine_function(results);
// 4. Return
return final_result;Match threads to your CPU cores but leave 1-2 cores free for the system. For a 12-core machine, threads = 10 is often optimal. Use threads = 1 for debugging to simplify error tracking.
9 Implementing Your Own Block-Wise Methods
One of BigDataStatMeth’s design goals is enabling users to implement new statistical methods using block-wise processing. The package provides different levels of abstraction:
9.1 Architecture Overview
BigDataStatMeth uses a three-layer architecture:
1. Internal C++ Layer (Accessible from C++, not from R):
The lowest level consists of efficient C++ classes and functions that handle direct HDF5 file operations:
hdf5Datasetclass - Wraps HDF5 C API for file access, provides block reading/writing- Efficient memory management and buffer handling
- Direct HDF5 C API access for maximum performance
- OpenMP-parallelized block iteration
Important: These internal components are NOT directly exposed to R users - you can’t call them from an R script. However, they ARE fully accessible if you’re developing new methods in C++ using BigDataStatMeth as a header-only library. If you’re implementing a new statistical method in C++, you have full access to these low-level building blocks.
Each high-level algorithm (PCA, regression, etc.) uses these internal functions and manages its own block iteration strategy for optimal performance. The block-reading logic is algorithm-specific rather than exposed as general-purpose functions because different algorithms benefit from different iteration patterns.
2. Mid-Level Functions (C++ with R bindings):
Block-wise operations implemented in C++ but accessible from both R and C++:
bdCrossprod_hdf5()/bdtCrossprod_hdf5()- Crossproduct operationsbdblockmult_hdf5()- Matrix multiplicationbdNormalize_hdf5()- Centering and scalingbdApply_hdf5()- Apply functions to blocks
These functions handle the complexity of block iteration, parallel processing, and result accumulation internally. From R, you simply call them with appropriate parameters.
3. High-Level Statistical Methods (C++ with R bindings):
Complete statistical analyses implemented using the mid-level functions:
bdSVD_hdf5()- Singular Value DecompositionbdPCA_hdf5()- Principal Component Analysis
bdQR_hdf5()- QR Decomposition- Regression methods, association tests, etc.
9.2 Developing New Methods
Approach 1: Compose from existing functions (Recommended for R users)
Use existing mid-level and high-level functions as building blocks. Example: implementing Canonical Correlation Analysis (CCA) in R:
# CCA implementation using BigDataStatMeth functions
bdCCA_hdf5 <- function(filename, X_dataset, Y_dataset, k = 10) {
# Step 1: Normalize both datasets
bdNormalize_hdf5(filename, X_dataset, bcenter = TRUE, bscale = TRUE)
bdNormalize_hdf5(filename, Y_dataset, bcenter = TRUE, bscale = TRUE)
# Step 2: Compute QR decomposition for each
qr_x <- bdQR_hdf5(filename, X_dataset)
qr_y <- bdQR_hdf5(filename, Y_dataset)
# Step 3: Compute cross-product of Q matrices
cross <- bdCrossprod_hdf5(
filename,
A = qr_x$Q_dataset,
B = qr_y$Q_dataset
)
# Step 4: SVD of cross-product
svd_result <- bdSVD_hdf5(filename, cross$dataset, k = k)
# Step 5: Back-transform to get canonical variates
# ... (using bdblockmult_hdf5 to multiply through)
return(list(correlations = svd_result$d, ...))
}This approach leverages existing optimized functions without writing C++ code.
Approach 2: Implement in C++ (For developers)
For maximum performance or novel algorithms, implement directly in C++. You’ll work with:
hdf5Datasetclass for file access- Eigen library for linear algebra
- OpenMP for parallelization
See the CCA implementation examples in the package source (Example_bdCCA.cpp, Example_bdCCA.h) which show both R-based and C++-based approaches to the same algorithm.
Key consideration: The C++ approach requires understanding: - Memory management in C++ - HDF5 C API through hdf5Dataset wrapper
- Eigen matrix operations - Thread safety for parallel operations
Most users will find composing from existing functions sufficient. The C++ API is there for those developing new core algorithms or requiring maximum performance.
9.3 Learning from Examples
The package includes several complete examples showing different implementation strategies:
Example_bdCCA_hdf5_R.R- CCA implemented purely in R using BigDataStatMeth functionsExample_bdCCA.cpp/.h- CCA implemented in C++ for comparisonExample_getQRbyBlocks.R- Hierarchical QR showing block partitioning strategies
These examples demonstrate how to think through: 1. What can be computed block-by-block? 2. What intermediate results need to be merged? 3. What’s the memory footprint at each step? 4. When to write intermediate results to HDF5 vs. keep in memory?
10 Interactive Exercise
10.1 Practice: Analyzing Block-Wise Efficiency
Understanding which operations work well with block-wise processing and which present challenges helps you design efficient analysis pipelines. This exercise develops that intuition.
# Consider this analysis workflow
analyze_workflow <- function() {
# Operation 1: Compute column means
means <- bdColMeans_hdf5(file, dataset)
# Operation 2: Center the data
centered <- bdNormalize_hdf5(file, dataset, bcenter = TRUE, bscale = FALSE)
# Operation 3: Compute t(X) %*% X
crossprod <- bdCrossprod_hdf5(file, centered)
# Operation 4: Eigen-decomposition for PCA
pca <- bdSVD_hdf5(file, crossprod)
# Question: Which operations are most/least block-amenable?
}Think through these scenarios - understanding the principles matters more than having “correct” answers:
1. Operation Efficiency Analysis: - Which operation above is most naturally block-wise? Why? - Which operation requires the most inter-block coordination? - If you had to do this with limited memory, which step would be the bottleneck?
2. Block Size Decisions: - For a 100,000 × 50,000 matrix with 32 GB RAM available, what block size would you choose? - What if you had 128 GB RAM? Would you still use blocks? - How would block size affect the accuracy of the results?
3. Algorithm Adaptation: - Consider k-means clustering. Can you decompose it block-wise? - What about hierarchical clustering? - Which machine learning algorithms would be easy vs. hard to adapt?
4. Designing Your Pipeline: - Your analysis needs: QC filtering, normalization, PCA, then regression - Which steps can process data block-by-block independently? - Which steps need access to summary statistics from all data? - Where would you store intermediate results?
5. Trade-offs: - Smaller blocks = less memory but more disk I/O - Larger blocks = more memory but fewer I/O operations
- For your system, where’s the sweet spot?
Try sketching out a block-wise version of your actual analysis pipeline. Which parts translate easily? Which require creative rethinking? This mental exercise builds intuition for designing efficient big data workflows.
11 Key Takeaways
Let’s consolidate what you’ve learned about adapting algorithms for block-wise processing with disk-based data.
11.1 Essential Concepts
The divide-process-combine paradigm underlies all block-wise computing. You partition data into manageable pieces, process each piece (often independently), then combine results to get the final answer. This simple pattern adapts to operations from simple means to complex matrix factorizations. The challenge lies in figuring out what to process and how to combine results validly.
Block-amenable operations share common mathematical properties that make them decompose naturally. Operations that can be expressed as sums, products, or aggregations across independent portions of data work well. Computing means, sums, element-wise operations, and many matrix products fall into this category. The mathematical structure of these operations allows valid decomposition.
Block-resistant operations require information flow between all parts of the data or produce outputs that scale problematically. Anything requiring global sorting, finding specific quantiles, or creating O(n²) outputs challenges block-wise approaches. These operations don’t decompose cleanly - they need special algorithmic tricks or fundamentally different approaches.
Block size creates a memory-accuracy-speed trade-off with no universally optimal choice. Smaller blocks use less memory but require more disk I/O and potentially more approximation error in hierarchical algorithms. Larger blocks use more memory but require fewer I/O operations and may be more accurate. The optimal choice depends on your available RAM, disk speed, and accuracy requirements.
Hierarchical processing enables algorithms that don’t naturally decompose to single-pass block operations. By creating multiple levels of computation - process blocks, merge results into smaller intermediate blocks, repeat until small enough - you can handle very large matrices while controlling memory usage at each level. This is how BigDataStatMeth implements complex operations like SVD/PCA.
Composability is power. You can build complex analyses by composing simple block-wise operations. Rather than implementing every possible analysis from scratch in C++, combine existing functions: normalize, compute crossproduct, perform SVD, multiply back through. This compositional approach lets you implement new methods at the R level, leveraging optimized core operations.
11.2 When Block-Wise Processing Works Well
Understanding which problems benefit from block-wise approaches helps you make good architectural decisions for your analyses.
✅ Highly effective for:
Matrix multiplication and products - These decompose naturally into block-block operations. BigDataStatMeth’s implementations are highly optimized and scale well.
Element-wise operations - Operations like adding, subtracting, multiplying by scalars process each element independently. Perfect for block-wise processing with no communication overhead.
Row or column operations - Computing means, sums, normalizations by rows or columns work naturally with blocks containing complete rows or columns.
Many factorizations - SVD, QR, and Cholesky decompositions have hierarchical block-wise algorithms that maintain numerical accuracy while limiting memory usage.
❌ Challenging for:
Global operations - Finding the median, computing percentiles, or operations that require knowing the full data distribution don’t decompose easily.
Fine-grained dependencies - Algorithms where each element depends on many others (like some iterative optimization methods) resist block decomposition.
Operations creating large outputs - Computing all pairwise distances or correlations creates O(n²) results. Even block-wise approaches struggle when output size explodes.
Highly iterative methods - Algorithms requiring many passes through data with convergence checks add overhead when each pass means disk I/O.
For most statistical analyses in genomics and similar fields - PCA, regression, association tests, basic machine learning - block-wise approaches work very well. The operations these methods need decompose naturally. More specialized algorithms may require case-by-case evaluation of whether block-wise processing provides benefits.
12 Next Steps
You now understand how standard statistical algorithms adapt for disk-based data processing. This conceptual foundation prepares you to:
- Understand Linear Algebra Foundations → Review the mathematical operations BigDataStatMeth implements
- Getting Started Tutorial → Apply these concepts to your own data
- CCA Implementation Example → See a complete complex method implemented block-wise
The BigDataStatMeth paper provides mathematical details and proofs for the hierarchical algorithms. The C++ API documentation shows lower-level implementation details if you want to understand exactly how operations are executed.