Block-Wise Computing

Adapting Statistical Algorithms for Disk-Based Data

0.1 What You’ll Learn

By the end of this section, you will:

  • Understand the divide-process-combine paradigm of block-wise algorithms
  • Know which operations decompose naturally to block-wise processing
  • Recognize which algorithms are challenging to adapt for blocks
  • Understand memory-computation trade-offs with block size
  • See concrete examples of block-wise matrix operations
  • Know how to compose new methods from existing BigDataStatMeth functions

1 From In-Memory to Block-Wise: A Paradigm Shift

You now understand why large datasets don’t fit in RAM and how HDF5 enables efficient access to disk-stored data. But knowing you can read arbitrary portions of a matrix from disk doesn’t automatically tell you how to perform statistical analyses on that data. The challenge is algorithmic: how do you adapt methods designed for in-memory matrices to work with data that must be processed in pieces?

This is not merely an implementation detail - it requires rethinking how algorithms work. A standard PCA implementation might call svd(X) where X is a complete matrix in memory. That single function call encapsulates a sophisticated numerical algorithm, but it fundamentally assumes it can access any element of X at any time. When X lives on disk, this assumption breaks. You must decompose the algorithm into steps that process manageable blocks of data, with intermediate results that fit in memory.

The mathematical operations remain the same - we’re still computing eigenvalues, performing matrix multiplications, or fitting regression models. But the computational strategy changes fundamentally. This section explains how standard statistical methods are adapted for block-wise processing, using examples from BigDataStatMeth’s implementation.

2 The Core Concept: Divide, Process, Combine

Block-wise computing follows a conceptual pattern that applies across different statistical methods:

  1. Divide: Partition the data matrix into blocks that fit comfortably in memory
  2. Process: Perform computations on each block independently (when possible)
  3. Combine: Merge the block-level results to obtain the final answer

The art lies in step 2 and 3 - figuring out what computations can be done on blocks independently, what information must be passed between blocks, and how to combine results validly. Not all algorithms decompose equally well, and some require sophisticated mathematical reformulations to work in a block-wise framework.

Let’s start with simple examples and build toward more complex methods.

3 Example 1: Computing Means and Standard Deviations

Computing column means illustrates the simplest case of block-wise processing. Suppose you have a matrix with 100,000 rows and 50,000 columns that’s too large for memory, but you can comfortably work with blocks of 10,000 rows at a time.

3.1 The In-Memory Approach

# Traditional approach - assumes X fits in memory
column_means <- colMeans(X)

This single line hides significant computation: for each column, R reads all 100,000 values, sums them, and divides by 100,000.

3.2 The Block-Wise Approach

For block-wise computation, we leverage a mathematical property: the mean of a set of values equals the weighted average of the means of subsets, where weights are the subset sizes.

flowchart TD
    A[Matrix on Disk<br/>100,000 × 50,000] --> B[Initialize<br/>column_sums = 0]
    B --> C{More blocks?}
    C -->|Yes| D[Read Block i<br/>10,000 rows]
    D --> E[Compute<br/>colSums for block]
    E --> F[Accumulate:<br/>column_sums += block_sums]
    F --> C
    C -->|No| G[Finalize:<br/>means = column_sums / n_rows]
    
    style A fill:#f0f8ff
    style G fill:#e8f6e8
    style D fill:#fff8e1
    style E fill:#fff8e1
    style F fill:#fff8e1

Step 1: Initialize accumulators

We start by creating a vector to accumulate the sum of each column across all blocks. We also query the HDF5 file metadata to know how many rows exist in total and calculate how many blocks we’ll need to process.

Step 2: Process blocks sequentially

For each block:

  1. Calculate boundaries: Determine which rows belong to this block. Most blocks have block_size rows, but the last block might be smaller.

  2. Read from disk: Load only this block’s rows into RAM. This is the key to memory efficiency - we never hold the entire matrix.

  3. Compute contribution: Sum the values in each column of this block. This gives us a vector of length n_cols.

  4. Accumulate: Add this block’s column sums to our running total. This works because addition is associative and commutative.

  5. Free memory: Explicitly release the block’s memory before reading the next one. This ensures peak memory usage is just one block, not all blocks.

Step 3: Calculate final means

After processing all blocks, we have the total sum for each column. Divide by the total number of rows to get the mean.

Why this works mathematically:

\text{mean}(X) = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{\sum_{j=1}^{k} \sum_{i \in \text{block}_j} x_i}{n}

The inner sums are what we compute for each block, and we accumulate them in the outer sum.

# Algorithm: BlockwiseMeans
# Input: HDF5 dataset with n_rows × n_cols, block_size
# Output: column_means vector of length n_cols

# 1. Initialize
column_sums <- vector of zeros (length n_cols)
n_rows_total <- get_total_rows_from_hdf5()
n_blocks <- ceiling(n_rows_total / block_size)

# 2. Process each block
for (i in 1:n_blocks) {
  # Define block boundaries
  start_row <- (i-1) * block_size + 1
  end_row <- min(i * block_size, n_rows_total)
  
  # Read block from disk
  block <- read_hdf5_rows(start_row, end_row)
  
  # Compute block contribution
  block_sums <- compute_column_sums(block)
  
  # Accumulate results
  column_sums <- column_sums + block_sums
  
  # Free memory
  free(block)
}

# 3. Finalize computation
column_means <- column_sums / n_rows_total

# 4. Return result
return(column_means)
NoteMemory Efficiency

Peak memory is one block (10,000 × 50,000 = ~40 MB for doubles) plus the accumulator vector (50,000 values = ~0.4 MB).

Total: ~40 MB instead of 40 GB for the full matrix - a 1000× reduction!

Key insight: For operations that can be expressed as combining independent contributions from blocks (sums, counts), block-wise processing is straightforward. The mathematical property that makes this work is additive decomposition: the sum over all data equals the sum of sums over blocks.

BigDataStatMeth implementation: This is handled internally by C++ functions like get_HDF5_mean_sd_by_column() (or by_row variant) which manage block reading, computation, and accumulation efficiently using OpenMP parallelization when multiple cores are available.

3.3 Why Standard Deviation Is Slightly More Complex

Standard deviation requires both means and squared deviations from means. You might think you could compute it in one pass through blocks, but there’s a subtle issue: you need the overall mean to compute deviations, but you won’t know the overall mean until you’ve seen all blocks.

ImportantTwo-Pass Algorithm Required

Computing standard deviation requires two complete passes through the data:

  1. Pass 1: Compute overall means
  2. Pass 2: Compute squared deviations using those means

This doubles I/O but keeps memory constant.

Two passes through the data:

flowchart TD
    A[Dataset on Disk] --> B[Pass 1:<br/>Compute Means]
    B --> C[Store means<br/>in memory]
    C --> D[Pass 2:<br/>Start over]
    D --> E[Read Block i]
    E --> F[Center:<br/>block - means]
    F --> G[Square:<br/>centered²]
    G --> H[Accumulate<br/>squared_devs]
    H --> I{More blocks?}
    I -->|Yes| E
    I -->|No| J[Finalize:<br/>sqrt of sum over n-1]
    
    style A fill:#f0f8ff
    style B fill:#fff8e1
    style C fill:#ffe8e8
    style J fill:#e8f6e8

Why two passes are necessary:

To compute standard deviation, we need: \text{SD} = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}

The problem: we need \bar{x} (the mean) to compute each (x_i - \bar{x})^2 term. But we can’t know the overall mean until we’ve processed all blocks.

Pass 1: Compute means (same as before)

Use the block-wise mean algorithm to get the global mean for each column.

Pass 2: Compute squared deviations

Now that we know the means, we:

  1. Read each block again: Yes, this means reading the entire dataset twice. For very large datasets on slow storage, this I/O cost matters.

  2. Center the block: Subtract the global mean from each column. This “centering” operation broadcasts the mean vector across all rows of the block.

  3. Square element-wise: Each centered value is squared: (x - \bar{x})^2

  4. Accumulate squared deviations: Sum these squared values, column by column, across all blocks.

  5. Finalize: After processing all blocks, divide by n-1 (Bessel’s correction for sample variance) and take the square root.

Alternative: One-pass algorithm

There exist numerically stable one-pass algorithms (e.g., Welford’s algorithm) that compute mean and variance simultaneously. However, they’re slightly more complex to implement and can be less numerically stable for data with large means and small variances. BigDataStatMeth opts for the two-pass approach for clarity and numerical reliability in typical statistical applications.

# Algorithm: BlockwiseStandardDeviation
# Input: HDF5 dataset with n_rows × n_cols, block_size
# Output: column_sds vector of length n_cols

# Pass 1: Compute means
column_means <- BlockwiseMeans(dataset, block_size)

# Pass 2: Compute squared deviations
squared_devs <- vector of zeros (length n_cols)
n_blocks <- ceiling(n_rows_total / block_size)

for (i in 1:n_blocks) {
  # Read block from disk
  block <- read_hdf5_rows(start_row, end_row)
  
  # Center the block (subtract column means)
  centered_block <- block - column_means  # Broadcast subtraction
  
  # Square element-wise
  squared_block <- centered_block^2
  
  # Sum squared deviations for this block
  block_sq_devs <- compute_column_sums(squared_block)
  
  # Accumulate
  squared_devs <- squared_devs + block_sq_devs
  
  # Free memory
  free(block)
}

# Finalize: compute standard deviation
column_sds <- sqrt(squared_devs / (n_rows_total - 1))

return(column_sds)

I/O cost: This two-pass approach doubles the I/O (reading data twice), but keeps memory requirements constant regardless of data size.

BigDataStatMeth implementation: The package provides two ways to compute these statistics:

  1. bdNormalize_hdf5() - Performs centering and/or scaling in-place, automatically handling the multi-pass computation. Results are written back to the HDF5 file.

  2. bdgetSDandMean_hdf5() - Computes means and standard deviations and returns them either in memory (onmemory = TRUE) or stores them in the HDF5 file (onmemory = FALSE) for later use. This is useful when you need the statistics themselves, not just normalized data.

Both functions use the C++ internal implementation get_HDF5_mean_sd_by_column() which handles block iteration, parallel computation with OpenMP, and numerically stable accumulation.

4 Example 2: Matrix Multiplication by Blocks

Matrix multiplication is fundamental to many statistical methods - it appears in linear regression, principal component analysis, and countless other algorithms. It also illustrates a more complex block-wise pattern because the result depends on interactions between different parts of both input matrices.

4.1 The Mathematical Foundation

Recall that matrix multiplication C = A \times B where A is m \times k and B is k \times n, produces:

C_{ij} = \sum_{p=1}^{k} A_{ip} \times B_{pj}

Each element of C is a dot product of a row from A and a column from B. The key insight for block-wise processing is that this operation can be decomposed spatially - we can compute different regions of C independently.

4.2 Block-Wise Strategy

If we partition A into row blocks and B into column blocks:

A = \begin{bmatrix} A_1 \\ A_2 \\ \vdots \\ A_m \end{bmatrix}, \quad B = \begin{bmatrix} B_1 & B_2 & \cdots & B_n \end{bmatrix}

Then each block of the result C_{ij} = A_i \times B_j can be computed independently. More usefully, if we partition along the shared dimension (the k dimension):

A = \begin{bmatrix} A_1 & A_2 & \cdots & A_p \end{bmatrix}, \quad B = \begin{bmatrix} B_1 \\ B_2 \\ \vdots \\ B_p \end{bmatrix}

Then: C = \sum_{i=1}^{p} A_i \times B_i

This decomposition means we can: 1. Read one block pair (A_i, B_i) at a time from disk 2. Compute their contribution to C 3. Accumulate the result 4. Never hold more than two blocks and the accumulator in memory

4.3 Practical Implementation

flowchart TD
    A["Matrix A m × k<br/>on disk"] --> B["Partition along<br/>k dimension"]
    C["Matrix B k × n<br/>on disk"] --> B
    B --> D["Block pairs:<br/>A₁ B₁, A₂ B₂, ..."]
    D --> E{More blocks?}
    E -->|Yes| F["Read block pair<br/>Aᵢ Bᵢ"]
    F --> G["Compute<br/>Aᵢ × Bᵢ"]
    G --> H["Accumulate:<br/>C += Aᵢ × Bᵢ"]
    H --> E
    E -->|No| I["Result matrix<br/>C m × n"]
    
    style A fill:#f0f8ff
    style C fill:#f0f8ff
    style I fill:#e8f6e8
    style F fill:#fff8e1
    style G fill:#fff8e1
    style H fill:#fff8e1

Key insight: Matrix multiplication can be decomposed along the shared dimension (the k in A_{m \times k} \times B_{k \times n}).

Step 1: Initialize

Create a zero matrix C of dimensions m \times n to accumulate results. Calculate how many blocks we’ll need along the k dimension.

Step 2: Process each block pair

For each block i:

  1. Read from A: Extract columns start:end of matrix A. This gives us a “tall thin” matrix: m rows (all of them) by block_width columns (just this block’s slice).

  2. Read from B: Extract rows start:end of matrix B. This gives us a “short wide” matrix: block_width rows (matching A’s columns) by n columns (all of them).

  3. Multiply: Compute the standard matrix product of these two small pieces. The result is m \times n - the same size as our final answer.

  4. Accumulate: Add this contribution to C. This works because: C = A \times B = \sum_{i=1}^{blocks} A_i \times B_i where A_i and B_i are the block slices.

Step 3: Handle result

If C fits in memory, return it directly. If not (which can happen when m and n are both large), write it to HDF5 incrementally and return a reference.

Memory analysis:

Peak memory = A_{block} + B_{block} + C = (m \times b) + (b \times n) + (m \times n) values

For large k but moderate m and n, this is much less than holding full A and B which would require (m \times k) + (k \times n) values.

# Algorithm: BlockwiseMatrixMultiplication
# Input: Matrix A m × k in HDF5, Matrix B k × n in HDF5, block_size
# Output: Matrix C = A × B m × n

# Initialize result matrix
C <- zero matrix m × n
n_blocks <- ceiling(k / block_size)

# Process blocks along shared dimension
for (i in 1:n_blocks) {
  # Determine block range along k dimension
  start_idx <- (i-1) * block_size + 1
  end_idx <- min(i * block_size, k)
  block_width <- end_idx - start_idx + 1
  
  # Read block from A (columns start_idx:end_idx)
  A_block <- read_hdf5_columns(A, start_idx:end_idx)  # m × block_width
  
  # Read corresponding block from B (rows start_idx:end_idx)
  B_block <- read_hdf5_rows(B, start_idx:end_idx)     # block_width × n
  
  # Compute this block's contribution to C
  C_contribution <- A_block %*% B_block  # Standard matrix mult
  
  # Accumulate into result
  C <- C + C_contribution
  
  # Free memory
  free(A_block, B_block)
}

# Write result (if too large for memory, write directly to HDF5)
if (C fits in memory) {
  return(C)
} else {
  write_to_hdf5(C)
  return(HDF5_reference)
}
TipSpatial Decomposition

Matrix multiplication is decomposed spatially along the shared dimension k. Each block pair contributes independently to the final result - this is different from the temporal decomposition used for means where blocks contribute sequentially.

BigDataStatMeth implementation: The bdblockmult_hdf5() function implements this strategy with additional optimizations:

  • Automatic block size selection based on available memory
  • Writing results directly to HDF5 to handle cases where even the result matrix is too large
  • Parallel processing of independent block multiplications using OpenMP when multiple cores are available
  • Careful memory management to avoid unnecessary copies
  • All computation done in C++ for maximum efficiency

5 Example 3: The QR Decomposition Challenge

The QR decomposition is more challenging because it requires maintaining orthogonality across all columns, which creates dependencies between blocks. This illustrates how not all algorithms decompose trivially.

5.1 Why QR Matters

The QR decomposition factors a matrix A (dimensions m \times n) into: A = Q \times R

where Q is orthogonal (m \times n, with Q^T Q = I) and R is upper triangular (n \times n). This decomposition is fundamental to:

  • Solving least squares problems (linear regression)
  • Computing principal components
  • Many iterative algorithms that need orthonormal bases

5.2 The Block-Wise Strategy: Hierarchical QR

The key insight is that QR decomposition can be computed hierarchically through a series of smaller QR decompositions.

flowchart TD
    A["Matrix A<br/>m × n<br/>Too large for RAM"] --> B["Partition into<br/>k row blocks"]
    B --> C1["Block A₁<br/>m₁ × n"]
    B --> C2["Block A₂<br/>m₂ × n"]
    B --> C3["Block Aₖ<br/>mₖ × n"]
    
    C1 --> D1["Local QR:<br/>A₁ = Q₁R₁"]
    C2 --> D2["Local QR:<br/>A₂ = Q₂R₂"]
    C3 --> D3["Local QR:<br/>Aₖ = QₖRₖ"]
    
    D1 --> E["Stack R matrices:<br/>[R₁; R₂; ...; Rₖ]<br/>kn × n<br/>Fits in RAM!"]
    D2 --> E
    D3 --> E
    
    E --> F["Final QR:<br/>Rstack = QR × Rfinal"]
    
    F --> G["Reconstruct Q:<br/>Multiply Q_i by QR blocks"]
    
    G --> H["Result:<br/>A = Q × Rfinal"]
    
    style A fill:#ffe8e8
    style E fill:#fff8e1
    style H fill:#e8f6e8
    style D1 fill:#f0f8ff
    style D2 fill:#f0f8ff
    style D3 fill:#f0f8ff

Step 1: Partition

Divide the matrix A into blocks along rows: A = [A_1; A_2; ...; A_k]

Step 2: Local QR

Compute QR for each block independently: A_i = Q_i R_i

Each block’s QR decomposition can be computed separately because they operate on independent rows.

Step 3: Collect R matrices

Stack all the upper triangular R_i matrices vertically: R_{stack} = [R_1; R_2; ...; R_k]

Critical insight: R_{stack} is much smaller than A. It’s kn \times n instead of m \times n, and typically kn \ll m, so this stacked matrix fits in memory even when A doesn’t!

Step 4: Final QR

Compute QR of the stacked R matrices (this fits in RAM): R_{stack} = Q_R \times R_{final}

Step 5: Reconstruct Q

The overall Q is obtained by multiplying each local Q_i by the corresponding block of Q_R. This reconstruction can be done block by block, again staying within memory constraints.

Why this works mathematically:

A = \begin{bmatrix} Q_1 R_1 \\ Q_2 R_2 \\ \vdots \\ Q_k R_k \end{bmatrix} = \begin{bmatrix} Q_1 \\ Q_2 \\ \vdots \\ Q_k \end{bmatrix} \begin{bmatrix} R_1 \\ R_2 \\ \vdots \\ R_k \end{bmatrix}

Since the Q_i are orthogonal, applying another QR to the stacked R matrices gives us the final factorization.

# Algorithm: HierarchicalQR
# Input: Matrix A m × n in HDF5, block_size
# Output: Q matrix m × n, R matrix n × n

# Step 1: Partition and compute local QRs
n_blocks <- ceiling(m / block_size)
R_list <- list()  # Store R matrices

for (i in 1:n_blocks) {
  # Read block from HDF5
  A_block <- read_hdf5_rows(start_row, end_row)
  
  # Compute local QR
  qr_result <- qr_decomposition(A_block)
  Q_local <- qr_result$Q
  R_local <- qr_result$R
  
  # Store Q to HDF5 for later
  write_hdf5(Q_local, group = "local_Q", dataset = paste0("Q_", i))
  
  # Keep R in memory (small)
  R_list[[i]] <- R_local
  
  free(A_block, Q_local)
}

# Step 2: Stack R matrices (fits in RAM)
R_stacked <- bind_rows(R_list)  # (k*n × n)

# Step 3: Final QR on stacked Rs
qr_final <- qr_decomposition(R_stacked)
Q_R <- qr_final$Q
R_final <- qr_final$R

# Step 4: Reconstruct global Q (block by block)
for (i in 1:n_blocks) {
  # Read local Q_i
  Q_local <- read_hdf5(group = "local_Q", dataset = paste0("Q_", i))
  
  # Multiply by corresponding Q_R block
  start_idx <- (i-1) * n + 1
  end_idx <- i * n
  Q_R_block <- Q_R[start_idx:end_idx, ]
  
  Q_global_block <- Q_local %*% Q_R_block
  
  # Write back to HDF5
  write_hdf5(Q_global_block, group = "final_Q", dataset = paste0("Q_", i))
}

return(list(Q = "final_Q", R = R_final))
TipThe Magic of Hierarchical QR

The stacked R matrices are only (k \times n) \times n in size. Even with 10 blocks, if n=1000, that’s just a 10{,}000 \times 1000 matrix - about 80 MB. This fits comfortably in RAM even when the original matrix is hundreds of GB!

BigDataStatMeth implementation: The package implements hierarchical QR in bdQR_hdf5(), handling the block partitioning, intermediate storage, and final assembly automatically. This enables methods like PCA on out-of-memory data.

6 Example 4: Block-Wise SVD for Principal Component Analysis

Singular Value Decomposition (SVD) underlies principal component analysis and is one of the most important matrix decompositions in statistics. Computing SVD on matrices too large for memory is a significant challenge that demonstrates the full power of block-wise thinking.

6.1 The Mathematical Goal

For a matrix X (dimensions n \times p), we want: X = U \Sigma V^T

where: - U is n \times r with orthonormal columns (left singular vectors) - \Sigma is r \times r diagonal (singular values)
- V is p \times r with orthonormal columns (right singular vectors) - r = \min(n, p) or smaller if we want only the leading components

For PCA, we typically want only the first k components where k \ll r, which simplifies the problem significantly.

6.2 The Block-Wise Strategy: Hierarchical Decomposition

BigDataStatMeth implements a hierarchical SVD algorithm (Iwen & Ong, 2017) that works in levels.

flowchart TD
    A["Matrix X n × p<br/>Too large for RAM"] --> B["Level 1:<br/>Partition into m blocks"]
    
    B --> C1["Block X₁"]
    B --> C2["Block X₂"]  
    B --> C3["Block Xₘ"]
    
    C1 --> D1["Local SVD:<br/>X₁ = U₁Σ₁V₁ᵀ<br/>Keep k components"]
    C2 --> D2["Local SVD:<br/>X₂ = U₂Σ₂V₂ᵀ<br/>Keep k components"]
    C3 --> D3["Local SVD:<br/>Xₘ = UₘΣₘVₘᵀ<br/>Keep k components"]
    
    D1 --> E1["Form Z₁ = V₁Σ₁<br/>p × k"]
    D2 --> E2["Form Z₂ = V₂Σ₂<br/>p × k"]
    D3 --> E3["Form Zₘ = VₘΣₘ<br/>p × k"]
    
    E1 --> F["Level 2:<br/>Stack Z matrices<br/>Z = [Z₁; Z₂; ...; Zₘ]<br/>mk × k<br/>Fits in RAM!"]
    E2 --> F
    E3 --> F
    
    F --> G["Global SVD:<br/>Z = UzΣzVzᵀ"]
    
    G --> H["Extract Results:<br/>Singular values = Σz<br/>Right vectors = Vz<br/>Left vectors from U₁...Uₘ"]
    
    H --> I["Final SVD:<br/>X ≈ UΣVᵀ"]
    
    style A fill:#ffe8e8
    style F fill:#fff8e1
    style I fill:#e8f6e8
    style D1 fill:#f0f8ff
    style D2 fill:#f0f8ff
    style D3 fill:#f0f8ff
    style G fill:#e8f6e8

Level 1: Partition and Local SVD

  1. Partition X into blocks along rows: X = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_m \end{bmatrix}

  2. Compute truncated SVD for each block, keeping only k components: X_i = U_i \Sigma_i V_i^T

    where U_i is n_i \times k, \Sigma_i is k \times k, and V_i is p \times k.

  3. Form weighted matrices from the right singular vectors: Z_i = V_i \Sigma_i

    Each Z_i is p \times k - much smaller than the original X_i which was n_i \times p!

Level 2: Merge and Global SVD

  1. Stack the Z_i matrices: Z = \begin{bmatrix} Z_1 \\ Z_2 \\ \vdots \\ Z_m \end{bmatrix}

    This Z matrix is mk \times k. Since m (blocks) and k (components) are both small, Z fits in memory.

  2. Compute global SVD of Z: Z = U_Z \Sigma_Z V_Z^T

  3. Extract results:

    • Global singular values = \Sigma_Z
    • Global right singular vectors = V_Z
    • Global left singular vectors: multiply back through blocks

Why this works: The row space of X (spanned by right singular vectors) is well-approximated by combining the row spaces of blocks. Block SVDs capture main variation within each block; global SVD captures overall variation.

Approximation quality: Not exact, but excellent for large n relative to p with sufficient k. Error decreases as k increases.

Concrete example: X is 100,000 × 20,000 (16 GB)

  • Partition into m = 10 blocks of 10,000 rows
  • Keep k = 50 components

Memory per block during local SVD: - Block X_i: 10,000 × 20,000 = 1.6 GB - U_i: 10,000 × 50 = 4 MB
- \Sigma_i: 50 × 50 = 20 KB - V_i: 20,000 × 50 = 8 MB - Peak: ~1.6 GB (dominated by the block)

Memory for global SVD: - Z: (10 × 50) × 20,000 = 8 MB - This easily fits in memory!

Result: 16 GB problem → 1.6 GB peak memory (10× reduction) while computing accurate approximation.

NoteHierarchical Decomposition Power

The stacked Z matrix is tiny: (m \times k) \times p instead of n \times p. With 10 blocks and 50 components, even for p=20{,}000, that’s just 8 MB - fitting comfortably in RAM even when the original is 16 GB!

We’ve reduced a 16 GB problem to one requiring ~1.6 GB peak memory - a 10× reduction - while computing an accurate approximation to the full SVD.

BigDataStatMeth implementation: The bdSVD_hdf5() function implements this hierarchical strategy with additional optimizations:

  • Automatic selection of block size and hierarchy depth based on data dimensions and available memory
  • Parallel processing of independent blocks
  • Optional centering and scaling before SVD
  • Multiple hierarchy levels for very large datasets (you can set q levels with k blocks per level)

7 The Pattern: Recognizing Block-Wise Opportunities

Looking across these examples, several patterns emerge for when and how algorithms can be adapted for block-wise processing:

7.1 1. Operations with Additive Decomposition

If an operation can be expressed as summing contributions from different data portions, it’s straightforward to implement block-wise:

  • Examples: Means, sums, counts, sufficient statistics for simple models
  • Strategy: Process each block, accumulate results, finalize
  • Memory: One block plus small accumulator
  • I/O: Single pass through data

7.2 2. Operations with Hierarchical Decomposition

If an operation can be computed hierarchically (computing on blocks, then combining block results), block-wise processing is feasible:

  • Examples: QR decomposition, SVD, certain optimization algorithms
  • Strategy: Local computation on blocks, merge results, possibly iterate
  • Memory: One block plus moderately-sized merge structure
  • I/O: Multiple passes, but still practical

7.3 3. Operations Requiring Global Information

Some operations need information about the entire dataset to process any part of it:

  • Examples: Sorting, ranking, quantiles, certain graph algorithms
  • Challenge: May require multiple passes or sophisticated algorithms
  • Strategy: Often use approximation or sampling strategies

7.4 4. Operations That Don’t Decompose Well

A few operations are fundamentally difficult to decompose:

  • Examples: Some machine learning algorithms with complex dependencies, certain nonlinear optimizations
  • Options: Use approximations, reformulate the problem, or accept that in-memory computation is necessary

BigDataStatMeth focuses on category 1 and 2 operations, which cover most standard statistical analyses. For category 3, the package often provides approximate but practical solutions.

8 Practical Considerations

8.1 Block Size Selection

Choosing appropriate block sizes involves trade-offs:

Too small blocks: - More overhead from reading many blocks - More function calls and loop iterations - Potentially less cache-friendly computations

Too large blocks: - Risk of memory exhaustion - Less opportunity for parallelization - Longer time to first result

TipFinding the Sweet Spot

BigDataStatMeth uses smart heuristics based on available RAM, data dimensions, and operation type. Users can override defaults when they have specific system knowledge, but the automatic selection works well for most cases.

Rule of thumb: Blocks should use ~10-20% of available RAM, leaving room for intermediate results and OS operations.

BigDataStatMeth uses heuristics based on: - Available RAM (conservative estimates) - Data dimensions (balanced partitioning) - Operation type (some need larger blocks for efficiency) - Computational complexity (CPU-bound vs. I/O-bound)

Users can override defaults when they have specific knowledge about their system.

8.2 Numerical Stability

Block-wise algorithms can affect numerical properties, though the impact varies by operation:

Potential issues: - Accumulation of rounding errors across many blocks - Loss of precision in hierarchical methods (e.g., hierarchical SVD is approximate) - Order-dependent results for some operations - Cancellation errors when subtracting nearly equal numbers

ImportantWhat BigDataStatMeth Guarantees

The package prioritizes practical numerical reliability over theoretical guarantees:

  • Tested algorithms: Extensively verified against in-memory methods
  • Established methods: Uses well-known algorithms from numerical linear algebra literature
  • Double precision: All computations use 64-bit floating point
  • Documented approximations: When algorithms are approximate (hierarchical SVD), this is clearly stated

What is NOT guaranteed: - ❌ Bit-exact reproducibility across different hardware - ❌ Formal error bounds for all operations - ❌ Specific compensated algorithms (e.g., Kahan summation)

Practical implications:

For most statistical applications, the numerical precision is more than adequate: - PCA/SVD components match in-memory results to many decimal places - Regression coefficients are statistically indistinguishable - Hypothesis test results are identical

NoteWhen to Worry About Numerical Precision

If your application requires provable numerical bounds or arbitrary precision, BigDataStatMeth may not be suitable.

The package optimizes for the common case where statistical noise far exceeds numerical error - which describes most genomic and large-scale data analyses.

User control:

Some functions provide parameters to control accuracy: - SVD/PCA: k parameter controls how many components to compute (more = better approximation to full decomposition) - Iterative methods: tolerance parameters control convergence criteria

Check function documentation for available accuracy controls.

8.3 Parallelization

Many block-wise operations are embarrassingly parallel - blocks can be processed independently. BigDataStatMeth leverages this through OpenMP parallelization at the C++ level.

flowchart TD
    A[Matrix in HDF5<br/>divided into blocks] --> B[OpenMP Scheduler<br/>threads=4]
    
    B --> C1[Thread 1:<br/>Block 1]
    B --> C2[Thread 2:<br/>Block 2]
    B --> C3[Thread 3:<br/>Block 3]
    B --> C4[Thread 4:<br/>Block 4]
    
    C1 --> D1[Process<br/>independently]
    C2 --> D2[Process<br/>independently]
    C3 --> D3[Process<br/>independently]
    C4 --> D4[Process<br/>independently]
    
    D1 --> E[Combine results<br/>sequentially]
    D2 --> E
    D3 --> E
    D4 --> E
    
    E --> F[Final result]
    
    style A fill:#f0f8ff
    style B fill:#fff8e1
    style C1 fill:#e8f6e8
    style C2 fill:#e8f6e8
    style C3 fill:#e8f6e8
    style C4 fill:#e8f6e8
    style F fill:#e8f6e8

OpenMP (Open Multi-Processing) is a parallel programming API for C++ that BigDataStatMeth uses internally. When you specify threads = 4 in a function like bdSVD_hdf5(), the C++ code distributes independent block operations across 4 CPU cores.

Key characteristics:

  • Shared memory: All threads access the same HDF5 file, but read different blocks
  • Thread-safe I/O: BigDataStatMeth ensures HDF5 reads don’t conflict between threads
  • Load balancing: OpenMP automatically distributes blocks to available threads
  • Overhead vs. benefit: Parallelism only helps when block processing time exceeds coordination overhead

When parallelization helps:

  • Large blocks that take seconds to process each
  • CPU-intensive operations (matrix multiplication, decompositions)
  • Many blocks to process (good load distribution)

When it doesn’t help:

  • Very small blocks (overhead dominates)
  • I/O-bound operations (disk is the bottleneck, not CPU)
  • Few blocks (can’t keep all cores busy)

BigDataStatMeth automatically uses parallelization where beneficial. Users control thread count via the threads parameter.

// Algorithm: ParallelBlockProcessing
// Input: HDF5 dataset, n_blocks, n_threads
// Output: Combined results from all blocks

// 1. Initialize
results <- array to store each block's output

// 2. Parallel processing (OpenMP)
#pragma omp parallel for num_threads(n_threads)
for (int i = 0; i < n_blocks; i++) {
  // Each thread processes its assigned blocks
  int thread_id = omp_get_thread_num();
  
  // Thread-safe read
  block_i = read_hdf5_block(i);
  
  // Independent computation
  result_i = process_block(block_i);
  
  // Store result
  results[i] = result_i;
  
  // Free memory
  free(block_i);
}

// 3. Combine results (sequential)
final_result = combine_function(results);

// 4. Return
return final_result;
TipThread Configuration

Match threads to your CPU cores but leave 1-2 cores free for the system. For a 12-core machine, threads = 10 is often optimal. Use threads = 1 for debugging to simplify error tracking.

9 Implementing Your Own Block-Wise Methods

One of BigDataStatMeth’s design goals is enabling users to implement new statistical methods using block-wise processing. The package provides different levels of abstraction:

9.1 Architecture Overview

BigDataStatMeth uses a three-layer architecture:

1. Internal C++ Layer (Accessible from C++, not from R):

The lowest level consists of efficient C++ classes and functions that handle direct HDF5 file operations:

  • hdf5Dataset class - Wraps HDF5 C API for file access, provides block reading/writing
  • Efficient memory management and buffer handling
  • Direct HDF5 C API access for maximum performance
  • OpenMP-parallelized block iteration

Important: These internal components are NOT directly exposed to R users - you can’t call them from an R script. However, they ARE fully accessible if you’re developing new methods in C++ using BigDataStatMeth as a header-only library. If you’re implementing a new statistical method in C++, you have full access to these low-level building blocks.

Each high-level algorithm (PCA, regression, etc.) uses these internal functions and manages its own block iteration strategy for optimal performance. The block-reading logic is algorithm-specific rather than exposed as general-purpose functions because different algorithms benefit from different iteration patterns.

2. Mid-Level Functions (C++ with R bindings):

Block-wise operations implemented in C++ but accessible from both R and C++:

  • bdCrossprod_hdf5() / bdtCrossprod_hdf5() - Crossproduct operations
  • bdblockmult_hdf5() - Matrix multiplication
  • bdNormalize_hdf5() - Centering and scaling
  • bdApply_hdf5() - Apply functions to blocks

These functions handle the complexity of block iteration, parallel processing, and result accumulation internally. From R, you simply call them with appropriate parameters.

3. High-Level Statistical Methods (C++ with R bindings):

Complete statistical analyses implemented using the mid-level functions:

  • bdSVD_hdf5() - Singular Value Decomposition
  • bdPCA_hdf5() - Principal Component Analysis
  • bdQR_hdf5() - QR Decomposition
  • Regression methods, association tests, etc.

9.2 Developing New Methods

Approach 1: Compose from existing functions (Recommended for R users)

Use existing mid-level and high-level functions as building blocks. Example: implementing Canonical Correlation Analysis (CCA) in R:

# CCA implementation using BigDataStatMeth functions
bdCCA_hdf5 <- function(filename, X_dataset, Y_dataset, k = 10) {
  
  # Step 1: Normalize both datasets
  bdNormalize_hdf5(filename, X_dataset, bcenter = TRUE, bscale = TRUE)
  bdNormalize_hdf5(filename, Y_dataset, bcenter = TRUE, bscale = TRUE)
  
  # Step 2: Compute QR decomposition for each
  qr_x <- bdQR_hdf5(filename, X_dataset)
  qr_y <- bdQR_hdf5(filename, Y_dataset)
  
  # Step 3: Compute cross-product of Q matrices
  cross <- bdCrossprod_hdf5(
    filename, 
    A = qr_x$Q_dataset,
    B = qr_y$Q_dataset
  )
  
  # Step 4: SVD of cross-product
  svd_result <- bdSVD_hdf5(filename, cross$dataset, k = k)
  
  # Step 5: Back-transform to get canonical variates
  # ... (using bdblockmult_hdf5 to multiply through)
  
  return(list(correlations = svd_result$d, ...))
}

This approach leverages existing optimized functions without writing C++ code.

Approach 2: Implement in C++ (For developers)

For maximum performance or novel algorithms, implement directly in C++. You’ll work with:

  • hdf5Dataset class for file access
  • Eigen library for linear algebra
  • OpenMP for parallelization

See the CCA implementation examples in the package source (Example_bdCCA.cpp, Example_bdCCA.h) which show both R-based and C++-based approaches to the same algorithm.

Key consideration: The C++ approach requires understanding: - Memory management in C++ - HDF5 C API through hdf5Dataset wrapper
- Eigen matrix operations - Thread safety for parallel operations

Most users will find composing from existing functions sufficient. The C++ API is there for those developing new core algorithms or requiring maximum performance.

9.3 Learning from Examples

The package includes several complete examples showing different implementation strategies:

  • Example_bdCCA_hdf5_R.R - CCA implemented purely in R using BigDataStatMeth functions
  • Example_bdCCA.cpp/.h - CCA implemented in C++ for comparison
  • Example_getQRbyBlocks.R - Hierarchical QR showing block partitioning strategies

These examples demonstrate how to think through: 1. What can be computed block-by-block? 2. What intermediate results need to be merged? 3. What’s the memory footprint at each step? 4. When to write intermediate results to HDF5 vs. keep in memory?

10 Interactive Exercise

10.1 Practice: Analyzing Block-Wise Efficiency

Understanding which operations work well with block-wise processing and which present challenges helps you design efficient analysis pipelines. This exercise develops that intuition.

# Consider this analysis workflow
analyze_workflow <- function() {
  # Operation 1: Compute column means
  means <- bdColMeans_hdf5(file, dataset)
  
  # Operation 2: Center the data  
  centered <- bdNormalize_hdf5(file, dataset, bcenter = TRUE, bscale = FALSE)
  
  # Operation 3: Compute t(X) %*% X
  crossprod <- bdCrossprod_hdf5(file, centered)
  
  # Operation 4: Eigen-decomposition for PCA
  pca <- bdSVD_hdf5(file, crossprod)
  
  # Question: Which operations are most/least block-amenable?
}
TipReflection Questions

Think through these scenarios - understanding the principles matters more than having “correct” answers:

1. Operation Efficiency Analysis: - Which operation above is most naturally block-wise? Why? - Which operation requires the most inter-block coordination? - If you had to do this with limited memory, which step would be the bottleneck?

2. Block Size Decisions: - For a 100,000 × 50,000 matrix with 32 GB RAM available, what block size would you choose? - What if you had 128 GB RAM? Would you still use blocks? - How would block size affect the accuracy of the results?

3. Algorithm Adaptation: - Consider k-means clustering. Can you decompose it block-wise? - What about hierarchical clustering? - Which machine learning algorithms would be easy vs. hard to adapt?

4. Designing Your Pipeline: - Your analysis needs: QC filtering, normalization, PCA, then regression - Which steps can process data block-by-block independently? - Which steps need access to summary statistics from all data? - Where would you store intermediate results?

5. Trade-offs: - Smaller blocks = less memory but more disk I/O - Larger blocks = more memory but fewer I/O operations
- For your system, where’s the sweet spot?

Try sketching out a block-wise version of your actual analysis pipeline. Which parts translate easily? Which require creative rethinking? This mental exercise builds intuition for designing efficient big data workflows.

11 Key Takeaways

Let’s consolidate what you’ve learned about adapting algorithms for block-wise processing with disk-based data.

11.1 Essential Concepts

The divide-process-combine paradigm underlies all block-wise computing. You partition data into manageable pieces, process each piece (often independently), then combine results to get the final answer. This simple pattern adapts to operations from simple means to complex matrix factorizations. The challenge lies in figuring out what to process and how to combine results validly.

Block-amenable operations share common mathematical properties that make them decompose naturally. Operations that can be expressed as sums, products, or aggregations across independent portions of data work well. Computing means, sums, element-wise operations, and many matrix products fall into this category. The mathematical structure of these operations allows valid decomposition.

Block-resistant operations require information flow between all parts of the data or produce outputs that scale problematically. Anything requiring global sorting, finding specific quantiles, or creating O(n²) outputs challenges block-wise approaches. These operations don’t decompose cleanly - they need special algorithmic tricks or fundamentally different approaches.

Block size creates a memory-accuracy-speed trade-off with no universally optimal choice. Smaller blocks use less memory but require more disk I/O and potentially more approximation error in hierarchical algorithms. Larger blocks use more memory but require fewer I/O operations and may be more accurate. The optimal choice depends on your available RAM, disk speed, and accuracy requirements.

Hierarchical processing enables algorithms that don’t naturally decompose to single-pass block operations. By creating multiple levels of computation - process blocks, merge results into smaller intermediate blocks, repeat until small enough - you can handle very large matrices while controlling memory usage at each level. This is how BigDataStatMeth implements complex operations like SVD/PCA.

Composability is power. You can build complex analyses by composing simple block-wise operations. Rather than implementing every possible analysis from scratch in C++, combine existing functions: normalize, compute crossproduct, perform SVD, multiply back through. This compositional approach lets you implement new methods at the R level, leveraging optimized core operations.

11.2 When Block-Wise Processing Works Well

Understanding which problems benefit from block-wise approaches helps you make good architectural decisions for your analyses.

Highly effective for:

  • Matrix multiplication and products - These decompose naturally into block-block operations. BigDataStatMeth’s implementations are highly optimized and scale well.

  • Element-wise operations - Operations like adding, subtracting, multiplying by scalars process each element independently. Perfect for block-wise processing with no communication overhead.

  • Row or column operations - Computing means, sums, normalizations by rows or columns work naturally with blocks containing complete rows or columns.

  • Many factorizations - SVD, QR, and Cholesky decompositions have hierarchical block-wise algorithms that maintain numerical accuracy while limiting memory usage.

Challenging for:

  • Global operations - Finding the median, computing percentiles, or operations that require knowing the full data distribution don’t decompose easily.

  • Fine-grained dependencies - Algorithms where each element depends on many others (like some iterative optimization methods) resist block decomposition.

  • Operations creating large outputs - Computing all pairwise distances or correlations creates O(n²) results. Even block-wise approaches struggle when output size explodes.

  • Highly iterative methods - Algorithms requiring many passes through data with convergence checks add overhead when each pass means disk I/O.

For most statistical analyses in genomics and similar fields - PCA, regression, association tests, basic machine learning - block-wise approaches work very well. The operations these methods need decompose naturally. More specialized algorithms may require case-by-case evaluation of whether block-wise processing provides benefits.

12 Next Steps

You now understand how standard statistical algorithms adapt for disk-based data processing. This conceptual foundation prepares you to:


NoteWant to Learn More?

The BigDataStatMeth paper provides mathematical details and proofs for the hierarchical algorithms. The C++ API documentation shows lower-level implementation details if you want to understand exactly how operations are executed.