Performance Optimization

Benchmarks and Tuning Strategies for Large-Scale Analysis

1 Overview

Performance optimization in BigDataStatMeth requires understanding a fundamental truth: you cannot simultaneously maximize speed, minimize memory usage, and maintain perfect numerical accuracy. Every optimization decision involves tradeoffs, and the “best” configuration depends entirely on your specific constraints—hardware limitations, dataset characteristics, and analysis requirements.

Consider a common scenario: you have a 100GB dataset and need to perform PCA. One configuration processes it in 10 minutes using 8GB RAM. Another finishes in 5 minutes but requires 64GB RAM. Which is “better”? If you only have 16GB available, the second configuration is impossible—the first configuration is objectively superior for your situation. If you have 128GB RAM but are analyzing hundreds of datasets, the 5-minute option saves hours of total computation time. Neither configuration is universally optimal; context determines the right choice.

This guide teaches you to make these decisions systematically. You’ll learn to identify where your computation spends time (profiling), understand how parameters affect performance (tuning), and recognize when you’re fighting fundamental limitations (bottlenecks). The goal isn’t to memorize optimal parameter values—those don’t exist—but to develop intuition for balancing constraints in your specific environment.

The strategies here apply broadly across BigDataStatMeth’s operations: PCA, SVD, CCA, matrix operations, statistical tests. While optimal values vary by dataset and hardware, the underlying principles remain constant. Master these principles, and you can optimize any workflow.

1.1 What You’ll Learn

Identify performance bottlenecks in your workflows using profiling tools
Understand how block size (k, q) controls the memory-speed tradeoff
Tune thread counts while recognizing parallelization limits
Optimize HDF5 chunk sizes to match your access patterns
Balance numerical accuracy against computational efficiency
Recognize fundamental bottlenecks (I/O, memory bandwidth, Amdahl’s law)
Make informed optimization decisions based on your constraints
Benchmark systematically to measure actual improvements

2 Understanding the Three Constraints

Every optimization navigates three competing constraints. Understanding why these constraints conflict helps you make better tradeoff decisions.

flowchart LR
    A[Memory] <--> B[Speed]
    B <--> C[Accuracy]
    C <--> A
    
    style A fill:#e8f6e8
    style B fill:#e7f3ff
    style C fill:#fff8e1

2.1 Memory vs Speed

Block-wise computing divides matrices into smaller pieces to control memory usage. Smaller blocks mean lower peak memory—you can process datasets much larger than your RAM. But smaller blocks come with costs: more operations to combine results, more HDF5 reads/writes, more overhead from setup and coordination.

Think of it like moving furniture. Disassembling a table into many small pieces lets you fit it through doorways (memory constraint), but assembly and disassembly take time (speed penalty). Keeping the table intact is faster but requires a truck large enough to carry it whole (memory requirement). The optimal approach depends on what size “truck” (RAM) you have available.

Mathematically, if you double the number of blocks (k), you roughly halve peak memory usage but may increase computation time by 20-50% depending on the operation. For pure computation (matrix multiplication, SVD), the overhead is lower. For I/O-heavy operations (reading data, writing results), overhead dominates.

2.2 Speed vs Accuracy

Some algorithms offer approximate solutions faster than exact solutions. Block-wise SVD, for example, can compute fewer iterations to converge faster, accepting slightly lower precision in singular values. For exploratory data analysis—visualizing principal components, identifying outliers, getting a general sense of data structure—high precision matters less than rapid iteration. You’re happy with “close enough” if it lets you test ten parameter combinations instead of one.

For publication-quality results, accuracy cannot be compromised. The difference between a singular value of 42.157 and 42.154 might seem trivial, but reviewers and reproducibility standards demand exact, verifiable computations. In these cases, you accept longer computation times to guarantee numerical precision.

BigDataStatMeth defaults to conservative, accurate algorithms. When speed matters more than the last decimal place, you can adjust—but always verify results remain scientifically valid for your purposes.

2.3 Memory vs Accuracy

This constraint is subtler and less obvious. Extreme block sizes can affect numerical stability. Very small blocks (dozens of rows) may create ill-conditioned subproblems—matrices where numerical algorithms struggle with rounding errors. Very large blocks approaching memory limits may cause intermittent failures when memory fragmentation prevents allocation despite “enough” RAM theoretically existing.

The practical guidance: stay in reasonable ranges. Blocks with 1,000-10,000 rows typically work well. Below 500 rows per block, watch for numerical warnings. Above 50,000 rows per block, ensure you have substantial memory headroom. BigDataStatMeth’s algorithms handle edge cases, but respecting these guidelines avoids potential issues.

3 Quick Start: Decision Tree

Before diving into theory, here’s a practical decision tree for choosing starting configurations. These values aren’t optimal—they’re reasonable starting points based on common scenarios. Always profile and tune for your specific case.

flowchart TD
    Start[Start: Choose Configuration] --> DataSize{Dataset Size?}
    
    DataSize -->|< 10 GB| Small[Small Dataset]
    DataSize -->|10-100 GB| Medium[Medium Dataset]
    DataSize -->|> 100 GB| Large[Large Dataset]
    
    Small --> SmallMem{Available RAM?}
    SmallMem -->|> 16 GB| SmallConfig[k=2, q=1<br/>threads=4]
    SmallMem -->|< 16 GB| SmallConfigMem[k=4, q=1<br/>threads=2]
    
    Medium --> MediumMem{Available RAM?}
    MediumMem -->|> 32 GB| MediumConfig[k=4, q=1<br/>threads=8]
    MediumMem -->|16-32 GB| MediumConfigMed[k=8, q=1<br/>threads=4]
    MediumMem -->|< 16 GB| MediumConfigLow[k=16, q=2<br/>threads=2]
    
    Large --> LargeMem{Available RAM?}
    LargeMem -->|> 64 GB| LargeConfig[k=8, q=2<br/>threads=8]
    LargeMem -->|< 64 GB| LargeConfigMem[k=32, q=2<br/>threads=4]
    
    SmallConfig --> Profile[Profile & Tune Iteratively]
    SmallConfigMem --> Profile
    MediumConfig --> Profile
    MediumConfigMed --> Profile
    MediumConfigLow --> Profile
    LargeConfig --> Profile
    LargeConfigMem --> Profile
    
    style Start fill:#f0f8ff
    style Profile fill:#e8f6e8

How to use this tree:

Start by estimating your dataset size in GB: (rows × cols × 8 bytes) / (1024^3). Then follow the tree based on your available RAM. The resulting k, q, and threads values are starting points—safe configurations unlikely to crash but not necessarily optimal. After establishing this baseline, profile to identify bottlenecks and tune iteratively.

For example, if you have a 50GB dataset and 32GB RAM, the tree suggests k=8, q=1, threads=8. This configuration splits data into 8 blocks (each ~6.25GB), processes in a single hierarchy level, and uses 8 threads. Peak memory usage will be roughly 8-10GB (safe for 32GB RAM), and 8 threads should parallelize well on modern CPUs. But your specific data might benefit from k=4 (fewer blocks, faster) or k=16 (more blocks if memory pressure appears). The tree gets you started; profiling reveals improvements.

4 Key Parameters Deep Dive

4.1 Block Size: The k Parameter

The k parameter fundamentally controls BigDataStatMeth’s memory-speed tradeoff. Understanding k deeply helps you tune confidently.

What k does mechanically: For a matrix with n rows, setting k=4 creates 4 blocks of approximately n/4 rows each (the last block may be slightly larger if n doesn’t divide evenly). Each block is processed sequentially or in parallel depending on the operation. Results from blocks are combined—summed, stacked, or merged—to produce final output.

Memory impact: Peak memory usage equals approximately the size of the largest block plus overhead for intermediate results. If your matrix is 100,000 rows × 20,000 columns (16GB), setting k=4 creates blocks of ~25,000 × 20,000 (4GB each). Peak usage will be roughly 5-6GB (one block + overhead). Doubling k to 8 halves block size to 2GB and reduces peak memory to 3-4GB. This linear relationship makes memory management straightforward: if hitting memory limits, double k.

Speed impact: More blocks mean more operations. Consider matrix multiplication: A × B with A split into 4 blocks requires 4 block multiplications then combining results. With 8 blocks, you need 8 block multiplications. Pure computation time increases roughly linearly with k. But I/O time (reading from HDF5, writing results) also increases—more blocks mean more separate read/write operations, each with overhead. For operations where I/O dominates (simple additions, subtractions), doubling k might increase total time by 80-100%. For computation-heavy operations (SVD, matrix inversion), time increases by 20-40%.

Choosing k strategically:

Start with this formula:

# Calculate reasonable k based on available memory
dataset_size_gb <- (n_rows * n_cols * 8) / (1024^3)
available_gb <- 16  # Your available RAM
safety_factor <- 0.7  # Use 70% of available RAM (headroom for overhead)

k_minimum <- ceiling(dataset_size_gb / (available_gb * safety_factor))
k_safe <- max(2, k_minimum)  # Never use k=1 (no blocking benefits)

cat("Recommended k:", k_safe, "\n")

This formula ensures your blocks fit comfortably in memory. But you might adjust:

If memory is abundant: Reduce k below the calculated value for speed
If computation is very slow: Try k/2 to reduce overhead (if memory allows)
If memory pressure appears: Increase k to provide more headroom
If using many threads: Ensure k ≥ threads (otherwise threads sit idle)

Guidelines by use case:

Scenario	Recommended k	Reasoning
Exploratory analysis, abundant RAM	k = 2-4	Minimize overhead, iterate quickly
Production analysis, moderate RAM	k = 8-16	Balance memory safety and speed
Memory-constrained server	k = 16-32	Guarantee successful completion
Repeated analysis on same data	k = 4-8	Moderate value, tune based on profiling

Remember: these are starting points. Profile your actual workflow to find your optimal k.

4.2 Hierarchy Levels: The q Parameter

The q parameter controls hierarchy depth in multi-level algorithms, primarily block-wise SVD. Understanding when hierarchy helps requires grasping how block results combine.

Single-level (q=1) approach: Compute SVD on each of k blocks independently, producing k sets of singular vectors. Then combine all k results simultaneously—stack the k vectors, compute SVD of the combined matrix. This final SVD operates on a k×p matrix where p is the number of retained components. For k=8 and p=50, the final SVD processes an 8×50 matrix—tiny, fast.

Two-level (q=2) approach: Compute SVD on each of k blocks (first level). Group results into pairs or quadruplets, compute SVD on each group (second level). Finally, combine group results (top level). For k=16, first level produces 16 results, second level combines these into 4 intermediate results, top level combines those 4 into final output.

When does hierarchy help?

Hierarchy reduces peak memory during combination. With q=1 and k=32, combining 32 results simultaneously might require substantial memory—allocating arrays for 32 matrices, stacking them, computing SVD. With q=2, you combine 32 results in groups of 4 (8 groups), then combine 8 intermediate results—peak memory is lower because you never hold all 32 original results simultaneously.

The cost: more intermediate operations. Two-level hierarchy does roughly 1.3-1.5× as much computation as single-level. Three-level increases this to 1.5-2×. Unless memory-constrained, extra computation isn’t worthwhile.

Decision guide:

q=1: Default for k ≤ 8, or when memory isn’t limiting
q=2: Use when k > 8 AND memory is tight (less than 2× block size available)
q>2: Rarely needed; only for extreme cases with k > 32 and severe memory limits

Example decision: You have k=16 blocks and 16GB RAM. Each block uses ~3GB. With q=1, combining 16 results needs perhaps 6GB peak (manageable). With q=2, peak is ~4GB (safer but slower). If your system has 32GB RAM, q=1 is fine—extra headroom makes hierarchy unnecessary. If RAM is exactly 16GB and other processes use 8GB, q=2 provides safety margin.

4.3 Thread Count: Parallel Processing

Threading enables processing multiple blocks simultaneously, dramatically reducing wall-clock time. But effective threading requires understanding both software parallelism and hardware limitations.

How BigDataStatMeth uses threads: When you specify threads=4, eligible operations distribute work across 4 CPU cores. For block-wise operations, different blocks compute in parallel—while core 1 processes block 1, core 2 handles block 2, etc. For matrix operations using BLAS/LAPACK (multiplication, decompositions), underlying libraries may parallelize internally. Not all operations parallelize equally: some are inherently sequential, others embarrassingly parallel.

Why more threads isn’t always better:

This is crucial to understand. Adding threads only helps when:

Operations are parallelizable: Sequential algorithms gain nothing from threads
Enough independent work exists: With k=4 blocks and threads=8, 4 threads sit idle
Computation dominates I/O: If most time is disk access, parallel computation doesn’t help
Memory bandwidth suffices: Multiple cores reading simultaneously can saturate bandwidth
CPU cores are available: Oversubscription (more threads than cores) causes thrashing

Understanding bottlenecks: Three fundamental bottlenecks limit threading efficiency:

I/O Bottleneck: Hard drives (even SSDs) have finite bandwidth. If 4 threads each try to read 1GB/second from a disk with 2GB/second total bandwidth, they compete—each gets only 500MB/second. Adding more threads doesn’t increase total bandwidth; it just divides it further. This is why simple operations (addition, scaling) often show poor threading efficiency—computation is fast, but reading data from HDF5 is slow. All threads wait for I/O.

Memory Bandwidth Bottleneck: Modern CPUs are fast, but RAM is comparatively slow. If 8 threads each try to read 8GB/second from RAM with 25GB/second total bandwidth, they get ~3GB/second each—less than needed. Memory-intensive operations (large matrix multiplications, cache-unfriendly access patterns) can saturate memory bandwidth. Adding more threads can actually slow things down by increasing cache misses and memory contention.

Amdahl’s Law (Inherent Sequentiality): Some parts of algorithms cannot parallelize. If 20% of your computation is inherently sequential, maximum possible speedup is 5× regardless of thread count. With 4 threads, you might achieve 3.2× speedup (80% of theoretical maximum). With 8 threads, maybe 3.8× speedup—adding 4 threads gave only 0.6× improvement because sequential portions dominate.

Measuring threading efficiency:

library(microbenchmark)

# Measure speedup with different thread counts
mb <- microbenchmark(
  t1 = bdSVD_hdf5("data.hdf5", "data", "matrix", k=8, threads=1),
  t2 = bdSVD_hdf5("data.hdf5", "data", "matrix", k=8, threads=2),
  t4 = bdSVD_hdf5("data.hdf5", "data", "matrix", k=8, threads=4),
  t8 = bdSVD_hdf5("data.hdf5", "data", "matrix", k=8, threads=8),
  times = 3
)

# Calculate speedup and efficiency
times <- summary(mb)$median
speedup <- times[1] / times
efficiency <- speedup / c(1, 2, 4, 8)

print(data.frame(
  threads = c(1, 2, 4, 8),
  time_sec = round(times, 2),
  speedup = round(speedup, 2),
  efficiency = round(efficiency, 2)
))

Interpreting efficiency:

90-100% efficiency: Ideal scaling, rare in practice
70-90% efficiency: Excellent scaling, computation dominates
50-70% efficiency: Good scaling, acceptable for most use cases
30-50% efficiency: Moderate scaling, likely hitting I/O or memory bandwidth
<30% efficiency: Poor scaling, fundamental bottleneck exists

If you see 4 threads achieving only 1.5× speedup (37% efficiency), investigate:

Is operation I/O-heavy? (Check if system time >> user time)
Is memory bandwidth saturated? (Monitor with tools like htop or iostat)
Is BLAS library already parallel? (Check with sessionInfo())
Are blocks too small? (Very small blocks have high overhead relative to computation)

Practical thread guidelines:

Hardware	Recommended Threads	Notes
Laptop (4 cores)	threads = 2-4	Use all cores unless thermal throttling appears
Workstation (8-16 cores)	threads = 4-8	Start with half, increase if efficiency good
Shared server (32+ cores)	threads = n_cores / n_users	Be considerate of others
Cloud instance (variable)	threads = physical_cores	Avoid hyperthreads for consistency
High-memory, low-core	threads = n_cores	CPU-bound, use all available

BLAS library interaction:

BigDataStatMeth uses BLAS/LAPACK for matrix operations. Some BLAS implementations (Intel MKL, OpenBLAS) are multi-threaded by default. If BLAS uses 4 threads and you specify threads=4, you may have 16 total threads (4 × 4), causing oversubscription.

Check BLAS configuration:

sessionInfo()
# Look for BLAS/LAPACK implementation
# If using OpenBLAS or MKL, they may be threaded

# Limit BLAS threads if needed (Linux/Mac)
Sys.setenv(OPENBLAS_NUM_THREADS = 1)
Sys.setenv(MKL_NUM_THREADS = 1)

If BLAS is single-threaded (reference BLAS), BigDataStatMeth’s threading provides all parallelism. If BLAS is multi-threaded, consider reducing BigDataStatMeth threads to avoid oversubscription.

4.4 HDF5 Chunk Size

HDF5 stores data in chunks—contiguous blocks on disk. Chunk size dramatically affects I/O performance but is set when creating datasets, not when computing. If you control dataset creation (via bdCreate_hdf5_matrix), chunk size becomes a key optimization lever.

How chunking affects I/O: When you read a single element from HDF5, the entire chunk containing it loads into memory. Read 100 elements scattered across 100 different chunks, and HDF5 performs 100 disk seeks, loading 100 chunks. Read 100 elements all in one chunk, and HDF5 performs 1 disk seek, loading 1 chunk. Sequential access to well-chunked data is fast; random access to poorly-chunked data is slow.

BigDataStatMeth access patterns:

Most BigDataStatMeth operations access data in rectangular blocks—reading several rows and all columns, or all rows and several columns. Optimal chunking aligns chunks with typical access blocks.

Row-wise operations (processing samples): Chunk by rows. If reading 1,000 rows at a time, chunk size ~1,000 rows × all columns
Column-wise operations (processing features): Chunk by columns. If reading 100 columns at a time, chunk size ~all rows × 100 columns
Block operations (typical for BigDataStatMeth): Square-ish chunks work well. For 100,000 × 10,000 matrix read in blocks of 10,000 × 10,000, use 10,000 × 10,000 chunks

Chunk size guidelines:

Target 1-10 MB chunks: Very small chunks (<100 KB) have excessive overhead. Very large chunks (>100 MB) waste space and time loading unused data.
Match access pattern: If you always read full rows, make chunks span full rows. If you read columns, span full columns.
Consider compression: Larger chunks compress better (more data for compression algorithms to find patterns). If using compression, prefer larger chunks (5-10 MB).
Balance with cache: Chunks should fit comfortably in CPU cache (L3 cache often 8-16 MB). Chunks larger than cache cause cache thrashing.

Example chunking decisions:

# Genomics: 50,000 samples × 500,000 SNPs
# Typical access: process 5,000 samples at a time (row-wise)
bdCreate_hdf5_matrix(
  filename = "genomics.hdf5",
  object = genotype_matrix,
  group = "data",
  dataset = "snps",
  chunk_rows = 5000,      # Matches block size
  chunk_cols = 500000     # All columns
)

# Gene expression: 100 samples × 20,000 genes
# Typical access: analyze all samples for subset of genes (column-wise)
bdCreate_hdf5_matrix(
  filename = "expression.hdf5",
  object = expression_matrix,
  group = "data",
  dataset = "genes",
  chunk_rows = 100,       # All samples
  chunk_cols = 2000       # ~10 chunks column-wise
)

# General matrix: 10,000 × 10,000
# Typical access: square blocks
bdCreate_hdf5_matrix(
  filename = "matrix.hdf5",
  object = large_matrix,
  group = "data",
  dataset = "values",
  chunk_rows = 1000,      # Square chunks
  chunk_cols = 1000       # ~100 total chunks
)

When chunk size matters most:

Very large datasets (>10 GB) where I/O is significant fraction of time
Repeated access to same data (poor chunking penalizes every access)
Network storage (NFS, cloud) where disk seeks are expensive
Compression enabled (larger chunks compress better)

When chunk size matters less:

Small datasets (<1 GB) that fit entirely in cache
Single-pass operations (read once, compute, done)
Local SSD storage where seeks are cheap
Computation-heavy operations where I/O is <10% of time

If profiling shows system time (I/O) is >30% of elapsed time, investigate chunking. Otherwise, default chunking is likely fine.

5 Profiling: Finding the Bottleneck

Before optimizing, you must identify where time actually goes. Optimizing the wrong part of your code wastes effort and provides minimal benefit. Profiling reveals the truth.

5.1 Why Profile First

Intuition about bottlenecks is often wrong. You might assume SVD computation dominates, but profiling reveals 60% of time is normalization. You might think I/O is the problem, but profiling shows memory allocation overhead. Without measurement, optimization is guesswork.

The profiling workflow:

Establish baseline: Run with conservative parameters, measure total time
Profile: Identify which operations consume time
Prioritize: Focus on operations using >20% of total time
Optimize: Tune those specific operations
Verify: Ensure optimization helped and results remain correct
Repeat: Profile again to find next bottleneck

5.2 Basic Timing with system.time()

The simplest profiling tool is system.time():

timing <- system.time({
  result <- bdPCA_hdf5(
    filename = "genomics.hdf5",
    group = "data",
    dataset = "genotypes",
    k = 8,
    q = 1,
    threads = 4
  )
})

print(timing)
#    user  system elapsed 
#  45.231   2.109  12.847

Interpreting the output:

user: CPU time spent in R code and user-level functions. This is “computation time”—the sum of all CPU time across all threads. With 4 threads, user time can exceed elapsed time (4 threads × 10 seconds = 40 seconds user time in 10 seconds elapsed).
system: CPU time spent in system calls—primarily I/O (reading/writing files), memory allocation, and inter-process communication. High system time indicates I/O or memory management overhead.
elapsed: Wall-clock time—what you experience. This is total time from start to finish, regardless of threading.

Common patterns and what they mean:

# Pattern 1: user >> elapsed (e.g., user=45, elapsed=13)
# Good parallelization! 4 threads did 45 seconds of work in 13 seconds
# Speedup: 45/13 = 3.46× with 4 threads (87% efficiency)

# Pattern 2: system high (e.g., user=20, system=15, elapsed=35)
# I/O bottleneck! System time (15s) is 43% of elapsed time
# Solution: Optimize HDF5 chunk size, use faster storage, or cache data

# Pattern 3: elapsed >> user + system (e.g., user=10, system=2, elapsed=40)
# Waiting for something! CPU idle most of the time
# Possible causes: disk very slow, network storage, memory swapping
# Check: iostat, iotop, or system monitor during execution

# Pattern 4: user ≈ elapsed, threads=4 (e.g., user=15, elapsed=14)
# No parallelization benefit! Work is sequential or I/O-bound
# Solution: Reduce threads (they're not helping), optimize sequential portions

5.3 Detailed Profiling with Rprof()

For deeper insight, use Rprof() to identify which specific functions consume time:

# Start profiling
Rprof("pca_profile.out", memory.profiling = TRUE)

# Run your analysis
result <- bdPCA_hdf5(
  filename = "large_data.hdf5",
  group = "data",
  dataset = "matrix",
  k = 16,
  q = 2,
  threads = 8
)

# Stop profiling
Rprof(NULL)

# Analyze results
profile_summary <- summaryRprof("pca_profile.out")

# Print time spent in each function
print(profile_summary$by.self)

Example output and interpretation:

           self.time self.pct total.time total.pct
bdSVD_hdf5      24.5     48.5       35.2      69.8
bdNormalize     12.3     24.4       12.8      25.4
h5read           8.2     16.3        8.2      16.3
crossprod        3.1      6.1        3.1       6.1

This output tells a clear story:

bdSVD_hdf5 (48.5% self time): SVD computation dominates—this is the primary optimization target
bdNormalize (24.4% self time): Normalization takes 1/4 of total time—consider pre-normalizing if running multiple analyses
h5read (16.3% self time): I/O is significant but not dominant—chunking optimization might help but won’t transform performance
crossprod (6.1% self time): Minor contributor—optimizing this won’t noticeably improve total time

Memory profiling insights:

# Memory allocation by function
print(profile_summary$by.total[1:10, c("total.time", "mem.total")])

Shows which functions allocate most memory. High memory allocation indicates where increasing k (more blocks) would help most.

5.4 Visual Profiling with profvis

For interactive exploration, profvis provides flamegraphs showing time distribution visually:

library(profvis)

profvis({
  bdSVD_hdf5(
    filename = "data.hdf5",
    group = "data",
    dataset = "matrix",
    k = 8,
    threads = 4
  )
})

The interactive visualization shows:

Horizontal axis: Time from start to finish
Vertical axis: Call stack (functions calling functions)
Width of bars: Time spent in each function
Colors: Different colors for different code sections

Click on bars to zoom in, hover to see details. Look for:

Wide bars: Functions consuming significant time (optimization targets)
Deep stacks: Many nested function calls (potential overhead)
Memory spikes: Sudden memory allocations (candidates for blocking)

6 Optimization Workflows

Different bottlenecks require different optimization strategies. These workflows guide you through systematic tuning.

6.1 Memory-Constrained Environment

Symptoms you’re memory-constrained:

Frequent swapping (system very slow, disk activity constant)
“Cannot allocate vector of size X” errors
R crashes with memory-related errors
System monitor shows RAM usage at 95-100%

The optimization strategy:

flowchart TD
    A[Memory Issues Detected] --> B[Increase k by 2x]
    B --> C{Issues resolved?}
    C -->|Yes| H[Success: Monitor future runs]
    C -->|No| D[Add hierarchy: q=2]
    D --> E{Issues resolved?}
    E -->|Yes| H
    E -->|No| F[Reduce threads by half]
    F --> G{Issues resolved?}
    G -->|Yes| H
    G -->|No| I[Process in stages OR<br/>upgrade hardware]
    
    style A fill:#ffe8e8
    style H fill:#e8f6e8
    style I fill:#fff8e1

Step-by-step implementation:

Step 1: Increase block count (k)

Doubling k halves memory usage. If currently using k=4 and hitting memory limits, try k=8:

# Original configuration (memory issues)
result <- bdPCA_hdf5(
  filename = "large_data.hdf5",
  group = "data",
  dataset = "matrix",
  k = 4,
  q = 1,
  threads = 4
)
# Error or very slow due to swapping

# Optimized: Double k
result <- bdPCA_hdf5(
  filename = "large_data.hdf5",
  group = "data",
  dataset = "matrix",
  k = 8,    # Doubled
  q = 1,
  threads = 4
)
# Should complete successfully, though slower

Step 2: Add hierarchy if still problematic

If k=8 still causes issues, add q=2 to reduce peak memory during combination:

result <- bdPCA_hdf5(
  filename = "large_data.hdf5",
  group = "data",
  dataset = "matrix",
  k = 8,
  q = 2,    # Added hierarchy
  threads = 4
)

Step 3: Reduce threading if memory pressure persists

More threads means more blocks loaded simultaneously. Reducing threads decreases parallel memory pressure:

result <- bdPCA_hdf5(
  filename = "large_data.hdf5",
  group = "data",
  dataset = "matrix",
  k = 8,
  q = 2,
  threads = 2    # Reduced from 4
)

Step 4: Process in stages

If still problematic, break analysis into smaller stages, writing intermediate results to HDF5:

# Stage 1: Normalize (writes to HDF5)
bdNormalize_hdf5(
  filename = "large_data.hdf5",
  group = "data",
  dataset = "matrix",
  bcenter = TRUE,
  bscale = FALSE
)

# Stage 2: QR decomposition (done internally, writes to HDF5)
# Stage 3: SVD on reduced data (smaller working set)
result <- bdPCA_hdf5(
  filename = "large_data.hdf5",
  group = "NORMALIZED/data",
  dataset = "matrix",
  bcenter = FALSE,  # Already normalized
  bscale = FALSE,
  k = 8,
  q = 2,
  threads = 2
)

Each stage has smaller peak memory than attempting everything at once.

6.2 Speed-Constrained Environment

Symptoms you’re speed-constrained:

Computation takes too long for your workflow
Memory usage is well below capacity (30-50% of available RAM)
Waiting for results interrupts iterative analysis
System monitor shows RAM available but computation slow

The optimization strategy:

flowchart TD
    A[Speed Issues Detected] --> B[Check memory usage]
    B --> C{RAM < 70% used?}
    C -->|Yes| D[Decrease k by half]
    D --> E{Faster?}
    E -->|Yes| F[Increase threads]
    F --> G{Faster?}
    G -->|Yes| J[Optimize HDF5 chunking]
    C -->|No| H[Memory limiting speed<br/>Use balanced approach]
    E -->|No| I[Check I/O bottleneck]
    G -->|No| K[Check parallelization limits]
    J --> L[Success: Optimal configuration]
    
    style A fill:#fff8e1
    style L fill:#e8f6e8

Step-by-step implementation:

Step 1: Reduce block count if memory allows

Fewer blocks mean less overhead. If memory usage is low, try reducing k:

# Current: Slow but plenty of RAM available
result <- bdPCA_hdf5("data.hdf5", "data", "matrix", 
                     k=8, threads=4)
# Elapsed: 45 seconds, RAM usage: 8GB / 32GB available

# Optimized: Fewer blocks
result <- bdPCA_hdf5("data.hdf5", "data", "matrix",
                     k=4, threads=4)  # Halved k
# Elapsed: 28 seconds (38% faster)

Step 2: Increase thread count

More threads reduce wall-clock time for parallelizable operations:

# Check available cores
n_cores <- parallel::detectCores()
cat("Available cores:", n_cores, "\n")

# Use more threads
result <- bdPCA_hdf5("data.hdf5", "data", "matrix",
                     k=4, threads=n_cores)
# Measure speedup vs fewer threads

Monitor efficiency—if adding threads provides minimal speedup, you’ve hit a bottleneck (I/O, memory bandwidth, or Amdahl’s law).

Step 3: Pre-normalize data

If running multiple analyses on the same data, normalize once and reuse:

# Normalize once
bdNormalize_hdf5("data.hdf5", "data", "matrix",
                 bcenter=TRUE, bscale=FALSE)

# Run analyses without re-normalizing
for (param in parameter_grid) {
  result <- bdPCA_hdf5(
    filename = "data.hdf5",
    group = "NORMALIZED/data",
    dataset = "matrix",
    bcenter = FALSE,  # Skip normalization
    bscale = FALSE,
    k = 4,
    threads = 8
    # Other parameters vary
  )
  # Process result
}

This eliminates redundant normalization across runs, saving 20-40% of time in iterative workflows.

Step 4: Optimize HDF5 chunk size

If profiling shows significant I/O time, recreate dataset with optimal chunking:

# Read existing data
original_data <- h5read("data.hdf5", "data/matrix")

# Recreate with optimized chunks
bdCreate_hdf5_matrix(
  filename = "data_optimized.hdf5",
  object = original_data,
  group = "data",
  dataset = "matrix",
  chunk_rows = nrow(original_data) / 4,  # Match k=4
  chunk_cols = ncol(original_data),
  compression = 0  # Disable compression for max speed
)

# Use optimized dataset
result <- bdPCA_hdf5("data_optimized.hdf5", "data", "matrix",
                     k=4, threads=8)

Optimal chunking can reduce I/O time by 50-80% for I/O-heavy operations.

6.3 Balanced Optimization: Iterative Tuning

When neither memory nor speed is catastrophically limiting, tune iteratively to find the sweet spot:

Phase 1: Establish baseline

# Conservative configuration (guaranteed to work)
cat("Testing baseline configuration...\n")
baseline_time <- system.time({
  baseline_result <- bdPCA_hdf5(
    filename = "data.hdf5",
    group = "data",
    dataset = "matrix",
    k = 8,
    q = 1,
    threads = 2
  )
})

cat("Baseline time:", baseline_time["elapsed"], "seconds\n")
cat("Peak memory: Check system monitor\n")

Phase 2: Test reduced blocking

cat("Testing k=4 (fewer blocks)...\n")
test1_time <- system.time({
  test1_result <- bdPCA_hdf5(
    filename = "data.hdf5",
    group = "data",
    dataset = "matrix",
    k = 4,  # Halved
    q = 1,
    threads = 2
  )
})

improvement1 <- (baseline_time["elapsed"] - test1_time["elapsed"]) / 
                baseline_time["elapsed"] * 100

cat("k=4 time:", test1_time["elapsed"], "seconds\n")
cat("Improvement:", round(improvement1, 1), "%\n")

Phase 3: Test increased threading

cat("Testing threads=4 (more parallelism)...\n")
test2_time <- system.time({
  test2_result <- bdPCA_hdf5(
    filename = "data.hdf5",
    group = "data",
    dataset = "matrix",
    k = 4,
    q = 1,
    threads = 4  # Doubled
  )
})

improvement2 <- (test1_time["elapsed"] - test2_time["elapsed"]) / 
                test1_time["elapsed"] * 100

cat("threads=4 time:", test2_time["elapsed"], "seconds\n")
cat("Additional improvement:", round(improvement2, 1), "%\n")

Phase 4: Verify and select

# Verify results match baseline
cat("Verifying correctness...\n")
all.equal(
  baseline_result$components[,1:5],
  test2_result$components[,1:5],
  tolerance = 1e-10
)

# Choose best configuration
configurations <- data.frame(
  config = c("Baseline", "k=4", "k=4,t=4"),
  time = c(baseline_time["elapsed"], 
           test1_time["elapsed"],
           test2_time["elapsed"]),
  speedup = c(1.0,
              baseline_time["elapsed"] / test1_time["elapsed"],
              baseline_time["elapsed"] / test2_time["elapsed"])
)

print(configurations)

cat("\nRecommended configuration: k=4, threads=4\n")

This systematic approach finds good configurations without exhaustive testing.

7 Common Performance Pitfalls

Experience reveals recurring mistakes that degrade performance. Recognizing these patterns helps you avoid them.

7.1 Pitfall 1: Excessive Blocking

Symptom: Computation is very slow despite low memory usage (30-40% of RAM used).

Cause: Block count (k) is much larger than necessary, creating excessive overhead from operations and I/O.

Why this happens: Overly conservative memory estimates or copying someone else’s configuration without adapting to your hardware.

Example:

# Poor configuration
system.time({
  result <- bdSVD_hdf5("data.hdf5", "data", "small", k=32)
})
#   elapsed: 45 seconds
# Dataset: 10,000×5,000 (400 MB)
# Available RAM: 16 GB (plenty of headroom!)

# Good configuration
system.time({
  result <- bdSVD_hdf5("data.hdf5", "data", "small", k=2)
})
#   elapsed: 8 seconds (82% faster)

Fix: Calculate appropriate k based on actual memory needs:

# Calculate dataset size
n_rows <- 10000
n_cols <- 5000
dataset_size_gb <- (n_rows * n_cols * 8) / (1024^3)
cat("Dataset size:", round(dataset_size_gb, 2), "GB\n")

# Available RAM
available_gb <- 16
k_needed <- ceiling(dataset_size_gb / (available_gb * 0.5))
cat("Minimum k needed:", k_needed, "\n")
cat("Recommended k:", max(2, k_needed), "\n")

Rule: Use smallest k that keeps memory usage comfortably below limits (60-70% of RAM).

7.2 Pitfall 2: Thread Oversubscription

Symptom: Adding more threads doesn’t improve performance or actually makes it worse.

Cause: More threads than physical CPU cores, causing context switching overhead and cache thrashing.

Why this happens: Confusing physical cores with hyperthreads/logical cores, or assuming “more threads = more speed.”

Example:

# System: 4 physical cores (8 hyperthreads)

# Poor: threads=16 (4× oversubscription)
system.time({
  result <- bdPCA_hdf5("data.hdf5", "data", "matrix", 
                       k=8, threads=16)
})
#   elapsed: 25 seconds
# Context switching overhead, cache thrashing

# Good: threads=4 (matches physical cores)
system.time({
  result <- bdPCA_hdf5("data.hdf5", "data", "matrix",
                       k=8, threads=4)
})
#   elapsed: 12 seconds (52% faster)

Understanding the problem: When threads exceed physical cores, the operating system rapidly switches between them (context switching). Each switch: saves one thread’s state, loads another’s state, invalidates CPU caches. With 16 threads on 4 cores, each core switches between 4 threads. Switching costs accumulate, and cache efficiency plummets—data constantly evicted before use.

Fix: Match threads to physical cores:

# Find physical cores (not hyperthreads)
n_physical <- parallel::detectCores(logical = FALSE)
cat("Physical cores:", n_physical, "\n")

# Use physical cores as upper bound
optimal_threads <- min(n_physical, 8)  # Cap at 8 for safety

Additional check—BLAS threading:

Some BLAS libraries (MKL, OpenBLAS) use threading internally. If BLAS uses 4 threads and you specify threads=4 in BigDataStatMeth, you have 4×4=16 threads—oversubscription even with 16 physical cores.

# Check BLAS configuration
sessionInfo()
# Look for "BLAS: " line

# If multi-threaded BLAS, limit BigDataStatMeth threads
Sys.setenv(OPENBLAS_NUM_THREADS = 1)  # Disable BLAS threading
# OR
# Use fewer BigDataStatMeth threads to account for BLAS threading

7.3 Pitfall 3: Redundant Operations

Symptom: Repeated analyses (parameter sweeps, cross-validation) each take the same amount of time, including time for operations that don’t change between runs.

Cause: Not leveraging HDF5’s persistence to reuse computed results.

Why this happens: Treating each analysis as independent rather than recognizing shared computations.

Example:

# Poor: Normalize every iteration
parameter_values <- c(2, 5, 10, 20, 50)

for (n_components in parameter_values) {
  result <- bdPCA_hdf5(
    filename = "data.hdf5",
    group = "data",
    dataset = "raw",
    bcenter = TRUE,   # Normalizes every iteration
    bscale = TRUE,
    k = 4,
    # n_components varies
  )
  # Process result
}
# Total time: 5 × (normalization + PCA) = 5 × 60s = 300s

# Good: Normalize once, reuse
bdNormalize_hdf5(
  filename = "data.hdf5",
  group = "data",
  dataset = "raw",
  bcenter = TRUE,
  bscale = TRUE
)

for (n_components in parameter_values) {
  result <- bdPCA_hdf5(
    filename = "data.hdf5",
    group = "NORMALIZED/data",  # Pre-normalized
    dataset = "raw",
    bcenter = FALSE,  # Skip normalization
    bscale = FALSE,
    k = 4,
    # n_components varies
  )
  # Process result
}
# Total time: normalization + 5 × PCA = 50s + 5×10s = 100s (67% faster)

Understanding the savings: If normalization takes 20% of total time and you run 10 analyses, normalizing once saves 9/10 × 20% = 18% of total workflow time. For workflows with hundreds of runs (bootstrapping, cross-validation), savings multiply dramatically.

Fix strategy:

Identify operations that don’t change between runs
Perform them once, store results in HDF5
Reuse stored results in subsequent runs

Applies to normalization, QR decomposition (for CCA), centering, scaling—any operation deterministic and shared across runs.

8 Benchmarking: Measuring Real Performance

To validate optimizations, benchmark systematically with real data and hardware. The following framework provides structure for comprehensive performance testing.

8.1 Systematic Benchmarking Framework

library(microbenchmark)

# Define configurations to test
test_configs <- expand.grid(
  k = c(2, 4, 8, 16),
  threads = c(1, 2, 4, 8),
  stringsAsFactors = FALSE
)

# Run each configuration multiple times
results <- lapply(seq_len(nrow(test_configs)), function(i) {
  cfg <- test_configs[i, ]
  
  cat("Testing k=", cfg$k, ", threads=", cfg$threads, "...\n", sep="")
  
  timing <- system.time({
    bdSVD_hdf5(
      file = "benchmark_data.hdf5",
      group = "data",
      dataset = "matrix",
      k = cfg$k,
      threads = cfg$threads,
      bcenter = TRUE,
      bscale = FALSE
    )
  })
  
  data.frame(
    k = cfg$k,
    threads = cfg$threads,
    user = timing[["user"]],
    system = timing[["system"]],
    elapsed = timing[["elapsed"]]
  )
})

benchmark_df <- do.call(rbind, results)

# Analyze results
cat("\nBenchmark Results:\n")
print(benchmark_df)

# Find optimal configuration
optimal_idx <- which.min(benchmark_df$elapsed)
optimal <- benchmark_df[optimal_idx, ]

cat("\nOptimal configuration:\n")
cat("  k =", optimal$k, "\n")
cat("  threads =", optimal$threads, "\n")
cat("  elapsed time =", round(optimal$elapsed, 2), "seconds\n")

8.2 Visualizing Performance

library(ggplot2)

# Plot: k vs time for each thread count
ggplot(benchmark_df, aes(x=k, y=elapsed, color=factor(threads))) +
  geom_line(size=1) +
  geom_point(size=3) +
  scale_x_log2(breaks=c(2,4,8,16)) +
  scale_color_brewer(palette="Set1", name="Threads") +
  labs(
    title="SVD Performance: Block Size vs Thread Count",
    subtitle=paste("Dataset:", nrow, "×", ncol, "- Hardware:", RAM, "GB RAM"),
    x="Number of Blocks (k)",
    y="Elapsed Time (seconds)"
  ) +
  theme_minimal(base_size=14) +
  theme(legend.position="right")

8.3 Interpreting Benchmarks

Look for these patterns:

Optimal k: Time decreases as k decreases, then levels off or increases. The minimum is your optimal k for this dataset/hardware combination.
Threading efficiency: Compare times across thread counts for fixed k. If doubling threads doesn’t halve time, you’re hitting parallelization limits.
Interaction effects: Sometimes optimal k changes with thread count. With few threads, smaller k is best (less overhead). With many threads, larger k helps (more blocks to parallelize).

Example interpretation:

k=2,  threads=1: 48.3s
k=2,  threads=4: 14.2s  (speedup: 3.4×, efficiency: 85%)
k=8,  threads=4: 18.7s
k=16, threads=4: 26.1s

Conclusion: k=2 optimal (data fits in memory), threads=4 good efficiency

9 Real-World Case Study: 1000 Genomes PCA

Let’s walk through complete optimization of a real analysis: PCA on 1000 Genomes data.

Dataset: 2,504 samples × 500 genes from 1000 Genomes Project
Hardware: Standard workstation (16 GB RAM, 4 physical cores)
Goal: Optimize PCA to minimize analysis time while ensuring correctness

9.1 Initial Baseline

# Conservative starting configuration
cat("=== Baseline Configuration ===\n")
baseline_time <- system.time({
  baseline <- bdPCA_hdf5(
    filename = "1000G.hdf5",
    group = "data",
    dataset = "genotypes",
    bcenter = TRUE,
    bscale = FALSE,
    k = 8,    # Conservative blocking
    q = 1,
    threads = 2  # Conservative threading
  )
})

cat("Elapsed time:", baseline_time["elapsed"], "seconds\n")
cat("User time:", baseline_time["user"], "seconds\n")
cat("System time:", baseline_time["system"], "seconds\n\n")

# Typical output:
#   Elapsed time: 23.45 seconds
#   User time: 41.23 seconds
#   System time: 3.12 seconds

Initial observations:

User time >> elapsed time: Parallelization working (2 threads doing ~1.8× work)
System time moderate (~13% of elapsed): I/O not dominant but noticeable
Memory usage: ~4 GB peak (plenty of headroom with 16 GB available)

9.2 Profile to Identify Bottlenecks

cat("=== Profiling to Find Bottlenecks ===\n")
Rprof("pca_profile.out")

bdPCA_hdf5(
  filename = "1000G.hdf5",
  group = "data",
  dataset = "genotypes",
  bcenter = TRUE,
  bscale = FALSE,
  k = 8,
  q = 1,
  threads = 2
)

Rprof(NULL)

# Analyze profile
profile <- summaryRprof("pca_profile.out")
print(profile$by.self[1:10,])

# Typical output shows:
#   bdSVD_hdf5:      60% of time (primary bottleneck)
#   bdNormalize:     25% of time (opportunity for pre-computation)
#   h5read:          12% of time (I/O overhead)
#   Other:           3% of time

Bottleneck identified: SVD computation dominates (60%), normalization is significant (25%), I/O is noticeable (12%).

Optimization strategy based on profiling:

Reduce k to speed up SVD (we have memory headroom)
Increase threads to parallelize SVD better (4 cores available)
Pre-normalize data to eliminate repeated normalization
Consider HDF5 chunking if I/O remains problematic after other optimizations

9.3 Optimization 1: Reduce Block Count

cat("=== Optimization 1: Reduce k ===\n")
opt1_time <- system.time({
  opt1 <- bdPCA_hdf5(
    filename = "1000G.hdf5",
    group = "data",
    dataset = "genotypes",
    bcenter = TRUE,
    bscale = FALSE,
    k = 4,    # Halved from 8
    q = 1,
    threads = 2
  )
})

cat("Elapsed time:", opt1_time["elapsed"], "seconds\n")
improvement1 <- (baseline_time["elapsed"] - opt1_time["elapsed"]) / 
                baseline_time["elapsed"] * 100
cat("Improvement:", round(improvement1, 1), "%\n\n")

# Typical output:
#   Elapsed time: 18.32 seconds
#   Improvement: 21.9%

Why this helped: Reducing from k=8 to k=4 halves the number of block operations. SVD on 4 blocks is faster than SVD on 8 blocks. Memory usage increased from 4GB to 6GB—well within our 16GB capacity.

9.4 Optimization 2: Increase Threading

cat("=== Optimization 2: Increase threads ===\n")
opt2_time <- system.time({
  opt2 <- bdPCA_hdf5(
    filename = "1000G.hdf5",
    group = "data",
    dataset = "genotypes",
    bcenter = TRUE,
    bscale = FALSE,
    k = 4,
    q = 1,
    threads = 4  # Doubled from 2
  )
})

cat("Elapsed time:", opt2_time["elapsed"], "seconds\n")
improvement2 <- (opt1_time["elapsed"] - opt2_time["elapsed"]) / 
                opt1_time["elapsed"] * 100
cat("Additional improvement:", round(improvement2, 1), "%\n")
total_improvement <- (baseline_time["elapsed"] - opt2_time["elapsed"]) / 
                     baseline_time["elapsed"] * 100
cat("Total improvement vs baseline:", round(total_improvement, 1), "%\n\n")

# Calculate threading efficiency
speedup <- opt1_time["elapsed"] / opt2_time["elapsed"]
efficiency <- speedup / 2  # Doubled threads
cat("Threading efficiency:", round(efficiency * 100, 1), "%\n\n")

# Typical output:
#   Elapsed time: 12.84 seconds
#   Additional improvement: 29.9%
#   Total improvement vs baseline: 45.3%
#   Threading efficiency: 71.2%

Why this helped: With k=4, we have 4 blocks that can compute in parallel. Using 4 threads (matching our 4 physical cores) provides good parallelization. Efficiency of 71% is solid—some overhead from coordination and I/O, but most computation parallelizes well.

9.5 Optimization 3: Pre-normalize Data

cat("=== Optimization 3: Pre-normalize ===\n")

# Normalize once
norm_time <- system.time({
  bdNormalize_hdf5(
    filename = "1000G.hdf5",
    group = "data",
    dataset = "genotypes",
    bcenter = TRUE,
    bscale = FALSE
  )
})

cat("Normalization time (one-time cost):", norm_time["elapsed"], "seconds\n")

# Run PCA on pre-normalized data
opt3_time <- system.time({
  opt3 <- bdPCA_hdf5(
    filename = "1000G.hdf5",
    group = "NORMALIZED/data",
    dataset = "genotypes",
    bcenter = FALSE,  # Already normalized
    bscale = FALSE,
    k = 4,
    q = 1,
    threads = 4
  )
})

cat("PCA time (using pre-normalized):", opt3_time["elapsed"], "seconds\n")
improvement3 <- (opt2_time["elapsed"] - opt3_time["elapsed"]) / 
                opt2_time["elapsed"] * 100
cat("Additional improvement:", round(improvement3, 1), "%\n")

final_improvement <- (baseline_time["elapsed"] - opt3_time["elapsed"]) / 
                     baseline_time["elapsed"] * 100
cat("FINAL improvement vs baseline:", round(final_improvement, 1), "%\n\n")

# Typical output:
#   Normalization time: 4.23 seconds
#   PCA time: 9.18 seconds
#   Additional improvement: 28.5%
#   FINAL improvement vs baseline: 60.9%

Why this helped: Profiling showed normalization consumed 25% of time. By pre-normalizing, we eliminate that 25% from each subsequent PCA run. If running only once, the one-time normalization cost negates some benefit. But for multiple runs (common in parameter tuning, cross-validation), savings multiply.

Amortized analysis: If running 10 PCAs with different parameters:

Without pre-normalization: 10 × 12.84s = 128.4s total
With pre-normalization: 4.23s + 10 × 9.18s = 96.0s total (25% faster)

For 100 runs: 4.23s + 100 × 9.18s = 922s vs 1284s (28% faster)

9.6 Verification of Correctness

cat("=== Verifying Correctness ===\n")

# Compare principal components
pc_match <- all.equal(
  baseline$components[, 1:10],
  opt3$components[, 1:10],
  tolerance = 1e-10
)

cat("Principal components match:", pc_match, "\n")

# Compare variance explained
var_match <- all.equal(
  baseline$variance_prop[1:10],
  opt3$variance_prop[1:10],
  tolerance = 1e-10
)

cat("Variance explained matches:", var_match, "\n\n")

# Typical output:
#   Principal components match: TRUE
#   Variance explained matches: TRUE

Critical step: Always verify optimization doesn’t change results. Numerical differences within tolerance (1e-10) are acceptable—floating-point arithmetic isn’t perfectly deterministic. But results should be essentially identical.

9.7 Final Configuration Summary

cat("=== Final Optimization Summary ===\n\n")

summary_df <- data.frame(
  Configuration = c("Baseline", "Opt1: k=4", "Opt2: +threads", "Opt3: +prenorm"),
  Time_seconds = c(
    baseline_time["elapsed"],
    opt1_time["elapsed"],
    opt2_time["elapsed"],
    opt3_time["elapsed"]
  ),
  Speedup = c(
    1.0,
    baseline_time["elapsed"] / opt1_time["elapsed"],
    baseline_time["elapsed"] / opt2_time["elapsed"],
    baseline_time["elapsed"] / opt3_time["elapsed"]
  ),
  Config_details = c(
    "k=8, threads=2",
    "k=4, threads=2",
    "k=4, threads=4",
    "k=4, threads=4, prenorm"
  )
)

print(summary_df)

cat("\nFinal configuration:\n")
cat("  - Pre-normalize data once\n")
cat("  - k = 4 (balance speed/memory)\n")
cat("  - threads = 4 (match physical cores)\n")
cat("  - Result: 60.9% faster than baseline\n")
cat("  - Verified: Results identical to baseline\n")

Key lessons from case study:

Profile identified the right targets: Without profiling, we might have optimized I/O (only 12% of time) instead of SVD (60% of time)
Incremental optimization works: Each step provided measurable benefit (22%, 30%, 29%)
Memory headroom enabled speed: Having 16GB for a 2GB dataset let us reduce k aggressively
Pre-computation pays off: One-time normalization cost amortizes over multiple runs
Verification is essential: Always confirm results unchanged after optimization

10 Interactive Exercise

10.1 Apply Optimization to Your Data

Use this exercise to practice optimization on your own datasets:

# ==========================================
# Exercise 1: Baseline Measurement
# ==========================================

cat("Exercise 1: Establishing Baseline\n")
cat("----------------------------------\n")

# TODO: Replace with your actual file, group, dataset
your_file <- "your_data.hdf5"
your_group <- "your_group"
your_dataset <- "your_dataset"

baseline_time <- system.time({
  baseline_result <- bdPCA_hdf5(
    filename = your_file,
    group = your_group,
    dataset = your_dataset,
    k = 4,
    threads = 2
  )
})

cat("Baseline time:", round(baseline_time["elapsed"], 2), "seconds\n")
cat("User time:", round(baseline_time["user"], 2), "seconds\n")
cat("System time:", round(baseline_time["system"], 2), "seconds\n\n")

# ==========================================
# Exercise 2: Test Different k Values
# ==========================================

cat("Exercise 2: Testing Block Sizes (k)\n")
cat("------------------------------------\n")

k_values <- c(2, 4, 8, 16)
k_times <- numeric(length(k_values))

for (i in seq_along(k_values)) {
  cat("Testing k =", k_values[i], "... ")
  
  t <- system.time({
    bdPCA_hdf5(your_file, your_group, your_dataset,
               k=k_values[i], threads=2)
  })
  
  k_times[i] <- t["elapsed"]
  cat(round(k_times[i], 2), "seconds\n")
}

# Find optimal k
optimal_k <- k_values[which.min(k_times)]
cat("\nOptimal k:", optimal_k, "\n")
cat("Best time:", round(min(k_times), 2), "seconds\n\n")

# Plot k vs time
plot(k_values, k_times, type="b", pch=19,
     xlab="Number of Blocks (k)", ylab="Time (seconds)",
     main="Impact of Block Size on Performance",
     col="blue", lwd=2)
grid()

# ==========================================
# Exercise 3: Test Different Thread Counts
# ==========================================

cat("Exercise 3: Testing Thread Counts\n")
cat("----------------------------------\n")

thread_values <- c(1, 2, 4, 8)
thread_times <- numeric(length(thread_values))

for (i in seq_along(thread_values)) {
  cat("Testing threads =", thread_values[i], "... ")
  
  t <- system.time({
    bdPCA_hdf5(your_file, your_group, your_dataset,
               k=optimal_k, threads=thread_values[i])
  })
  
  thread_times[i] <- t["elapsed"]
  cat(round(thread_times[i], 2), "seconds\n")
}

# Calculate speedup and efficiency
speedup <- thread_times[1] / thread_times
efficiency <- speedup / thread_values

cat("\nThreading Analysis:\n")
results_df <- data.frame(
  Threads = thread_values,
  Time = round(thread_times, 2),
  Speedup = round(speedup, 2),
  Efficiency = paste0(round(efficiency * 100, 1), "%")
)
print(results_df)

# ==========================================
# Exercise 4: Measure Memory Usage
# ==========================================

cat("\nExercise 4: Memory Usage by Configuration\n")
cat("-------------------------------------------\n")

library(pryr)

measure_memory <- function(k_val) {
  mem_before <- mem_used()
  
  bdPCA_hdf5(your_file, your_group, your_dataset,
             k=k_val, threads=2)
  
  mem_after <- mem_used()
  (mem_after - mem_before) / 1024^2  # Convert to MB
}

memory_usage <- sapply(k_values, measure_memory)

cat("\nMemory Usage by k:\n")
mem_df <- data.frame(
  k = k_values,
  Memory_MB = round(memory_usage, 1)
)
print(mem_df)

# ==========================================
# Summary and Recommendations
# ==========================================

cat("\n\n========================================\n")
cat("OPTIMIZATION SUMMARY\n")
cat("========================================\n\n")

cat("Baseline configuration: k=4, threads=2\n")
cat("Baseline time:", round(baseline_time["elapsed"], 2), "seconds\n\n")

cat("Optimal configuration:\n")
optimal_threads <- thread_values[which.min(thread_times)]
cat("  k =", optimal_k, "\n")
cat("  threads =", optimal_threads, "\n")
cat("  time =", round(min(thread_times), 2), "seconds\n\n")

improvement <- (baseline_time["elapsed"] - min(thread_times)) / 
               baseline_time["elapsed"] * 100
cat("Improvement:", round(improvement, 1), "%\n\n")

cat("Threading efficiency:", 
    round(efficiency[which.min(thread_times)] * 100, 1), "%\n")

if (efficiency[which.min(thread_times)] < 0.5) {
  cat("\nNote: Threading efficiency <50% suggests bottleneck\n")
  cat("Possible causes: I/O limitation, memory bandwidth, or BLAS threading\n")
}

Reflection Questions

After completing the exercises, consider:

About k (block size):

How does k affect time in your data? Is relationship linear?
What’s the memory usage at your optimal k? Could you use smaller k?
Does optimal k change with different thread counts?

About threading:

What speedup did you achieve with maximum threads?
What’s your threading efficiency? Is it >50%?
If efficiency is low, what bottleneck might exist? (Check user vs system time)

About your workflow:

Could you pre-normalize to save time in repeated runs?
Is I/O (system time) significant? (>20% of elapsed time?)
Would different HDF5 chunking help if I/O is high?
What’s the best tradeoff for your specific needs?

Decision making:

Which configuration would you use for production? Why?
How would your choice change if RAM was limited?
How would your choice change if speed was critical?
What evidence supports your configuration choice?

11 Key Takeaways

Let’s consolidate essential principles that guide effective performance optimization.

Optimization is about tradeoffs, not absolutes. There’s no universally “best” configuration—optimal settings depend on your dataset size, available hardware, and whether memory or speed limits you more. Understanding why these tradeoffs exist helps you make intelligent decisions for your specific situation. When memory is tight, accept slower computation with more blocking. When speed matters and memory is abundant, minimize blocking for faster execution.

Profile before optimizing. Intuition about bottlenecks is frequently wrong. You might assume I/O dominates when actually computation takes 80% of time, or vice versa. Profiling with system.time(), Rprof(), or profvis reveals where time actually goes. Focus optimization effort on operations consuming >20% of total time—optimizing minor contributors wastes effort with minimal benefit.

Block size (k) controls the fundamental memory-speed tradeoff. Smaller k means fewer, larger blocks—faster but more RAM required. Larger k means many small blocks—slower but memory-efficient. Start with k that uses 50-70% of available RAM. If hitting memory limits, increase k. If memory usage is low and speed matters, decrease k. The relationship is roughly linear: doubling k halves memory usage and increases time by 20-50% depending on operation intensity.

Threading scales only when operations parallelize. More threads help when work divides independently (processing blocks in parallel, parallel BLAS operations). Threading doesn’t help when operations are sequential, I/O-bound, or memory-bandwidth-limited. If 4 threads provide <2× speedup compared to 1 thread, you’ve hit a fundamental bottleneck—adding more threads won’t help. Common bottlenecks: disk I/O (all threads wait for disk), memory bandwidth (all threads compete for RAM access), Amdahl’s law (inherent sequential portions limit parallelism).

HDF5 chunk size affects I/O performance dramatically. Optimal chunking aligns chunks with your access patterns—row-wise chunks for row-wise operations, column-wise chunks for column-wise operations. Target 1-10 MB chunks that match typical block sizes. If profiling shows high system time (>30% of elapsed time), investigate chunk size. Proper chunking can reduce I/O time by 50-80% for I/O-heavy operations.

Leverage HDF5 persistence to avoid redundant computation. Operations like normalization, centering, and scaling are deterministic—same input always produces same output. Compute these operations once, store results in HDF5, and reuse them in subsequent analyses. For workflows with repeated analyses (parameter sweeps, cross-validation, bootstrapping), pre-computation saves enormous time. A one-time normalization cost amortizes over hundreds of runs.

Always verify correctness after optimization. Faster incorrect results are worse than slower correct results. After each optimization, verify results match baseline within numerical tolerance. Test with all.equal(result1, result2, tolerance=1e-10). If results differ beyond floating-point noise, optimization introduced a bug—revert and investigate.

Iterate systematically. Start with conservative configuration guaranteed to complete successfully. Profile to identify bottlenecks. Optimize the biggest bottleneck. Measure improvement. Verify correctness. Repeat. Change one parameter at a time so you know what helped. Keep what works, revert what doesn’t. Document your final configuration and the reasoning behind it.

12 Next Steps

Apply to your data:

Profile your current workflows to identify where time goes
Experiment with different k and thread values systematically
Measure actual improvements—don’t assume
Verify results remain correct after each optimization

Explore advanced optimization:

Study how different operations scale (SVD vs matrix multiplication vs statistical tests)
Experiment with HDF5 chunking if I/O is your bottleneck
Consider hybrid strategies: pre-compute expensive operations, cache intermediate results
Learn about your BLAS library configuration (single-threaded vs multi-threaded)

Deepen understanding:

Read about Block-wise Computing mathematical foundations
Explore HDF5 storage optimization in detail
Study CCA implementation as complex workflow example
Review BigDataStatMeth paper for algorithmic details and theoretical analysis

Share knowledge:

Document configurations that work well for your datasets and hardware
Share benchmarking results with colleagues analyzing similar data
Contribute performance findings to the community
Help others optimize their workflows based on your experience

Performance optimization is empirical and iterative. What works for one dataset on one machine might not work for another dataset on different hardware. The principles here provide a systematic framework for discovering what works in your environment. Apply them methodically, measure carefully, and you’ll achieve substantial improvements in your BigDataStatMeth workflows.

--- title: "Performance Optimization" subtitle: "Benchmarks and Tuning Strategies for Large-Scale Analysis" --- ## Overview Performance optimization in BigDataStatMeth requires understanding a fundamental truth: you cannot simultaneously maximize speed, minimize memory usage, and maintain perfect numerical accuracy. Every optimization decision involves tradeoffs, and the "best" configuration depends entirely on your specific constraints—hardware limitations, dataset characteristics, and analysis requirements. Consider a common scenario: you have a 100GB dataset and need to perform PCA. One configuration processes it in 10 minutes using 8GB RAM. Another finishes in 5 minutes but requires 64GB RAM. Which is "better"? If you only have 16GB available, the second configuration is impossible—the first configuration is objectively superior for your situation. If you have 128GB RAM but are analyzing hundreds of datasets, the 5-minute option saves hours of total computation time. Neither configuration is universally optimal; context determines the right choice. This guide teaches you to make these decisions systematically. You'll learn to identify where your computation spends time (profiling), understand how parameters affect performance (tuning), and recognize when you're fighting fundamental limitations (bottlenecks). The goal isn't to memorize optimal parameter values—those don't exist—but to develop intuition for balancing constraints in your specific environment. The strategies here apply broadly across BigDataStatMeth's operations: PCA, SVD, CCA, matrix operations, statistical tests. While optimal values vary by dataset and hardware, the underlying principles remain constant. Master these principles, and you can optimize any workflow. ::: {.learning-objectives} ### What You'll Learn - Identify performance bottlenecks in your workflows using profiling tools - Understand how block size (k, q) controls the memory-speed tradeoff - Tune thread counts while recognizing parallelization limits - Optimize HDF5 chunk sizes to match your access patterns - Balance numerical accuracy against computational efficiency - Recognize fundamental bottlenecks (I/O, memory bandwidth, Amdahl's law) - Make informed optimization decisions based on your constraints - Benchmark systematically to measure actual improvements ::: --- ## Understanding the Three Constraints Every optimization navigates three competing constraints. Understanding why these constraints conflict helps you make better tradeoff decisions. ```{mermaid} flowchart LR A[Memory] <--> B[Speed] B <--> C[Accuracy] C <--> A style A fill:#e8f6e8 style B fill:#e7f3ff style C fill:#fff8e1 ``` ### Memory vs Speed Block-wise computing divides matrices into smaller pieces to control memory usage. Smaller blocks mean lower peak memory—you can process datasets much larger than your RAM. But smaller blocks come with costs: more operations to combine results, more HDF5 reads/writes, more overhead from setup and coordination. Think of it like moving furniture. Disassembling a table into many small pieces lets you fit it through doorways (memory constraint), but assembly and disassembly take time (speed penalty). Keeping the table intact is faster but requires a truck large enough to carry it whole (memory requirement). The optimal approach depends on what size "truck" (RAM) you have available. Mathematically, if you double the number of blocks (k), you roughly halve peak memory usage but may increase computation time by 20-50% depending on the operation. For pure computation (matrix multiplication, SVD), the overhead is lower. For I/O-heavy operations (reading data, writing results), overhead dominates. ### Speed vs Accuracy Some algorithms offer approximate solutions faster than exact solutions. Block-wise SVD, for example, can compute fewer iterations to converge faster, accepting slightly lower precision in singular values. For exploratory data analysis—visualizing principal components, identifying outliers, getting a general sense of data structure—high precision matters less than rapid iteration. You're happy with "close enough" if it lets you test ten parameter combinations instead of one. For publication-quality results, accuracy cannot be compromised. The difference between a singular value of 42.157 and 42.154 might seem trivial, but reviewers and reproducibility standards demand exact, verifiable computations. In these cases, you accept longer computation times to guarantee numerical precision. BigDataStatMeth defaults to conservative, accurate algorithms. When speed matters more than the last decimal place, you can adjust—but always verify results remain scientifically valid for your purposes. ### Memory vs Accuracy This constraint is subtler and less obvious. Extreme block sizes can affect numerical stability. Very small blocks (dozens of rows) may create ill-conditioned subproblems—matrices where numerical algorithms struggle with rounding errors. Very large blocks approaching memory limits may cause intermittent failures when memory fragmentation prevents allocation despite "enough" RAM theoretically existing. The practical guidance: stay in reasonable ranges. Blocks with 1,000-10,000 rows typically work well. Below 500 rows per block, watch for numerical warnings. Above 50,000 rows per block, ensure you have substantial memory headroom. BigDataStatMeth's algorithms handle edge cases, but respecting these guidelines avoids potential issues. --- ## Quick Start: Decision Tree Before diving into theory, here's a practical decision tree for choosing starting configurations. These values aren't optimal—they're reasonable starting points based on common scenarios. Always profile and tune for your specific case. ```{mermaid} flowchart TD Start[Start: Choose Configuration] --> DataSize{Dataset Size?} DataSize -->|< 10 GB| Small[Small Dataset] DataSize -->|10-100 GB| Medium[Medium Dataset] DataSize -->|> 100 GB| Large[Large Dataset] Small --> SmallMem{Available RAM?} SmallMem -->|> 16 GB| SmallConfig[k=2, q=1 threads=4] SmallMem -->|< 16 GB| SmallConfigMem[k=4, q=1 threads=2] Medium --> MediumMem{Available RAM?} MediumMem -->|> 32 GB| MediumConfig[k=4, q=1 threads=8] MediumMem -->|16-32 GB| MediumConfigMed[k=8, q=1 threads=4] MediumMem -->|< 16 GB| MediumConfigLow[k=16, q=2 threads=2] Large --> LargeMem{Available RAM?} LargeMem -->|> 64 GB| LargeConfig[k=8, q=2 threads=8] LargeMem -->|< 64 GB| LargeConfigMem[k=32, q=2 threads=4] SmallConfig --> Profile[Profile & Tune Iteratively] SmallConfigMem --> Profile MediumConfig --> Profile MediumConfigMed --> Profile MediumConfigLow --> Profile LargeConfig --> Profile LargeConfigMem --> Profile style Start fill:#f0f8ff style Profile fill:#e8f6e8 ``` **How to use this tree:** Start by estimating your dataset size in GB: `(rows × cols × 8 bytes) / (1024^3)`. Then follow the tree based on your available RAM. The resulting k, q, and threads values are starting points—safe configurations unlikely to crash but not necessarily optimal. After establishing this baseline, profile to identify bottlenecks and tune iteratively. For example, if you have a 50GB dataset and 32GB RAM, the tree suggests k=8, q=1, threads=8. This configuration splits data into 8 blocks (each ~6.25GB), processes in a single hierarchy level, and uses 8 threads. Peak memory usage will be roughly 8-10GB (safe for 32GB RAM), and 8 threads should parallelize well on modern CPUs. But your specific data might benefit from k=4 (fewer blocks, faster) or k=16 (more blocks if memory pressure appears). The tree gets you started; profiling reveals improvements. --- ## Key Parameters Deep Dive ### Block Size: The k Parameter The k parameter fundamentally controls BigDataStatMeth's memory-speed tradeoff. Understanding k deeply helps you tune confidently. **What k does mechanically:** For a matrix with n rows, setting k=4 creates 4 blocks of approximately n/4 rows each (the last block may be slightly larger if n doesn't divide evenly). Each block is processed sequentially or in parallel depending on the operation. Results from blocks are combined—summed, stacked, or merged—to produce final output. **Memory impact:** Peak memory usage equals approximately the size of the largest block plus overhead for intermediate results. If your matrix is 100,000 rows × 20,000 columns (16GB), setting k=4 creates blocks of ~25,000 × 20,000 (4GB each). Peak usage will be roughly 5-6GB (one block + overhead). Doubling k to 8 halves block size to 2GB and reduces peak memory to 3-4GB. This linear relationship makes memory management straightforward: if hitting memory limits, double k. **Speed impact:** More blocks mean more operations. Consider matrix multiplication: A × B with A split into 4 blocks requires 4 block multiplications then combining results. With 8 blocks, you need 8 block multiplications. Pure computation time increases roughly linearly with k. But I/O time (reading from HDF5, writing results) also increases—more blocks mean more separate read/write operations, each with overhead. For operations where I/O dominates (simple additions, subtractions), doubling k might increase total time by 80-100%. For computation-heavy operations (SVD, matrix inversion), time increases by 20-40%. **Choosing k strategically:** Start with this formula: ```r # Calculate reasonable k based on available memory dataset_size_gb <- (n_rows * n_cols * 8) / (1024^3) available_gb <- 16 # Your available RAM safety_factor <- 0.7 # Use 70% of available RAM (headroom for overhead) k_minimum <- ceiling(dataset_size_gb / (available_gb * safety_factor)) k_safe <- max(2, k_minimum) # Never use k=1 (no blocking benefits) cat("Recommended k:", k_safe, "\n") ``` This formula ensures your blocks fit comfortably in memory. But you might adjust: - **If memory is abundant:** Reduce k below the calculated value for speed - **If computation is very slow:** Try k/2 to reduce overhead (if memory allows) - **If memory pressure appears:** Increase k to provide more headroom - **If using many threads:** Ensure k ≥ threads (otherwise threads sit idle) **Guidelines by use case:** | Scenario | Recommended k | Reasoning | |----------|---------------|-----------| | Exploratory analysis, abundant RAM | k = 2-4 | Minimize overhead, iterate quickly | | Production analysis, moderate RAM | k = 8-16 | Balance memory safety and speed | | Memory-constrained server | k = 16-32 | Guarantee successful completion | | Repeated analysis on same data | k = 4-8 | Moderate value, tune based on profiling | Remember: these are starting points. Profile your actual workflow to find your optimal k. ### Hierarchy Levels: The q Parameter The q parameter controls hierarchy depth in multi-level algorithms, primarily block-wise SVD. Understanding when hierarchy helps requires grasping how block results combine. **Single-level (q=1) approach:** Compute SVD on each of k blocks independently, producing k sets of singular vectors. Then combine all k results simultaneously—stack the k vectors, compute SVD of the combined matrix. This final SVD operates on a k×p matrix where p is the number of retained components. For k=8 and p=50, the final SVD processes an 8×50 matrix—tiny, fast. **Two-level (q=2) approach:** Compute SVD on each of k blocks (first level). Group results into pairs or quadruplets, compute SVD on each group (second level). Finally, combine group results (top level). For k=16, first level produces 16 results, second level combines these into 4 intermediate results, top level combines those 4 into final output. **When does hierarchy help?** Hierarchy reduces peak memory during combination. With q=1 and k=32, combining 32 results simultaneously might require substantial memory—allocating arrays for 32 matrices, stacking them, computing SVD. With q=2, you combine 32 results in groups of 4 (8 groups), then combine 8 intermediate results—peak memory is lower because you never hold all 32 original results simultaneously. The cost: more intermediate operations. Two-level hierarchy does roughly 1.3-1.5× as much computation as single-level. Three-level increases this to 1.5-2×. Unless memory-constrained, extra computation isn't worthwhile. **Decision guide:** - **q=1:** Default for k ≤ 8, or when memory isn't limiting - **q=2:** Use when k > 8 AND memory is tight (less than 2× block size available) - **q>2:** Rarely needed; only for extreme cases with k > 32 and severe memory limits Example decision: You have k=16 blocks and 16GB RAM. Each block uses ~3GB. With q=1, combining 16 results needs perhaps 6GB peak (manageable). With q=2, peak is ~4GB (safer but slower). If your system has 32GB RAM, q=1 is fine—extra headroom makes hierarchy unnecessary. If RAM is exactly 16GB and other processes use 8GB, q=2 provides safety margin. ### Thread Count: Parallel Processing Threading enables processing multiple blocks simultaneously, dramatically reducing wall-clock time. But effective threading requires understanding both software parallelism and hardware limitations. **How BigDataStatMeth uses threads:** When you specify threads=4, eligible operations distribute work across 4 CPU cores. For block-wise operations, different blocks compute in parallel—while core 1 processes block 1, core 2 handles block 2, etc. For matrix operations using BLAS/LAPACK (multiplication, decompositions), underlying libraries may parallelize internally. Not all operations parallelize equally: some are inherently sequential, others embarrassingly parallel. **Why more threads isn't always better:** This is crucial to understand. Adding threads only helps when: 1. **Operations are parallelizable:** Sequential algorithms gain nothing from threads 2. **Enough independent work exists:** With k=4 blocks and threads=8, 4 threads sit idle 3. **Computation dominates I/O:** If most time is disk access, parallel computation doesn't help 4. **Memory bandwidth suffices:** Multiple cores reading simultaneously can saturate bandwidth 5. **CPU cores are available:** Oversubscription (more threads than cores) causes thrashing **Understanding bottlenecks:** Three fundamental bottlenecks limit threading efficiency: **I/O Bottleneck:** Hard drives (even SSDs) have finite bandwidth. If 4 threads each try to read 1GB/second from a disk with 2GB/second total bandwidth, they compete—each gets only 500MB/second. Adding more threads doesn't increase total bandwidth; it just divides it further. This is why simple operations (addition, scaling) often show poor threading efficiency—computation is fast, but reading data from HDF5 is slow. All threads wait for I/O. **Memory Bandwidth Bottleneck:** Modern CPUs are fast, but RAM is comparatively slow. If 8 threads each try to read 8GB/second from RAM with 25GB/second total bandwidth, they get ~3GB/second each—less than needed. Memory-intensive operations (large matrix multiplications, cache-unfriendly access patterns) can saturate memory bandwidth. Adding more threads can actually slow things down by increasing cache misses and memory contention. **Amdahl's Law (Inherent Sequentiality):** Some parts of algorithms cannot parallelize. If 20% of your computation is inherently sequential, maximum possible speedup is 5× regardless of thread count. With 4 threads, you might achieve 3.2× speedup (80% of theoretical maximum). With 8 threads, maybe 3.8× speedup—adding 4 threads gave only 0.6× improvement because sequential portions dominate. **Measuring threading efficiency:** ```r library(microbenchmark) # Measure speedup with different thread counts mb <- microbenchmark( t1 = bdSVD_hdf5("data.hdf5", "data", "matrix", k=8, threads=1), t2 = bdSVD_hdf5("data.hdf5", "data", "matrix", k=8, threads=2), t4 = bdSVD_hdf5("data.hdf5", "data", "matrix", k=8, threads=4), t8 = bdSVD_hdf5("data.hdf5", "data", "matrix", k=8, threads=8), times = 3 ) # Calculate speedup and efficiency times <- summary(mb)$median speedup <- times[1] / times efficiency <- speedup / c(1, 2, 4, 8) print(data.frame( threads = c(1, 2, 4, 8), time_sec = round(times, 2), speedup = round(speedup, 2), efficiency = round(efficiency, 2) )) ``` **Interpreting efficiency:** - **90-100% efficiency:** Ideal scaling, rare in practice - **70-90% efficiency:** Excellent scaling, computation dominates - **50-70% efficiency:** Good scaling, acceptable for most use cases - **30-50% efficiency:** Moderate scaling, likely hitting I/O or memory bandwidth - **<30% efficiency:** Poor scaling, fundamental bottleneck exists If you see 4 threads achieving only 1.5× speedup (37% efficiency), investigate: - Is operation I/O-heavy? (Check if system time >> user time) - Is memory bandwidth saturated? (Monitor with tools like `htop` or `iostat`) - Is BLAS library already parallel? (Check with `sessionInfo()`) - Are blocks too small? (Very small blocks have high overhead relative to computation) **Practical thread guidelines:** | Hardware | Recommended Threads | Notes | |----------|-------------------|--------| | Laptop (4 cores) | threads = 2-4 | Use all cores unless thermal throttling appears | | Workstation (8-16 cores) | threads = 4-8 | Start with half, increase if efficiency good | | Shared server (32+ cores) | threads = n_cores / n_users | Be considerate of others | | Cloud instance (variable) | threads = physical_cores | Avoid hyperthreads for consistency | | High-memory, low-core | threads = n_cores | CPU-bound, use all available | **BLAS library interaction:** BigDataStatMeth uses BLAS/LAPACK for matrix operations. Some BLAS implementations (Intel MKL, OpenBLAS) are multi-threaded by default. If BLAS uses 4 threads and you specify threads=4, you may have 16 total threads (4 × 4), causing oversubscription. Check BLAS configuration: ```r sessionInfo() # Look for BLAS/LAPACK implementation # If using OpenBLAS or MKL, they may be threaded # Limit BLAS threads if needed (Linux/Mac) Sys.setenv(OPENBLAS_NUM_THREADS = 1) Sys.setenv(MKL_NUM_THREADS = 1) ``` If BLAS is single-threaded (reference BLAS), BigDataStatMeth's threading provides all parallelism. If BLAS is multi-threaded, consider reducing BigDataStatMeth threads to avoid oversubscription. ### HDF5 Chunk Size HDF5 stores data in chunks—contiguous blocks on disk. Chunk size dramatically affects I/O performance but is set when creating datasets, not when computing. If you control dataset creation (via `bdCreate_hdf5_matrix`), chunk size becomes a key optimization lever. **How chunking affects I/O:** When you read a single element from HDF5, the entire chunk containing it loads into memory. Read 100 elements scattered across 100 different chunks, and HDF5 performs 100 disk seeks, loading 100 chunks. Read 100 elements all in one chunk, and HDF5 performs 1 disk seek, loading 1 chunk. Sequential access to well-chunked data is fast; random access to poorly-chunked data is slow. **BigDataStatMeth access patterns:** Most BigDataStatMeth operations access data in rectangular blocks—reading several rows and all columns, or all rows and several columns. Optimal chunking aligns chunks with typical access blocks. - **Row-wise operations** (processing samples): Chunk by rows. If reading 1,000 rows at a time, chunk size ~1,000 rows × all columns - **Column-wise operations** (processing features): Chunk by columns. If reading 100 columns at a time, chunk size ~all rows × 100 columns - **Block operations** (typical for BigDataStatMeth): Square-ish chunks work well. For 100,000 × 10,000 matrix read in blocks of 10,000 × 10,000, use 10,000 × 10,000 chunks **Chunk size guidelines:** 1. **Target 1-10 MB chunks:** Very small chunks (<100 KB) have excessive overhead. Very large chunks (>100 MB) waste space and time loading unused data. 2. **Match access pattern:** If you always read full rows, make chunks span full rows. If you read columns, span full columns. 3. **Consider compression:** Larger chunks compress better (more data for compression algorithms to find patterns). If using compression, prefer larger chunks (5-10 MB). 4. **Balance with cache:** Chunks should fit comfortably in CPU cache (L3 cache often 8-16 MB). Chunks larger than cache cause cache thrashing. **Example chunking decisions:** ```r # Genomics: 50,000 samples × 500,000 SNPs # Typical access: process 5,000 samples at a time (row-wise) bdCreate_hdf5_matrix( filename = "genomics.hdf5", object = genotype_matrix, group = "data", dataset = "snps", chunk_rows = 5000, # Matches block size chunk_cols = 500000 # All columns ) # Gene expression: 100 samples × 20,000 genes # Typical access: analyze all samples for subset of genes (column-wise) bdCreate_hdf5_matrix( filename = "expression.hdf5", object = expression_matrix, group = "data", dataset = "genes", chunk_rows = 100, # All samples chunk_cols = 2000 # ~10 chunks column-wise ) # General matrix: 10,000 × 10,000 # Typical access: square blocks bdCreate_hdf5_matrix( filename = "matrix.hdf5", object = large_matrix, group = "data", dataset = "values", chunk_rows = 1000, # Square chunks chunk_cols = 1000 # ~100 total chunks ) ``` **When chunk size matters most:** - Very large datasets (>10 GB) where I/O is significant fraction of time - Repeated access to same data (poor chunking penalizes every access) - Network storage (NFS, cloud) where disk seeks are expensive - Compression enabled (larger chunks compress better) **When chunk size matters less:** - Small datasets (<1 GB) that fit entirely in cache - Single-pass operations (read once, compute, done) - Local SSD storage where seeks are cheap - Computation-heavy operations where I/O is <10% of time If profiling shows system time (I/O) is >30% of elapsed time, investigate chunking. Otherwise, default chunking is likely fine. --- ## Profiling: Finding the Bottleneck Before optimizing, you must identify where time actually goes. Optimizing the wrong part of your code wastes effort and provides minimal benefit. Profiling reveals the truth. ### Why Profile First Intuition about bottlenecks is often wrong. You might assume SVD computation dominates, but profiling reveals 60% of time is normalization. You might think I/O is the problem, but profiling shows memory allocation overhead. Without measurement, optimization is guesswork. The profiling workflow: 1. **Establish baseline:** Run with conservative parameters, measure total time 2. **Profile:** Identify which operations consume time 3. **Prioritize:** Focus on operations using >20% of total time 4. **Optimize:** Tune those specific operations 5. **Verify:** Ensure optimization helped and results remain correct 6. **Repeat:** Profile again to find next bottleneck ### Basic Timing with system.time() The simplest profiling tool is `system.time()`: ```r timing <- system.time({ result <- bdPCA_hdf5( filename = "genomics.hdf5", group = "data", dataset = "genotypes", k = 8, q = 1, threads = 4 ) }) print(timing) # user system elapsed # 45.231 2.109 12.847 ``` **Interpreting the output:** - **user:** CPU time spent in R code and user-level functions. This is "computation time"—the sum of all CPU time across all threads. With 4 threads, user time can exceed elapsed time (4 threads × 10 seconds = 40 seconds user time in 10 seconds elapsed). - **system:** CPU time spent in system calls—primarily I/O (reading/writing files), memory allocation, and inter-process communication. High system time indicates I/O or memory management overhead. - **elapsed:** Wall-clock time—what you experience. This is total time from start to finish, regardless of threading. **Common patterns and what they mean:** ```r # Pattern 1: user >> elapsed (e.g., user=45, elapsed=13) # Good parallelization! 4 threads did 45 seconds of work in 13 seconds # Speedup: 45/13 = 3.46× with 4 threads (87% efficiency) # Pattern 2: system high (e.g., user=20, system=15, elapsed=35) # I/O bottleneck! System time (15s) is 43% of elapsed time # Solution: Optimize HDF5 chunk size, use faster storage, or cache data # Pattern 3: elapsed >> user + system (e.g., user=10, system=2, elapsed=40) # Waiting for something! CPU idle most of the time # Possible causes: disk very slow, network storage, memory swapping # Check: iostat, iotop, or system monitor during execution # Pattern 4: user ≈ elapsed, threads=4 (e.g., user=15, elapsed=14) # No parallelization benefit! Work is sequential or I/O-bound # Solution: Reduce threads (they're not helping), optimize sequential portions ``` ### Detailed Profiling with Rprof() For deeper insight, use `Rprof()` to identify which specific functions consume time: ```r # Start profiling Rprof("pca_profile.out", memory.profiling = TRUE) # Run your analysis result <- bdPCA_hdf5( filename = "large_data.hdf5", group = "data", dataset = "matrix", k = 16, q = 2, threads = 8 ) # Stop profiling Rprof(NULL) # Analyze results profile_summary <- summaryRprof("pca_profile.out") # Print time spent in each function print(profile_summary$by.self) ``` **Example output and interpretation:** ``` self.time self.pct total.time total.pct bdSVD_hdf5 24.5 48.5 35.2 69.8 bdNormalize 12.3 24.4 12.8 25.4 h5read 8.2 16.3 8.2 16.3 crossprod 3.1 6.1 3.1 6.1 ``` This output tells a clear story: - **bdSVD_hdf5 (48.5% self time):** SVD computation dominates—this is the primary optimization target - **bdNormalize (24.4% self time):** Normalization takes 1/4 of total time—consider pre-normalizing if running multiple analyses - **h5read (16.3% self time):** I/O is significant but not dominant—chunking optimization might help but won't transform performance - **crossprod (6.1% self time):** Minor contributor—optimizing this won't noticeably improve total time **Memory profiling insights:** ```r # Memory allocation by function print(profile_summary$by.total[1:10, c("total.time", "mem.total")]) ``` Shows which functions allocate most memory. High memory allocation indicates where increasing k (more blocks) would help most. ### Visual Profiling with profvis For interactive exploration, `profvis` provides flamegraphs showing time distribution visually: ```r library(profvis) profvis({ bdSVD_hdf5( filename = "data.hdf5", group = "data", dataset = "matrix", k = 8, threads = 4 ) }) ``` The interactive visualization shows: - **Horizontal axis:** Time from start to finish - **Vertical axis:** Call stack (functions calling functions) - **Width of bars:** Time spent in each function - **Colors:** Different colors for different code sections Click on bars to zoom in, hover to see details. Look for: - **Wide bars:** Functions consuming significant time (optimization targets) - **Deep stacks:** Many nested function calls (potential overhead) - **Memory spikes:** Sudden memory allocations (candidates for blocking) --- ## Optimization Workflows Different bottlenecks require different optimization strategies. These workflows guide you through systematic tuning. ### Memory-Constrained Environment **Symptoms you're memory-constrained:** - Frequent swapping (system very slow, disk activity constant) - "Cannot allocate vector of size X" errors - R crashes with memory-related errors - System monitor shows RAM usage at 95-100% **The optimization strategy:** ```{mermaid} flowchart TD A[Memory Issues Detected] --> B[Increase k by 2x] B --> C{Issues resolved?} C -->|Yes| H[Success: Monitor future runs] C -->|No| D[Add hierarchy: q=2] D --> E{Issues resolved?} E -->|Yes| H E -->|No| F[Reduce threads by half] F --> G{Issues resolved?} G -->|Yes| H G -->|No| I[Process in stages OR upgrade hardware] style A fill:#ffe8e8 style H fill:#e8f6e8 style I fill:#fff8e1 ``` **Step-by-step implementation:** **Step 1: Increase block count (k)** Doubling k halves memory usage. If currently using k=4 and hitting memory limits, try k=8: ```r # Original configuration (memory issues) result <- bdPCA_hdf5( filename = "large_data.hdf5", group = "data", dataset = "matrix", k = 4, q = 1, threads = 4 ) # Error or very slow due to swapping # Optimized: Double k result <- bdPCA_hdf5( filename = "large_data.hdf5", group = "data", dataset = "matrix", k = 8, # Doubled q = 1, threads = 4 ) # Should complete successfully, though slower ``` **Step 2: Add hierarchy if still problematic** If k=8 still causes issues, add q=2 to reduce peak memory during combination: ```r result <- bdPCA_hdf5( filename = "large_data.hdf5", group = "data", dataset = "matrix", k = 8, q = 2, # Added hierarchy threads = 4 ) ``` **Step 3: Reduce threading if memory pressure persists** More threads means more blocks loaded simultaneously. Reducing threads decreases parallel memory pressure: ```r result <- bdPCA_hdf5( filename = "large_data.hdf5", group = "data", dataset = "matrix", k = 8, q = 2, threads = 2 # Reduced from 4 ) ``` **Step 4: Process in stages** If still problematic, break analysis into smaller stages, writing intermediate results to HDF5: ```r # Stage 1: Normalize (writes to HDF5) bdNormalize_hdf5( filename = "large_data.hdf5", group = "data", dataset = "matrix", bcenter = TRUE, bscale = FALSE ) # Stage 2: QR decomposition (done internally, writes to HDF5) # Stage 3: SVD on reduced data (smaller working set) result <- bdPCA_hdf5( filename = "large_data.hdf5", group = "NORMALIZED/data", dataset = "matrix", bcenter = FALSE, # Already normalized bscale = FALSE, k = 8, q = 2, threads = 2 ) ``` Each stage has smaller peak memory than attempting everything at once. ### Speed-Constrained Environment **Symptoms you're speed-constrained:** - Computation takes too long for your workflow - Memory usage is well below capacity (30-50% of available RAM) - Waiting for results interrupts iterative analysis - System monitor shows RAM available but computation slow **The optimization strategy:** ```{mermaid} flowchart TD A[Speed Issues Detected] --> B[Check memory usage] B --> C{RAM < 70% used?} C -->|Yes| D[Decrease k by half] D --> E{Faster?} E -->|Yes| F[Increase threads] F --> G{Faster?} G -->|Yes| J[Optimize HDF5 chunking] C -->|No| H[Memory limiting speed Use balanced approach] E -->|No| I[Check I/O bottleneck] G -->|No| K[Check parallelization limits] J --> L[Success: Optimal configuration] style A fill:#fff8e1 style L fill:#e8f6e8 ``` **Step-by-step implementation:** **Step 1: Reduce block count if memory allows** Fewer blocks mean less overhead. If memory usage is low, try reducing k: ```r # Current: Slow but plenty of RAM available result <- bdPCA_hdf5("data.hdf5", "data", "matrix", k=8, threads=4) # Elapsed: 45 seconds, RAM usage: 8GB / 32GB available # Optimized: Fewer blocks result <- bdPCA_hdf5("data.hdf5", "data", "matrix", k=4, threads=4) # Halved k # Elapsed: 28 seconds (38% faster) ``` **Step 2: Increase thread count** More threads reduce wall-clock time for parallelizable operations: ```r # Check available cores n_cores <- parallel::detectCores() cat("Available cores:", n_cores, "\n") # Use more threads result <- bdPCA_hdf5("data.hdf5", "data", "matrix", k=4, threads=n_cores) # Measure speedup vs fewer threads ``` Monitor efficiency—if adding threads provides minimal speedup, you've hit a bottleneck (I/O, memory bandwidth, or Amdahl's law). **Step 3: Pre-normalize data** If running multiple analyses on the same data, normalize once and reuse: ```r # Normalize once bdNormalize_hdf5("data.hdf5", "data", "matrix", bcenter=TRUE, bscale=FALSE) # Run analyses without re-normalizing for (param in parameter_grid) { result <- bdPCA_hdf5( filename = "data.hdf5", group = "NORMALIZED/data", dataset = "matrix", bcenter = FALSE, # Skip normalization bscale = FALSE, k = 4, threads = 8 # Other parameters vary ) # Process result } ``` This eliminates redundant normalization across runs, saving 20-40% of time in iterative workflows. **Step 4: Optimize HDF5 chunk size** If profiling shows significant I/O time, recreate dataset with optimal chunking: ```r # Read existing data original_data <- h5read("data.hdf5", "data/matrix") # Recreate with optimized chunks bdCreate_hdf5_matrix( filename = "data_optimized.hdf5", object = original_data, group = "data", dataset = "matrix", chunk_rows = nrow(original_data) / 4, # Match k=4 chunk_cols = ncol(original_data), compression = 0 # Disable compression for max speed ) # Use optimized dataset result <- bdPCA_hdf5("data_optimized.hdf5", "data", "matrix", k=4, threads=8) ``` Optimal chunking can reduce I/O time by 50-80% for I/O-heavy operations. ### Balanced Optimization: Iterative Tuning When neither memory nor speed is catastrophically limiting, tune iteratively to find the sweet spot: **Phase 1: Establish baseline** ```r # Conservative configuration (guaranteed to work) cat("Testing baseline configuration...\n") baseline_time <- system.time({ baseline_result <- bdPCA_hdf5( filename = "data.hdf5", group = "data", dataset = "matrix", k = 8, q = 1, threads = 2 ) }) cat("Baseline time:", baseline_time["elapsed"], "seconds\n") cat("Peak memory: Check system monitor\n") ``` **Phase 2: Test reduced blocking** ```r cat("Testing k=4 (fewer blocks)...\n") test1_time <- system.time({ test1_result <- bdPCA_hdf5( filename = "data.hdf5", group = "data", dataset = "matrix", k = 4, # Halved q = 1, threads = 2 ) }) improvement1 <- (baseline_time["elapsed"] - test1_time["elapsed"]) / baseline_time["elapsed"] * 100 cat("k=4 time:", test1_time["elapsed"], "seconds\n") cat("Improvement:", round(improvement1, 1), "%\n") ``` **Phase 3: Test increased threading** ```r cat("Testing threads=4 (more parallelism)...\n") test2_time <- system.time({ test2_result <- bdPCA_hdf5( filename = "data.hdf5", group = "data", dataset = "matrix", k = 4, q = 1, threads = 4 # Doubled ) }) improvement2 <- (test1_time["elapsed"] - test2_time["elapsed"]) / test1_time["elapsed"] * 100 cat("threads=4 time:", test2_time["elapsed"], "seconds\n") cat("Additional improvement:", round(improvement2, 1), "%\n") ``` **Phase 4: Verify and select** ```r # Verify results match baseline cat("Verifying correctness...\n") all.equal( baseline_result$components[,1:5], test2_result$components[,1:5], tolerance = 1e-10 ) # Choose best configuration configurations <- data.frame( config = c("Baseline", "k=4", "k=4,t=4"), time = c(baseline_time["elapsed"], test1_time["elapsed"], test2_time["elapsed"]), speedup = c(1.0, baseline_time["elapsed"] / test1_time["elapsed"], baseline_time["elapsed"] / test2_time["elapsed"]) ) print(configurations) cat("\nRecommended configuration: k=4, threads=4\n") ``` This systematic approach finds good configurations without exhaustive testing. --- ## Common Performance Pitfalls Experience reveals recurring mistakes that degrade performance. Recognizing these patterns helps you avoid them. ### Pitfall 1: Excessive Blocking **Symptom:** Computation is very slow despite low memory usage (30-40% of RAM used). **Cause:** Block count (k) is much larger than necessary, creating excessive overhead from operations and I/O. **Why this happens:** Overly conservative memory estimates or copying someone else's configuration without adapting to your hardware. **Example:** ```r # Poor configuration system.time({ result <- bdSVD_hdf5("data.hdf5", "data", "small", k=32) }) # elapsed: 45 seconds # Dataset: 10,000×5,000 (400 MB) # Available RAM: 16 GB (plenty of headroom!) # Good configuration system.time({ result <- bdSVD_hdf5("data.hdf5", "data", "small", k=2) }) # elapsed: 8 seconds (82% faster) ``` **Fix:** Calculate appropriate k based on actual memory needs: ```r # Calculate dataset size n_rows <- 10000 n_cols <- 5000 dataset_size_gb <- (n_rows * n_cols * 8) / (1024^3) cat("Dataset size:", round(dataset_size_gb, 2), "GB\n") # Available RAM available_gb <- 16 k_needed <- ceiling(dataset_size_gb / (available_gb * 0.5)) cat("Minimum k needed:", k_needed, "\n") cat("Recommended k:", max(2, k_needed), "\n") ``` **Rule:** Use smallest k that keeps memory usage comfortably below limits (60-70% of RAM). --- ### Pitfall 2: Thread Oversubscription **Symptom:** Adding more threads doesn't improve performance or actually makes it worse. **Cause:** More threads than physical CPU cores, causing context switching overhead and cache thrashing. **Why this happens:** Confusing physical cores with hyperthreads/logical cores, or assuming "more threads = more speed." **Example:** ```r # System: 4 physical cores (8 hyperthreads) # Poor: threads=16 (4× oversubscription) system.time({ result <- bdPCA_hdf5("data.hdf5", "data", "matrix", k=8, threads=16) }) # elapsed: 25 seconds # Context switching overhead, cache thrashing # Good: threads=4 (matches physical cores) system.time({ result <- bdPCA_hdf5("data.hdf5", "data", "matrix", k=8, threads=4) }) # elapsed: 12 seconds (52% faster) ``` **Understanding the problem:** When threads exceed physical cores, the operating system rapidly switches between them (context switching). Each switch: saves one thread's state, loads another's state, invalidates CPU caches. With 16 threads on 4 cores, each core switches between 4 threads. Switching costs accumulate, and cache efficiency plummets—data constantly evicted before use. **Fix:** Match threads to physical cores: ```r # Find physical cores (not hyperthreads) n_physical <- parallel::detectCores(logical = FALSE) cat("Physical cores:", n_physical, "\n") # Use physical cores as upper bound optimal_threads <- min(n_physical, 8) # Cap at 8 for safety ``` **Additional check—BLAS threading:** Some BLAS libraries (MKL, OpenBLAS) use threading internally. If BLAS uses 4 threads and you specify threads=4 in BigDataStatMeth, you have 4×4=16 threads—oversubscription even with 16 physical cores. ```r # Check BLAS configuration sessionInfo() # Look for "BLAS: " line # If multi-threaded BLAS, limit BigDataStatMeth threads Sys.setenv(OPENBLAS_NUM_THREADS = 1) # Disable BLAS threading # OR # Use fewer BigDataStatMeth threads to account for BLAS threading ``` --- ### Pitfall 3: Redundant Operations **Symptom:** Repeated analyses (parameter sweeps, cross-validation) each take the same amount of time, including time for operations that don't change between runs. **Cause:** Not leveraging HDF5's persistence to reuse computed results. **Why this happens:** Treating each analysis as independent rather than recognizing shared computations. **Example:** ```r # Poor: Normalize every iteration parameter_values <- c(2, 5, 10, 20, 50) for (n_components in parameter_values) { result <- bdPCA_hdf5( filename = "data.hdf5", group = "data", dataset = "raw", bcenter = TRUE, # Normalizes every iteration bscale = TRUE, k = 4, # n_components varies ) # Process result } # Total time: 5 × (normalization + PCA) = 5 × 60s = 300s # Good: Normalize once, reuse bdNormalize_hdf5( filename = "data.hdf5", group = "data", dataset = "raw", bcenter = TRUE, bscale = TRUE ) for (n_components in parameter_values) { result <- bdPCA_hdf5( filename = "data.hdf5", group = "NORMALIZED/data", # Pre-normalized dataset = "raw", bcenter = FALSE, # Skip normalization bscale = FALSE, k = 4, # n_components varies ) # Process result } # Total time: normalization + 5 × PCA = 50s + 5×10s = 100s (67% faster) ``` **Understanding the savings:** If normalization takes 20% of total time and you run 10 analyses, normalizing once saves 9/10 × 20% = 18% of total workflow time. For workflows with hundreds of runs (bootstrapping, cross-validation), savings multiply dramatically. **Fix strategy:** 1. Identify operations that don't change between runs 2. Perform them once, store results in HDF5 3. Reuse stored results in subsequent runs Applies to normalization, QR decomposition (for CCA), centering, scaling—any operation deterministic and shared across runs. --- ## Benchmarking: Measuring Real Performance To validate optimizations, benchmark systematically with real data and hardware. The following framework provides structure for comprehensive performance testing. ### Systematic Benchmarking Framework ```r library(microbenchmark) # Define configurations to test test_configs <- expand.grid( k = c(2, 4, 8, 16), threads = c(1, 2, 4, 8), stringsAsFactors = FALSE ) # Run each configuration multiple times results <- lapply(seq_len(nrow(test_configs)), function(i) { cfg <- test_configs[i, ] cat("Testing k=", cfg$k, ", threads=", cfg$threads, "...\n", sep="") timing <- system.time({ bdSVD_hdf5( file = "benchmark_data.hdf5", group = "data", dataset = "matrix", k = cfg$k, threads = cfg$threads, bcenter = TRUE, bscale = FALSE ) }) data.frame( k = cfg$k, threads = cfg$threads, user = timing[["user"]], system = timing[["system"]], elapsed = timing[["elapsed"]] ) }) benchmark_df <- do.call(rbind, results) # Analyze results cat("\nBenchmark Results:\n") print(benchmark_df) # Find optimal configuration optimal_idx <- which.min(benchmark_df$elapsed) optimal <- benchmark_df[optimal_idx, ] cat("\nOptimal configuration:\n") cat(" k =", optimal$k, "\n") cat(" threads =", optimal$threads, "\n") cat(" elapsed time =", round(optimal$elapsed, 2), "seconds\n") ``` ### Visualizing Performance ```r library(ggplot2) # Plot: k vs time for each thread count ggplot(benchmark_df, aes(x=k, y=elapsed, color=factor(threads))) + geom_line(size=1) + geom_point(size=3) + scale_x_log2(breaks=c(2,4,8,16)) + scale_color_brewer(palette="Set1", name="Threads") + labs( title="SVD Performance: Block Size vs Thread Count", subtitle=paste("Dataset:", nrow, "×", ncol, "- Hardware:", RAM, "GB RAM"), x="Number of Blocks (k)", y="Elapsed Time (seconds)" ) + theme_minimal(base_size=14) + theme(legend.position="right") ``` ### Interpreting Benchmarks **Look for these patterns:** 1. **Optimal k:** Time decreases as k decreases, then levels off or increases. The minimum is your optimal k for this dataset/hardware combination. 2. **Threading efficiency:** Compare times across thread counts for fixed k. If doubling threads doesn't halve time, you're hitting parallelization limits. 3. **Interaction effects:** Sometimes optimal k changes with thread count. With few threads, smaller k is best (less overhead). With many threads, larger k helps (more blocks to parallelize). **Example interpretation:** ``` k=2, threads=1: 48.3s k=2, threads=4: 14.2s (speedup: 3.4×, efficiency: 85%) k=8, threads=4: 18.7s k=16, threads=4: 26.1s Conclusion: k=2 optimal (data fits in memory), threads=4 good efficiency ``` --- ## Real-World Case Study: 1000 Genomes PCA Let's walk through complete optimization of a real analysis: PCA on 1000 Genomes data. **Dataset:** 2,504 samples × 500 genes from 1000 Genomes Project **Hardware:** Standard workstation (16 GB RAM, 4 physical cores) **Goal:** Optimize PCA to minimize analysis time while ensuring correctness ### Initial Baseline ```r # Conservative starting configuration cat("=== Baseline Configuration ===\n") baseline_time <- system.time({ baseline <- bdPCA_hdf5( filename = "1000G.hdf5", group = "data", dataset = "genotypes", bcenter = TRUE, bscale = FALSE, k = 8, # Conservative blocking q = 1, threads = 2 # Conservative threading ) }) cat("Elapsed time:", baseline_time["elapsed"], "seconds\n") cat("User time:", baseline_time["user"], "seconds\n") cat("System time:", baseline_time["system"], "seconds\n\n") # Typical output: # Elapsed time: 23.45 seconds # User time: 41.23 seconds # System time: 3.12 seconds ``` **Initial observations:** - User time >> elapsed time: Parallelization working (2 threads doing ~1.8× work) - System time moderate (~13% of elapsed): I/O not dominant but noticeable - Memory usage: ~4 GB peak (plenty of headroom with 16 GB available) ### Profile to Identify Bottlenecks ```r cat("=== Profiling to Find Bottlenecks ===\n") Rprof("pca_profile.out") bdPCA_hdf5( filename = "1000G.hdf5", group = "data", dataset = "genotypes", bcenter = TRUE, bscale = FALSE, k = 8, q = 1, threads = 2 ) Rprof(NULL) # Analyze profile profile <- summaryRprof("pca_profile.out") print(profile$by.self[1:10,]) # Typical output shows: # bdSVD_hdf5: 60% of time (primary bottleneck) # bdNormalize: 25% of time (opportunity for pre-computation) # h5read: 12% of time (I/O overhead) # Other: 3% of time ``` **Bottleneck identified:** SVD computation dominates (60%), normalization is significant (25%), I/O is noticeable (12%). **Optimization strategy based on profiling:** 1. Reduce k to speed up SVD (we have memory headroom) 2. Increase threads to parallelize SVD better (4 cores available) 3. Pre-normalize data to eliminate repeated normalization 4. Consider HDF5 chunking if I/O remains problematic after other optimizations ### Optimization 1: Reduce Block Count ```r cat("=== Optimization 1: Reduce k ===\n") opt1_time <- system.time({ opt1 <- bdPCA_hdf5( filename = "1000G.hdf5", group = "data", dataset = "genotypes", bcenter = TRUE, bscale = FALSE, k = 4, # Halved from 8 q = 1, threads = 2 ) }) cat("Elapsed time:", opt1_time["elapsed"], "seconds\n") improvement1 <- (baseline_time["elapsed"] - opt1_time["elapsed"]) / baseline_time["elapsed"] * 100 cat("Improvement:", round(improvement1, 1), "%\n\n") # Typical output: # Elapsed time: 18.32 seconds # Improvement: 21.9% ``` **Why this helped:** Reducing from k=8 to k=4 halves the number of block operations. SVD on 4 blocks is faster than SVD on 8 blocks. Memory usage increased from 4GB to 6GB—well within our 16GB capacity. ### Optimization 2: Increase Threading ```r cat("=== Optimization 2: Increase threads ===\n") opt2_time <- system.time({ opt2 <- bdPCA_hdf5( filename = "1000G.hdf5", group = "data", dataset = "genotypes", bcenter = TRUE, bscale = FALSE, k = 4, q = 1, threads = 4 # Doubled from 2 ) }) cat("Elapsed time:", opt2_time["elapsed"], "seconds\n") improvement2 <- (opt1_time["elapsed"] - opt2_time["elapsed"]) / opt1_time["elapsed"] * 100 cat("Additional improvement:", round(improvement2, 1), "%\n") total_improvement <- (baseline_time["elapsed"] - opt2_time["elapsed"]) / baseline_time["elapsed"] * 100 cat("Total improvement vs baseline:", round(total_improvement, 1), "%\n\n") # Calculate threading efficiency speedup <- opt1_time["elapsed"] / opt2_time["elapsed"] efficiency <- speedup / 2 # Doubled threads cat("Threading efficiency:", round(efficiency * 100, 1), "%\n\n") # Typical output: # Elapsed time: 12.84 seconds # Additional improvement: 29.9% # Total improvement vs baseline: 45.3% # Threading efficiency: 71.2% ``` **Why this helped:** With k=4, we have 4 blocks that can compute in parallel. Using 4 threads (matching our 4 physical cores) provides good parallelization. Efficiency of 71% is solid—some overhead from coordination and I/O, but most computation parallelizes well. ### Optimization 3: Pre-normalize Data ```r cat("=== Optimization 3: Pre-normalize ===\n") # Normalize once norm_time <- system.time({ bdNormalize_hdf5( filename = "1000G.hdf5", group = "data", dataset = "genotypes", bcenter = TRUE, bscale = FALSE ) }) cat("Normalization time (one-time cost):", norm_time["elapsed"], "seconds\n") # Run PCA on pre-normalized data opt3_time <- system.time({ opt3 <- bdPCA_hdf5( filename = "1000G.hdf5", group = "NORMALIZED/data", dataset = "genotypes", bcenter = FALSE, # Already normalized bscale = FALSE, k = 4, q = 1, threads = 4 ) }) cat("PCA time (using pre-normalized):", opt3_time["elapsed"], "seconds\n") improvement3 <- (opt2_time["elapsed"] - opt3_time["elapsed"]) / opt2_time["elapsed"] * 100 cat("Additional improvement:", round(improvement3, 1), "%\n") final_improvement <- (baseline_time["elapsed"] - opt3_time["elapsed"]) / baseline_time["elapsed"] * 100 cat("FINAL improvement vs baseline:", round(final_improvement, 1), "%\n\n") # Typical output: # Normalization time: 4.23 seconds # PCA time: 9.18 seconds # Additional improvement: 28.5% # FINAL improvement vs baseline: 60.9% ``` **Why this helped:** Profiling showed normalization consumed 25% of time. By pre-normalizing, we eliminate that 25% from each subsequent PCA run. If running only once, the one-time normalization cost negates some benefit. But for multiple runs (common in parameter tuning, cross-validation), savings multiply. **Amortized analysis:** If running 10 PCAs with different parameters: - Without pre-normalization: 10 × 12.84s = 128.4s total - With pre-normalization: 4.23s + 10 × 9.18s = 96.0s total (25% faster) For 100 runs: 4.23s + 100 × 9.18s = 922s vs 1284s (28% faster) ### Verification of Correctness ```r cat("=== Verifying Correctness ===\n") # Compare principal components pc_match <- all.equal( baseline$components[, 1:10], opt3$components[, 1:10], tolerance = 1e-10 ) cat("Principal components match:", pc_match, "\n") # Compare variance explained var_match <- all.equal( baseline$variance_prop[1:10], opt3$variance_prop[1:10], tolerance = 1e-10 ) cat("Variance explained matches:", var_match, "\n\n") # Typical output: # Principal components match: TRUE # Variance explained matches: TRUE ``` **Critical step:** Always verify optimization doesn't change results. Numerical differences within tolerance (1e-10) are acceptable—floating-point arithmetic isn't perfectly deterministic. But results should be essentially identical. ### Final Configuration Summary ```r cat("=== Final Optimization Summary ===\n\n") summary_df <- data.frame( Configuration = c("Baseline", "Opt1: k=4", "Opt2: +threads", "Opt3: +prenorm"), Time_seconds = c( baseline_time["elapsed"], opt1_time["elapsed"], opt2_time["elapsed"], opt3_time["elapsed"] ), Speedup = c( 1.0, baseline_time["elapsed"] / opt1_time["elapsed"], baseline_time["elapsed"] / opt2_time["elapsed"], baseline_time["elapsed"] / opt3_time["elapsed"] ), Config_details = c( "k=8, threads=2", "k=4, threads=2", "k=4, threads=4", "k=4, threads=4, prenorm" ) ) print(summary_df) cat("\nFinal configuration:\n") cat(" - Pre-normalize data once\n") cat(" - k = 4 (balance speed/memory)\n") cat(" - threads = 4 (match physical cores)\n") cat(" - Result: 60.9% faster than baseline\n") cat(" - Verified: Results identical to baseline\n") ``` **Key lessons from case study:** 1. **Profile identified the right targets:** Without profiling, we might have optimized I/O (only 12% of time) instead of SVD (60% of time) 2. **Incremental optimization works:** Each step provided measurable benefit (22%, 30%, 29%) 3. **Memory headroom enabled speed:** Having 16GB for a 2GB dataset let us reduce k aggressively 4. **Pre-computation pays off:** One-time normalization cost amortizes over multiple runs 5. **Verification is essential:** Always confirm results unchanged after optimization --- ## Interactive Exercise {.exercise} ### Apply Optimization to Your Data Use this exercise to practice optimization on your own datasets: ```r # ========================================== # Exercise 1: Baseline Measurement # ========================================== cat("Exercise 1: Establishing Baseline\n") cat("----------------------------------\n") # TODO: Replace with your actual file, group, dataset your_file <- "your_data.hdf5" your_group <- "your_group" your_dataset <- "your_dataset" baseline_time <- system.time({ baseline_result <- bdPCA_hdf5( filename = your_file, group = your_group, dataset = your_dataset, k = 4, threads = 2 ) }) cat("Baseline time:", round(baseline_time["elapsed"], 2), "seconds\n") cat("User time:", round(baseline_time["user"], 2), "seconds\n") cat("System time:", round(baseline_time["system"], 2), "seconds\n\n") # ========================================== # Exercise 2: Test Different k Values # ========================================== cat("Exercise 2: Testing Block Sizes (k)\n") cat("------------------------------------\n") k_values <- c(2, 4, 8, 16) k_times <- numeric(length(k_values)) for (i in seq_along(k_values)) { cat("Testing k =", k_values[i], "... ") t <- system.time({ bdPCA_hdf5(your_file, your_group, your_dataset, k=k_values[i], threads=2) }) k_times[i] <- t["elapsed"] cat(round(k_times[i], 2), "seconds\n") } # Find optimal k optimal_k <- k_values[which.min(k_times)] cat("\nOptimal k:", optimal_k, "\n") cat("Best time:", round(min(k_times), 2), "seconds\n\n") # Plot k vs time plot(k_values, k_times, type="b", pch=19, xlab="Number of Blocks (k)", ylab="Time (seconds)", main="Impact of Block Size on Performance", col="blue", lwd=2) grid() # ========================================== # Exercise 3: Test Different Thread Counts # ========================================== cat("Exercise 3: Testing Thread Counts\n") cat("----------------------------------\n") thread_values <- c(1, 2, 4, 8) thread_times <- numeric(length(thread_values)) for (i in seq_along(thread_values)) { cat("Testing threads =", thread_values[i], "... ") t <- system.time({ bdPCA_hdf5(your_file, your_group, your_dataset, k=optimal_k, threads=thread_values[i]) }) thread_times[i] <- t["elapsed"] cat(round(thread_times[i], 2), "seconds\n") } # Calculate speedup and efficiency speedup <- thread_times[1] / thread_times efficiency <- speedup / thread_values cat("\nThreading Analysis:\n") results_df <- data.frame( Threads = thread_values, Time = round(thread_times, 2), Speedup = round(speedup, 2), Efficiency = paste0(round(efficiency * 100, 1), "%") ) print(results_df) # ========================================== # Exercise 4: Measure Memory Usage # ========================================== cat("\nExercise 4: Memory Usage by Configuration\n") cat("-------------------------------------------\n") library(pryr) measure_memory <- function(k_val) { mem_before <- mem_used() bdPCA_hdf5(your_file, your_group, your_dataset, k=k_val, threads=2) mem_after <- mem_used() (mem_after - mem_before) / 1024^2 # Convert to MB } memory_usage <- sapply(k_values, measure_memory) cat("\nMemory Usage by k:\n") mem_df <- data.frame( k = k_values, Memory_MB = round(memory_usage, 1) ) print(mem_df) # ========================================== # Summary and Recommendations # ========================================== cat("\n\n========================================\n") cat("OPTIMIZATION SUMMARY\n") cat("========================================\n\n") cat("Baseline configuration: k=4, threads=2\n") cat("Baseline time:", round(baseline_time["elapsed"], 2), "seconds\n\n") cat("Optimal configuration:\n") optimal_threads <- thread_values[which.min(thread_times)] cat(" k =", optimal_k, "\n") cat(" threads =", optimal_threads, "\n") cat(" time =", round(min(thread_times), 2), "seconds\n\n") improvement <- (baseline_time["elapsed"] - min(thread_times)) / baseline_time["elapsed"] * 100 cat("Improvement:", round(improvement, 1), "%\n\n") cat("Threading efficiency:", round(efficiency[which.min(thread_times)] * 100, 1), "%\n") if (efficiency[which.min(thread_times)] < 0.5) { cat("\nNote: Threading efficiency <50% suggests bottleneck\n") cat("Possible causes: I/O limitation, memory bandwidth, or BLAS threading\n") } ``` ::: {.callout-tip} ### Reflection Questions After completing the exercises, consider: **About k (block size):** - How does k affect time in your data? Is relationship linear? - What's the memory usage at your optimal k? Could you use smaller k? - Does optimal k change with different thread counts? **About threading:** - What speedup did you achieve with maximum threads? - What's your threading efficiency? Is it >50%? - If efficiency is low, what bottleneck might exist? (Check user vs system time) **About your workflow:** - Could you pre-normalize to save time in repeated runs? - Is I/O (system time) significant? (>20% of elapsed time?) - Would different HDF5 chunking help if I/O is high? - What's the best tradeoff for your specific needs? **Decision making:** - Which configuration would you use for production? Why? - How would your choice change if RAM was limited? - How would your choice change if speed was critical? - What evidence supports your configuration choice? ::: --- ## Key Takeaways {.key-concept} Let's consolidate essential principles that guide effective performance optimization. **Optimization is about tradeoffs, not absolutes.** There's no universally "best" configuration—optimal settings depend on your dataset size, available hardware, and whether memory or speed limits you more. Understanding why these tradeoffs exist helps you make intelligent decisions for your specific situation. When memory is tight, accept slower computation with more blocking. When speed matters and memory is abundant, minimize blocking for faster execution. **Profile before optimizing.** Intuition about bottlenecks is frequently wrong. You might assume I/O dominates when actually computation takes 80% of time, or vice versa. Profiling with `system.time()`, `Rprof()`, or `profvis` reveals where time actually goes. Focus optimization effort on operations consuming >20% of total time—optimizing minor contributors wastes effort with minimal benefit. **Block size (k) controls the fundamental memory-speed tradeoff.** Smaller k means fewer, larger blocks—faster but more RAM required. Larger k means many small blocks—slower but memory-efficient. Start with k that uses 50-70% of available RAM. If hitting memory limits, increase k. If memory usage is low and speed matters, decrease k. The relationship is roughly linear: doubling k halves memory usage and increases time by 20-50% depending on operation intensity. **Threading scales only when operations parallelize.** More threads help when work divides independently (processing blocks in parallel, parallel BLAS operations). Threading doesn't help when operations are sequential, I/O-bound, or memory-bandwidth-limited. If 4 threads provide <2× speedup compared to 1 thread, you've hit a fundamental bottleneck—adding more threads won't help. Common bottlenecks: disk I/O (all threads wait for disk), memory bandwidth (all threads compete for RAM access), Amdahl's law (inherent sequential portions limit parallelism). **HDF5 chunk size affects I/O performance dramatically.** Optimal chunking aligns chunks with your access patterns—row-wise chunks for row-wise operations, column-wise chunks for column-wise operations. Target 1-10 MB chunks that match typical block sizes. If profiling shows high system time (>30% of elapsed time), investigate chunk size. Proper chunking can reduce I/O time by 50-80% for I/O-heavy operations. **Leverage HDF5 persistence to avoid redundant computation.** Operations like normalization, centering, and scaling are deterministic—same input always produces same output. Compute these operations once, store results in HDF5, and reuse them in subsequent analyses. For workflows with repeated analyses (parameter sweeps, cross-validation, bootstrapping), pre-computation saves enormous time. A one-time normalization cost amortizes over hundreds of runs. **Always verify correctness after optimization.** Faster incorrect results are worse than slower correct results. After each optimization, verify results match baseline within numerical tolerance. Test with `all.equal(result1, result2, tolerance=1e-10)`. If results differ beyond floating-point noise, optimization introduced a bug—revert and investigate. **Iterate systematically.** Start with conservative configuration guaranteed to complete successfully. Profile to identify bottlenecks. Optimize the biggest bottleneck. Measure improvement. Verify correctness. Repeat. Change one parameter at a time so you know what helped. Keep what works, revert what doesn't. Document your final configuration and the reasoning behind it. --- ## Next Steps **Apply to your data:** - Profile your current workflows to identify where time goes - Experiment with different k and thread values systematically - Measure actual improvements—don't assume - Verify results remain correct after each optimization **Explore advanced optimization:** - Study how different operations scale (SVD vs matrix multiplication vs statistical tests) - Experiment with HDF5 chunking if I/O is your bottleneck - Consider hybrid strategies: pre-compute expensive operations, cache intermediate results - Learn about your BLAS library configuration (single-threaded vs multi-threaded) **Deepen understanding:** - Read about [Block-wise Computing](../fundamentals/blockwise-computing.qmd) mathematical foundations - Explore [HDF5 storage](../fundamentals/understanding-hdf5.qmd) optimization in detail - Study [CCA implementation](../developing-methods/cca-r-implementation.qmd) as complex workflow example - Review BigDataStatMeth paper for algorithmic details and theoretical analysis **Share knowledge:** - Document configurations that work well for your datasets and hardware - Share benchmarking results with colleagues analyzing similar data - Contribute performance findings to the community - Help others optimize their workflows based on your experience Performance optimization is empirical and iterative. What works for one dataset on one machine might not work for another dataset on different hardware. The principles here provide a systematic framework for discovering what works in your environment. Apply them methodically, measure carefully, and you'll achieve substantial improvements in your BigDataStatMeth workflows.