# BiocManager helps install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")Getting Started
Installation, First Steps, and the HDF5Matrix Object
1 Overview
BigDataStatMeth enables statistical analysis on datasets too large for RAM by using HDF5 file storage and block-wise processing. This tutorial guides you through installation, introduces the fundamental object that makes everything work, and shows your first operations — ensuring you have a solid working foundation before diving into more complex analyses.
Think of this as setting up your laboratory bench before starting experiments. You’ll not only verify that all your tools work correctly, but also understand the conceptual model that underlies every operation in the package.
1.1 What You’ll Learn
By the end of this tutorial, you will:
- Install BigDataStatMeth and all required dependencies correctly
- Understand the HDF5Matrix object and how it differs from in-memory matrices
- Create, open, inspect, and close HDF5Matrix objects
- Access data efficiently using subsetting and conversion to memory
- Configure global options for parallelization, block size, and compression
- Perform basic matrix operations using standard R syntax on disk-backed data
- Know where to find help when you encounter issues
- Be prepared for more advanced tutorials
2 Prerequisites
2.1 System Requirements
Operating System: - Linux (recommended) - macOS - Windows (requires Rtools — see below)
R Version: - R ≥ 4.1.0 (check with R.version.string)
RAM: - Minimum: 4 GB - Recommended: 8+ GB for comfortable work
Disk Space: - ~500 MB for package and dependencies - Additional space for your HDF5 data files
BigDataStatMeth is compiled from C++ source code. Windows users must install Rtools before installing the package.
Download the version matching your R installation from the link above.
3 Step 1: Install Dependencies
BigDataStatMeth requires packages from both CRAN and Bioconductor.
3.1 Install BiocManager
3.2 Install Required Packages
BigDataStatMeth’s only Bioconductor dependency is Rhdf5lib, which provides the HDF5 C library used at compilation. All other dependencies are standard CRAN packages.
# Bioconductor: HDF5 C library (required for compilation)
BiocManager::install("Rhdf5lib")
# CRAN packages (installed automatically, but explicit install avoids surprises)
install.packages(c("R6", "Rcpp", "RcppEigen", "RCurl", "data.table"))When installing the CRAN release with install.packages("BigDataStatMeth"), all dependencies — including Rhdf5lib — are resolved and installed automatically.
When installing the GitHub version with devtools::install_github(), Rhdf5lib may not be pulled in automatically. Install it first with BiocManager::install("Rhdf5lib") to avoid compilation errors.
If installation fails:
- Update R: Some packages require recent R versions
- Update BiocManager: Run
BiocManager::install()with no arguments - Check compilation tools: Especially on Windows/macOS
- Installation logs: Look for specific error messages about missing libraries
4 Step 2: Install BigDataStatMeth
4.1 From CRAN (Recommended — Stable Version)
The stable version is available on CRAN:
install.packages("BigDataStatMeth")This installs the latest stable, tested release.
4.2 From GitHub (Development Version)
For the latest development features:
# Install devtools if needed
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
# Install development version from GitHub
devtools::install_github("isglobal-brge/BigDataStatMeth")Use CRAN version if: - You want maximum stability - You’re doing production analysis - You prefer well-tested releases
Use GitHub version if: - You need the latest features - You’re contributing to development - You want to test new functionality
5 Step 3: Load and Verify
5.1 Load the Package
library(BigDataStatMeth)If no errors appear, the package loaded successfully!
5.2 Quick Verification
Run this simple test to verify everything works:
# Create a small test matrix
test_matrix <- matrix(rnorm(100), nrow = 10, ncol = 10)
# Create an HDF5Matrix object — the fundamental object in BigDataStatMeth
test_file <- "verification_test.hdf5"
X_test <- hdf5_create_matrix(
filename = test_file,
dataset = "test/data",
data = test_matrix,
overwrite = TRUE
)
# The object knows its dimensions without loading data into RAM
if (all(dim(X_test) == c(10, 10))) {
cat("✓ Installation verified!\n")
cat("✓ HDF5Matrix created:", dim(X_test)[1], "×", dim(X_test)[2], "\n")
close(X_test)
file.remove(test_file)
} else {
cat("✗ Installation issue — unexpected dimensions\n")
}✓ Installation verified!
✓ HDF5Matrix created: 10 × 10
[1] TRUE
Expected output:
✓ Installation verified!
✓ HDF5Matrix created: 10 × 10
Common issues:
- “Package not found”: Restart R session
- “HDF5 library error”: Reinstall
Rhdf5libwithBiocManager::install("Rhdf5lib")then recompile BigDataStatMeth - Permission denied: Check write permissions in working directory
- Symbol not found: Recompile package from source
Try sessionInfo() to check loaded packages and versions. For function-level help: ?hdf5_create_matrix.
6 Step 4: The HDF5Matrix Object
Before working with real data, you need to understand the building block of BigDataStatMeth: the HDF5Matrix object. This is the conceptual shift from earlier versions of the package — instead of calling functions with file paths and dataset names every time, you create an object once and then work with it using standard R syntax.
Think of an HDF5Matrix as a window onto data that lives on disk. The data doesn’t move into RAM until you explicitly ask for it — but the object itself knows where the data is, how big it is, and can operate on it block by block. To R code, it behaves like a regular matrix.
6.1 Creating an HDF5Matrix
hdf5_create_matrix() writes data to an HDF5 file and returns an HDF5Matrix object pointing to it. The dataset argument combines the group path and dataset name using HDF5’s standard /group/dataset convention:
set.seed(42)
A <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
example_file <- "hdf5matrix_intro.hdf5"
A_h5 <- hdf5_create_matrix(
filename = example_file,
dataset = "data/A",
data = A,
overwrite = TRUE
)
# Printing the object shows where it lives and how big it is
A_h5HDF5Matrix object
File: hdf5matrix_intro.hdf5
Path: data/A
Dimensions: 300 x 80
Type:
Status: OPEN
# Standard R accessors work directly
dim(A_h5)[1] 300 80
nrow(A_h5)[1] 300
ncol(A_h5)[1] 80
The dataset argument in hdf5_create_matrix() uses the path "group/dataset". Groups are like folders inside the HDF5 file — they are created automatically if they don’t exist yet. So "data/A" creates a group called data containing a dataset called A.
This means you can organize multiple datasets in a single file just by choosing their paths:
hdf5_create_matrix(file, "raw/genotypes", data = geno)
hdf5_create_matrix(file, "raw/expression", data = expr)
hdf5_create_matrix(file, "results/pca", data = scores)6.2 Accessing Data
Subsetting an HDF5Matrix uses standard R bracket syntax. Crucially, only the requested block is read from disk — the rest stays on disk untouched:
# Read a 5×6 block — only this block is loaded into RAM
A_h5[1:5, 1:6] [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1.3709584 -0.004620768 -0.2484829 0.94192422 -0.74651645 -0.6013830
[2,] -0.5646982 0.760242168 0.4223204 -0.24861404 0.03660612 -0.1358161
[3,] 0.3631284 0.038990913 0.9876533 0.09647886 0.32330962 -0.9872728
[4,] 0.6328626 0.735072142 0.8355682 -0.43393094 0.37967603 0.8319250
[5,] 0.4042683 -0.146472627 -0.6605219 2.17866787 0.87655650 -0.7950595
When you need an in-memory copy — for visualization, export, or passing to other R functions — use as.matrix(). Subset on disk first whenever possible:
# Subset on disk first, then bring to memory — efficient
small_block <- as.matrix(A_h5[1:10, 1:5])
dim(small_block)[1] 10 5
# Verify values match the original
all.equal(small_block, A[1:10, 1:5])[1] TRUE
as.matrix()
Use as.matrix() for:
- Small results: Singular values, summary statistics, final scores
- Visualization: ggplot2 and base R plots need in-memory data
- Export: Writing results to CSV or other formats
Avoid as.matrix() on the full object if your matrix is large. The whole point of HDF5Matrix is that data stays on disk until needed. Subset first: as.matrix(X[rows, cols]), not as.matrix(X).
6.3 Inspecting the File
list_datasets() shows all datasets currently stored in an HDF5 file. You can pass either the file path or an HDF5Matrix object:
# By file path — lists everything in the file
list_datasets(example_file)[1] "data/A"
# By HDF5Matrix object — lists datasets in its group
list_datasets(A_h5)[1] "A"
list_datasets() accepts either the HDF5 file path (to see everything in the file) or an HDF5Matrix object (to see the datasets in its group). Both forms are useful: the file path gives you a global picture; the object form scopes the view to where you are currently working.
6.4 Reopening an Existing Dataset
If you already have an HDF5 file with data from a previous session, hdf5_matrix() opens an existing dataset and returns an HDF5Matrix object:
# Open an existing dataset — no data is read into RAM yet
A_reopen <- hdf5_matrix(example_file, "data/A")
dim(A_reopen)[1] 300 80
This is the pattern you’ll use when loading results from a previous analysis run.
6.5 Global Options
BigDataStatMeth provides a global options system to configure parallelization, block size, and compression across all operations. This is particularly important for operators like +, -, *, / where you can’t pass extra arguments — the global options control their behaviour.
# View the current defaults
old_opts <- hdf5matrix_options()
old_opts$paral
NULL
$block_size
NULL
$threads
NULL
$compression
NULL
The four options are:
| Option | Default | Effect |
|---|---|---|
paral |
NULL (auto-detect) |
Enable OpenMP parallelization |
threads |
NULL (auto-detect) |
Number of parallel threads |
block_size |
NULL (auto-calculate) |
Elements per I/O block |
compression |
NULL → uses 6 |
gzip compression level (0–9) |
Set them for a session or restore them after local overrides:
# Configure for a parallel analysis session
hdf5matrix_options(
paral = TRUE,
threads = 2L,
block_size = 512L,
compression = 6L
)
hdf5matrix_options()$paral
[1] TRUE
$block_size
[1] 512
$threads
[1] 2
$compression
[1] 6
The compression option controls gzip compression for all output datasets. Higher levels produce smaller files but take longer to write. Level 6 (the default) is a balanced choice for most workflows.
The benchmark below illustrates the trade-off on a moderate-sized matrix:
set.seed(123)
X_bench <- round(matrix(rnorm(2000 * 200), 2000, 200), 2)
# No compression
f_none <- tempfile(fileext = ".h5")
t0 <- system.time(
hdf5_create_matrix(f_none, "data/X",
data = X_bench, compression = 0, overwrite = TRUE)
)
# Default compression (level 6)
f_def <- tempfile(fileext = ".h5")
t6 <- system.time(
hdf5_create_matrix(f_def, "data/X",
data = X_bench, compression = 6, overwrite = TRUE)
)
data.frame(
compression = c(0, 6),
write_time_s = c(t0[["elapsed"]], t6[["elapsed"]]),
file_size_MB = round(file.info(c(f_none, f_def))$size / 1024^2, 3)
) compression write_time_s file_size_MB
1 0 0.001 3.054
2 6 0.106 0.731
The exact numbers depend on your hardware and data, but the pattern is consistent: compression level 6 significantly reduces file size with a modest increase in write time. For interactive analysis on a laptop, level 6 is a sensible default. For high-throughput pipelines writing many large datasets, consider level 1 or 2.
# Restore original defaults for the rest of the tutorial
hdf5matrix_options(
paral = old_opts$paral,
threads = old_opts$threads,
block_size = old_opts$block_size,
compression = old_opts$compression
)6.6 Releasing Resources
HDF5 files remain open while their HDF5Matrix objects exist, which improves performance for repeated access. Release handles explicitly when you’re done:
# Close a single object
close(A_reopen)
# Close all open HDF5Matrix handles at once
hdf5_close_all()R’s garbage collector will release HDF5 handles automatically when objects go out of scope. Explicit close() or hdf5_close_all() is good practice at the end of analysis sections or before re-running code interactively — it avoids accumulating stale handles between runs.
7 Step 5: Your First Analysis Dataset
Now that you understand the HDF5Matrix object, let’s create a realistic dataset and explore it.
7.1 Create Sample Data
# Simulate a genomic dataset: 1,000 samples × 5,000 SNPs
set.seed(123)
n_samples <- 1000
n_snps <- 5000
genotype_data <- matrix(
sample(0:2, n_samples * n_snps, replace = TRUE),
nrow = n_samples,
ncol = n_snps
)
# Add meaningful row/column names
rownames(genotype_data) <- paste0("Sample_", 1:n_samples)
colnames(genotype_data) <- paste0("SNP_", 1:n_snps)
# Check in-memory size
format(object.size(genotype_data), units = "MB")[1] "19.5 Mb"
7.2 Create an HDF5Matrix from the Data
data_file <- "my_first_dataset.hdf5"
geno_h5 <- hdf5_create_matrix(
filename = data_file,
dataset = "genotypes/data",
data = genotype_data,
overwrite = TRUE
)
cat("✓ HDF5Matrix created\n")✓ HDF5Matrix created
geno_h5HDF5Matrix object
File: my_first_dataset.hdf5
Path: genotypes/data
Dimensions: 1000 x 5000
Type:
Status: OPEN
HDF5 files organize data hierarchically:
- File:
my_first_dataset.hdf5- Group:
genotypes(like a folder)- Dataset:
data(the actual matrix)
- Dataset:
- Group:
The HDF5Matrix object holds a reference to this location. No data is in RAM — only the pointer to where the data lives on disk.
7.3 Inspect and Access the Data
# List what's in the file
list_datasets(data_file)[1] "genotypes/.data_dimnames/1" "genotypes/.data_dimnames/2"
[3] "genotypes/data"
# Dimensions — answered instantly without reading the data
dim(geno_h5)[1] 1000 5000
# Read a small subset — only this block is loaded
geno_h5[1:5, 1:10] SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10
Sample_1 2 1 2 2 2 0 1 0 2 0
Sample_2 2 2 1 2 2 1 0 0 0 2
Sample_3 2 1 0 0 2 2 1 0 0 0
Sample_4 1 0 0 0 2 1 0 0 1 1
Sample_5 2 2 0 1 0 2 0 2 2 2
7.4 Verify the Data Stored Correctly
# Bring a small block to memory and compare with original
block_hdf5 <- as.matrix(geno_h5[1:5, 1:10])
block_orig <- genotype_data[1:5, 1:10]
all.equal(block_hdf5, block_orig)[1] TRUE
If you see output like "Attributes: < Component 'dimnames': ...>", the matrices are numerically identical but have different attributes. This is expected — dimension names are stored separately in HDF5. The numeric values are what matter for calculations. You can verify explicitly:
all.equal(as.numeric(block_hdf5), as.numeric(block_orig))[1] TRUE
8 Step 6: Basic Operations
With an HDF5Matrix object in hand, standard R operators work directly on disk-backed data. No new syntax to learn — BigDataStatMeth dispatches these to block-wise implementations transparently.
8.1 Matrix Multiplication
# Two matrices to multiply
set.seed(456)
A <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)
B <- matrix(rnorm(100 * 200), nrow = 100, ncol = 200)
ops_file <- "operations_example.hdf5"
A_h5 <- hdf5_create_matrix(ops_file, "matrices/A", data = A, overwrite = TRUE)
B_h5 <- hdf5_create_matrix(ops_file, "matrices/B", data = B, overwrite = TRUE)
# Standard R matrix multiplication — executed block-wise on disk
M_h5 <- A_h5 %*% B_h5
cat("Result dimensions:", dim(M_h5), "\n")Result dimensions: 500 200
cat("Result preview (first 5×5):\n")Result preview (first 5×5):
print(as.matrix(M_h5[1:5, 1:5])) [,1] [,2] [,3] [,4] [,5]
[1,] 3.966529 -1.596117 -10.575227 -1.523170 13.37335
[2,] 10.496300 -4.726875 16.389046 3.817424 16.47317
[3,] -8.331547 -9.299947 -3.084136 4.030101 11.75505
[4,] -5.159786 2.319629 2.815899 -3.407731 -5.72410
[5,] 3.384955 18.492893 1.697668 -13.890756 10.48865
# Verify against in-memory computation
all.equal(as.matrix(M_h5), A %*% B)[1] TRUE
Notice that A_h5 %*% B_h5 uses exactly the same syntax as in-memory matrix multiplication. Behind the scenes, BigDataStatMeth partitioned both matrices into blocks, multiplied each pair, accumulated the result, and wrote it to disk — all without you managing a single block boundary.
For truly large matrices (100 GB+), this same code works identically, just takes longer.
8.2 Crossproduct
A crossproduct computes t(A) %*% A — that is, the transpose of a matrix multiplied by itself. The result is a square symmetric matrix whose entries are dot products between columns of A. This operation is at the heart of many statistical methods: it appears in PCA (as the covariance structure), in ordinary least squares (as the normal equations (XᵀX)β = Xᵀy), and in any method that needs pairwise column similarities. For large matrices the block-wise implementation is essential because the input is never fully loaded into RAM, yet the result is exact.
crossprod() accepts optional outgroup and outdataset arguments to control where the result is written inside the HDF5 file:
# t(A) %*% A — with explicit output location
XtX_h5 <- crossprod(
A_h5,
outgroup = "results",
outdataset = "A_crossprod"
)
cat("Crossproduct dimensions:", dim(XtX_h5), "\n")Crossproduct dimensions: 100 100
cat("Preview (first 5×5):\n")Preview (first 5×5):
print(as.matrix(XtX_h5[1:5, 1:5])) [,1] [,2] [,3] [,4] [,5]
[1,] 475.947503 -26.983942 1.705196 18.276113 22.03469
[2,] -26.983942 488.288982 -24.322274 6.712965 -14.34956
[3,] 1.705196 -24.322274 498.798075 9.800220 52.82859
[4,] 18.276113 6.712965 9.800220 511.565405 -26.92909
[5,] 22.034686 -14.349559 52.828589 -26.929093 438.22361
# Verify
all.equal(as.matrix(XtX_h5), crossprod(A))[1] TRUE
# See everything written to the file so far
list_datasets(ops_file)[1] "OUTPUT/A_x_B" "matrices/A" "matrices/B"
[4] "results/A_crossprod"
9 Step 7: Clean Up
HDF5 files keep file handles open as long as the HDF5Matrix objects that point to them are alive. This is intentional — keeping handles open avoids the overhead of opening and closing the file on every operation, which matters when you call dozens of operations in sequence. But it means that at the end of an analysis session, or before re-running code interactively, you should release those handles explicitly.
hdf5_close_all() closes every HDF5 handle currently tracked by the package in one call. It is the safe way to ensure that no file is left locked after you are done:
hdf5_close_all()After hdf5_close_all(), calling gc() is good practice. R’s garbage collector runs finalizers for objects that are no longer referenced, which releases any remaining C++-level resources associated with HDF5Matrix objects that have already gone out of scope but whose finalizers have not yet run:
gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 929515 49.7 1783064 95.3 NA 1783064 95.3
Vcells 4588701 35.1 10146329 77.5 36864 8193715 62.6
- End of an analysis script — ensures no file stays locked after the script finishes.
- Before re-running code interactively — prevents “file already open” or “dataset exists” errors from stale handles left by the previous run.
- When disk space seems unexpectedly large — HDF5 files can hold free space internally from deleted datasets; closing handles properly lets the file system account for the current state.
The combination hdf5_close_all(); gc() is the reliable reset for a clean slate.
# Remove tutorial files
file.remove(data_file, ops_file,
"hdf5matrix_intro.hdf5",
f_none, f_def)[1] FALSE FALSE FALSE FALSE FALSE
10 Interactive Exercise
10.1 Practice: Creating Your Own HDF5 Workflow
Now that you’ve seen the basic operations, try designing a small workflow with data relevant to your work.
# Exercise: Create a mini-analysis workflow
# 1. Generate or load your own data
my_data <- matrix(rnorm(1000 * 500), nrow = 1000, ncol = 500)
# Replace with: read.csv(), or data from your field
# 2. Create an HDF5Matrix with meaningful names
X <- hdf5_create_matrix(
filename = "my_analysis.hdf5",
dataset = "raw_data/measurements", # Choose a descriptive path
data = my_data,
overwrite = TRUE
)
# 3. Inspect what you created
X
list_datasets("my_analysis.hdf5")
# 4. Perform an operation
result <- crossprod(X, outgroup = "processed", outdataset = "XtX")
# 5. Bring a summary to memory and inspect
diag_vals <- as.matrix(result[1:5, 1:5])
print(diag_vals)
# 6. Clean up
hdf5_close_all()As you work through this exercise, consider:
1. File Organization: - How are you structuring your HDF5 file? (groups and datasets) - If you added more data tomorrow, where would it go? - Could someone else understand your organization scheme?
2. Memory vs. Disk: - At what data size would your operation fail in memory? - How much disk space does your HDF5 file use? - Compare: object size in R (object.size(my_data)) vs. file size on disk
3. The HDF5Matrix Object: - At which point do you read data into RAM in this workflow? - What happens if you call as.matrix(X) on the full matrix? - How does X[1:10, 1:5] differ from as.matrix(X)[1:10, 1:5]?
4. Operation Choice: - Why use crossprod(X) instead of t(X) %*% X? - What statistical question does this operation answer? - Could you achieve the same result differently?
5. Error Handling: - What happens if you try to create a dataset that already exists? - Try it: What does the error message tell you? - How do you fix it? (overwrite = TRUE or a different path?)
6. Scaling Up: - Your test used 1,000 × 500. What about 100,000 × 50,000? - Which operations would work unchanged? - Would you adjust compression or thread settings for very large data?
Don’t worry about “correct” answers — the goal is developing intuition about when and how to use these tools.
11 What You’ve Accomplished
✅ Installed BigDataStatMeth and all dependencies
✅ Understood the HDF5Matrix object and its role in the package
✅ Created HDF5Matrix objects from R data
✅ Accessed data with subsetting and as.matrix()
✅ Configured global options including compression
✅ Performed block-wise operations with standard R syntax
✅ Verified that results match in-memory computations
12 Next Steps
Now that you have BigDataStatMeth working and understand the HDF5Matrix paradigm, continue learning:
Continue the tutorial series:
- ✅ Getting Started (you are here)
- → Working with HDF5 Matrices — File operations, data import, and management
- → Your First Analysis — Complete workflow from raw data to results
Explore practical workflows:
- Implementing PCA — Principal Component Analysis on genomic data
- Implementing CCA — Canonical Correlation Analysis
Dive deeper into concepts:
- Understanding HDF5 — How HDF5 storage works
- Block-Wise Computing — Algorithms behind the scenes
13 Getting Help
If you encounter issues:
- Check documentation:
?hdf5_create_matrix,?hdf5matrix_options,?svd.HDF5Matrix - Review examples: Package vignettes contain working code
- GitHub Issues: Report bugs at isglobal-brge/BigDataStatMeth
# Check package version
packageVersion("BigDataStatMeth")
# View all loaded packages
sessionInfo()
# Check current global options
hdf5matrix_options()
# Check working directory (where files are created)
getwd()14 Key Takeaways
Let’s consolidate what you’ve learned about setting up and using BigDataStatMeth for the first time.
14.1 Essential Concepts
Installation creates the foundation for all subsequent work with BigDataStatMeth. The package requires Rhdf5lib (the HDF5 C library, from Bioconductor) and standard CRAN packages, plus compilation tools. Windows users face additional complexity requiring Rtools, but following the installation sequence systematically prevents most problems. Testing with small examples immediately after installation catches configuration issues before you invest time in real analyses.
The HDF5Matrix object is the central abstraction of BigDataStatMeth. Rather than calling functions with file paths and dataset names on every operation, you create an HDF5Matrix object once — pointing to data stored on disk — and then work with it using standard R syntax. The object knows where its data lives, how large it is, and dispatches every operation block-wise without loading the full matrix into memory. This object-oriented design makes code readable and scalable at the same time.
HDF5 files are organized hierarchically like a file system, with groups acting as folders and datasets as files. The path string "group/dataset" passed to hdf5_create_matrix() encodes this structure directly. Creating your first HDF5 file teaches this fundamental paradigm: data lives on disk, accessed selectively, rather than entirely in RAM. Good organization from the start saves confusion when projects grow complex.
Global options control the computational behaviour of all HDF5Matrix operations. hdf5matrix_options() lets you configure parallelization, number of threads, block size, and compression in one place. Compression level 6 (the default) balances file size and write speed well for most workflows, but you can tune it based on your storage and throughput requirements. These settings are especially useful for operators like + and - where no explicit arguments can be passed.
Standard R generics work directly on HDF5Matrix objects. Calls like prcomp(), svd(), crossprod(), and %*% operate block-wise on disk without loading the full matrix into memory. The familiar R syntax is the interface — BigDataStatMeth simply makes it scale to datasets that don’t fit in RAM. The C++ API exists for developers who need to implement novel statistical methods or integrate directly with the HDF5 computational infrastructure. For the vast majority of analyses, the S3 interface provides everything needed.
Verification prevents wasted effort. Testing installations with small examples catches problems when they’re easy to fix. If basic operations fail on tiny test data, they won’t mysteriously work on real 100,000 × 100,000 matrices. Small-scale testing establishes that your environment works correctly before investing hours generating or converting large datasets.
14.2 When to Use BigDataStatMeth
Making informed decisions about when BigDataStatMeth helps versus when simpler approaches suffice saves time and prevents unnecessary complexity.
✅ Use BigDataStatMeth when:
Data exceeds 30% of available RAM — This threshold provides headroom for intermediate computations and operating system needs. Below 30%, traditional R approaches work fine. Above 30%, you risk memory exhaustion during operations, and disk-based computing becomes necessary.
You’re starting a new analysis project — Converting data to HDF5 at the beginning avoids migration pain later. It’s easier to start organized than to reorganize mid-project when you discover your data has grown beyond memory limits.
Multiple analyses will reuse the same data — Converting to HDF5 once pays off when you’ll run PCA, then regression, then association tests on the same dataset. The upfront conversion cost amortizes across repeated analyses.
Your workflow spans multiple tools — If you work in R, Python, and command-line tools, HDF5 provides a common format all can read efficiently. This beats converting between CSV, RData, and tool-specific formats repeatedly.
❌ Traditional R works better when:
Data comfortably fits in less than 20% of RAM — If
data <- read.csv(file)works without issues, stick with familiar R approaches. Traditional methods are simpler, more flexible, and better supported by the broader R ecosystem. Don’t add complexity unnecessarily.You’re doing one-off exploratory analysis — For quick investigations you won’t repeat, the HDF5 conversion overhead outweighs benefits. Load data, explore, save key results, discard working data, and you’re done.
You need maximum flexibility — In-memory R data structures support arbitrary manipulations trivially. HDF5 adds structure, which aids organization but constrains spontaneous manipulations. If your workflow involves many ad-hoc transformations, staying in memory maintains flexibility.
The key question isn’t just “can my data fit in memory?” but “does my workflow benefit from disk-based computing?” Understanding your analysis requirements and computational resources helps make this decision rationally rather than by trial and error.