Getting Started

Installation, First Steps, and the HDF5Matrix Object

1 Overview

BigDataStatMeth enables statistical analysis on datasets too large for RAM by using HDF5 file storage and block-wise processing. This tutorial guides you through installation, introduces the fundamental object that makes everything work, and shows your first operations — ensuring you have a solid working foundation before diving into more complex analyses.

Think of this as setting up your laboratory bench before starting experiments. You’ll not only verify that all your tools work correctly, but also understand the conceptual model that underlies every operation in the package.

1.1 What You’ll Learn

By the end of this tutorial, you will:

  • Install BigDataStatMeth and all required dependencies correctly
  • Understand the HDF5Matrix object and how it differs from in-memory matrices
  • Create, open, inspect, and close HDF5Matrix objects
  • Access data efficiently using subsetting and conversion to memory
  • Configure global options for parallelization, block size, and compression
  • Perform basic matrix operations using standard R syntax on disk-backed data
  • Know where to find help when you encounter issues
  • Be prepared for more advanced tutorials

2 Prerequisites

2.1 System Requirements

Operating System: - Linux (recommended) - macOS - Windows (requires Rtools — see below)

R Version: - R ≥ 4.1.0 (check with R.version.string)

RAM: - Minimum: 4 GB - Recommended: 8+ GB for comfortable work

Disk Space: - ~500 MB for package and dependencies - Additional space for your HDF5 data files

ImportantWindows Users: Install Rtools First

BigDataStatMeth is compiled from C++ source code. Windows users must install Rtools before installing the package.

Download the version matching your R installation from the link above.


3 Step 1: Install Dependencies

BigDataStatMeth requires packages from both CRAN and Bioconductor.

3.1 Install BiocManager

# BiocManager helps install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

3.2 Install Required Packages

BigDataStatMeth’s only Bioconductor dependency is Rhdf5lib, which provides the HDF5 C library used at compilation. All other dependencies are standard CRAN packages.

# Bioconductor: HDF5 C library (required for compilation)
BiocManager::install("Rhdf5lib")

# CRAN packages (installed automatically, but explicit install avoids surprises)
install.packages(c("R6", "Rcpp", "RcppEigen", "RCurl", "data.table"))
NoteInstalling from CRAN vs GitHub

When installing the CRAN release with install.packages("BigDataStatMeth"), all dependencies — including Rhdf5lib — are resolved and installed automatically.

When installing the GitHub version with devtools::install_github(), Rhdf5lib may not be pulled in automatically. Install it first with BiocManager::install("Rhdf5lib") to avoid compilation errors.

TipTroubleshooting Dependencies

If installation fails:

  1. Update R: Some packages require recent R versions
  2. Update BiocManager: Run BiocManager::install() with no arguments
  3. Check compilation tools: Especially on Windows/macOS
  4. Installation logs: Look for specific error messages about missing libraries

4 Step 2: Install BigDataStatMeth

4.2 From GitHub (Development Version)

For the latest development features:

# Install devtools if needed
if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

# Install development version from GitHub
devtools::install_github("isglobal-brge/BigDataStatMeth")
NoteWhich Version Should I Use?

Use CRAN version if: - You want maximum stability - You’re doing production analysis - You prefer well-tested releases

Use GitHub version if: - You need the latest features - You’re contributing to development - You want to test new functionality


5 Step 3: Load and Verify

5.1 Load the Package

library(BigDataStatMeth)

If no errors appear, the package loaded successfully!

5.2 Quick Verification

Run this simple test to verify everything works:

# Create a small test matrix
test_matrix <- matrix(rnorm(100), nrow = 10, ncol = 10)

# Create an HDF5Matrix object — the fundamental object in BigDataStatMeth
test_file <- "verification_test.hdf5"
X_test <- hdf5_create_matrix(
  filename  = test_file,
  dataset   = "test/data",
  data      = test_matrix,
  overwrite = TRUE
)

# The object knows its dimensions without loading data into RAM
if (all(dim(X_test) == c(10, 10))) {
  cat("✓ Installation verified!\n")
  cat("✓ HDF5Matrix created:", dim(X_test)[1], "×", dim(X_test)[2], "\n")
  close(X_test)
  file.remove(test_file)
} else {
  cat("✗ Installation issue — unexpected dimensions\n")
}
✓ Installation verified!
✓ HDF5Matrix created: 10 × 10 
[1] TRUE

Expected output:

✓ Installation verified!
✓ HDF5Matrix created: 10 × 10
WarningIf Verification Fails

Common issues:

  • “Package not found”: Restart R session
  • “HDF5 library error”: Reinstall Rhdf5lib with BiocManager::install("Rhdf5lib") then recompile BigDataStatMeth
  • Permission denied: Check write permissions in working directory
  • Symbol not found: Recompile package from source

Try sessionInfo() to check loaded packages and versions. For function-level help: ?hdf5_create_matrix.


6 Step 4: The HDF5Matrix Object

Before working with real data, you need to understand the building block of BigDataStatMeth: the HDF5Matrix object. This is the conceptual shift from earlier versions of the package — instead of calling functions with file paths and dataset names every time, you create an object once and then work with it using standard R syntax.

Think of an HDF5Matrix as a window onto data that lives on disk. The data doesn’t move into RAM until you explicitly ask for it — but the object itself knows where the data is, how big it is, and can operate on it block by block. To R code, it behaves like a regular matrix.

6.1 Creating an HDF5Matrix

hdf5_create_matrix() writes data to an HDF5 file and returns an HDF5Matrix object pointing to it. The dataset argument combines the group path and dataset name using HDF5’s standard /group/dataset convention:

set.seed(42)
A <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)

example_file <- "hdf5matrix_intro.hdf5"

A_h5 <- hdf5_create_matrix(
  filename  = example_file,
  dataset   = "data/A",
  data      = A,
  overwrite = TRUE
)

# Printing the object shows where it lives and how big it is
A_h5
HDF5Matrix object
  File: hdf5matrix_intro.hdf5
  Path: data/A
  Dimensions: 300 x 80
  Type: 
  Status: OPEN
# Standard R accessors work directly
dim(A_h5)
[1] 300  80
nrow(A_h5)
[1] 300
ncol(A_h5)
[1] 80
NoteThe Dataset Path Convention

The dataset argument in hdf5_create_matrix() uses the path "group/dataset". Groups are like folders inside the HDF5 file — they are created automatically if they don’t exist yet. So "data/A" creates a group called data containing a dataset called A.

This means you can organize multiple datasets in a single file just by choosing their paths:

hdf5_create_matrix(file, "raw/genotypes", data = geno)
hdf5_create_matrix(file, "raw/expression", data = expr)
hdf5_create_matrix(file, "results/pca",    data = scores)

6.2 Accessing Data

Subsetting an HDF5Matrix uses standard R bracket syntax. Crucially, only the requested block is read from disk — the rest stays on disk untouched:

# Read a 5×6 block — only this block is loaded into RAM
A_h5[1:5, 1:6]
           [,1]         [,2]       [,3]        [,4]        [,5]       [,6]
[1,]  1.3709584 -0.004620768 -0.2484829  0.94192422 -0.74651645 -0.6013830
[2,] -0.5646982  0.760242168  0.4223204 -0.24861404  0.03660612 -0.1358161
[3,]  0.3631284  0.038990913  0.9876533  0.09647886  0.32330962 -0.9872728
[4,]  0.6328626  0.735072142  0.8355682 -0.43393094  0.37967603  0.8319250
[5,]  0.4042683 -0.146472627 -0.6605219  2.17866787  0.87655650 -0.7950595

When you need an in-memory copy — for visualization, export, or passing to other R functions — use as.matrix(). Subset on disk first whenever possible:

# Subset on disk first, then bring to memory — efficient
small_block <- as.matrix(A_h5[1:10, 1:5])
dim(small_block)
[1] 10  5
# Verify values match the original
all.equal(small_block, A[1:10, 1:5])
[1] TRUE
TipWhen to Use as.matrix()

Use as.matrix() for:

  • Small results: Singular values, summary statistics, final scores
  • Visualization: ggplot2 and base R plots need in-memory data
  • Export: Writing results to CSV or other formats

Avoid as.matrix() on the full object if your matrix is large. The whole point of HDF5Matrix is that data stays on disk until needed. Subset first: as.matrix(X[rows, cols]), not as.matrix(X).

6.3 Inspecting the File

list_datasets() shows all datasets currently stored in an HDF5 file. You can pass either the file path or an HDF5Matrix object:

# By file path — lists everything in the file
list_datasets(example_file)
[1] "data/A"
# By HDF5Matrix object — lists datasets in its group
list_datasets(A_h5)
[1] "A"
Notelist_datasets() and the File Path

list_datasets() accepts either the HDF5 file path (to see everything in the file) or an HDF5Matrix object (to see the datasets in its group). Both forms are useful: the file path gives you a global picture; the object form scopes the view to where you are currently working.

6.4 Reopening an Existing Dataset

If you already have an HDF5 file with data from a previous session, hdf5_matrix() opens an existing dataset and returns an HDF5Matrix object:

# Open an existing dataset — no data is read into RAM yet
A_reopen <- hdf5_matrix(example_file, "data/A")

dim(A_reopen)
[1] 300  80

This is the pattern you’ll use when loading results from a previous analysis run.


6.5 Global Options

BigDataStatMeth provides a global options system to configure parallelization, block size, and compression across all operations. This is particularly important for operators like +, -, *, / where you can’t pass extra arguments — the global options control their behaviour.

# View the current defaults
old_opts <- hdf5matrix_options()
old_opts
$paral
NULL

$block_size
NULL

$threads
NULL

$compression
NULL

The four options are:

Option Default Effect
paral NULL (auto-detect) Enable OpenMP parallelization
threads NULL (auto-detect) Number of parallel threads
block_size NULL (auto-calculate) Elements per I/O block
compression NULL → uses 6 gzip compression level (0–9)

Set them for a session or restore them after local overrides:

# Configure for a parallel analysis session
hdf5matrix_options(
  paral       = TRUE,
  threads     = 2L,
  block_size  = 512L,
  compression = 6L
)

hdf5matrix_options()
$paral
[1] TRUE

$block_size
[1] 512

$threads
[1] 2

$compression
[1] 6
TipCompression: the Trade-off Between Size and Speed

The compression option controls gzip compression for all output datasets. Higher levels produce smaller files but take longer to write. Level 6 (the default) is a balanced choice for most workflows.

The benchmark below illustrates the trade-off on a moderate-sized matrix:

set.seed(123)
X_bench <- round(matrix(rnorm(2000 * 200), 2000, 200), 2)

# No compression
f_none <- tempfile(fileext = ".h5")
t0 <- system.time(
  hdf5_create_matrix(f_none, "data/X",
                     data = X_bench, compression = 0, overwrite = TRUE)
)

# Default compression (level 6)
f_def <- tempfile(fileext = ".h5")
t6 <- system.time(
  hdf5_create_matrix(f_def, "data/X",
                     data = X_bench, compression = 6, overwrite = TRUE)
)

data.frame(
  compression   = c(0, 6),
  write_time_s  = c(t0[["elapsed"]], t6[["elapsed"]]),
  file_size_MB  = round(file.info(c(f_none, f_def))$size / 1024^2, 3)
)
  compression write_time_s file_size_MB
1           0        0.001        3.054
2           6        0.106        0.731

The exact numbers depend on your hardware and data, but the pattern is consistent: compression level 6 significantly reduces file size with a modest increase in write time. For interactive analysis on a laptop, level 6 is a sensible default. For high-throughput pipelines writing many large datasets, consider level 1 or 2.

# Restore original defaults for the rest of the tutorial
hdf5matrix_options(
  paral      = old_opts$paral,
  threads    = old_opts$threads,
  block_size = old_opts$block_size,
  compression = old_opts$compression
)

6.6 Releasing Resources

HDF5 files remain open while their HDF5Matrix objects exist, which improves performance for repeated access. Release handles explicitly when you’re done:

# Close a single object
close(A_reopen)

# Close all open HDF5Matrix handles at once
hdf5_close_all()
NoteAutomatic vs. Explicit Cleanup

R’s garbage collector will release HDF5 handles automatically when objects go out of scope. Explicit close() or hdf5_close_all() is good practice at the end of analysis sections or before re-running code interactively — it avoids accumulating stale handles between runs.


7 Step 5: Your First Analysis Dataset

Now that you understand the HDF5Matrix object, let’s create a realistic dataset and explore it.

7.1 Create Sample Data

# Simulate a genomic dataset: 1,000 samples × 5,000 SNPs
set.seed(123)
n_samples <- 1000
n_snps    <- 5000

genotype_data <- matrix(
  sample(0:2, n_samples * n_snps, replace = TRUE),
  nrow = n_samples,
  ncol = n_snps
)

# Add meaningful row/column names
rownames(genotype_data) <- paste0("Sample_", 1:n_samples)
colnames(genotype_data) <- paste0("SNP_", 1:n_snps)

# Check in-memory size
format(object.size(genotype_data), units = "MB")
[1] "19.5 Mb"

7.2 Create an HDF5Matrix from the Data

data_file <- "my_first_dataset.hdf5"

geno_h5 <- hdf5_create_matrix(
  filename  = data_file,
  dataset   = "genotypes/data",
  data      = genotype_data,
  overwrite = TRUE
)

cat("✓ HDF5Matrix created\n")
✓ HDF5Matrix created
geno_h5
HDF5Matrix object
  File: my_first_dataset.hdf5
  Path: genotypes/data
  Dimensions: 1000 x 5000
  Type: 
  Status: OPEN
NoteHDF5 File Structure

HDF5 files organize data hierarchically:

  • File: my_first_dataset.hdf5
    • Group: genotypes (like a folder)
      • Dataset: data (the actual matrix)

The HDF5Matrix object holds a reference to this location. No data is in RAM — only the pointer to where the data lives on disk.

7.3 Inspect and Access the Data

# List what's in the file
list_datasets(data_file)
[1] "genotypes/.data_dimnames/1" "genotypes/.data_dimnames/2"
[3] "genotypes/data"            
# Dimensions — answered instantly without reading the data
dim(geno_h5)
[1] 1000 5000
# Read a small subset — only this block is loaded
geno_h5[1:5, 1:10]
         SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10
Sample_1     2     1     2     2     2     0     1     0     2      0
Sample_2     2     2     1     2     2     1     0     0     0      2
Sample_3     2     1     0     0     2     2     1     0     0      0
Sample_4     1     0     0     0     2     1     0     0     1      1
Sample_5     2     2     0     1     0     2     0     2     2      2

7.4 Verify the Data Stored Correctly

# Bring a small block to memory and compare with original
block_hdf5 <- as.matrix(geno_h5[1:5, 1:10])
block_orig  <- genotype_data[1:5, 1:10]

all.equal(block_hdf5, block_orig)
[1] TRUE
NoteUnderstanding all.equal() with HDF5Matrix

If you see output like "Attributes: < Component 'dimnames': ...>", the matrices are numerically identical but have different attributes. This is expected — dimension names are stored separately in HDF5. The numeric values are what matter for calculations. You can verify explicitly:

all.equal(as.numeric(block_hdf5), as.numeric(block_orig))
[1] TRUE

8 Step 6: Basic Operations

With an HDF5Matrix object in hand, standard R operators work directly on disk-backed data. No new syntax to learn — BigDataStatMeth dispatches these to block-wise implementations transparently.

8.1 Matrix Multiplication

# Two matrices to multiply
set.seed(456)
A <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)
B <- matrix(rnorm(100 * 200), nrow = 100, ncol = 200)

ops_file <- "operations_example.hdf5"
A_h5 <- hdf5_create_matrix(ops_file, "matrices/A", data = A, overwrite = TRUE)
B_h5 <- hdf5_create_matrix(ops_file, "matrices/B", data = B, overwrite = TRUE)

# Standard R matrix multiplication — executed block-wise on disk
M_h5 <- A_h5 %*% B_h5

cat("Result dimensions:", dim(M_h5), "\n")
Result dimensions: 500 200 
cat("Result preview (first 5×5):\n")
Result preview (first 5×5):
print(as.matrix(M_h5[1:5, 1:5]))
          [,1]      [,2]       [,3]       [,4]     [,5]
[1,]  3.966529 -1.596117 -10.575227  -1.523170 13.37335
[2,] 10.496300 -4.726875  16.389046   3.817424 16.47317
[3,] -8.331547 -9.299947  -3.084136   4.030101 11.75505
[4,] -5.159786  2.319629   2.815899  -3.407731 -5.72410
[5,]  3.384955 18.492893   1.697668 -13.890756 10.48865
# Verify against in-memory computation
all.equal(as.matrix(M_h5), A %*% B)
[1] TRUE
TipBlock-Wise Processing in Action

Notice that A_h5 %*% B_h5 uses exactly the same syntax as in-memory matrix multiplication. Behind the scenes, BigDataStatMeth partitioned both matrices into blocks, multiplied each pair, accumulated the result, and wrote it to disk — all without you managing a single block boundary.

For truly large matrices (100 GB+), this same code works identically, just takes longer.

8.2 Crossproduct

A crossproduct computes t(A) %*% A — that is, the transpose of a matrix multiplied by itself. The result is a square symmetric matrix whose entries are dot products between columns of A. This operation is at the heart of many statistical methods: it appears in PCA (as the covariance structure), in ordinary least squares (as the normal equations (XᵀX)β = Xᵀy), and in any method that needs pairwise column similarities. For large matrices the block-wise implementation is essential because the input is never fully loaded into RAM, yet the result is exact.

crossprod() accepts optional outgroup and outdataset arguments to control where the result is written inside the HDF5 file:

# t(A) %*% A — with explicit output location
XtX_h5 <- crossprod(
  A_h5,
  outgroup   = "results",
  outdataset = "A_crossprod"
)

cat("Crossproduct dimensions:", dim(XtX_h5), "\n")
Crossproduct dimensions: 100 100 
cat("Preview (first 5×5):\n")
Preview (first 5×5):
print(as.matrix(XtX_h5[1:5, 1:5]))
           [,1]       [,2]       [,3]       [,4]      [,5]
[1,] 475.947503 -26.983942   1.705196  18.276113  22.03469
[2,] -26.983942 488.288982 -24.322274   6.712965 -14.34956
[3,]   1.705196 -24.322274 498.798075   9.800220  52.82859
[4,]  18.276113   6.712965   9.800220 511.565405 -26.92909
[5,]  22.034686 -14.349559  52.828589 -26.929093 438.22361
# Verify
all.equal(as.matrix(XtX_h5), crossprod(A))
[1] TRUE
# See everything written to the file so far
list_datasets(ops_file)
[1] "OUTPUT/A_x_B"        "matrices/A"          "matrices/B"         
[4] "results/A_crossprod"

9 Step 7: Clean Up

HDF5 files keep file handles open as long as the HDF5Matrix objects that point to them are alive. This is intentional — keeping handles open avoids the overhead of opening and closing the file on every operation, which matters when you call dozens of operations in sequence. But it means that at the end of an analysis session, or before re-running code interactively, you should release those handles explicitly.

hdf5_close_all() closes every HDF5 handle currently tracked by the package in one call. It is the safe way to ensure that no file is left locked after you are done:

hdf5_close_all()

After hdf5_close_all(), calling gc() is good practice. R’s garbage collector runs finalizers for objects that are no longer referenced, which releases any remaining C++-level resources associated with HDF5Matrix objects that have already gone out of scope but whose finalizers have not yet run:

gc()
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  929515 49.7    1783064 95.3         NA  1783064 95.3
Vcells 4588701 35.1   10146329 77.5      36864  8193715 62.6
TipWhen to Call hdf5_close_all() and gc()
  • End of an analysis script — ensures no file stays locked after the script finishes.
  • Before re-running code interactively — prevents “file already open” or “dataset exists” errors from stale handles left by the previous run.
  • When disk space seems unexpectedly large — HDF5 files can hold free space internally from deleted datasets; closing handles properly lets the file system account for the current state.

The combination hdf5_close_all(); gc() is the reliable reset for a clean slate.

# Remove tutorial files
file.remove(data_file, ops_file,
            "hdf5matrix_intro.hdf5",
            f_none, f_def)
[1] FALSE FALSE FALSE FALSE FALSE

10 Interactive Exercise

10.1 Practice: Creating Your Own HDF5 Workflow

Now that you’ve seen the basic operations, try designing a small workflow with data relevant to your work.

# Exercise: Create a mini-analysis workflow

# 1. Generate or load your own data
my_data <- matrix(rnorm(1000 * 500), nrow = 1000, ncol = 500)
# Replace with: read.csv(), or data from your field

# 2. Create an HDF5Matrix with meaningful names
X <- hdf5_create_matrix(
  filename  = "my_analysis.hdf5",
  dataset   = "raw_data/measurements",   # Choose a descriptive path
  data      = my_data,
  overwrite = TRUE
)

# 3. Inspect what you created
X
list_datasets("my_analysis.hdf5")

# 4. Perform an operation
result <- crossprod(X, outgroup = "processed", outdataset = "XtX")

# 5. Bring a summary to memory and inspect
diag_vals <- as.matrix(result[1:5, 1:5])
print(diag_vals)

# 6. Clean up
hdf5_close_all()
TipReflection Questions

As you work through this exercise, consider:

1. File Organization: - How are you structuring your HDF5 file? (groups and datasets) - If you added more data tomorrow, where would it go? - Could someone else understand your organization scheme?

2. Memory vs. Disk: - At what data size would your operation fail in memory? - How much disk space does your HDF5 file use? - Compare: object size in R (object.size(my_data)) vs. file size on disk

3. The HDF5Matrix Object: - At which point do you read data into RAM in this workflow? - What happens if you call as.matrix(X) on the full matrix? - How does X[1:10, 1:5] differ from as.matrix(X)[1:10, 1:5]?

4. Operation Choice: - Why use crossprod(X) instead of t(X) %*% X? - What statistical question does this operation answer? - Could you achieve the same result differently?

5. Error Handling: - What happens if you try to create a dataset that already exists? - Try it: What does the error message tell you? - How do you fix it? (overwrite = TRUE or a different path?)

6. Scaling Up: - Your test used 1,000 × 500. What about 100,000 × 50,000? - Which operations would work unchanged? - Would you adjust compression or thread settings for very large data?

Don’t worry about “correct” answers — the goal is developing intuition about when and how to use these tools.


11 What You’ve Accomplished

Installed BigDataStatMeth and all dependencies
Understood the HDF5Matrix object and its role in the package
Created HDF5Matrix objects from R data
Accessed data with subsetting and as.matrix()
Configured global options including compression
Performed block-wise operations with standard R syntax
Verified that results match in-memory computations


12 Next Steps

Now that you have BigDataStatMeth working and understand the HDF5Matrix paradigm, continue learning:

Continue the tutorial series:

  1. Getting Started (you are here)
  2. Working with HDF5 Matrices — File operations, data import, and management
  3. Your First Analysis — Complete workflow from raw data to results

Explore practical workflows:

Dive deeper into concepts:


13 Getting Help

If you encounter issues:

  1. Check documentation: ?hdf5_create_matrix, ?hdf5matrix_options, ?svd.HDF5Matrix
  2. Review examples: Package vignettes contain working code
  3. GitHub Issues: Report bugs at isglobal-brge/BigDataStatMeth
TipQuick Troubleshooting Commands
# Check package version
packageVersion("BigDataStatMeth")

# View all loaded packages
sessionInfo()

# Check current global options
hdf5matrix_options()

# Check working directory (where files are created)
getwd()

14 Key Takeaways

Let’s consolidate what you’ve learned about setting up and using BigDataStatMeth for the first time.

14.1 Essential Concepts

Installation creates the foundation for all subsequent work with BigDataStatMeth. The package requires Rhdf5lib (the HDF5 C library, from Bioconductor) and standard CRAN packages, plus compilation tools. Windows users face additional complexity requiring Rtools, but following the installation sequence systematically prevents most problems. Testing with small examples immediately after installation catches configuration issues before you invest time in real analyses.

The HDF5Matrix object is the central abstraction of BigDataStatMeth. Rather than calling functions with file paths and dataset names on every operation, you create an HDF5Matrix object once — pointing to data stored on disk — and then work with it using standard R syntax. The object knows where its data lives, how large it is, and dispatches every operation block-wise without loading the full matrix into memory. This object-oriented design makes code readable and scalable at the same time.

HDF5 files are organized hierarchically like a file system, with groups acting as folders and datasets as files. The path string "group/dataset" passed to hdf5_create_matrix() encodes this structure directly. Creating your first HDF5 file teaches this fundamental paradigm: data lives on disk, accessed selectively, rather than entirely in RAM. Good organization from the start saves confusion when projects grow complex.

Global options control the computational behaviour of all HDF5Matrix operations. hdf5matrix_options() lets you configure parallelization, number of threads, block size, and compression in one place. Compression level 6 (the default) balances file size and write speed well for most workflows, but you can tune it based on your storage and throughput requirements. These settings are especially useful for operators like + and - where no explicit arguments can be passed.

Standard R generics work directly on HDF5Matrix objects. Calls like prcomp(), svd(), crossprod(), and %*% operate block-wise on disk without loading the full matrix into memory. The familiar R syntax is the interface — BigDataStatMeth simply makes it scale to datasets that don’t fit in RAM. The C++ API exists for developers who need to implement novel statistical methods or integrate directly with the HDF5 computational infrastructure. For the vast majority of analyses, the S3 interface provides everything needed.

Verification prevents wasted effort. Testing installations with small examples catches problems when they’re easy to fix. If basic operations fail on tiny test data, they won’t mysteriously work on real 100,000 × 100,000 matrices. Small-scale testing establishes that your environment works correctly before investing hours generating or converting large datasets.

14.2 When to Use BigDataStatMeth

Making informed decisions about when BigDataStatMeth helps versus when simpler approaches suffice saves time and prevents unnecessary complexity.

Use BigDataStatMeth when:

  • Data exceeds 30% of available RAM — This threshold provides headroom for intermediate computations and operating system needs. Below 30%, traditional R approaches work fine. Above 30%, you risk memory exhaustion during operations, and disk-based computing becomes necessary.

  • You’re starting a new analysis project — Converting data to HDF5 at the beginning avoids migration pain later. It’s easier to start organized than to reorganize mid-project when you discover your data has grown beyond memory limits.

  • Multiple analyses will reuse the same data — Converting to HDF5 once pays off when you’ll run PCA, then regression, then association tests on the same dataset. The upfront conversion cost amortizes across repeated analyses.

  • Your workflow spans multiple tools — If you work in R, Python, and command-line tools, HDF5 provides a common format all can read efficiently. This beats converting between CSV, RData, and tool-specific formats repeatedly.

Traditional R works better when:

  • Data comfortably fits in less than 20% of RAM — If data <- read.csv(file) works without issues, stick with familiar R approaches. Traditional methods are simpler, more flexible, and better supported by the broader R ecosystem. Don’t add complexity unnecessarily.

  • You’re doing one-off exploratory analysis — For quick investigations you won’t repeat, the HDF5 conversion overhead outweighs benefits. Load data, explore, save key results, discard working data, and you’re done.

  • You need maximum flexibility — In-memory R data structures support arbitrary manipulations trivially. HDF5 adds structure, which aids organization but constrains spontaneous manipulations. If your workflow involves many ad-hoc transformations, staying in memory maintains flexibility.

The key question isn’t just “can my data fit in memory?” but “does my workflow benefit from disk-based computing?” Understanding your analysis requirements and computational resources helps make this decision rationally rather than by trial and error.