library(BigDataStatMeth)Working with HDF5 Matrices
File Operations, Data Conversion, and Management
1 Overview
This tutorial teaches you the practical skills for working efficiently with HDF5 files: importing data from common formats, organizing complex projects with multiple datasets, accessing data strategically, and applying best practices for file structure. Think of this as learning how to organize and navigate your data storage before diving into statistical analyses.
1.1 What You’ll Learn
By the end of this tutorial, you will:
- Import text files (CSV, TSV) directly into HDF5 format
- Organize multiple related datasets in one HDF5 file using groups
- Add datasets to existing files and manage the file structure
- Read data efficiently using HDF5Matrix subsetting — loading only what you need
- Understand how file handles work and when to release them
- Apply best practices for HDF5 file organization in research projects
2 Prerequisites
Complete the Getting Started tutorial first. You should have:
- BigDataStatMeth installed and working
- A solid understanding of the HDF5Matrix object
- Familiarity with
hdf5_create_matrix(),list_datasets(), andhdf5_close_all()
3 Importing Text Files into HDF5
3.1 The Problem
You have large text files (CSV, TSV, or other delimiters) that won’t fit comfortably in RAM. Loading them with read.csv() or read.table() is slow or impossible at scale.
3.2 The Solution
hdf5_import() reads a delimited file in chunks and writes it directly to HDF5, returning an HDF5Matrix object ready for analysis. It handles local files and URLs, and auto-detects the separator from the file extension (.csv → comma, .tsv → tab).
3.3 Example: Importing a CSV File
We use a real clinical dataset (colesterol.csv) with cholesterol and metabolic measurements:
if (!file.exists("colesterol.csv")) {
stop("colesterol.csv not found. Please ensure it's in the working directory.")
}
file_size_kb <- file.info("colesterol.csv")$size / 1024
cat("File size:", round(file_size_kb, 1), "KB\n")File size: 152.1 KB
cat("\nFirst few lines of the CSV:\n")
First few lines of the CSV:
head(read.csv("colesterol.csv", nrows = 3)) TCholesterol Age Insulin Creatinine BUN LLDR Triglycerides
1 223.7348 55.25039 14.90246 0.9026095 10.22067 1.450970 199.0737
2 248.1820 53.47404 25.12592 0.8710345 17.10225 1.002655 169.7877
3 180.2071 58.54268 12.95197 0.8882435 11.21530 1.073924 150.2938
HDL_C LDL_C Sex
1 38.37616 107.09286 0
2 36.35374 83.66443 1
3 55.91920 108.02967 1
Now import it into HDF5. The result is an HDF5Matrix pointing to the newly created dataset — no data loaded into RAM yet:
clinical_h5 <- hdf5_import(
source = "colesterol.csv",
filename = "clinical_data.hdf5",
dataset = "clinical/measurements",
header = TRUE,
overwrite = TRUE
)
cat("✓ File imported to HDF5\n")✓ File imported to HDF5
clinical_h5HDF5Matrix object
File: clinical_data.hdf5
Path: clinical/measurements
Dimensions: 1000 x 10
Type:
Status: OPEN
hdf5_import() writes the numeric content of the file as the main dataset. If header = TRUE, column names are preserved as dimension names on the HDF5Matrix object and can be retrieved with colnames(). If rownames = TRUE, the first column is treated as row identifiers.
3.4 Inspect and Access the Result
# List what was written to the file
list_datasets("clinical_data.hdf5")[1] "clinical/.measurements_dimnames/2" "clinical/measurements"
# Dimensions — no data loaded yet
dim(clinical_h5)[1] 1000 10
# Read the first 5 rows — only this block is fetched from disk
as.matrix(clinical_h5[1:5, ]) TCholesterol Age Insulin Creatinine BUN LLDR Triglycerides
[1,] 223.7348 55.25039 14.90246 0.9026095 10.22067 1.450970 199.0737
[2,] 248.1820 53.47404 25.12592 0.8710345 17.10225 1.002655 169.7877
[3,] 180.2071 58.54268 12.95197 0.8882435 11.21530 1.073924 150.2938
[4,] 200.1522 62.58284 28.27819 0.9361357 12.79399 1.129373 150.0377
[5,] 234.3281 68.70254 13.21404 0.7775238 11.39689 1.235655 161.8820
HDL_C LDL_C Sex
[1,] 38.37616 107.09286 0
[2,] 36.35374 83.66443 1
[3,] 55.91920 108.02967 1
[4,] 51.89241 108.83731 1
[5,] 45.25873 94.98369 1
# Column names from the CSV header
colnames(clinical_h5) [1] "TCholesterol" "Age" "Insulin" "Creatinine"
[5] "BUN" "LLDR" "Triglycerides" "HDL_C"
[9] "LDL_C" "Sex"
3.5 Working with Different Delimiters
hdf5_import() auto-detects the separator for .csv and .tsv files. For other formats, pass sep explicitly:
# Tab-separated file (.tsv extension auto-detects tab separator)
hdf5_import("data.tsv", "output.hdf5", "imported/data")
# Custom separator (e.g., semicolons — sep required)
hdf5_import("data.txt", "output.hdf5", "imported/data", sep = ";")
# Import from a URL — downloaded and converted automatically
hdf5_import(
source = "https://example.com/data.csv",
filename = "downloaded.hdf5",
dataset = "data/measurements"
)4 Managing Multiple Datasets
One HDF5 file can hold many datasets organized into groups. This is the natural way to keep a project’s data together: raw inputs, quality-controlled versions, intermediate results, and final outputs all in a single file, navigable and self-documenting.
4.1 Creating a Multi-Dataset File
set.seed(100)
genotype <- matrix(sample(0:2, 1000*500, replace = TRUE), 1000, 500)
phenotype <- matrix(rnorm(1000*10), 1000, 10)
covariates <- matrix(rnorm(1000*5), 1000, 5)
project_file <- "multi_dataset_project.hdf5"
geno_h5 <- hdf5_create_matrix(project_file, "genetics/snps",
data = genotype, overwrite = TRUE)
pheno_h5 <- hdf5_create_matrix(project_file, "phenotypes/traits",
data = phenotype, overwrite = TRUE)
cov_h5 <- hdf5_create_matrix(project_file, "phenotypes/covariates",
data = covariates, overwrite = TRUE)
cat("✓ Multi-dataset file created\n")✓ Multi-dataset file created
4.2 Inspecting File Contents
list_datasets(project_file)[1] "genetics/snps" "phenotypes/covariates" "phenotypes/traits"
Good group-naming strategies:
By data type:
/genetics/— SNP matrices, genomic data/phenotypes/— Clinical measurements, traits/results/— Analysis outputs (PCA, regression)
By processing stage:
/raw/— Original imported data/qc/— Quality-controlled data/normalized/— Ready for analysis
By analysis:
/pca/— Components, loadings, variance explained/gwas/— Association results
The path "group/subgroup/dataset" in hdf5_create_matrix() creates as many nesting levels as you need.
5 Adding and Replacing Datasets
5.1 Adding New Datasets
Calling hdf5_create_matrix() on an existing file with overwrite = FALSE adds the new dataset without affecting others:
quality_scores <- matrix(runif(1000*500), 1000, 500)
qc_h5 <- hdf5_create_matrix(
filename = project_file,
dataset = "quality_control/scores",
data = quality_scores,
overwrite = FALSE
)
list_datasets(project_file)[1] "genetics/snps" "phenotypes/covariates" "phenotypes/traits"
[4] "quality_control/scores"
5.2 Replacing Datasets
To replace an existing dataset with new content, set overwrite = TRUE. The new data is written in place and the old content is discarded:
# Replace a dataset with updated values
updated_scores <- quality_scores * 1.1
hdf5_create_matrix(
filename = project_file,
dataset = "quality_control/scores",
data = updated_scores,
overwrite = TRUE
)BigDataStatMeth focuses on statistical computation rather than file management, so it does not include a dataset-deletion function. The overwrite = TRUE option replaces the content of a dataset but may not immediately reduce the physical file size on disk — HDF5 files track freed space internally and reuse it for future writes.
For true physical deletion and file compaction (reducing the file size after removing datasets), use the h5repack command-line utility from the HDF Group, which rewrites the file without the deleted content. This is an external HDF5 tool documented at support.hdfgroup.org.
6 Efficient Data Access
6.1 Reading Only What You Need
One of the most important features of HDF5 is selective I/O: you can read any subset of a dataset — specific rows, specific columns, or an arbitrary block — without loading the entire matrix into memory. This is what makes it possible to analyze 100 GB files on a 16 GB laptop.
With HDF5Matrix objects, this selective reading happens through standard R bracket syntax. Only the requested indices are fetched from disk:
# Open the dataset as an HDF5Matrix
snps_h5 <- hdf5_matrix(project_file, "genetics/snps")
# Read only the first 100 rows and 50 columns — the rest stays on disk
subset_data <- as.matrix(snps_h5[1:100, 1:50])
cat("Subset dimensions:", dim(subset_data), "\n")Subset dimensions: 100 50
cat("Preview (first 5×5):\n")Preview (first 5×5):
print(subset_data[1:5, 1:5]) [,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 0 2
[2,] 2 2 1 1 1
[3,] 1 1 0 1 2
[4,] 2 2 0 2 2
[5,] 0 2 0 2 0
cat("\n✓ Loaded", prod(dim(subset_data)), "values\n")
✓ Loaded 5000 values
cat(" Full dataset has", prod(dim(snps_h5)), "values\n") Full dataset has 5e+05 values
snps_h5[1:100, 1:50] reads only that 100×50 block from the HDF5 file. as.matrix(snps_h5[1:100, 1:50]) then converts it to a standard R matrix in RAM.
The two steps are often combined as shown above. Avoid as.matrix(snps_h5) without subsetting if the full matrix is large — that loads everything into RAM, defeating the purpose of HDF5 storage.
6.2 Flexible Indexing
You can use any vector of row or column indices, not just contiguous ranges:
# Specific columns only — useful for targeting selected variables
selected_cols <- snps_h5[, c(1, 10, 20, 30, 40)]
cat("Selected columns dimensions:", dim(selected_cols), "\n")Selected columns dimensions: 1000 5
# Every 10th row — useful for quick overview of large datasets
sparse_sample <- as.matrix(snps_h5[seq(1, 1000, by = 10), 1:10])
cat("Sparse sample dimensions:", dim(sparse_sample), "\n")Sparse sample dimensions: 100 10
cat("10× reduction in rows vs full dataset\n")10× reduction in rows vs full dataset
For exploration — first N rows:
X[1:100, ]For specific variables:
X[, c(5, 10, 15)]For random sampling:
rows <- sample(1:10000, 500)
as.matrix(X[rows, ])The key principle: Load only what you need. Combine subsetting with as.matrix() only when your analysis tool requires an in-memory matrix — for BigDataStatMeth operations like %*%, crossprod(), or prcomp(), you pass the HDF5Matrix directly and no explicit loading is needed.
7 Working with Large Files
7.1 How BigDataStatMeth Handles Scale
When you apply a BigDataStatMeth operation to an HDF5Matrix — matrix multiplication, PCA, SVD — the package manages everything automatically:
- Opens the file and navigates to the dataset
- Reads the data in optimally sized blocks
- Performs the computation block by block
- Writes partial results and accumulates them
- Returns a new
HDF5Matrixpointing to the output
You pass the object and call the function. Everything else is handled internally:
# BigDataStatMeth handles everything internally:
X <- hdf5_matrix("huge_file.hdf5", "data/matrix_A")
Y <- hdf5_matrix("huge_file.hdf5", "data/matrix_B")
result <- X %*% Y
# ↑ Behind the scenes:
# Blocks of X and Y are read from disk in turn
# Partial products are computed and accumulated
# Final result written to a new HDF5 dataset
# result is an HDF5Matrix pointing to that datasetWith BigDataStatMeth’s S3 interface, you never need to manage block sizes, file handles, or read/write patterns manually. The package optimizes these decisions automatically based on matrix dimensions and the global options set via hdf5matrix_options().
Manual block control is only available when developing new methods using the C++ API — for advanced users implementing custom statistical algorithms where fine-grained control over memory layout and I/O patterns is required.
7.2 Checking Dataset Dimensions Without Loading Data
hdf5_matrix() opens an existing dataset and gives you access to its properties instantly, without reading any data into memory:
# Open datasets to inspect their dimensions
snps_h5 <- hdf5_matrix(project_file, "genetics/snps")
traits_h5 <- hdf5_matrix(project_file, "phenotypes/traits")
cat("File size:", round(file.info(project_file)$size / (1024^2), 2), "MB\n\n")File size: 2.83 MB
cat("SNP matrix: ", nrow(snps_h5), "samples ×", ncol(snps_h5), "SNPs\n")SNP matrix: 1000 samples × 500 SNPs
cat("Trait matrix:", nrow(traits_h5), "samples ×", ncol(traits_h5), "traits\n")Trait matrix: 1000 samples × 10 traits
close(snps_h5)
close(traits_h5)This pattern — open, check dimensions, plan your analysis, then proceed — is useful for verifying file contents after a pipeline step or before launching a long computation.
8 Data Organization Best Practices
8.1 1. Use Descriptive Names
The path string in hdf5_create_matrix() is your documentation. A name that describes the data and its processing state is far more useful six months later than a cryptic abbreviation:
# Descriptive: self-explanatory six months later
hdf5_create_matrix("study.hdf5", "gwas_diabetes_2024/genotypes_qc_maf05",
data = data)
# Avoid: what is g1/d?
hdf5_create_matrix("data.hdf5", "g1/d",
data = data)8.2 2. Document Your Structure
Keep a short text file alongside your HDF5 file describing its organization. This takes two minutes and saves hours of confusion later:
structure_doc <- "
HDF5 File Structure
===================
/genetics/snps - Genotype matrix (samples × SNPs)
/phenotypes/traits - Clinical measurements
/phenotypes/covariates - Age, sex, ancestry PCs
/results/pca - PCA results from genetics data
/results/pca/components - Sample scores
/results/pca/loadings - SNP loadings
"
writeLines(structure_doc, "project_structure.txt")8.3 3. Use Consistent Dimensions
Keep samples in rows and features in columns throughout your project. This is the standard convention in statistics and makes combining datasets straightforward — a sample that appears as row 42 in the genotype matrix should be row 42 in the phenotype matrix:
# Consistent convention throughout the project
genotypes # n_samples × n_snps
phenotypes # n_samples × n_traits
results # n_samples × n_components
# All share the same row ordering → trivial to alignThis is worth understanding once so it never surprises you again.
R uses column-major order (like Fortran): data is stored column by column in memory. HDF5 uses row-major order (like C/C++): data is stored row by row on disk.
When you create a matrix in R and save it to HDF5:
# In R: 3 samples (rows) × 5 SNPs (columns)
genotypes <- matrix(1:15, nrow = 3, ncol = 5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 4 7 10 13
# [2,] 2 5 8 11 14
# [3,] 3 6 9 12 15
X <- hdf5_create_matrix("test.hdf5", "data/geno", data = genotypes)If you open the file in HDFView or h5dump, you’ll see the data appears transposed (5 × 3). This is normal, expected, and handled automatically.
BigDataStatMeth manages the conversion transparently:
✓ When you read with as.matrix() or subsetting X[i, j] → you get the correct R matrix
✓ When you compute with crossprod() or %*% → dimensions are correct
✓ When you save results → they are stored correctly for R
The only time this matters is if you inspect HDF5 files with external tools like HDFView and wonder why dimensions appear swapped. They’re not wrong — just viewed from the HDF5 side rather than the R side. Trust BigDataStatMeth to handle the storage details.
9 Interactive Exercise
9.1 Practice: Designing Your Project’s HDF5 Structure
Effective HDF5 organization makes the difference between a maintainable project and a confusing mess six months later. This exercise helps you think through organization before starting real analyses.
# Exercise: Plan and create an HDF5 structure for your project
# Scenario: multi-omic study with:
# - Genomic data (SNPs): 50,000 individuals × 500,000 variants
# - Transcriptomic data (RNA-seq): same 50,000 individuals × 20,000 genes
# - Phenotype data: 50,000 individuals × 50 measurements
project_file <- "my_study.hdf5"
# Create the structure — paths define the hierarchy automatically
geno_h5 <- hdf5_create_matrix(project_file, "raw/genomics/snps",
data = dummy_geno, overwrite = TRUE)
expr_h5 <- hdf5_create_matrix(project_file, "raw/transcriptomics/expression",
data = dummy_expr, overwrite = TRUE)
pheno_h5 <- hdf5_create_matrix(project_file, "raw/phenotypes/measurements",
data = dummy_pheno, overwrite = TRUE)
# Inspect the resulting hierarchy
list_datasets(project_file)
# Access selectively — only what you need for a given step
first_100_snps <- as.matrix(geno_h5[1:100, 1:50])
# Close when done
hdf5_close_all()
gc()Think through these design decisions for your actual or planned projects:
1. File Organization Strategy: - One large file vs. multiple smaller files? - For your project, which approach makes more sense? - How does your team typically share data? - Will different people access different parts?
2. Group Hierarchy: - How deep should your path structure go? - "data/type/version" vs. "data_type_version"? - Can you navigate your structure six months from now? - Would a new collaborator understand it?
3. Dataset Naming: - "matrix1" vs. "filtered_normalized_snps_maf05"? - How do you indicate processing steps in the name? - Do you need version numbers in paths?
4. Metadata Management: - Where do row/column names go? (Use rownames() and colnames() on HDF5Matrix) - How do you store processing parameters? - How do you link datasets that must stay aligned (same row order)?
5. Practical Constraints: - How much disk space do you have? - Will files be transferred between systems? - Backup strategy for large files?
6. Future Expansion: - What if you add more samples next year? - New data types (proteomics, metabolomics)? - Where does each new piece go?
There’s no universal correct organization — it depends on your project scale, team structure, and workflow. The goal is thinking through these questions before you have 50 analysis scripts depending on a particular structure.
10 Cleanup
hdf5_close_all()
gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 893208 47.8 1501440 80.2 NA 1501440 80.2
Vcells 2435000 18.6 8388608 64.0 36864 7101220 54.2
file.remove("clinical_data.hdf5", "multi_dataset_project.hdf5")[1] TRUE TRUE
cat("✓ Tutorial cleanup complete\n")✓ Tutorial cleanup complete
11 What You’ve Learned
✅ Import text files (CSV, TSV, URLs) directly into HDF5 with hdf5_import()
✅ Organize multiple datasets in one file using group paths
✅ Add and replace datasets in existing files
✅ Access data selectively using HDF5Matrix subsetting
✅ Inspect file contents and dimensions without loading data
✅ Apply best practices for file organization in research projects
12 Next Steps
Continue the tutorial series:
- ✅ Getting Started
- ✅ Working with HDF5 Matrices (you are here)
- → Your First Analysis — Complete analysis workflow
Ready for real analysis:
- Implementing PCA — Full PCA workflow on genomic data
- Implementing CCA — Canonical correlation analysis
13 Key Takeaways
Let’s consolidate your understanding of HDF5 file management and organization for big data projects.
13.1 Essential Concepts
HDF5 is a data management system, not just a file format. It provides hierarchical organization (groups and datasets), selective I/O (read only what you need), and cross-platform compatibility. You’re not just storing data — you’re organizing an entire analysis project in a structured, queryable format that can outlive any particular analysis script or software version.
hdf5_import() converts text data to HDF5 in one step and returns a ready-to-use HDF5Matrix object. It handles the chunk-by-chunk reading that would otherwise crash memory, auto-detects separators, and works with local files, compressed archives, and remote URLs alike. File conversion is a one-time investment: a 50 GB CSV converted to HDF5 can be accessed selectively in milliseconds thereafter.
Organization decisions made early are hard to change later. Once you have 50 analysis scripts expecting data at "raw/genomics/snps", reorganizing the file breaks everything. Take time designing your group structure before populating it. The path convention in hdf5_create_matrix() makes this planning tangible — sketch your paths on paper before writing a single line of code.
Selective I/O is the performance engine of HDF5. Reading X[1:100, 1:50] from a 50,000 × 500,000 HDF5Matrix fetches exactly 5,000 values regardless of the matrix’s total size. This enables analysis of datasets that are many times larger than available RAM. Structuring your code to read only needed data — rather than loading entire matrices and then subsetting — is the single most impactful practice for working with large files.
HDF5Matrix objects manage file handles automatically. Files stay open while their objects are alive, improving performance for repeated access. Explicit hdf5_close_all(); gc() at the end of an analysis section ensures clean teardown, prevents stale handles on re-runs, and is the reliable way to reset state during interactive development.
File size on disk may not reflect deletions immediately. When datasets are overwritten or the file is restructured, HDF5 tracks freed space internally for reuse by subsequent writes. For physical compaction — actually reducing the file size — use the external h5repack utility from the HDF Group.
13.2 When to Use HDF5 File Management
✅ Use hierarchical HDF5 organization when:
You have multiple related datasets — Genomics, transcriptomics, and phenotypes all in one project. Groups keep everything organized in a single file rather than managing dozens of separate files with mismatched naming conventions.
Your project will grow over time — Starting organized prevents refactoring pain. Hierarchical structure accommodates new samples, time points, or data types without breaking existing paths.
Multiple team members access the same data — Clear organization (
/raw/,/qc/,/results/) makes files self-documenting. Collaborators can navigate without constant explanations.You need to document processing steps — Dataset paths like
"qc/genotypes_maf05_missing05"encode processing history. The file structure itself documents your analysis pipeline.
❌ Simpler approaches work better when:
You have a single dataset — If your entire project is one matrix, elaborate group organization adds no value.
Data won’t grow or change — For static, finalized datasets, simple flat structure suffices.
Quick, temporary analyses — For exploratory work you won’t reuse, organizational overhead outweighs benefits.