library(BigDataStatMeth)
library(rhdf5)Working with HDF5 Matrices
File Operations, Data Conversion, and Management
1 Overview
This tutorial teaches you the practical skills for working efficiently with HDF5 files: converting data from common formats, organizing complex projects with multiple datasets, and performing routine file operations. Think of this as learning the file system before learning programming - you need to understand how to organize and access your data before diving into statistical analyses.
1.1 What You’ll Learn
By the end of this tutorial, you will:
- Convert text files (CSV, TSV) to HDF5 format efficiently
- Organize multiple related datasets in one HDF5 file
- Use hierarchical groups to structure complex projects
- Add, inspect, and remove datasets safely
- Read data efficiently (whole datasets vs. subsets)
- Apply best practices for HDF5 file organization
- Understand when to use rhdf5 vs. BigDataStatMeth functions
2 Prerequisites
Complete the Getting Started tutorial first. You should have:
- BigDataStatMeth installed and working
- Basic understanding of HDF5 structure (groups and datasets)
- R and rhdf5 loaded
3 Converting Text Files to HDF5
3.1 The Problem
You have large text files (CSV, TSV, or custom delimiters) that won’t fit comfortably in RAM. Loading them with read.csv() or read.table() is slow or impossible.
3.2 The Solution
Use bdImportTextFile_hdf5() to convert directly to HDF5 format, processing the file in chunks.
3.3 Example: Converting a CSV File
We’ll use a real clinical dataset (colesterol.csv) with cholesterol and metabolic measurements:
# Check if file exists (should be in same directory as tutorial)
if (!file.exists("colesterol.csv")) {
stop("colesterol.csv not found. Please ensure it's in the working directory.")
}
# Check file size
file_size_kb <- file.info("colesterol.csv")$size / 1024
cat("File size:", round(file_size_kb, 1), "KB\n")File size: 152.1 KB
# Preview first few lines
cat("\nFirst few lines of the CSV:\n")
First few lines of the CSV:
head(read.csv("colesterol.csv", nrows = 3)) TCholesterol Age Insulin Creatinine BUN LLDR Triglycerides
1 223.7348 55.25039 14.90246 0.9026095 10.22067 1.450970 199.0737
2 248.1820 53.47404 25.12592 0.8710345 17.10225 1.002655 169.7877
3 180.2071 58.54268 12.95197 0.8882435 11.21530 1.073924 150.2938
HDL_C LDL_C Sex
1 38.37616 107.09286 0
2 36.35374 83.66443 1
3 55.91920 108.02967 1
Now convert to HDF5:
# Convert CSV to HDF5
result <- bdImportTextFile_hdf5(
filename = "colesterol.csv",
sep = ",", # CSV uses commas
outputfile = "clinical_data.hdf5",
outGroup = "clinical",
outDataset = "measurements",
header = TRUE, # First row is column names
overwrite = TRUE
)
cat("✓ File converted to HDF5\n")✓ File converted to HDF5
cat("Output file:", result$fn, "\n")Output file: clinical_data.hdf5
cat("Output dataset:", result$ds, "\n")Output dataset: clinical/measurements
bdImportTextFile_hdf5() handles numeric data. The first column (if character-based IDs) is stored separately as row names. Column names are stored in a separate dataset.
The numeric matrix contains only the numerical measurements (Age, Cholesterol, Insulin, etc.).
3.4 Verify the Conversion
# Inspect structure
h5ls(result$fn) group name otype dclass
0 / clinical H5I_GROUP
1 /clinical .measurements_dimnames H5I_GROUP
2 /clinical/.measurements_dimnames 2 H5I_DATASET COMPOUND
3 /clinical measurements H5I_DATASET FLOAT
dim
0
1
2 10
3 1000 x 10
# Read back a portion
converted_data <- h5read(result$fn, result$ds,
index = list(1:5, NULL))
converted_data [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 223.7348 55.25039 14.90246 0.9026095 10.22067 1.450970 199.0737 38.37616
[2,] 248.1820 53.47404 25.12592 0.8710345 17.10225 1.002655 169.7877 36.35374
[3,] 180.2071 58.54268 12.95197 0.8882435 11.21530 1.073924 150.2938 55.91920
[4,] 200.1522 62.58284 28.27819 0.9361357 12.79399 1.129373 150.0377 51.89241
[5,] 234.3281 68.70254 13.21404 0.7775238 11.39689 1.235655 161.8820 45.25873
[,9] [,10]
[1,] 107.09286 0
[2,] 83.66443 1
[3,] 108.02967 1
[4,] 108.83731 1
[5,] 94.98369 1
# Read column names
colnames_data <- h5read(result$fn, result$ds_cols)[,1]
colnames_data [1] "TCholesterol" "Age" "Insulin" "Creatinine"
[5] "BUN" "LLDR" "Triglycerides" "HDL_C"
[9] "LDL_C" "Sex"
3.5 Working with Different Delimiters
# Tab-separated file
bdImportTextFile_hdf5(
filename = "data.tsv",
sep = "\t",
outputfile = "data.hdf5",
outGroup = "imported",
outDataset = "data"
)
# Custom delimiter (e.g., semicolon)
bdImportTextFile_hdf5(
filename = "data.txt",
sep = ";",
outputfile = "data.hdf5",
outGroup = "imported",
outDataset = "data"
)4 Managing Multiple Datasets
One HDF5 file can store multiple datasets, organized in groups. This is ideal for related analyses.
4.1 Creating a Multi-Dataset File
# Create three related datasets
set.seed(100)
genotype <- matrix(sample(0:2, 1000*500, replace = TRUE), 1000, 500)
phenotype <- matrix(rnorm(1000*10), 1000, 10)
covariates <- matrix(rnorm(1000*5), 1000, 5)
# Store in same file, different groups
project_file <- "multi_dataset_project.hdf5"
bdCreate_hdf5_matrix(
filename = project_file,
object = genotype,
group = "genetics",
dataset = "snps",
overwriteFile = TRUE
)$fn
[1] "multi_dataset_project.hdf5"
$ds
[1] "genetics/snps"
bdCreate_hdf5_matrix(
filename = project_file,
object = phenotype,
group = "phenotypes",
dataset = "traits"
)$fn
[1] "multi_dataset_project.hdf5"
$ds
[1] "phenotypes/traits"
bdCreate_hdf5_matrix(
filename = project_file,
object = covariates,
group = "phenotypes",
dataset = "covariates"
)$fn
[1] "multi_dataset_project.hdf5"
$ds
[1] "phenotypes/covariates"
cat("✓ Multi-dataset file created\n")✓ Multi-dataset file created
4.2 Inspecting File Contents
# List all contents
h5ls(project_file) group name otype dclass dim
0 / genetics H5I_GROUP
1 /genetics snps H5I_DATASET INTEGER 1000 x 500
2 / phenotypes H5I_GROUP
3 /phenotypes covariates H5I_DATASET FLOAT 1000 x 5
4 /phenotypes traits H5I_DATASET FLOAT 1000 x 10
Good organization strategies:
By data type: - /genetics/ - SNP matrices, genomic data - /phenotypes/ - Clinical measurements, traits - /results/ - Analysis outputs (PCA, regression)
By processing stage: - /raw/ - Original imported data - /qc/ - Quality-controlled data - /normalized/ - Normalized, ready for analysis
By analysis: - /pca_analysis/ - PCA components, loadings - /gwas_results/ - Association results
5 Adding and Removing Datasets
5.1 Adding New Datasets
# Add another dataset to existing file
quality_scores <- matrix(runif(1000*500), 1000, 500)
bdCreate_hdf5_matrix(
filename = project_file,
object = quality_scores,
group = "quality_control",
dataset = "scores",
overwriteFile = FALSE # Don't overwrite the file
)$fn
[1] "multi_dataset_project.hdf5"
$ds
[1] "quality_control/scores"
# Verify it was added
h5ls(project_file) group name otype dclass dim
0 / genetics H5I_GROUP
1 /genetics snps H5I_DATASET INTEGER 1000 x 500
2 / phenotypes H5I_GROUP
3 /phenotypes covariates H5I_DATASET FLOAT 1000 x 5
4 /phenotypes traits H5I_DATASET FLOAT 1000 x 10
5 / quality_control H5I_GROUP
6 /quality_control scores H5I_DATASET FLOAT 1000 x 500
5.2 Removing Datasets
To remove datasets, use rhdf5 functions:
# Remove a specific dataset
h5delete(project_file, "/quality_control/scores")
# Remove entire group
h5delete(project_file, "/quality_control")HDF5 doesn’t reclaim disk space immediately after deletion. The space is marked as free but the file size doesn’t shrink. To reclaim space, you need to repack the file using HDF5 command-line tools (h5repack).
6 Efficient Data Access
One of the most powerful aspects of HDF5 is partial reading - you can access just the data you need without loading gigabytes into memory. This is the secret to working with datasets larger than your RAM.
You’ll notice we use rhdf5’s h5read() function for reading data. This is intentional! BigDataStatMeth focuses on computation (matrix operations, statistical methods, block-wise algorithms), while rhdf5 excels at file I/O (reading, writing, inspecting).
Division of labor: - rhdf5: File operations, data access, inspection - BigDataStatMeth: Matrix algebra, statistics, block-wise processing
There’s no need to reimplement what rhdf5 already does perfectly.
6.1 Reading Subsets
Instead of loading entire matrices (which might be 100 GB), read just what you need:
# Read first 100 samples and first 50 SNPs
subset_data <- h5read(
project_file,
"/genetics/snps",
index = list(1:100, 1:50)
)
cat("Subset dimensions:", dim(subset_data), "\n")Subset dimensions: 100 50
cat("Preview of data (first 5×5):\n")Preview of data (first 5×5):
print(subset_data[1:5, 1:5]) [,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 0 2
[2,] 2 2 1 1 1
[3,] 1 1 0 1 2
[4,] 2 2 0 2 2
[5,] 0 2 0 2 0
cat("\n✓ Loaded only", prod(dim(subset_data)), "values\n")
✓ Loaded only 5000 values
cat(" Full dataset would have been:", 1000 * 2000, "values\n") Full dataset would have been: 2e+06 values
6.2 Flexible Indexing
You can read specific columns, skip rows, or sample randomly:
# Read specific columns only (useful for specific variables)
column_subset <- h5read(
project_file,
"/genetics/snps",
index = list(NULL, c(1, 10, 20, 30, 40)) # Just these 5 columns
)
cat("\nColumn subset dimensions:", dim(column_subset), "\n")
Column subset dimensions: 1000 5
cat("Selected columns: 1, 10, 20, 30, 40\n")Selected columns: 1, 10, 20, 30, 40
# Read every 10th row (useful for quick checks)
sparse_sample <- h5read(
project_file,
"/genetics/snps",
index = list(seq(1, 1000, by = 10), 1:10)
)
cat("\nSparse sample dimensions:", dim(sparse_sample), "\n")
Sparse sample dimensions: 100 10
cat("Reading every 10th row reduces data by 90%\n")Reading every 10th row reduces data by 90%
For exploration:
h5read(file, dataset, index = list(1:100, NULL)) # First 100 rowsFor specific variables:
h5read(file, dataset, index = list(NULL, c(5, 10, 15))) # Specific columnsFor random sampling:
rows <- sample(1:10000, 500) # Random 500 rows
h5read(file, dataset, index = list(rows, NULL))The key principle: Load only what you need. This is how you analyze 100 GB files on a 16 GB laptop.
7 Working with Large Files
7.1 Understanding BigDataStatMeth’s Internal Management
When you use BigDataStatMeth functions like bdblockmult_hdf5() or bdSVD_hdf5(), the package handles all the complexity internally:
- Opens and closes files automatically
- Reads data in optimal block sizes
- Writes results efficiently
- Manages memory usage
You simply call the function and let BigDataStatMeth do the work:
# BigDataStatMeth handles everything internally:
result <- bdblockmult_hdf5(
filename = "huge_file.hdf5",
group = "data",
A = "matrix_A",
B = "matrix_B"
)
# ↑ Behind the scenes:
# - File opened
# - Data read in blocks
# - Computation performed
# - Results written
# - File closed
# You don't see any of this!With BigDataStatMeth’s R functions, you never need to manually open/close files or manage blocks. The package does it all.
Manual control is only available when developing new methods using the C++ API. This is for advanced users creating custom statistical methods who need fine-grained control over: - Block sizes - Memory allocation
- Read/write patterns - Custom algorithms
For 99% of users, the automatic management is exactly what you want.
7.2 Checking File Information
Sometimes you just want to know about the data without reading it:
# Get file size
file_size_mb <- file.info(project_file)$size / (1024^2)
cat("File size:", round(file_size_mb, 2), "MB\n")File size: 2.78 MB
# Get dataset dimensions without reading data
h5file <- H5Fopen(project_file)
dim_snps <- dim(h5file$genetics$snps)
cat("\nSNP matrix dimensions:", dim_snps[1], "samples ×", dim_snps[2], "SNPs\n")
SNP matrix dimensions: 1000 samples × 500 SNPs
dim_traits <- dim(h5file$phenotypes$traits)
cat("Trait matrix dimensions:", dim_traits[1], "samples ×", dim_traits[2], "traits\n")Trait matrix dimensions: 1000 samples × 10 traits
H5Fclose(h5file)This is useful for: - Planning analyses (how much memory will I need?) - Verifying dimensions before operations - Checking if files were created correctly H5Fclose(h5file)
cat(“SNP matrix dimensions:”, dim_snps[1], “×”, dim_snps[2], “”)
---
## Data Organization Best Practices
### 1. Use Descriptive Names
::: {.cell}
```{.r .cell-code}
# Good naming
bdCreate_hdf5_matrix(filename = "study.hdf5",
object = data,
group = "gwas_diabetes_2024",
dataset = "genotypes_qc_filtered")
# Avoid cryptic names
bdCreate_hdf5_matrix(filename = "data.hdf5",
object = data,
group = "g1",
dataset = "d")
:::
7.3 2. Document Your Structure
Keep a README or comments file describing your HDF5 organization:
# Create a text description
structure_doc <- "
HDF5 File Structure
==================
/genetics/snps - Genotype matrix (samples × SNPs)
/phenotypes/traits - Clinical measurements
/phenotypes/covariates - Age, sex, PCs
/results/pca - PCA results from genetics data
"
# You could store this as an attribute or separate file
writeLines(structure_doc, "project_structure.txt")7.4 3. Use Consistent Dimensions
Always keep samples in rows and features in columns. This is the standard convention in statistics and makes combining datasets much easier:
# Consistent convention
genotypes # n_samples × n_snps
phenotypes # n_samples × n_traits
results # n_samples × n_components
# All have samples in rows → easy to merge/joinThis is critical to understand when viewing HDF5 files directly!
R uses column-major order (like Fortran): - Data is stored column by column in memory - Matrix[i, j] means: row i, column j
HDF5 uses row-major order (like C/C++): - Data is stored row by row in memory
- Dataset[i, j] means: row i, column j
7.5 What This Means for You
When you create a matrix in R:
# In R: 3 samples (rows) × 5 SNPs (columns)
genotypes <- matrix(1:15, nrow = 3, ncol = 5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 4 7 10 13
# [2,] 2 5 8 11 14
# [3,] 3 6 9 12 15
# Save to HDF5
bdCreate_hdf5_matrix(filename = "test.hdf5",
object = genotypes,
group = "data", dataset = "geno")When you view it in HDFView or h5dump, you’ll see the transpose:
# In HDFView: 5 × 3 (appears transposed!)
# [0,0] [0,1] [0,2]
# 1 2 3
# [1,0] [1,1] [1,2]
# 4 5 6
# [2,0] [2,1] [2,2]
# 7 8 9
7.6 Don’t Panic!
This is normal and expected. BigDataStatMeth handles the conversion automatically:
✓ When you read with h5read() → You get the correct R matrix
✓ When you compute with bdCrossprod_hdf5() → Dimensions are correct
✓ When you save results → They’re stored correctly for R
The transposition is handled transparently. You only notice it if you inspect files with external tools like HDFView.
7.7 Key Takeaway
In R code: Think in R dimensions (samples × features)
In HDFView: Expect to see transposed dimensions
In BigDataStatMeth: Everything works correctly - dimensions match R expectations
The moral: Trust BigDataStatMeth to handle the storage details. Your matrices will always have the dimensions you expect in R.
8 Interactive Exercise
8.1 Practice: Designing Your Project’s HDF5 Structure
Effective HDF5 organization makes the difference between a maintainable project and a confusing mess six months later. This exercise helps you think through organization before starting real analyses.
# Exercise: Plan and create an HDF5 structure for your project
# Scenario: You're analyzing a multi-omic study with:
# - Genomic data (SNPs): 50,000 individuals × 500,000 variants
# - Transcriptomic data (RNA-seq): same 50,000 individuals × 20,000 genes
# - Phenotype data: 50,000 individuals × 50 measurements
# - Sample metadata: IDs, age, sex, ancestry
# Your task: Design the HDF5 structure
# Option 1: Everything in one file
project_file <- "my_study.hdf5"
# How would you organize groups?
# /raw_data/genomics/snps
# /raw_data/transcriptomics/expression
# /raw_data/phenotypes/measurements
# /metadata/samples
# /quality_control/filtered_snps
# /quality_control/normalized_expression
# /results/pca
# /results/associations
# Try creating this structure:
bdCreate_hdf5_matrix(project_file, dummy_data,
group = "/raw_data/genomics",
dataset = "snps")
# ... add more datasets
# Option 2: Separate files per data type
# genomics.hdf5, transcriptomics.hdf5, phenotypes.hdf5
# When is this better? When worse?Think through these design decisions for your actual or planned projects:
1. File Organization Strategy: - One large file vs. multiple smaller files? - For your project, which approach makes more sense? - How does your team typically share data? - Will different people access different parts?
2. Group Hierarchy: - How deep should your folder structure go? - /data/type/version vs. /data_type_version? - Can you navigate your structure 6 months from now? - Would a new collaborator understand it?
3. Dataset Naming: - “matrix1” vs. “filtered_normalized_snps_maf05”? - Descriptive names vs. short names? - How do you indicate processing steps? - Do you need version numbers in names?
4. Metadata Management: - Where do row/column names go? - How do you store processing parameters? - Document analysis dates and software versions? - Link between datasets (which samples are in which matrix)?
5. Practical Constraints: - How much disk space do you have? - Will files be transferred between systems? - Need to share via email/FTP? (file size limits) - Backup strategy for large files?
6. Future Expansion: - What if you add more samples next year? - New data types (proteomics, metabolomics)? - Reanalysis with different parameters? - Where does each new piece go?
There’s no universal “correct” organization - it depends on your project scale, team structure, and workflow. The goal is thinking through these questions before you have 50 analysis scripts depending on a particular structure.
9 Cleanup
# Close all HDF5 connections
h5closeAll()
# Remove generated HDF5 file
file.remove("clinical_data.hdf5",
"multi_dataset_project.hdf5")[1] TRUE TRUE
cat("✓ Tutorial cleanup complete\n")✓ Tutorial cleanup complete
cat("Note: colesterol.csv is kept for reference\n")Note: colesterol.csv is kept for reference
10 What You’ve Learned
✅ Convert text files (CSV, TSV) to HDF5 format
✅ Manage multiple datasets in one HDF5 file
✅ Organize data using groups and descriptive names
✅ Access data efficiently (subsets, slicing)
✅ Apply best practices for file organization
11 Next Steps
Continue the tutorial series:
- ✅ Getting Started
- ✅ Working with HDF5 Matrices (you are here)
- → Your First Analysis - Complete analysis workflow
Ready for real analysis:
- Implementing PCA - Full PCA workflow on genomic data
- Implementing CCA - Canonical correlation analysis
12 Key Takeaways
Let’s consolidate your understanding of HDF5 file management and organization for big data projects.
12.1 Essential Concepts
HDF5 is a data management system, not just a file format. It provides hierarchical organization (groups and datasets), partial I/O (read only what you need), and cross-platform compatibility. This makes it fundamentally different from flat file formats like CSV. You’re not just storing data - you’re organizing an entire analysis project in a structured, queryable format.
File conversion is a one-time investment that pays dividends throughout your project. Converting a 50 GB CSV file to HDF5 might take 20 minutes, but every subsequent analysis benefits from fast partial reads, organized structure, and efficient access patterns. The upfront cost amortizes across dozens or hundreds of analyses over the project lifetime.
Organization decisions made early are hard to change later. Once you have 50 analysis scripts expecting data at /raw_data/genomics/snps, reorganizing the file breaks everything. Take time designing your structure before populating it. A bit of planning prevents substantial refactoring pain later.
BigDataStatMeth and rhdf5 serve different purposes. BigDataStatMeth creates and operates on matrices (the computational work). rhdf5 provides file I/O and inspection (reading, writing, navigating). You need both: BigDataStatMeth for analysis, rhdf5 for file management. Understanding this division of labor prevents confusion about which tool does what.
Row-major vs. column-major storage is handled automatically. R uses column-major order, HDF5 uses row-major. BigDataStatMeth manages this conversion transparently. Your matrices always have the dimensions you expect in R code, even though they’re stored differently on disk. You only notice the difference if you inspect files with external tools like HDFView - and even then, it’s just a visual curiosity, not a problem.
Efficient HDF5 use means reading strategically. Don’t read entire 100 GB matrices to access a 1 MB subset. Use dimension slicing, work with row/column blocks, and read only required data. The ability to partially read files is HDF5’s main advantage - leverage it. This is why understanding dataset dimensions and organization matters.
12.2 When to Use HDF5 File Management
Understanding when HDF5’s organizational features help versus when simpler approaches suffice guides practical decisions about file structure and management.
✅ Use hierarchical HDF5 organization when:
You have multiple related datasets - Genomics + transcriptomics + phenotypes all in one project. Groups like
/genetics/,/expression/,/clinical/keep everything organized in a single file rather than managing dozens of separate files.Your project will grow over time - Starting organized prevents refactoring pain. If you’ll add more samples, time points, or data types later, hierarchical structure accommodates growth naturally.
Multiple team members access the same data - Clear organization (
/raw_data/,/quality_control/,/results/) makes files self-documenting. Collaborators can navigate without constant explanations.You need to document processing steps - Dataset names like
genotypes_filtered_maf005_hwe1e6encode processing history. The file structure itself documents your analysis pipeline.
✅ Use rhdf5 for file operations when:
You need to inspect file contents -
h5ls()shows structure,h5read()retrieves data. rhdf5 provides the file navigation and I/O tools.You’re managing metadata - Row names, column names, attributes, and documentation all go through rhdf5 functions.
You need partial data access - Reading specific rows, columns, or blocks uses rhdf5’s indexing capabilities:
h5read(file, dataset, index = list(rows, cols)).
❌ Simpler approaches work better when:
You have a single dataset - If your entire project is one matrix, elaborate HDF5 organization adds no value. A simple flat file or single-group HDF5 suffices.
Data won’t grow or change - For static datasets that are finalized and won’t be extended, elaborate organization provides minimal benefit over simple naming.
Quick, temporary analyses - If you’re not building a reusable analysis pipeline, the organizational overhead outweighs benefits. Use whatever gets you to results fastest.
You never need partial reads - If you always process entire matrices, HDF5’s partial I/O advantage disappears. The organizational features may still help, but the performance benefit is minimal.
Understanding these trade-offs helps you invest effort where it pays off - elaborate organization for complex, growing projects; simple structure for straightforward, static analyses.