Working with HDF5 Matrices

File Operations, Data Conversion, and Management

1 Overview

This tutorial teaches you the practical skills for working efficiently with HDF5 files: converting data from common formats, organizing complex projects with multiple datasets, and performing routine file operations. Think of this as learning the file system before learning programming - you need to understand how to organize and access your data before diving into statistical analyses.

1.1 What You’ll Learn

By the end of this tutorial, you will:

Convert text files (CSV, TSV) to HDF5 format efficiently
Organize multiple related datasets in one HDF5 file
Use hierarchical groups to structure complex projects
Add, inspect, and remove datasets safely
Read data efficiently (whole datasets vs. subsets)
Apply best practices for HDF5 file organization
Understand when to use rhdf5 vs. BigDataStatMeth functions

2 Prerequisites

Complete the Getting Started tutorial first. You should have:

BigDataStatMeth installed and working
Basic understanding of HDF5 structure (groups and datasets)
R and rhdf5 loaded

library(BigDataStatMeth)
library(rhdf5)

3 Converting Text Files to HDF5

3.1 The Problem

You have large text files (CSV, TSV, or custom delimiters) that won’t fit comfortably in RAM. Loading them with read.csv() or read.table() is slow or impossible.

3.2 The Solution

Use bdImportTextFile_hdf5() to convert directly to HDF5 format, processing the file in chunks.

3.3 Example: Converting a CSV File

We’ll use a real clinical dataset (colesterol.csv) with cholesterol and metabolic measurements:

# Check if file exists (should be in same directory as tutorial)
if (!file.exists("colesterol.csv")) {
  stop("colesterol.csv not found. Please ensure it's in the working directory.")
}

# Check file size
file_size_kb <- file.info("colesterol.csv")$size / 1024
cat("File size:", round(file_size_kb, 1), "KB\n")

File size: 152.1 KB

# Preview first few lines
cat("\nFirst few lines of the CSV:\n")


First few lines of the CSV:

head(read.csv("colesterol.csv", nrows = 3))

  TCholesterol      Age  Insulin Creatinine      BUN     LLDR Triglycerides
1     223.7348 55.25039 14.90246  0.9026095 10.22067 1.450970      199.0737
2     248.1820 53.47404 25.12592  0.8710345 17.10225 1.002655      169.7877
3     180.2071 58.54268 12.95197  0.8882435 11.21530 1.073924      150.2938
     HDL_C     LDL_C Sex
1 38.37616 107.09286   0
2 36.35374  83.66443   1
3 55.91920 108.02967   1

Now convert to HDF5:

# Convert CSV to HDF5
result <- bdImportTextFile_hdf5(
  filename = "colesterol.csv",
  sep = ",",                        # CSV uses commas
  outputfile = "clinical_data.hdf5",
  outGroup = "clinical",
  outDataset = "measurements",
  header = TRUE,                    # First row is column names
  overwrite = TRUE
)

cat("✓ File converted to HDF5\n")

✓ File converted to HDF5

cat("Output file:", result$fn, "\n")

Output file: clinical_data.hdf5

cat("Output dataset:", result$ds, "\n")

Output dataset: clinical/measurements

Important: Numeric Data Only

bdImportTextFile_hdf5() handles numeric data. The first column (if character-based IDs) is stored separately as row names. Column names are stored in a separate dataset.

The numeric matrix contains only the numerical measurements (Age, Cholesterol, Insulin, etc.).

3.4 Verify the Conversion

# Inspect structure
h5ls(result$fn)

                             group                   name       otype   dclass
0                                /               clinical   H5I_GROUP         
1                        /clinical .measurements_dimnames   H5I_GROUP         
2 /clinical/.measurements_dimnames                      2 H5I_DATASET COMPOUND
3                        /clinical           measurements H5I_DATASET    FLOAT
        dim
0          
1          
2        10
3 1000 x 10

# Read back a portion
converted_data <- h5read(result$fn, result$ds, 
                         index = list(1:5, NULL))
converted_data

         [,1]     [,2]     [,3]      [,4]     [,5]     [,6]     [,7]     [,8]
[1,] 223.7348 55.25039 14.90246 0.9026095 10.22067 1.450970 199.0737 38.37616
[2,] 248.1820 53.47404 25.12592 0.8710345 17.10225 1.002655 169.7877 36.35374
[3,] 180.2071 58.54268 12.95197 0.8882435 11.21530 1.073924 150.2938 55.91920
[4,] 200.1522 62.58284 28.27819 0.9361357 12.79399 1.129373 150.0377 51.89241
[5,] 234.3281 68.70254 13.21404 0.7775238 11.39689 1.235655 161.8820 45.25873
          [,9] [,10]
[1,] 107.09286     0
[2,]  83.66443     1
[3,] 108.02967     1
[4,] 108.83731     1
[5,]  94.98369     1

# Read column names
colnames_data <- h5read(result$fn, result$ds_cols)[,1]
colnames_data

 [1] "TCholesterol"  "Age"           "Insulin"       "Creatinine"   
 [5] "BUN"           "LLDR"          "Triglycerides" "HDL_C"        
 [9] "LDL_C"         "Sex"

3.5 Working with Different Delimiters

# Tab-separated file
bdImportTextFile_hdf5(
  filename = "data.tsv",
  sep = "\t",
  outputfile = "data.hdf5",
  outGroup = "imported",
  outDataset = "data"
)

# Custom delimiter (e.g., semicolon)
bdImportTextFile_hdf5(
  filename = "data.txt",
  sep = ";",
  outputfile = "data.hdf5",
  outGroup = "imported",
  outDataset = "data"
)

4 Managing Multiple Datasets

One HDF5 file can store multiple datasets, organized in groups. This is ideal for related analyses.

4.1 Creating a Multi-Dataset File

# Create three related datasets
set.seed(100)
genotype <- matrix(sample(0:2, 1000*500, replace = TRUE), 1000, 500)
phenotype <- matrix(rnorm(1000*10), 1000, 10)
covariates <- matrix(rnorm(1000*5), 1000, 5)

# Store in same file, different groups
project_file <- "multi_dataset_project.hdf5"

bdCreate_hdf5_matrix(
  filename = project_file,
  object = genotype,
  group = "genetics",
  dataset = "snps",
  overwriteFile = TRUE
)

$fn
[1] "multi_dataset_project.hdf5"

$ds
[1] "genetics/snps"

bdCreate_hdf5_matrix(
  filename = project_file,
  object = phenotype,
  group = "phenotypes",
  dataset = "traits"
)

$fn
[1] "multi_dataset_project.hdf5"

$ds
[1] "phenotypes/traits"

bdCreate_hdf5_matrix(
  filename = project_file,
  object = covariates,
  group = "phenotypes",
  dataset = "covariates"
)

$fn
[1] "multi_dataset_project.hdf5"

$ds
[1] "phenotypes/covariates"

cat("✓ Multi-dataset file created\n")

✓ Multi-dataset file created

4.2 Inspecting File Contents

# List all contents
h5ls(project_file)

        group       name       otype  dclass        dim
0           /   genetics   H5I_GROUP                   
1   /genetics       snps H5I_DATASET INTEGER 1000 x 500
2           / phenotypes   H5I_GROUP                   
3 /phenotypes covariates H5I_DATASET   FLOAT   1000 x 5
4 /phenotypes     traits H5I_DATASET   FLOAT  1000 x 10

Organizing Your Data

Good organization strategies:

By data type: - /genetics/ - SNP matrices, genomic data - /phenotypes/ - Clinical measurements, traits - /results/ - Analysis outputs (PCA, regression)

By processing stage: - /raw/ - Original imported data - /qc/ - Quality-controlled data - /normalized/ - Normalized, ready for analysis

By analysis: - /pca_analysis/ - PCA components, loadings - /gwas_results/ - Association results

5 Adding and Removing Datasets

5.1 Adding New Datasets

# Add another dataset to existing file
quality_scores <- matrix(runif(1000*500), 1000, 500)

bdCreate_hdf5_matrix(
  filename = project_file,
  object = quality_scores,
  group = "quality_control",
  dataset = "scores",
  overwriteFile = FALSE  # Don't overwrite the file
)

$fn
[1] "multi_dataset_project.hdf5"

$ds
[1] "quality_control/scores"

# Verify it was added
h5ls(project_file)

             group            name       otype  dclass        dim
0                /        genetics   H5I_GROUP                   
1        /genetics            snps H5I_DATASET INTEGER 1000 x 500
2                /      phenotypes   H5I_GROUP                   
3      /phenotypes      covariates H5I_DATASET   FLOAT   1000 x 5
4      /phenotypes          traits H5I_DATASET   FLOAT  1000 x 10
5                / quality_control   H5I_GROUP                   
6 /quality_control          scores H5I_DATASET   FLOAT 1000 x 500

5.2 Removing Datasets

To remove datasets, use rhdf5 functions:

# Remove a specific dataset
h5delete(project_file, "/quality_control/scores")

# Remove entire group
h5delete(project_file, "/quality_control")

Careful with Deletion

HDF5 doesn’t reclaim disk space immediately after deletion. The space is marked as free but the file size doesn’t shrink. To reclaim space, you need to repack the file using HDF5 command-line tools (h5repack).

6 Efficient Data Access

One of the most powerful aspects of HDF5 is partial reading - you can access just the data you need without loading gigabytes into memory. This is the secret to working with datasets larger than your RAM.

Why We Use rhdf5 for Data Access

You’ll notice we use rhdf5’s h5read() function for reading data. This is intentional! BigDataStatMeth focuses on computation (matrix operations, statistical methods, block-wise algorithms), while rhdf5 excels at file I/O (reading, writing, inspecting).

Division of labor: - rhdf5: File operations, data access, inspection - BigDataStatMeth: Matrix algebra, statistics, block-wise processing

There’s no need to reimplement what rhdf5 already does perfectly.

6.1 Reading Subsets

Instead of loading entire matrices (which might be 100 GB), read just what you need:

# Read first 100 samples and first 50 SNPs
subset_data <- h5read(
  project_file, 
  "/genetics/snps",
  index = list(1:100, 1:50)
)

cat("Subset dimensions:", dim(subset_data), "\n")

Subset dimensions: 100 50

cat("Preview of data (first 5×5):\n")

Preview of data (first 5×5):

print(subset_data[1:5, 1:5])

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    0    0    2
[2,]    2    2    1    1    1
[3,]    1    1    0    1    2
[4,]    2    2    0    2    2
[5,]    0    2    0    2    0

cat("\n✓ Loaded only", prod(dim(subset_data)), "values\n")


✓ Loaded only 5000 values

cat("  Full dataset would have been:", 1000 * 2000, "values\n")

  Full dataset would have been: 2e+06 values

6.2 Flexible Indexing

You can read specific columns, skip rows, or sample randomly:

# Read specific columns only (useful for specific variables)
column_subset <- h5read(
  project_file,
  "/genetics/snps",
  index = list(NULL, c(1, 10, 20, 30, 40))  # Just these 5 columns
)

cat("\nColumn subset dimensions:", dim(column_subset), "\n")


Column subset dimensions: 1000 5

cat("Selected columns: 1, 10, 20, 30, 40\n")

Selected columns: 1, 10, 20, 30, 40

# Read every 10th row (useful for quick checks)
sparse_sample <- h5read(
  project_file,
  "/genetics/snps",
  index = list(seq(1, 1000, by = 10), 1:10)
)

cat("\nSparse sample dimensions:", dim(sparse_sample), "\n")


Sparse sample dimensions: 100 10

cat("Reading every 10th row reduces data by 90%\n")

Reading every 10th row reduces data by 90%

Smart Reading Strategies

For exploration:

h5read(file, dataset, index = list(1:100, NULL))  # First 100 rows

For specific variables:

h5read(file, dataset, index = list(NULL, c(5, 10, 15)))  # Specific columns

For random sampling:

rows <- sample(1:10000, 500)  # Random 500 rows
h5read(file, dataset, index = list(rows, NULL))

The key principle: Load only what you need. This is how you analyze 100 GB files on a 16 GB laptop.

7 Working with Large Files

7.1 Understanding BigDataStatMeth’s Internal Management

When you use BigDataStatMeth functions like bdblockmult_hdf5() or bdSVD_hdf5(), the package handles all the complexity internally:

Opens and closes files automatically
Reads data in optimal block sizes
Writes results efficiently
Manages memory usage

You simply call the function and let BigDataStatMeth do the work:

# BigDataStatMeth handles everything internally:
result <- bdblockmult_hdf5(
  filename = "huge_file.hdf5",
  group = "data",
  A = "matrix_A",
  B = "matrix_B"
)
# ↑ Behind the scenes:
#   - File opened
#   - Data read in blocks
#   - Computation performed
#   - Results written
#   - File closed
# You don't see any of this!

Manual File Management: C++ API Only

With BigDataStatMeth’s R functions, you never need to manually open/close files or manage blocks. The package does it all.

Manual control is only available when developing new methods using the C++ API. This is for advanced users creating custom statistical methods who need fine-grained control over: - Block sizes - Memory allocation
- Read/write patterns - Custom algorithms

For 99% of users, the automatic management is exactly what you want.

7.2 Checking File Information

Sometimes you just want to know about the data without reading it:

# Get file size
file_size_mb <- file.info(project_file)$size / (1024^2)
cat("File size:", round(file_size_mb, 2), "MB\n")

File size: 2.78 MB

# Get dataset dimensions without reading data
h5file <- H5Fopen(project_file)
dim_snps <- dim(h5file$genetics$snps)
cat("\nSNP matrix dimensions:", dim_snps[1], "samples ×", dim_snps[2], "SNPs\n")


SNP matrix dimensions: 1000 samples × 500 SNPs

dim_traits <- dim(h5file$phenotypes$traits)
cat("Trait matrix dimensions:", dim_traits[1], "samples ×", dim_traits[2], "traits\n")

Trait matrix dimensions: 1000 samples × 10 traits

H5Fclose(h5file)

This is useful for: - Planning analyses (how much memory will I need?) - Verifying dimensions before operations - Checking if files were created correctly H5Fclose(h5file)

cat(“SNP matrix dimensions:”, dim_snps[1], “×”, dim_snps[2], “”)


---

## Data Organization Best Practices

### 1. Use Descriptive Names


::: {.cell}

```{.r .cell-code}
# Good naming
bdCreate_hdf5_matrix(filename = "study.hdf5",
                     object = data,
                     group = "gwas_diabetes_2024",
                     dataset = "genotypes_qc_filtered")

# Avoid cryptic names
bdCreate_hdf5_matrix(filename = "data.hdf5",
                     object = data,
                     group = "g1",
                     dataset = "d")

:::

7.3 2. Document Your Structure

Keep a README or comments file describing your HDF5 organization:

# Create a text description
structure_doc <- "
HDF5 File Structure
==================
/genetics/snps - Genotype matrix (samples × SNPs)
/phenotypes/traits - Clinical measurements  
/phenotypes/covariates - Age, sex, PCs
/results/pca - PCA results from genetics data
"

# You could store this as an attribute or separate file
writeLines(structure_doc, "project_structure.txt")

7.4 3. Use Consistent Dimensions

Always keep samples in rows and features in columns. This is the standard convention in statistics and makes combining datasets much easier:

# Consistent convention
genotypes  # n_samples × n_snps
phenotypes # n_samples × n_traits
results    # n_samples × n_components

# All have samples in rows → easy to merge/join

Understanding Row-Major vs Column-Major Storage

This is critical to understand when viewing HDF5 files directly!

R uses column-major order (like Fortran): - Data is stored column by column in memory - Matrix[i, j] means: row i, column j

HDF5 uses row-major order (like C/C++): - Data is stored row by row in memory
- Dataset[i, j] means: row i, column j

7.5 What This Means for You

When you create a matrix in R:

# In R: 3 samples (rows) × 5 SNPs (columns)
genotypes <- matrix(1:15, nrow = 3, ncol = 5)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    4    7   10   13
# [2,]    2    5    8   11   14
# [3,]    3    6    9   12   15

# Save to HDF5
bdCreate_hdf5_matrix(filename = "test.hdf5", 
                     object = genotypes,
                     group = "data", dataset = "geno")

When you view it in HDFView or h5dump, you’ll see the transpose:

# In HDFView: 5 × 3 (appears transposed!)
# [0,0] [0,1] [0,2]
#   1     2     3
# [1,0] [1,1] [1,2]  
#   4     5     6
# [2,0] [2,1] [2,2]
#   7     8     9

7.6 Don’t Panic!

This is normal and expected. BigDataStatMeth handles the conversion automatically:

✓ When you read with h5read() → You get the correct R matrix
✓ When you compute with bdCrossprod_hdf5() → Dimensions are correct
✓ When you save results → They’re stored correctly for R

The transposition is handled transparently. You only notice it if you inspect files with external tools like HDFView.

7.7 Key Takeaway

In R code: Think in R dimensions (samples × features)
In HDFView: Expect to see transposed dimensions
In BigDataStatMeth: Everything works correctly - dimensions match R expectations

The moral: Trust BigDataStatMeth to handle the storage details. Your matrices will always have the dimensions you expect in R.

8 Interactive Exercise

8.1 Practice: Designing Your Project’s HDF5 Structure

Effective HDF5 organization makes the difference between a maintainable project and a confusing mess six months later. This exercise helps you think through organization before starting real analyses.

# Exercise: Plan and create an HDF5 structure for your project

# Scenario: You're analyzing a multi-omic study with:
# - Genomic data (SNPs): 50,000 individuals × 500,000 variants
# - Transcriptomic data (RNA-seq): same 50,000 individuals × 20,000 genes
# - Phenotype data: 50,000 individuals × 50 measurements
# - Sample metadata: IDs, age, sex, ancestry

# Your task: Design the HDF5 structure

# Option 1: Everything in one file
project_file <- "my_study.hdf5"

# How would you organize groups?
# /raw_data/genomics/snps
# /raw_data/transcriptomics/expression
# /raw_data/phenotypes/measurements
# /metadata/samples
# /quality_control/filtered_snps
# /quality_control/normalized_expression
# /results/pca
# /results/associations

# Try creating this structure:
bdCreate_hdf5_matrix(project_file, dummy_data, 
                     group = "/raw_data/genomics", 
                     dataset = "snps")
# ... add more datasets

# Option 2: Separate files per data type
# genomics.hdf5, transcriptomics.hdf5, phenotypes.hdf5
# When is this better? When worse?

Reflection Questions

Think through these design decisions for your actual or planned projects:

1. File Organization Strategy: - One large file vs. multiple smaller files? - For your project, which approach makes more sense? - How does your team typically share data? - Will different people access different parts?

2. Group Hierarchy: - How deep should your folder structure go? - /data/type/version vs. /data_type_version? - Can you navigate your structure 6 months from now? - Would a new collaborator understand it?

3. Dataset Naming: - “matrix1” vs. “filtered_normalized_snps_maf05”? - Descriptive names vs. short names? - How do you indicate processing steps? - Do you need version numbers in names?

4. Metadata Management: - Where do row/column names go? - How do you store processing parameters? - Document analysis dates and software versions? - Link between datasets (which samples are in which matrix)?

5. Practical Constraints: - How much disk space do you have? - Will files be transferred between systems? - Need to share via email/FTP? (file size limits) - Backup strategy for large files?

6. Future Expansion: - What if you add more samples next year? - New data types (proteomics, metabolomics)? - Reanalysis with different parameters? - Where does each new piece go?

There’s no universal “correct” organization - it depends on your project scale, team structure, and workflow. The goal is thinking through these questions before you have 50 analysis scripts depending on a particular structure.

9 Cleanup

# Close all HDF5 connections
h5closeAll()

# Remove generated HDF5 file
file.remove("clinical_data.hdf5",
            "multi_dataset_project.hdf5")

[1] TRUE TRUE

cat("✓ Tutorial cleanup complete\n")

✓ Tutorial cleanup complete

cat("Note: colesterol.csv is kept for reference\n")

Note: colesterol.csv is kept for reference

10 What You’ve Learned

✅ Convert text files (CSV, TSV) to HDF5 format
✅ Manage multiple datasets in one HDF5 file
✅ Organize data using groups and descriptive names
✅ Access data efficiently (subsets, slicing)
✅ Apply best practices for file organization

11 Next Steps

Continue the tutorial series:

✅ Getting Started
✅ Working with HDF5 Matrices (you are here)
→ Your First Analysis - Complete analysis workflow

Ready for real analysis:

Implementing PCA - Full PCA workflow on genomic data
Implementing CCA - Canonical correlation analysis

12 Key Takeaways

Let’s consolidate your understanding of HDF5 file management and organization for big data projects.

12.1 Essential Concepts

HDF5 is a data management system, not just a file format. It provides hierarchical organization (groups and datasets), partial I/O (read only what you need), and cross-platform compatibility. This makes it fundamentally different from flat file formats like CSV. You’re not just storing data - you’re organizing an entire analysis project in a structured, queryable format.

File conversion is a one-time investment that pays dividends throughout your project. Converting a 50 GB CSV file to HDF5 might take 20 minutes, but every subsequent analysis benefits from fast partial reads, organized structure, and efficient access patterns. The upfront cost amortizes across dozens or hundreds of analyses over the project lifetime.

Organization decisions made early are hard to change later. Once you have 50 analysis scripts expecting data at /raw_data/genomics/snps, reorganizing the file breaks everything. Take time designing your structure before populating it. A bit of planning prevents substantial refactoring pain later.

BigDataStatMeth and rhdf5 serve different purposes. BigDataStatMeth creates and operates on matrices (the computational work). rhdf5 provides file I/O and inspection (reading, writing, navigating). You need both: BigDataStatMeth for analysis, rhdf5 for file management. Understanding this division of labor prevents confusion about which tool does what.

Row-major vs. column-major storage is handled automatically. R uses column-major order, HDF5 uses row-major. BigDataStatMeth manages this conversion transparently. Your matrices always have the dimensions you expect in R code, even though they’re stored differently on disk. You only notice the difference if you inspect files with external tools like HDFView - and even then, it’s just a visual curiosity, not a problem.

Efficient HDF5 use means reading strategically. Don’t read entire 100 GB matrices to access a 1 MB subset. Use dimension slicing, work with row/column blocks, and read only required data. The ability to partially read files is HDF5’s main advantage - leverage it. This is why understanding dataset dimensions and organization matters.

12.2 When to Use HDF5 File Management

Understanding when HDF5’s organizational features help versus when simpler approaches suffice guides practical decisions about file structure and management.

✅ Use hierarchical HDF5 organization when:

You have multiple related datasets - Genomics + transcriptomics + phenotypes all in one project. Groups like /genetics/, /expression/, /clinical/ keep everything organized in a single file rather than managing dozens of separate files.
Your project will grow over time - Starting organized prevents refactoring pain. If you’ll add more samples, time points, or data types later, hierarchical structure accommodates growth naturally.
Multiple team members access the same data - Clear organization (/raw_data/, /quality_control/, /results/) makes files self-documenting. Collaborators can navigate without constant explanations.
You need to document processing steps - Dataset names like genotypes_filtered_maf005_hwe1e6 encode processing history. The file structure itself documents your analysis pipeline.

✅ Use rhdf5 for file operations when:

You need to inspect file contents - h5ls() shows structure, h5read() retrieves data. rhdf5 provides the file navigation and I/O tools.
You’re managing metadata - Row names, column names, attributes, and documentation all go through rhdf5 functions.
You need partial data access - Reading specific rows, columns, or blocks uses rhdf5’s indexing capabilities: h5read(file, dataset, index = list(rows, cols)).

❌ Simpler approaches work better when:

You have a single dataset - If your entire project is one matrix, elaborate HDF5 organization adds no value. A simple flat file or single-group HDF5 suffices.
Data won’t grow or change - For static datasets that are finalized and won’t be extended, elaborate organization provides minimal benefit over simple naming.
Quick, temporary analyses - If you’re not building a reusable analysis pipeline, the organizational overhead outweighs benefits. Use whatever gets you to results fastest.
You never need partial reads - If you always process entire matrices, HDF5’s partial I/O advantage disappears. The organizational features may still help, but the performance benefit is minimal.

Understanding these trade-offs helps you invest effort where it pays off - elaborate organization for complex, growing projects; simple structure for straightforward, static analyses.