Understanding HDF5 Storage

A Deep Dive into the Foundation of BigDataStatMeth

0.1 What You’ll Learn

By the end of this section, you will:

Understand what HDF5 is and why it exists
Grasp how HDF5 organizes data hierarchically
Learn how chunking enables efficient disk-based computing
Know when and why to use compression
Be able to create, inspect, and work with HDF5 files
Understand how BigDataStatMeth leverages HDF5 features

1 The Problem HDF5 Solves

Imagine you’re analyzing genomic data: a matrix of 100,000 individuals × 50,000 genetic variants. Let’s do some quick math:

# Matrix dimensions
n_individuals <- 100000
n_variants <- 50000

# Memory needed (assuming 8 bytes per double-precision number)
memory_gb <- (n_individuals * n_variants * 8) / (1024^3)
cat(sprintf("Memory required: %.1f GB\n", memory_gb))

Memory required: 37.3 GB

The challenge: A typical laptop has 16GB RAM. Even a high-end workstation with 128GB would struggle when performing operations that create intermediate results.

The Traditional Approach Fails

# This would crash on most systems
genotype_matrix <- read.csv("100k_x_50k_genotypes.csv")  # ❌ Out of memory!
pca_result <- prcomp(genotype_matrix)                     # ❌ Never gets here

We need a fundamentally different approach.

2 What is HDF5?

When we encounter datasets that exceed our computer’s memory capacity, we face a fundamental question: how do we work with data we cannot fully load? Traditional file formats like CSV, RData, or even binary formats force us to read entire files into memory before we can work with them. This all-or-nothing approach creates an insurmountable barrier when data grows beyond available RAM.

HDF5 (Hierarchical Data Format version 5) emerged from this exact challenge in the scientific computing community. Developed by the National Center for Supercomputing Applications (NCSA) and now maintained by The HDF Group, HDF5 was designed by scientists who routinely work with terabytes of data from instruments, simulations, and experiments. It’s not just “another file format” - it’s fundamentally a database system optimized for scientific matrices and arrays.

The key innovation of HDF5 is deceptively simple but profoundly powerful: rather than treating a file as a monolithic block of data that must be read entirely, HDF5 treats files as structured databases where you can efficiently access exactly the pieces you need, when you need them. This changes everything about how we can work with large datasets.

2.1 The HDF5 Philosophy

HDF5 was designed around a core principle that directly addresses our big data problem:

“Data should be accessible in pieces, without loading everything into memory.”

This philosophy manifests in three revolutionary features that distinguish HDF5 from traditional file formats:

1. Hierarchical Organization

Just as you organize documents into folders and subfolders on your computer, HDF5 lets you organize datasets within groups. You might have a group called /raw_data/ containing original measurements, another called /processed/ for cleaned data, and /results/ for analytical outputs. This organization isn’t just cosmetic - it helps both humans and computers understand the relationships between different pieces of data. When working on a genomics project, for example, you might organize by chromosome, by sample type, or by processing stage, all within a single file.

2. Partial I/O

This is where HDF5 truly shines. Imagine you have a matrix with 100,000 rows but only need to analyze the first 10,000. With a CSV file, you must read all 100,000 rows before you can work with any of them. With HDF5, you simply request rows 1-10,000, and only those rows are read from disk. The rest of the data stays untouched on disk, consuming zero memory. This selective reading extends to any dimension - rows, columns, or even arbitrary rectangular regions of your matrices. This capability is what makes disk-based computing practical.

3. Self-Describing Metadata

Every HDF5 dataset carries its own documentation. The file knows the dimensions of each matrix, the data type of each element, the names of rows and columns, when the data was created, and any other information you choose to store. This metadata travels with the data, so six months later, you (or a colleague) can open the file and immediately understand what it contains without consulting separate documentation. The data describes itself.

These three features combine to create a storage system where large datasets remain accessible and usable without overwhelming your computer’s memory. But understanding the features is just the beginning - seeing how HDF5 structures data internally helps us use it effectively.

3 HDF5 File Structure

To truly grasp how HDF5 enables efficient big data workflows, we need to understand its internal organization. Think of an HDF5 file as containing its own miniature filesystem - a hierarchy of containers and data, all within a single file.

At the top level, an HDF5 file can contain two types of objects: groups and datasets. Groups are like directories or folders, providing organizational structure. Datasets are where the actual data lives - these are your matrices, vectors, or arrays. Groups can contain other groups (nested hierarchies) and datasets. This creates a tree-like structure that can represent complex, multi-component analyses within a single file.

Here’s how this might look for a typical genomics analysis:

graph TD
    A[HDF5 File: analysis.hdf5] --> B[Group: /data]
    A --> C[Group: /results]
    A --> D[Group: /metadata]
    
    B --> B1[Dataset: genotypes<br/>100k × 50k matrix]
    B --> B2[Dataset: phenotypes<br/>100k × 10 matrix]
    
    C --> C1[Dataset: pca_components<br/>50k × 20 matrix]
    C --> C2[Dataset: pca_scores<br/>100k × 20 matrix]
    
    D --> D1[Attributes: creation_date]
    D --> D2[Attributes: sample_ids]
    
    style A fill:#f0f8ff
    style B fill:#e8f6e8
    style C fill:#e8f6e8
    style D fill:#e8f6e8
    style B1 fill:#fff8e1
    style B2 fill:#fff8e1
    style C1 fill:#fff8e1
    style C2 fill:#fff8e1

HDF5 hierarchical structure

In this example, we have a single HDF5 file (analysis.hdf5) that contains everything related to a genetic association study. The file is organized into three main groups at the root level. The /data group holds our raw and processed data matrices - the genotype matrix with 100,000 individuals across 50,000 genetic variants, plus a smaller phenotype matrix with clinical measurements for those same individuals. The /results group stores the outputs of our PCA analysis - both the principal components (50,000 variants × 20 components) and the individual scores (100,000 individuals × 20 components). Finally, the /metadata group contains important information about the study, like when the data was collected and which samples correspond to which individuals.

This hierarchical organization does more than keep things tidy. It allows us to work with different parts of the analysis independently. We can read just the phenotype data without touching the much larger genotype matrix. We can update results without disturbing raw data. We can add new analyses in new groups without restructuring existing work. Everything stays together in one file, but nothing forces us to load everything at once.

3.1 Components Explained

To work effectively with HDF5, you need to understand three types of objects that make up this hierarchy. Each serves a distinct purpose in organizing and documenting your data.

Groups are organizational containers that work exactly like folders on your computer’s filesystem. They provide structure and context for your data. A group can contain other groups (creating deeper hierarchies) and datasets. For instance, you might create a structure like /data/genomics/chromosome1/variants to organize genetic variants by chromosome. Groups help both humans and software understand how different pieces of data relate to each other. When you come back to an analysis months later, this organization helps you quickly locate what you need.

Datasets are where your actual numerical data lives. These are the matrices, vectors, and arrays you’ll perform computations on. A dataset can be one-dimensional (a vector), two-dimensional (a matrix), or even higher-dimensional (tensors for complex data structures). Crucially, datasets in HDF5 can be arbitrarily large - they’re not limited by your computer’s RAM because the data resides on disk. Each dataset stores not just the numbers, but also information about data types, dimensions, and structure. This is what allows HDF5 to read just portions of a dataset efficiently.

Attributes are small pieces of metadata attached to either groups or datasets. Think of attributes as post-it notes that document important information about your data. For a genotype dataset, attributes might store column names (which SNPs?), row names (which individuals?), the date of data collection, quality control thresholds used, or the genome build version. Attributes are always small and loaded entirely into memory, so they’re perfect for storing descriptive information that helps interpret the data. Unlike datasets, attributes are not designed for selective access - they’re meant to be read completely whenever you open a file.

4 Creating Your First HDF5 File

Now that you understand the conceptual foundation of HDF5 - its hierarchical structure, the power of partial I/O, and how chunking works under the hood - let’s make these ideas concrete by actually creating an HDF5 file. We’ll build a small example that mirrors a typical genomics study: genotype data for many individuals across many genetic variants, plus associated phenotype measurements.

The key function in BigDataStatMeth for creating HDF5 files is bdCreate_hdf5_matrix(). This function takes an R matrix (or data that can be converted to a matrix) and writes it to an HDF5 file with appropriate chunking, compression, and metadata. Behind the scenes, it’s handling all the complexity we discussed - choosing chunk sizes, setting up compression, creating the hierarchical structure - so you can focus on organizing your data logically.

Let’s walk through a complete example:

library(BigDataStatMeth)
library(rhdf5)

# Create some example data that mimics a genomics study
# In real work, this data would come from your actual measurements
set.seed(123)
genotype_data <- matrix(
  sample(c(0, 1, 2), 1000 * 500, replace = TRUE),  # 0, 1, 2 = genotype calls
  nrow = 1000,  # 1000 individuals
  ncol = 500    # 500 genetic variants
)

phenotype_data <- matrix(
  rnorm(1000 * 10),  # Continuous phenotype measurements
  nrow = 1000,       # Same 1000 individuals
  ncol = 10          # 10 measured traits
)

# Create HDF5 file with hierarchical organization
# This creates the file and writes the first dataset
bdCreate_hdf5_matrix(
  filename = "my_study.hdf5",    # Name of HDF5 file to create
  object = genotype_data,         # Data to write
  group = "data",                 # Group name (like a folder)
  dataset = "genotypes",          # Dataset name within the group
  overwriteFile = TRUE            # OK to overwrite if file exists
)

# Add second dataset to the same file
# Notice overwriteFile = FALSE so we add to existing file
bdCreate_hdf5_matrix(
  filename = "my_study.hdf5",    # Same file as above
  object = phenotype_data,        # Different data
  group = "data",                 # Same group
  dataset = "phenotypes",         # Different dataset name
  overwriteFile = FALSE           # Don't overwrite the file, add to it
)

# Inspect the structure to see what we created
h5ls("my_study.hdf5")

       group        name       otype dclass        dim
0          /        data   H5I_GROUP              
1     /data  genotypes H5I_DATASET  FLOAT 1000 x 500
2     /data phenotypes H5I_DATASET  FLOAT 1000 x 10

What Just Happened Behind the Scenes?

Let’s unpack what bdCreate_hdf5_matrix() did for us in those few lines of code:

File and Structure Creation: The first call created a new HDF5 file on disk called my_study.hdf5. It automatically created the group /data (since it didn’t exist yet) and added the genotypes dataset within that group. The second call recognized the file already exists and added another dataset to the same group without disturbing the first one.

Data Transfer: Your R matrices were written to disk in HDF5 format. The genotype matrix (1000 × 500 = 500,000 values) and phenotype matrix (1000 × 10 = 10,000 values) now live on disk, organized hierarchically. The original R objects still exist in memory - we’ve created copies on disk, not moved the data.

Automatic Optimization: Behind the scenes, BigDataStatMeth chose appropriate chunk sizes for each dataset based on their dimensions. It applied default compression (level 4) to reduce file size. It stored metadata about dimensions and data types. All of this happened automatically without you needing to specify these technical details.

Memory Efficiency: Notice that at no point did we need to have multiple copies of the data in memory. Each dataset was written directly from the R object to disk. This matters more with larger datasets - you can create multi-gigabyte HDF5 files without needing that much RAM, as long as each individual dataset fits in memory when you write it.

The h5ls() output shows our file structure. The / represents the root of the file, data is a group (H5I_GROUP), and within it are two datasets (H5I_DATASET) with their dimensions clearly visible.

5 Reading Data: The Power of Partial I/O

Now that we have data stored in HDF5 format, let’s explore what makes this format special: the ability to read just the portions we need.

# Read just the first 100 rows and 50 columns
subset_data <- h5read(
  "my_study.hdf5",
  "data/genotypes",
  index = list(1:100, 1:50)
)

# Check memory usage
object.size(subset_data)  # Only ~40 KB instead of ~3.8 MB!

This is fundamentally different from CSV or RData files where you must read everything.

graph TB
    T1["<b>Traditional Files (CSV, RData)</b>"]
    T1 --> A1[Full File<br/>37 GB]
    A1 --> B1[Read ALL<br/>37 GB RAM]
    B1 --> C1[Extract Subset<br/>~100 MB used]
    
    T2["<b>HDF5 File</b>"]
    T2 --> A2[HDF5 File<br/>37 GB]
    A2 --> B2[Read SUBSET<br/>~100 MB RAM]
    B2 --> C2[Work with Data<br/>~100 MB used]
    
    style T1 fill:#ffcccc,stroke:#cc0000,stroke-width:2px,color:#000
    style T2 fill:#ccffcc,stroke:#00cc00,stroke-width:2px,color:#000
    style A1 fill:#ffe8e8
    style B1 fill:#ffe8e8
    style C1 fill:#ffe8e8
    style A2 fill:#e8f6e8
    style B2 fill:#e8f6e8
    style C2 fill:#e8f6e8

Partial I/O: Reading only needed data

6 Chunking: How HDF5 Enables Efficient Access

We’ve established that HDF5 allows you to read just the portions of data you need - but how does it accomplish this efficiently? The answer lies in a fundamental design choice called chunking, which is perhaps the most important concept to understand when working with HDF5 files. Chunking is what transforms HDF5 from a simple file format into a high-performance data access system.

To appreciate why chunking matters, consider what happens with traditional file formats. When you store a matrix in a CSV file, the data is written sequentially: row 1, then row 2, then row 3, and so on. If you want to read just one column, you must scan through the entire file, extracting the relevant value from each row as you go. This means touching every single byte of a multi-gigabyte file just to access a tiny slice of data. It’s like having to read an entire book to find a single word.

HDF5 takes a fundamentally different approach by organizing data into rectangular blocks called chunks. This organization is built into how the data is stored on disk, not something that happens when you read the file. Understanding chunking helps you make better decisions about how to structure your data and which operations will be fast versus slow.

6.1 What is Chunking?

Instead of storing your matrix in a single continuous block (row-by-row or column-by-column), HDF5 divides it into rectangular chunks - think of them as tiles in a mosaic. Each chunk is stored as a contiguous unit on disk, meaning all the data in that chunk sits together in one place. When you request data that falls within a chunk, HDF5 can read that entire chunk in a single efficient disk operation.

The key insight is that related data - values that are likely to be accessed together - should live in the same chunk. If you typically access columns of data, you want chunks that contain complete column segments. If you work with rectangular regions, square chunks make sense. HDF5’s flexibility in chunk shapes lets you optimize for your specific access patterns.

Let’s visualize how a matrix gets divided into chunks:

graph TD
    subgraph "Original Matrix (1000 × 500)"
        A[Chunk 1<br/>250×250]
        B[Chunk 2<br/>250×250]
        C[Chunk 3<br/>250×250]
        D[Chunk 4<br/>250×250]
        E[Chunk 5<br/>250×250]
        F[Chunk 6<br/>250×250]
        G[Chunk 7<br/>250×250]
        H[Chunk 8<br/>250×250]
    end
    
    style A fill:#f0f8ff
    style B fill:#e8f6e8
    style C fill:#fff8e1
    style D fill:#ffe8e8
    style E fill:#f0f8ff
    style F fill:#e8f6e8
    style G fill:#fff8e1
    style H fill:#ffe8e8

Matrix chunking concept

In this example, a 1000×500 matrix is divided into eight chunks of 250×250 elements each. Each colored block represents a separate chunk stored contiguously on disk. When you need data from a specific region of your matrix, HDF5 identifies which chunks contain that data and reads only those chunks. The chunks that aren’t needed stay on disk, consuming zero memory.

The coloring in the diagram isn’t just decorative - it helps visualize an important property: chunks are independent storage units. You can read chunk 1 without touching chunks 2 through 8. You can update chunk 5 without affecting any other chunk. This independence is what makes partial I/O possible and efficient.

6.2 Why Chunking Matters

The choice of chunk size and shape has profound implications for performance. To understand why, let’s walk through a concrete example that illustrates both good and bad chunking strategies.

Example scenario: You want to read columns 1-250 of your matrix.

Matrix chunked by 250×250 blocks
Reading columns 1-250 requires: Chunks 1, 3, 5, 7 (4 chunks)
Disk reads: 4 seek + read operations ✓ Efficient!

Matrix stored row-by-row
Reading columns 1-250 requires: Touching every row (1000 seek operations)
Disk reads: 1000 seek + read operations ✗ Slow!

The difference is dramatic. With appropriate chunking, reading our column subset requires just 4 disk operations - one for each chunk that contains part of those columns. Without chunking (or with row-major storage), we need 1000 separate disk seeks, one for each row. Since disk seeks are typically measured in milliseconds while data transfer happens at gigabytes per second, the number of seeks dominates performance for partial reads.

This example illustrates a general principle: your chunk layout should match your access patterns. If you primarily access data by columns (common in statistical analysis), use chunks that span the full height of your matrix but only a portion of its width. If you work with rectangular regions, square chunks often work well. If you access entire rows, horizontal chunks make sense.

The good news is that BigDataStatMeth makes intelligent chunking decisions for you, optimizing for the block-wise statistical operations that are common in data analysis. But understanding chunking helps you recognize when certain operations will be fast (they align with chunk boundaries) versus slower (they require reading many partially-needed chunks).

Chunk Size Matters

Chunk size represents a fundamental trade-off in HDF5 performance. Several factors influence the optimal size:

Too small chunks mean more metadata overhead and more disk seeks. If each chunk is only a few kilobytes, you’ll spend more time seeking to chunks than actually reading data. The metadata describing where each chunk lives can become a significant burden.

Too large chunks mean reading more data than you need. If you want a single column but each chunk contains hundreds of columns, you’re transferring far more data from disk to memory than necessary. This wastes both I/O bandwidth and RAM.

The sweet spot typically falls between 10KB and 1MB per chunk, though the exact optimum depends on your hardware (especially disk type - SSD versus hard drive) and access patterns. Modern SSDs are more forgiving of many small reads, while traditional hard drives strongly prefer fewer large reads.

BigDataStatMeth handles chunking automatically with sensible defaults, but it’s important to understand the philosophy behind these choices. The package takes a deliberately conservative approach to chunk sizing. Rather than trying to maximize performance, the defaults prioritize system stability and broad compatibility. This means that BigDataStatMeth’s automatic chunking probably isn’t the absolute fastest possible for your specific system - but it won’t overwhelm your RAM or cause your system to become unresponsive.

Why this conservatism? The package cannot know in advance what hardware you’re running on (laptop with 8GB RAM vs. server with 256GB), what else is running on your system, or your specific storage architecture (local SSD, network drive, cloud storage). Chunk sizes that work beautifully on a high-end workstation might cause out-of-memory errors on a laptop. The defaults aim for “works reliably everywhere” rather than “optimal for this specific configuration.”

If you know your system’s capabilities and access patterns well, you can override the defaults and potentially achieve better performance. But for most users, especially those new to HDF5, the conservative defaults provide a good balance: operations complete successfully without system issues, even if not at maximum theoretical speed. As you gain experience, you can experiment with more aggressive chunk sizes for your specific workflows.

7 Compression: Saving Disk Space

Storage space and I/O bandwidth are precious resources, especially when working with large datasets. HDF5 addresses both concerns through transparent compression - the data is automatically compressed when written and decompressed when read, without you having to manage the process explicitly. This compression happens at the chunk level, meaning each chunk is compressed independently. This chunk-wise compression preserves the ability to access arbitrary portions of your data efficiently.

The key word is “transparent.” From your perspective as a user, compressed and uncompressed datasets work identically. You read and write data using the same functions, and HDF5 handles the compression automatically. The only difference you’ll notice is in file size on disk and potentially in read/write performance, depending on whether your system is limited by disk I/O or CPU speed.

HDF5 uses the widely-tested gzip compression algorithm, which provides a good balance of compression ratio, speed, and universal support. The algorithm is lossless, meaning you get back exactly the data you put in - critical for scientific computing where accuracy matters. You control the compression level, trading off between better compression (smaller files, more CPU time) and faster operation (larger files, less CPU time).

Here’s how to use compression with BigDataStatMeth:

# BigDataStatMeth uses compression by default
bdCreate_hdf5_matrix(
  filename = "compressed.hdf5",
  object = genotype_data,
  group = "data",
  dataset = "genotypes",
  compression_level = 6  # 0 (none) to 9 (maximum)
)

# Check file sizes
file.info("uncompressed.hdf5")$size / 1024^2  # MB
file.info("compressed.hdf5")$size / 1024^2    # MB

7.1 Compression Trade-offs

Choosing a compression level involves understanding the trade-off between file size and computational overhead. Higher compression levels examine more potential encoding strategies, finding more compact representations at the cost of more CPU cycles.

Level	Ratio	Speed	When to Use
0	1:1	Fastest	High-speed temporary files
4-6	~2:1-5:1	Fast	Default - good balance
9	~3:1-8:1	Slower	Archival, limited disk space

The compression ratio you achieve depends heavily on your data characteristics. Genomic data with many repeated values (like genotypes coded as 0, 1, 2) compresses extremely well, often achieving 5:1 or better ratios. Random floating-point numbers compress poorly because compression algorithms rely on finding patterns and redundancy in the data.

For most analytical workflows, levels 4-6 hit the sweet spot. They provide substantial space savings while adding minimal computational overhead. Modern CPUs can decompress data far faster than even fast SSDs can deliver it, so the decompression rarely becomes a bottleneck. In fact, compression can sometimes improve overall performance by reducing the amount of data that must be transferred from disk to memory - the time saved in I/O exceeds the time spent decompressing.

Level 0 (no compression) makes sense for temporary files where you prioritize speed over space, or when working with data that simply doesn’t compress well (already-compressed data, random noise, encrypted data). Level 9 is useful for archival storage where space is at a premium and you don’t mind slower write times, but it rarely makes sense for working files since levels 6-7 typically achieve nearly the same compression with noticeably better performance.

BigDataStatMeth Default

BigDataStatMeth uses compression level 4 by default, providing good compression ratios (typically 2-4× for typical genomic and statistical data) without significant performance penalty. This default works well for most use cases. The package applies compression automatically - you don’t need to think about it unless you have specific reasons to adjust the level.

For most users, the default compression is the right choice. It keeps your files manageable without slowing down your analysis. Only adjust compression if you have unusual requirements: no compression for maximum speed with temporary files, or higher compression for long-term storage of large datasets.

8 Exploring HDF5 Files

Once you’ve created HDF5 files, you’ll often want to inspect their contents to understand what data they contain and how it’s organized. HDF5 provides several tools for this exploration, both programmatic and visual. Understanding what’s in your files is essential for both your own work (remembering what analyses you’ve run) and for sharing data with colleagues who need to understand your file structure.

8.1 Using h5ls()

# List contents of HDF5 file
library(rhdf5)
h5ls("my_study.hdf5")

       group        name       otype dclass        dim
0          /        data   H5I_GROUP              
1          /     results   H5I_GROUP              
2     /data  genotypes H5I_DATASET  FLOAT 1000 x 500
3     /data phenotypes H5I_DATASET  FLOAT 1000 x 10

8.2 Using HDFView (GUI Tool)

For visual exploration, HDFView is an excellent free tool that provides a graphical interface to browse HDF5 file contents, inspect datasets, and view metadata.

9 How BigDataStatMeth Uses HDF5

BigDataStatMeth builds on HDF5’s capabilities:

graph LR
    A[Raw Data<br/>CSV, RData, GDS] --> B[bdCreate_hdf5_matrix]
    B --> C[HDF5 File<br/>Chunked & Compressed]
    C --> D[Block-wise<br/>Operations]
    D --> E[Results<br/>Stored in HDF5]
    E --> F[Extract to R<br/>or Keep on Disk]
    
    style C fill:#f0f8ff
    style D fill:#e8f6e8
    style E fill:#fff8e1

BigDataStatMeth’s HDF5 workflow

Key features:

Automatic chunking optimized for statistical operations
Metadata preservation (row names, column names)
Block-wise algorithms that read/write chunks efficiently
Result storage in same file for traceability

10 Practical Example: Complete Workflow

Let’s put it all together with a realistic example:

library(BigDataStatMeth)
library(rhdf5)

# 1. Create HDF5 file from existing data
set.seed(42)
large_matrix <- matrix(rnorm(10000 * 5000), nrow = 10000, ncol = 5000)

bdCreate_hdf5_matrix(
  filename = "analysis.hdf5",
  object = large_matrix,
  group = "data",
  dataset = "expression_matrix",
  overwriteFile = TRUE
)

# 2. Perform SVD without loading full matrix
svd_result <- bdSVD_hdf5(
  filename = "analysis.hdf5",
  group = "data",
  dataset = "expression_matrix",
  bcenter = TRUE,
  bscale = TRUE,
  k = 20  # Number of components
)

# 3. Check what's in the file now
h5ls("analysis.hdf5")

# 4. Extract just the components you need
components <- h5read("analysis.hdf5", svd_result$ds_v)
dim(components)  # 5000 × 20 (not 5000 × 5000!)

# 5. Clean up
h5closeAll()

What We Achieved

Stored 38GB dataset on disk
Performed PCA using ~500MB RAM
Kept results organized in same file
Can rerun analysis without re-reading data

11 Interactive Exercise

11.1 Practice: Design Your Own HDF5 Structure

The best way to internalize HDF5 concepts is to apply them to your own data. This exercise guides you through creating a multi-component HDF5 file and asks you to think about organizational decisions. There are no “correct” answers - the goal is to practice translating a research design into an HDF5 file structure and to consider how different organizational choices affect usability.

This is a thinking exercise. We provide starter code and questions for reflection, but no solutions. The questions are designed to make you think about how you would structure real projects. Your answers will depend on your specific research questions and workflow.

Create an HDF5 file with your own data structure:

# 1. Create a hierarchical organization for a study
library(BigDataStatMeth)

# Simulated study data
genomic_data <- matrix(sample(0:2, 5000*1000, replace=TRUE), 5000, 1000)
expression_data <- matrix(rnorm(5000*200), 5000, 200)
clinical_data <- data.frame(
  age = rnorm(5000, 50, 10),
  gender = sample(c("M", "F"), 5000, replace = TRUE)
)

# 2. Organize in HDF5
bdCreate_hdf5_matrix(
  "my_study.hdf5", genomic_data, 
  group = "omics/genomics", dataset = "snps",
  overwriteFile = TRUE
)

bdCreate_hdf5_matrix(
  "my_study.hdf5", expression_data,
  group = "omics/transcriptomics", dataset = "genes"
)

# 3. Explore your creation
h5ls("my_study.hdf5")

# Questions for reflection:
# 
# 1. Longitudinal data organization:
#    - How would you organize data collected at multiple time points?
#    - Would you create separate groups for each timepoint (/timepoint1/, /timepoint2/)?
#    - Or group by data type with timepoint as a dimension within datasets?
#    - What are the trade-offs of each approach?
#
# 2. Multi-omic integration:
#    - You now have genomics and transcriptomics. What if you add:
#      * Proteomics data
#      * Metabolomics data
#      * Clinical phenotypes
#    - How would you organize these to make cross-omic analyses easy?
#    - Should raw and processed data live in different groups?
#
# 3. Derived results storage:
#    - Where would you store PCA results for each omic?
#    - Where would integration analysis results go?
#    - How do you link results back to the data they came from?
#    - Should every analysis get its own group, or organize by analysis type?
#
# 4. Metadata strategy:
#    - What attributes would you attach to each dataset?
#    - How would you document processing steps?
#    - What information needs to travel with the data for reproducibility?
#
# Try implementing one of these scenarios in code. The goal is not perfection,
# but practice thinking about data organization for real research projects.

Reflection, Not Solutions

This exercise deliberately doesn’t provide “the answer” because there often isn’t a single correct way to organize complex data. Your organizational choices should reflect:

Your specific research questions (what comparisons matter most?)
Your workflow (what data do you access together?)
Your collaboration needs (who else needs to use this data?)
Your analysis pipeline (what tools will read this file?)

The experience of designing and trying different structures teaches more than following a prescribed solution. If you’re unsure, try multiple approaches and see which one feels more natural when you go to actually use the data.

12 Key Takeaways

We’ve covered a lot of ground in understanding HDF5. Let’s consolidate the essential concepts you need to remember as you work with BigDataStatMeth and HDF5 files.

12.1 Essential Concepts

You’ve learned that HDF5 is fundamentally different from traditional file formats. Let’s review the key ideas:

HDF5 is a database for matrices, not just a file format. This distinction matters because it changes how you think about data storage. Instead of “saving a file,” you’re building a database that can contain multiple related datasets, organized hierarchically, with built-in metadata.

Hierarchical organization mirrors how you think about complex research projects. Just as you organize documents into folders, HDF5 lets you organize datasets into groups. This organization isn’t cosmetic - it helps both you and your analysis code understand the relationships between different data components.

Chunking enables efficient partial I/O by organizing data into blocks that can be read independently. This is what makes the “read only what you need” promise real rather than theoretical. Understanding chunking helps you predict which operations will be fast and which will require reading more data than you’d prefer.

Compression reduces disk usage without adding complexity to your code. HDF5 handles compression transparently, and the default settings work well for most statistical applications. You get smaller files essentially for free.

Metadata is built-in, meaning your data carries its own documentation. Six months later, you (or a colleague) can open an HDF5 file and immediately understand what it contains, when it was created, and how to interpret the values.

BigDataStatMeth automates these HDF5 best practices. The package makes intelligent decisions about chunking, compression, and metadata so you can focus on your analysis rather than storage engineering. But understanding what’s happening under the hood helps you use the package more effectively.

12.2 When to Use HDF5

Making the right choice about file formats matters for both productivity and practicality. Here’s how to decide:

✅ Use HDF5 when:

Data exceeds available RAM - This is the primary use case. When you can’t load everything into memory, HDF5 lets you work with arbitrary-sized datasets by processing them in pieces.
Need repeated access to subsets - If your workflow involves reading different portions of data at different times, HDF5’s partial I/O capabilities pay off quickly.
Want organized, self-documenting storage - For complex projects with multiple data components, HDF5’s hierarchical structure and metadata support help keep everything organized and understandable.
Sharing data across platforms - HDF5 is supported by R, Python, MATLAB, Julia, C++, and many other languages. It provides a common data format that works across your entire analysis ecosystem.

❌ Don’t use HDF5 when:

Data fits comfortably in RAM - If your dataset uses less than about 20% of your available memory, traditional formats (RData, CSV) are simpler. The overhead of HDF5 doesn’t provide benefits when everything fits in memory anyway.
Need simple, human-readable formats - HDF5 files are binary and require special tools to inspect. If you need to open files in a text editor or want maximum simplicity, CSV or similar formats might be better despite their limitations.
Working with highly irregular structures - HDF5 excels with matrices and arrays - data that has regular structure. Highly nested or irregular data structures might be better served by other formats designed for those use cases.

The decision isn’t always clear-cut. Many projects start with data that fits in memory but grow over time. Starting with HDF5 can future-proof your analysis pipeline, especially if you anticipate scaling to larger datasets or want to leverage the organizational benefits even for medium-sized data.

13 Next Steps

You now have a solid foundation in HDF5 concepts and how BigDataStatMeth uses them. The natural next step is to understand how computational algorithms are adapted to work efficiently with disk-based data.

Block-Wise Computing → Learn how algorithms are adapted for disk-based matrices
Working with HDF5 Matrices → Practical tutorial on data management
Your First Analysis → Complete analytical workflow

14 Further Reading

If you want to deepen your understanding of HDF5 beyond what we’ve covered here, these resources provide different perspectives and more technical detail:

HDF5 User Guide - The official documentation from The HDF Group. This is comprehensive and authoritative, though quite technical. Good for understanding the full capabilities of HDF5 and diving into advanced features like parallel I/O, virtual datasets, and complex datatypes.

rhdf5 Bioconductor Package - The R package that BigDataStatMeth builds upon for low-level HDF5 operations. The rhdf5 documentation provides examples of direct HDF5 manipulation if you need finer control than BigDataStatMeth’s convenience functions provide.

HDFView - A free graphical tool for browsing HDF5 files visually. Essential for debugging file structure issues and understanding what your code has created. Works on Windows, Mac, and Linux.

BigDataStatMeth Documentation - The complete API reference covers all HDF5-related functions in detail, with additional examples and parameter explanations.

Questions or Feedback?

If something is unclear or you’d like more examples, please open an issue on GitHub. We value your feedback to improve this educational material.

--- title: "Understanding HDF5 Storage" subtitle: "A Deep Dive into the Foundation of BigDataStatMeth" --- ::: {.learning-objectives} ### What You'll Learn By the end of this section, you will: - Understand what HDF5 is and why it exists - Grasp how HDF5 organizes data hierarchically - Learn how chunking enables efficient disk-based computing - Know when and why to use compression - Be able to create, inspect, and work with HDF5 files - Understand how BigDataStatMeth leverages HDF5 features ::: ## The Problem HDF5 Solves Imagine you're analyzing genomic data: a matrix of 100,000 individuals × 50,000 genetic variants. Let's do some quick math: ```r # Matrix dimensions n_individuals <- 100000 n_variants <- 50000 # Memory needed (assuming 8 bytes per double-precision number) memory_gb <- (n_individuals * n_variants * 8) / (1024^3) cat(sprintf("Memory required: %.1f GB\n", memory_gb)) ``` ``` Memory required: 37.3 GB ``` **The challenge:** A typical laptop has 16GB RAM. Even a high-end workstation with 128GB would struggle when performing operations that create intermediate results. ::: {.callout-note} ### The Traditional Approach Fails ```r # This would crash on most systems genotype_matrix <- read.csv("100k_x_50k_genotypes.csv") # ❌ Out of memory! pca_result <- prcomp(genotype_matrix) # ❌ Never gets here ``` We need a fundamentally different approach. ::: ## What is HDF5? When we encounter datasets that exceed our computer's memory capacity, we face a fundamental question: how do we work with data we cannot fully load? Traditional file formats like CSV, RData, or even binary formats force us to read entire files into memory before we can work with them. This all-or-nothing approach creates an insurmountable barrier when data grows beyond available RAM. **HDF5** (Hierarchical Data Format version 5) emerged from this exact challenge in the scientific computing community. Developed by the National Center for Supercomputing Applications (NCSA) and now maintained by The HDF Group, HDF5 was designed by scientists who routinely work with terabytes of data from instruments, simulations, and experiments. It's not just "another file format" - it's fundamentally a **database system optimized for scientific matrices and arrays**. The key innovation of HDF5 is deceptively simple but profoundly powerful: rather than treating a file as a monolithic block of data that must be read entirely, HDF5 treats files as structured databases where you can efficiently access exactly the pieces you need, when you need them. This changes everything about how we can work with large datasets. ### The HDF5 Philosophy HDF5 was designed around a core principle that directly addresses our big data problem: > "Data should be accessible in pieces, without loading everything into memory." This philosophy manifests in three revolutionary features that distinguish HDF5 from traditional file formats: **1. Hierarchical Organization** Just as you organize documents into folders and subfolders on your computer, HDF5 lets you organize datasets within groups. You might have a group called `/raw_data/` containing original measurements, another called `/processed/` for cleaned data, and `/results/` for analytical outputs. This organization isn't just cosmetic - it helps both humans and computers understand the relationships between different pieces of data. When working on a genomics project, for example, you might organize by chromosome, by sample type, or by processing stage, all within a single file. **2. Partial I/O** This is where HDF5 truly shines. Imagine you have a matrix with 100,000 rows but only need to analyze the first 10,000. With a CSV file, you must read all 100,000 rows before you can work with any of them. With HDF5, you simply request rows 1-10,000, and only those rows are read from disk. The rest of the data stays untouched on disk, consuming zero memory. This selective reading extends to any dimension - rows, columns, or even arbitrary rectangular regions of your matrices. This capability is what makes disk-based computing practical. **3. Self-Describing Metadata** Every HDF5 dataset carries its own documentation. The file knows the dimensions of each matrix, the data type of each element, the names of rows and columns, when the data was created, and any other information you choose to store. This metadata travels with the data, so six months later, you (or a colleague) can open the file and immediately understand what it contains without consulting separate documentation. The data describes itself. These three features combine to create a storage system where large datasets remain accessible and usable without overwhelming your computer's memory. But understanding the features is just the beginning - seeing how HDF5 structures data internally helps us use it effectively. ## HDF5 File Structure To truly grasp how HDF5 enables efficient big data workflows, we need to understand its internal organization. Think of an HDF5 file as containing its own miniature filesystem - a hierarchy of containers and data, all within a single file. At the top level, an HDF5 file can contain two types of objects: **groups** and **datasets**. Groups are like directories or folders, providing organizational structure. Datasets are where the actual data lives - these are your matrices, vectors, or arrays. Groups can contain other groups (nested hierarchies) and datasets. This creates a tree-like structure that can represent complex, multi-component analyses within a single file. Here's how this might look for a typical genomics analysis: ```{mermaid} %%| fig-cap: "HDF5 hierarchical structure" %%| fig-alt: "Diagram showing HDF5 file structure with groups and datasets" graph TD A[HDF5 File: analysis.hdf5] --> B[Group: /data] A --> C[Group: /results] A --> D[Group: /metadata] B --> B1[Dataset: genotypes 100k × 50k matrix] B --> B2[Dataset: phenotypes 100k × 10 matrix] C --> C1[Dataset: pca_components 50k × 20 matrix] C --> C2[Dataset: pca_scores 100k × 20 matrix] D --> D1[Attributes: creation_date] D --> D2[Attributes: sample_ids] style A fill:#f0f8ff style B fill:#e8f6e8 style C fill:#e8f6e8 style D fill:#e8f6e8 style B1 fill:#fff8e1 style B2 fill:#fff8e1 style C1 fill:#fff8e1 style C2 fill:#fff8e1 ``` In this example, we have a single HDF5 file (`analysis.hdf5`) that contains everything related to a genetic association study. The file is organized into three main groups at the root level. The `/data` group holds our raw and processed data matrices - the genotype matrix with 100,000 individuals across 50,000 genetic variants, plus a smaller phenotype matrix with clinical measurements for those same individuals. The `/results` group stores the outputs of our PCA analysis - both the principal components (50,000 variants × 20 components) and the individual scores (100,000 individuals × 20 components). Finally, the `/metadata` group contains important information about the study, like when the data was collected and which samples correspond to which individuals. This hierarchical organization does more than keep things tidy. It allows us to work with different parts of the analysis independently. We can read just the phenotype data without touching the much larger genotype matrix. We can update results without disturbing raw data. We can add new analyses in new groups without restructuring existing work. Everything stays together in one file, but nothing forces us to load everything at once. ### Components Explained To work effectively with HDF5, you need to understand three types of objects that make up this hierarchy. Each serves a distinct purpose in organizing and documenting your data. **Groups** are organizational containers that work exactly like folders on your computer's filesystem. They provide structure and context for your data. A group can contain other groups (creating deeper hierarchies) and datasets. For instance, you might create a structure like `/data/genomics/chromosome1/variants` to organize genetic variants by chromosome. Groups help both humans and software understand how different pieces of data relate to each other. When you come back to an analysis months later, this organization helps you quickly locate what you need. **Datasets** are where your actual numerical data lives. These are the matrices, vectors, and arrays you'll perform computations on. A dataset can be one-dimensional (a vector), two-dimensional (a matrix), or even higher-dimensional (tensors for complex data structures). Crucially, datasets in HDF5 can be arbitrarily large - they're not limited by your computer's RAM because the data resides on disk. Each dataset stores not just the numbers, but also information about data types, dimensions, and structure. This is what allows HDF5 to read just portions of a dataset efficiently. **Attributes** are small pieces of metadata attached to either groups or datasets. Think of attributes as post-it notes that document important information about your data. For a genotype dataset, attributes might store column names (which SNPs?), row names (which individuals?), the date of data collection, quality control thresholds used, or the genome build version. Attributes are always small and loaded entirely into memory, so they're perfect for storing descriptive information that helps interpret the data. Unlike datasets, attributes are not designed for selective access - they're meant to be read completely whenever you open a file. ## Creating Your First HDF5 File Now that you understand the conceptual foundation of HDF5 - its hierarchical structure, the power of partial I/O, and how chunking works under the hood - let's make these ideas concrete by actually creating an HDF5 file. We'll build a small example that mirrors a typical genomics study: genotype data for many individuals across many genetic variants, plus associated phenotype measurements. The key function in BigDataStatMeth for creating HDF5 files is `bdCreate_hdf5_matrix()`. This function takes an R matrix (or data that can be converted to a matrix) and writes it to an HDF5 file with appropriate chunking, compression, and metadata. Behind the scenes, it's handling all the complexity we discussed - choosing chunk sizes, setting up compression, creating the hierarchical structure - so you can focus on organizing your data logically. Let's walk through a complete example: ```{r} #| eval: false #| code-fold: false library(BigDataStatMeth) library(rhdf5) # Create some example data that mimics a genomics study # In real work, this data would come from your actual measurements set.seed(123) genotype_data <- matrix( sample(c(0, 1, 2), 1000 * 500, replace = TRUE), # 0, 1, 2 = genotype calls nrow = 1000, # 1000 individuals ncol = 500 # 500 genetic variants ) phenotype_data <- matrix( rnorm(1000 * 10), # Continuous phenotype measurements nrow = 1000, # Same 1000 individuals ncol = 10 # 10 measured traits ) # Create HDF5 file with hierarchical organization # This creates the file and writes the first dataset bdCreate_hdf5_matrix( filename = "my_study.hdf5", # Name of HDF5 file to create object = genotype_data, # Data to write group = "data", # Group name (like a folder) dataset = "genotypes", # Dataset name within the group overwriteFile = TRUE # OK to overwrite if file exists ) # Add second dataset to the same file # Notice overwriteFile = FALSE so we add to existing file bdCreate_hdf5_matrix( filename = "my_study.hdf5", # Same file as above object = phenotype_data, # Different data group = "data", # Same group dataset = "phenotypes", # Different dataset name overwriteFile = FALSE # Don't overwrite the file, add to it ) # Inspect the structure to see what we created h5ls("my_study.hdf5") ``` ``` group name otype dclass dim 0 / data H5I_GROUP 1 /data genotypes H5I_DATASET FLOAT 1000 x 500 2 /data phenotypes H5I_DATASET FLOAT 1000 x 10 ``` ::: {.callout-tip} ### What Just Happened Behind the Scenes? Let's unpack what `bdCreate_hdf5_matrix()` did for us in those few lines of code: **File and Structure Creation:** The first call created a new HDF5 file on disk called `my_study.hdf5`. It automatically created the group `/data` (since it didn't exist yet) and added the `genotypes` dataset within that group. The second call recognized the file already exists and added another dataset to the same group without disturbing the first one. **Data Transfer:** Your R matrices were written to disk in HDF5 format. The genotype matrix (1000 × 500 = 500,000 values) and phenotype matrix (1000 × 10 = 10,000 values) now live on disk, organized hierarchically. The original R objects still exist in memory - we've created copies on disk, not moved the data. **Automatic Optimization:** Behind the scenes, BigDataStatMeth chose appropriate chunk sizes for each dataset based on their dimensions. It applied default compression (level 4) to reduce file size. It stored metadata about dimensions and data types. All of this happened automatically without you needing to specify these technical details. **Memory Efficiency:** Notice that at no point did we need to have multiple copies of the data in memory. Each dataset was written directly from the R object to disk. This matters more with larger datasets - you can create multi-gigabyte HDF5 files without needing that much RAM, as long as each individual dataset fits in memory when you write it. The `h5ls()` output shows our file structure. The `/` represents the root of the file, `data` is a group (H5I_GROUP), and within it are two datasets (H5I_DATASET) with their dimensions clearly visible. ::: ## Reading Data: The Power of Partial I/O Now that we have data stored in HDF5 format, let's explore what makes this format special: the ability to read just the portions we need. ```{r} #| eval: false # Read just the first 100 rows and 50 columns subset_data <- h5read( "my_study.hdf5", "data/genotypes", index = list(1:100, 1:50) ) # Check memory usage object.size(subset_data) # Only ~40 KB instead of ~3.8 MB! ``` This is **fundamentally different** from CSV or RData files where you must read everything. ```{mermaid} %%| fig-cap: "Partial I/O: Reading only needed data" %%| fig-alt: "Diagram comparing full file reading vs HDF5 partial reading" graph TB T1["Traditional Files (CSV, RData)"] T1 --> A1[Full File 37 GB] A1 --> B1[Read ALL 37 GB RAM] B1 --> C1[Extract Subset ~100 MB used] T2["HDF5 File"] T2 --> A2[HDF5 File 37 GB] A2 --> B2[Read SUBSET ~100 MB RAM] B2 --> C2[Work with Data ~100 MB used] style T1 fill:#ffcccc,stroke:#cc0000,stroke-width:2px,color:#000 style T2 fill:#ccffcc,stroke:#00cc00,stroke-width:2px,color:#000 style A1 fill:#ffe8e8 style B1 fill:#ffe8e8 style C1 fill:#ffe8e8 style A2 fill:#e8f6e8 style B2 fill:#e8f6e8 style C2 fill:#e8f6e8 ``` ## Chunking: How HDF5 Enables Efficient Access We've established that HDF5 allows you to read just the portions of data you need - but how does it accomplish this efficiently? The answer lies in a fundamental design choice called **chunking**, which is perhaps the most important concept to understand when working with HDF5 files. Chunking is what transforms HDF5 from a simple file format into a high-performance data access system. To appreciate why chunking matters, consider what happens with traditional file formats. When you store a matrix in a CSV file, the data is written sequentially: row 1, then row 2, then row 3, and so on. If you want to read just one column, you must scan through the entire file, extracting the relevant value from each row as you go. This means touching every single byte of a multi-gigabyte file just to access a tiny slice of data. It's like having to read an entire book to find a single word. HDF5 takes a fundamentally different approach by organizing data into rectangular blocks called chunks. This organization is built into how the data is stored on disk, not something that happens when you read the file. Understanding chunking helps you make better decisions about how to structure your data and which operations will be fast versus slow. ### What is Chunking? Instead of storing your matrix in a single continuous block (row-by-row or column-by-column), HDF5 divides it into rectangular **chunks** - think of them as tiles in a mosaic. Each chunk is stored as a contiguous unit on disk, meaning all the data in that chunk sits together in one place. When you request data that falls within a chunk, HDF5 can read that entire chunk in a single efficient disk operation. The key insight is that related data - values that are likely to be accessed together - should live in the same chunk. If you typically access columns of data, you want chunks that contain complete column segments. If you work with rectangular regions, square chunks make sense. HDF5's flexibility in chunk shapes lets you optimize for your specific access patterns. Let's visualize how a matrix gets divided into chunks: ```{mermaid} %%| fig-cap: "Matrix chunking concept" %%| fig-alt: "Visual representation of a matrix divided into chunks" graph TD subgraph "Original Matrix (1000 × 500)" A[Chunk 1 250×250] B[Chunk 2 250×250] C[Chunk 3 250×250] D[Chunk 4 250×250] E[Chunk 5 250×250] F[Chunk 6 250×250] G[Chunk 7 250×250] H[Chunk 8 250×250] end style A fill:#f0f8ff style B fill:#e8f6e8 style C fill:#fff8e1 style D fill:#ffe8e8 style E fill:#f0f8ff style F fill:#e8f6e8 style G fill:#fff8e1 style H fill:#ffe8e8 ``` In this example, a 1000×500 matrix is divided into eight chunks of 250×250 elements each. Each colored block represents a separate chunk stored contiguously on disk. When you need data from a specific region of your matrix, HDF5 identifies which chunks contain that data and reads only those chunks. The chunks that aren't needed stay on disk, consuming zero memory. The coloring in the diagram isn't just decorative - it helps visualize an important property: chunks are independent storage units. You can read chunk 1 without touching chunks 2 through 8. You can update chunk 5 without affecting any other chunk. This independence is what makes partial I/O possible and efficient. ### Why Chunking Matters The choice of chunk size and shape has profound implications for performance. To understand why, let's walk through a concrete example that illustrates both good and bad chunking strategies. **Example scenario:** You want to read columns 1-250 of your matrix. ::: {.panel-tabset} ### With Good Chunking ``` Matrix chunked by 250×250 blocks Reading columns 1-250 requires: Chunks 1, 3, 5, 7 (4 chunks) Disk reads: 4 seek + read operations ✓ Efficient! ``` ### Without Chunking (Row-Major) ``` Matrix stored row-by-row Reading columns 1-250 requires: Touching every row (1000 seek operations) Disk reads: 1000 seek + read operations ✗ Slow! ``` ::: The difference is dramatic. With appropriate chunking, reading our column subset requires just 4 disk operations - one for each chunk that contains part of those columns. Without chunking (or with row-major storage), we need 1000 separate disk seeks, one for each row. Since disk seeks are typically measured in milliseconds while data transfer happens at gigabytes per second, the number of seeks dominates performance for partial reads. This example illustrates a general principle: **your chunk layout should match your access patterns**. If you primarily access data by columns (common in statistical analysis), use chunks that span the full height of your matrix but only a portion of its width. If you work with rectangular regions, square chunks often work well. If you access entire rows, horizontal chunks make sense. The good news is that BigDataStatMeth makes intelligent chunking decisions for you, optimizing for the block-wise statistical operations that are common in data analysis. But understanding chunking helps you recognize when certain operations will be fast (they align with chunk boundaries) versus slower (they require reading many partially-needed chunks). ::: {.callout-important} ### Chunk Size Matters Chunk size represents a fundamental trade-off in HDF5 performance. Several factors influence the optimal size: **Too small chunks** mean more metadata overhead and more disk seeks. If each chunk is only a few kilobytes, you'll spend more time seeking to chunks than actually reading data. The metadata describing where each chunk lives can become a significant burden. **Too large chunks** mean reading more data than you need. If you want a single column but each chunk contains hundreds of columns, you're transferring far more data from disk to memory than necessary. This wastes both I/O bandwidth and RAM. **The sweet spot** typically falls between 10KB and 1MB per chunk, though the exact optimum depends on your hardware (especially disk type - SSD versus hard drive) and access patterns. Modern SSDs are more forgiving of many small reads, while traditional hard drives strongly prefer fewer large reads. BigDataStatMeth handles chunking automatically with sensible defaults, but it's important to understand the philosophy behind these choices. The package takes a **deliberately conservative approach** to chunk sizing. Rather than trying to maximize performance, the defaults prioritize system stability and broad compatibility. This means that BigDataStatMeth's automatic chunking probably isn't the absolute fastest possible for your specific system - but it won't overwhelm your RAM or cause your system to become unresponsive. Why this conservatism? The package cannot know in advance what hardware you're running on (laptop with 8GB RAM vs. server with 256GB), what else is running on your system, or your specific storage architecture (local SSD, network drive, cloud storage). Chunk sizes that work beautifully on a high-end workstation might cause out-of-memory errors on a laptop. The defaults aim for "works reliably everywhere" rather than "optimal for this specific configuration." If you know your system's capabilities and access patterns well, you can override the defaults and potentially achieve better performance. But for most users, especially those new to HDF5, the conservative defaults provide a good balance: operations complete successfully without system issues, even if not at maximum theoretical speed. As you gain experience, you can experiment with more aggressive chunk sizes for your specific workflows. ::: ## Compression: Saving Disk Space Storage space and I/O bandwidth are precious resources, especially when working with large datasets. HDF5 addresses both concerns through **transparent compression** - the data is automatically compressed when written and decompressed when read, without you having to manage the process explicitly. This compression happens at the chunk level, meaning each chunk is compressed independently. This chunk-wise compression preserves the ability to access arbitrary portions of your data efficiently. The key word is "transparent." From your perspective as a user, compressed and uncompressed datasets work identically. You read and write data using the same functions, and HDF5 handles the compression automatically. The only difference you'll notice is in file size on disk and potentially in read/write performance, depending on whether your system is limited by disk I/O or CPU speed. HDF5 uses the widely-tested gzip compression algorithm, which provides a good balance of compression ratio, speed, and universal support. The algorithm is lossless, meaning you get back exactly the data you put in - critical for scientific computing where accuracy matters. You control the compression level, trading off between better compression (smaller files, more CPU time) and faster operation (larger files, less CPU time). Here's how to use compression with BigDataStatMeth: ```{r} #| eval: false # BigDataStatMeth uses compression by default bdCreate_hdf5_matrix( filename = "compressed.hdf5", object = genotype_data, group = "data", dataset = "genotypes", compression_level = 6 # 0 (none) to 9 (maximum) ) # Check file sizes file.info("uncompressed.hdf5")$size / 1024^2 # MB file.info("compressed.hdf5")$size / 1024^2 # MB ``` ### Compression Trade-offs Choosing a compression level involves understanding the trade-off between file size and computational overhead. Higher compression levels examine more potential encoding strategies, finding more compact representations at the cost of more CPU cycles. | Level | Ratio | Speed | When to Use | |-------|-------|-------|-------------| | 0 | 1:1 | Fastest | High-speed temporary files | | 4-6 | ~2:1-5:1 | Fast | **Default - good balance** | | 9 | ~3:1-8:1 | Slower | Archival, limited disk space | The compression ratio you achieve depends heavily on your data characteristics. Genomic data with many repeated values (like genotypes coded as 0, 1, 2) compresses extremely well, often achieving 5:1 or better ratios. Random floating-point numbers compress poorly because compression algorithms rely on finding patterns and redundancy in the data. For most analytical workflows, levels 4-6 hit the sweet spot. They provide substantial space savings while adding minimal computational overhead. Modern CPUs can decompress data far faster than even fast SSDs can deliver it, so the decompression rarely becomes a bottleneck. In fact, compression can sometimes *improve* overall performance by reducing the amount of data that must be transferred from disk to memory - the time saved in I/O exceeds the time spent decompressing. Level 0 (no compression) makes sense for temporary files where you prioritize speed over space, or when working with data that simply doesn't compress well (already-compressed data, random noise, encrypted data). Level 9 is useful for archival storage where space is at a premium and you don't mind slower write times, but it rarely makes sense for working files since levels 6-7 typically achieve nearly the same compression with noticeably better performance. ::: {.callout-note} ### BigDataStatMeth Default BigDataStatMeth uses compression level 4 by default, providing good compression ratios (typically 2-4× for typical genomic and statistical data) without significant performance penalty. This default works well for most use cases. The package applies compression automatically - you don't need to think about it unless you have specific reasons to adjust the level. For most users, the default compression is the right choice. It keeps your files manageable without slowing down your analysis. Only adjust compression if you have unusual requirements: no compression for maximum speed with temporary files, or higher compression for long-term storage of large datasets. ::: ## Exploring HDF5 Files Once you've created HDF5 files, you'll often want to inspect their contents to understand what data they contain and how it's organized. HDF5 provides several tools for this exploration, both programmatic and visual. Understanding what's in your files is essential for both your own work (remembering what analyses you've run) and for sharing data with colleagues who need to understand your file structure. ### Using h5ls() ```{r} #| eval: false # List contents of HDF5 file library(rhdf5) h5ls("my_study.hdf5") ``` ``` group name otype dclass dim 0 / data H5I_GROUP 1 / results H5I_GROUP 2 /data genotypes H5I_DATASET FLOAT 1000 x 500 3 /data phenotypes H5I_DATASET FLOAT 1000 x 10 ``` ### Using HDFView (GUI Tool) For visual exploration, [HDFView](https://www.hdfgroup.org/downloads/hdfview/) is an excellent free tool that provides a graphical interface to browse HDF5 file contents, inspect datasets, and view metadata. ## How BigDataStatMeth Uses HDF5 BigDataStatMeth builds on HDF5's capabilities: ```{mermaid} %%| fig-cap: "BigDataStatMeth's HDF5 workflow" %%| fig-alt: "Diagram showing workflow from data to results" graph LR A[Raw Data CSV, RData, GDS] --> B[bdCreate_hdf5_matrix] B --> C[HDF5 File Chunked & Compressed] C --> D[Block-wise Operations] D --> E[Results Stored in HDF5] E --> F[Extract to R or Keep on Disk] style C fill:#f0f8ff style D fill:#e8f6e8 style E fill:#fff8e1 ``` **Key features:** 1. **Automatic chunking** optimized for statistical operations 2. **Metadata preservation** (row names, column names) 3. **Block-wise algorithms** that read/write chunks efficiently 4. **Result storage** in same file for traceability ## Practical Example: Complete Workflow Let's put it all together with a realistic example: ```{r} #| eval: false library(BigDataStatMeth) library(rhdf5) # 1. Create HDF5 file from existing data set.seed(42) large_matrix <- matrix(rnorm(10000 * 5000), nrow = 10000, ncol = 5000) bdCreate_hdf5_matrix( filename = "analysis.hdf5", object = large_matrix, group = "data", dataset = "expression_matrix", overwriteFile = TRUE ) # 2. Perform SVD without loading full matrix svd_result <- bdSVD_hdf5( filename = "analysis.hdf5", group = "data", dataset = "expression_matrix", bcenter = TRUE, bscale = TRUE, k = 20 # Number of components ) # 3. Check what's in the file now h5ls("analysis.hdf5") # 4. Extract just the components you need components <- h5read("analysis.hdf5", svd_result$ds_v) dim(components) # 5000 × 20 (not 5000 × 5000!) # 5. Clean up h5closeAll() ``` ::: {.callout-tip} ### What We Achieved - Stored 38GB dataset on disk - Performed PCA using ~500MB RAM - Kept results organized in same file - Can rerun analysis without re-reading data ::: ## Interactive Exercise {.exercise} ### Practice: Design Your Own HDF5 Structure The best way to internalize HDF5 concepts is to apply them to your own data. This exercise guides you through creating a multi-component HDF5 file and asks you to think about organizational decisions. There are no "correct" answers - the goal is to practice translating a research design into an HDF5 file structure and to consider how different organizational choices affect usability. **This is a thinking exercise.** We provide starter code and questions for reflection, but no solutions. The questions are designed to make you think about how you would structure real projects. Your answers will depend on your specific research questions and workflow. Create an HDF5 file with your own data structure: ```{r} #| eval: false # 1. Create a hierarchical organization for a study library(BigDataStatMeth) # Simulated study data genomic_data <- matrix(sample(0:2, 5000*1000, replace=TRUE), 5000, 1000) expression_data <- matrix(rnorm(5000*200), 5000, 200) clinical_data <- data.frame( age = rnorm(5000, 50, 10), gender = sample(c("M", "F"), 5000, replace = TRUE) ) # 2. Organize in HDF5 bdCreate_hdf5_matrix( "my_study.hdf5", genomic_data, group = "omics/genomics", dataset = "snps", overwriteFile = TRUE ) bdCreate_hdf5_matrix( "my_study.hdf5", expression_data, group = "omics/transcriptomics", dataset = "genes" ) # 3. Explore your creation h5ls("my_study.hdf5") # Questions for reflection: # # 1. Longitudinal data organization: # - How would you organize data collected at multiple time points? # - Would you create separate groups for each timepoint (/timepoint1/, /timepoint2/)? # - Or group by data type with timepoint as a dimension within datasets? # - What are the trade-offs of each approach? # # 2. Multi-omic integration: # - You now have genomics and transcriptomics. What if you add: # * Proteomics data # * Metabolomics data # * Clinical phenotypes # - How would you organize these to make cross-omic analyses easy? # - Should raw and processed data live in different groups? # # 3. Derived results storage: # - Where would you store PCA results for each omic? # - Where would integration analysis results go? # - How do you link results back to the data they came from? # - Should every analysis get its own group, or organize by analysis type? # # 4. Metadata strategy: # - What attributes would you attach to each dataset? # - How would you document processing steps? # - What information needs to travel with the data for reproducibility? # # Try implementing one of these scenarios in code. The goal is not perfection, # but practice thinking about data organization for real research projects. ``` ::: {.callout-tip} ### Reflection, Not Solutions This exercise deliberately doesn't provide "the answer" because there often isn't a single correct way to organize complex data. Your organizational choices should reflect: - Your specific research questions (what comparisons matter most?) - Your workflow (what data do you access together?) - Your collaboration needs (who else needs to use this data?) - Your analysis pipeline (what tools will read this file?) The experience of designing and trying different structures teaches more than following a prescribed solution. If you're unsure, try multiple approaches and see which one feels more natural when you go to actually use the data. ::: ## Key Takeaways {.key-concept} We've covered a lot of ground in understanding HDF5. Let's consolidate the essential concepts you need to remember as you work with BigDataStatMeth and HDF5 files. ### Essential Concepts You've learned that HDF5 is fundamentally different from traditional file formats. Let's review the key ideas: **HDF5 is a database for matrices**, not just a file format. This distinction matters because it changes how you think about data storage. Instead of "saving a file," you're building a database that can contain multiple related datasets, organized hierarchically, with built-in metadata. **Hierarchical organization** mirrors how you think about complex research projects. Just as you organize documents into folders, HDF5 lets you organize datasets into groups. This organization isn't cosmetic - it helps both you and your analysis code understand the relationships between different data components. **Chunking enables efficient partial I/O** by organizing data into blocks that can be read independently. This is what makes the "read only what you need" promise real rather than theoretical. Understanding chunking helps you predict which operations will be fast and which will require reading more data than you'd prefer. **Compression reduces disk usage** without adding complexity to your code. HDF5 handles compression transparently, and the default settings work well for most statistical applications. You get smaller files essentially for free. **Metadata is built-in**, meaning your data carries its own documentation. Six months later, you (or a colleague) can open an HDF5 file and immediately understand what it contains, when it was created, and how to interpret the values. **BigDataStatMeth automates** these HDF5 best practices. The package makes intelligent decisions about chunking, compression, and metadata so you can focus on your analysis rather than storage engineering. But understanding what's happening under the hood helps you use the package more effectively. ### When to Use HDF5 Making the right choice about file formats matters for both productivity and practicality. Here's how to decide: ✅ **Use HDF5 when:** - **Data exceeds available RAM** - This is the primary use case. When you can't load everything into memory, HDF5 lets you work with arbitrary-sized datasets by processing them in pieces. - **Need repeated access to subsets** - If your workflow involves reading different portions of data at different times, HDF5's partial I/O capabilities pay off quickly. - **Want organized, self-documenting storage** - For complex projects with multiple data components, HDF5's hierarchical structure and metadata support help keep everything organized and understandable. - **Sharing data across platforms** - HDF5 is supported by R, Python, MATLAB, Julia, C++, and many other languages. It provides a common data format that works across your entire analysis ecosystem. ❌ **Don't use HDF5 when:** - **Data fits comfortably in RAM** - If your dataset uses less than about 20% of your available memory, traditional formats (RData, CSV) are simpler. The overhead of HDF5 doesn't provide benefits when everything fits in memory anyway. - **Need simple, human-readable formats** - HDF5 files are binary and require special tools to inspect. If you need to open files in a text editor or want maximum simplicity, CSV or similar formats might be better despite their limitations. - **Working with highly irregular structures** - HDF5 excels with matrices and arrays - data that has regular structure. Highly nested or irregular data structures might be better served by other formats designed for those use cases. The decision isn't always clear-cut. Many projects start with data that fits in memory but grow over time. Starting with HDF5 can future-proof your analysis pipeline, especially if you anticipate scaling to larger datasets or want to leverage the organizational benefits even for medium-sized data. ## Next Steps You now have a solid foundation in HDF5 concepts and how BigDataStatMeth uses them. The natural next step is to understand how computational algorithms are adapted to work efficiently with disk-based data. - [**Block-Wise Computing →**](blockwise-computing.qmd) Learn how algorithms are adapted for disk-based matrices - [**Working with HDF5 Matrices →**](../tutorials/working-hdf5-matrices.qmd) Practical tutorial on data management - [**Your First Analysis →**](../tutorials/first-analysis.qmd) Complete analytical workflow ## Further Reading If you want to deepen your understanding of HDF5 beyond what we've covered here, these resources provide different perspectives and more technical detail: **[HDF5 User Guide](https://portal.hdfgroup.org/display/HDF5/HDF5+User+Guides)** - The official documentation from The HDF Group. This is comprehensive and authoritative, though quite technical. Good for understanding the full capabilities of HDF5 and diving into advanced features like parallel I/O, virtual datasets, and complex datatypes. **[rhdf5 Bioconductor Package](https://bioconductor.org/packages/release/bioc/html/rhdf5.html)** - The R package that BigDataStatMeth builds upon for low-level HDF5 operations. The rhdf5 documentation provides examples of direct HDF5 manipulation if you need finer control than BigDataStatMeth's convenience functions provide. **[HDFView](https://www.hdfgroup.org/downloads/hdfview/)** - A free graphical tool for browsing HDF5 files visually. Essential for debugging file structure issues and understanding what your code has created. Works on Windows, Mac, and Linux. **BigDataStatMeth Documentation** - The [complete API reference](../../api-reference/r-functions.qmd) covers all HDF5-related functions in detail, with additional examples and parameter explanations. --- ::: {.callout-note} ### Questions or Feedback? If something is unclear or you'd like more examples, please [open an issue](https://github.com/isglobal-brge/BigDataStatMeth/issues) on GitHub. We value your feedback to improve this educational material. :::