# BiocManager helps install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")Getting Started
Installation and First Steps with BigDataStatMeth
1 Overview
BigDataStatMeth enables statistical analysis on datasets too large for RAM by using HDF5 file storage and block-wise processing. This tutorial guides you through installation and your first analysis, ensuring you have a working environment before diving into more complex operations.
Think of this as setting up your laboratory bench before starting experiments. You’ll verify that all your tools work correctly with small, manageable examples before scaling to real big data analyses. This approach prevents frustrating debugging sessions later when you’re working with large files.
1.1 What You’ll Learn
By the end of this tutorial, you will:
- Install BigDataStatMeth and all required dependencies correctly
- Create your first HDF5 matrix from R data
- Perform basic matrix operations on HDF5-stored data
- Understand the HDF5 file structure and how to navigate it
- Verify your installation works correctly with test examples
- Know where to find help when you encounter issues
- Be prepared for more advanced tutorials
2 Prerequisites
2.1 System Requirements
Operating System: - Linux (recommended) - macOS - Windows (requires Rtools - see below)
R Version: - R ≥ 4.0.0 (check with R.version.string)
RAM: - Minimum: 4 GB - Recommended: 8+ GB for comfortable work
Disk Space: - ~500 MB for package and dependencies - Additional space for your HDF5 data files
BigDataStatMeth is compiled from C++ source code. Windows users must install Rtools before installing the package.
Download the version matching your R installation from the link above.
3 Step 1: Install Dependencies
BigDataStatMeth requires packages from both CRAN and Bioconductor.
3.1 Install BiocManager
3.2 Install Required Packages
# CRAN packages
install.packages(c("Matrix", "RcppEigen", "RSpectra"))
# Bioconductor packages
BiocManager::install(c("rhdf5", "HDF5Array"))If installation fails:
- Update R: Some packages require recent R versions
- Update BiocManager: Run
BiocManager::install()with no arguments - Check compilation tools: Especially on Windows/macOS
- Installation logs: Look for specific error messages about missing libraries
4 Step 2: Install BigDataStatMeth
4.1 From CRAN (Recommended - Stable Version)
The stable version is available on CRAN:
install.packages("BigDataStatMeth")This installs the latest stable, tested release.
4.2 From GitHub (Development Version)
For the latest development features:
# Install devtools if needed
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
# Install development version from GitHub
devtools::install_github("isglobal-brge/BigDataStatMeth")Use CRAN version if: - You want maximum stability - You’re doing production analysis - You prefer well-tested releases
Use GitHub version if: - You need the latest features - You’re contributing to development - You want to test new functionality
5 Step 3: Load and Verify
5.1 Load the Package
library(BigDataStatMeth)If no errors appear, the package loaded successfully!
You might see examples that load library(rhdf5). This is only necessary when you use rhdf5 functions directly like h5ls(), h5read(), or H5Fopen().
BigDataStatMeth already depends on rhdf5 internally, so you don’t need to load it for BigDataStatMeth functions. However, in this tutorial we’ll use some rhdf5 functions for file inspection, so we’ll load it when needed.
Key point: BigDataStatMeth implements its own high-level functions. We use rhdf5’s inspection functions because they’re already excellent and there’s no need to reimplement them.
5.2 Quick Verification
Run this simple test to verify everything works:
# Create a small test matrix
test_matrix <- matrix(rnorm(100), nrow = 10, ncol = 10)
# Create HDF5 file
test_file <- "verification_test.hdf5"
bdCreate_hdf5_matrix(
filename = test_file,
object = test_matrix,
group = "test",
dataset = "data",
overwriteFile = TRUE
)$fn
[1] "verification_test.hdf5"
$ds
[1] "test/data"
# Verify file was created
if (file.exists(test_file)) {
cat("✓ Installation verified!\n")
cat("✓ HDF5 file created successfully\n")
# Clean up
file.remove(test_file)
} else {
cat("✗ Installation issue - file not created\n")
}✓ Installation verified!
✓ HDF5 file created successfully
[1] TRUE
Expected output:
✓ Installation verified!
✓ HDF5 file created successfully
Common issues:
- “Package not found”: Restart R session
- “HDF5 library error”: Reinstall
rhdf5package
- Permission denied: Check write permissions in working directory
- Symbol not found: Recompile package from source
Try sessionInfo() to check loaded packages and versions.
6 Step 4: Your First HDF5 Dataset
Now let’s create a realistic dataset and perform basic operations.
6.1 Create Sample Data
# Simulate a genomic dataset: 1,000 samples × 5,000 SNPs
set.seed(123)
n_samples <- 1000
n_snps <- 5000
genotype_data <- matrix(
sample(0:2, n_samples * n_snps, replace = TRUE),
nrow = n_samples,
ncol = n_snps
)
# Add meaningful row/column names
rownames(genotype_data) <- paste0("Sample_", 1:n_samples)
colnames(genotype_data) <- paste0("SNP_", 1:n_snps)
# Check size
format(object.size(genotype_data), units = "MB")[1] "19.5 Mb"
6.2 Save to HDF5
# Create HDF5 file
data_file <- "my_first_dataset.hdf5"
bdCreate_hdf5_matrix(
filename = data_file,
object = genotype_data,
group = "genotypes",
dataset = "data",
overwriteFile = TRUE
)$fn
[1] "my_first_dataset.hdf5"
$ds
[1] "genotypes/data"
cat("✓ Dataset saved to HDF5\n")✓ Dataset saved to HDF5
HDF5 files organize data hierarchically:
- File:
my_first_dataset.hdf5- Group:
genotypes(like a folder)- Dataset:
data(the actual matrix)
- Dataset:
- Group:
Think of groups as folders and datasets as files inside them.
6.3 Inspect the File
For file inspection, we use rhdf5 functions:
library(rhdf5) # Needed for h5ls()
# List file contents
h5ls(data_file) group name otype dclass dim
0 / genotypes H5I_GROUP
1 /genotypes .data_dimnames H5I_GROUP
2 /genotypes/.data_dimnames 1 H5I_DATASET COMPOUND 1000
3 /genotypes/.data_dimnames 2 H5I_DATASET COMPOUND 5000
4 /genotypes data H5I_DATASET INTEGER 1000 x 5000
6.4 Read Data Back
# Read a small portion
small_chunk <- h5read(data_file, "/genotypes/data",
index = list(1:5, 1:10))
small_chunk [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 1 2 2 2 0 1 0 2 0
[2,] 2 2 1 2 2 1 0 0 0 2
[3,] 2 1 0 0 2 2 1 0 0 0
[4,] 1 0 0 0 2 1 0 0 1 1
[5,] 2 2 0 1 0 2 0 2 2 2
# Verify it matches original
all.equal(small_chunk, genotype_data[1:5, 1:10])[1] "Attributes: < Length mismatch: comparison on first 1 components >"
If you see output like "Attributes: < Component 'dimnames': target is NULL, current is list >", this means the matrices are numerically identical but have different attributes (row/column names).
This is expected: HDF5 stores dimension names separately from the data matrix. The numeric values are identical, which is what matters for calculations. You can verify with:
# Compare just the numeric values
all.equal(as.numeric(small_chunk), as.numeric(genotype_data[1:5, 1:10]))[1] TRUE
7 Step 5: Basic Operations
Let’s perform operations directly on the HDF5 file using BigDataStatMeth functions.
7.1 Matrix Multiplication
# Create two smaller matrices for demonstration
set.seed(456)
A <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)
B <- matrix(rnorm(100 * 200), nrow = 100, ncol = 200)
# Save to HDF5
example_file <- "operations_example.hdf5"
bdCreate_hdf5_matrix(filename = example_file, object = A,
group = "matrices", dataset = "A",
overwriteFile = TRUE)$fn
[1] "operations_example.hdf5"
$ds
[1] "matrices/A"
bdCreate_hdf5_matrix(filename = example_file, object = B,
group = "matrices", dataset = "B")$fn
[1] "operations_example.hdf5"
$ds
[1] "matrices/B"
# Perform block-wise multiplication
result <- bdblockmult_hdf5(
filename = example_file,
group = "matrices",
A = "A",
B = "B",
outgroup = "results"
)
# Read and display a portion of the result
result_sample <- h5read(result$fn, result$ds, index = list(1:5, 1:5))
cat("Result preview (first 5×5):\n")Result preview (first 5×5):
print(result_sample) [,1] [,2] [,3] [,4] [,5]
[1,] 3.966529 -1.596117 -10.575227 -1.523170 13.37335
[2,] 10.496300 -4.726875 16.389046 3.817424 16.47317
[3,] -8.331547 -9.299947 -3.084136 4.030101 11.75505
[4,] -5.159786 2.319629 2.815899 -3.407731 -5.72410
[5,] 3.384955 18.492893 1.697668 -13.890756 10.48865
cat("\n✓ Matrix multiplication completed\n")
✓ Matrix multiplication completed
cat("Result stored in:", result$ds, "\n")Result stored in: results/A_x_B
Notice we multiplied a 500×100 matrix by a 100×200 matrix without loading the full result into memory. BigDataStatMeth processed this in blocks and wrote directly to HDF5.
For truly large matrices (100 GB+), this same code works identically - just takes longer.
7.2 Crossproduct Operation
# Compute t(A) %*% A
crossprod_result <- bdCrossprod_hdf5(
filename = example_file,
group = "matrices",
A = "A"
)
# Read and display a portion
crossprod_sample <- h5read(crossprod_result$fn, crossprod_result$ds,
index = list(1:5, 1:5))
cat("Crossproduct result preview (first 5×5):\n")Crossproduct result preview (first 5×5):
print(crossprod_sample) [,1] [,2] [,3] [,4] [,5]
[1,] 475.947503 -26.983942 1.705196 18.276113 22.03469
[2,] -26.983942 488.288982 -24.322274 6.712965 -14.34956
[3,] 1.705196 -24.322274 498.798075 9.800220 52.82859
[4,] 18.276113 6.712965 9.800220 511.565405 -26.92909
[5,] 22.034686 -14.349559 52.828589 -26.929093 438.22361
cat("\n✓ Crossprod completed\n")
✓ Crossprod completed
cat("Result dimensions should be 100 × 100\n")Result dimensions should be 100 × 100
# Verify dimensions
h5ls(example_file) group name otype dclass dim
0 / OUTPUT H5I_GROUP
1 /OUTPUT CrossProd_A_x_A H5I_DATASET FLOAT 100 x 100
2 / matrices H5I_GROUP
3 /matrices A H5I_DATASET FLOAT 500 x 100
4 /matrices B H5I_DATASET FLOAT 100 x 200
5 / results H5I_GROUP
6 /results A_x_B H5I_DATASET FLOAT 500 x 200
8 Step 6: Clean Up
# Close any open HDF5 connections
h5closeAll()
# Remove test files (optional)
file.remove(data_file, example_file)[1] TRUE TRUE
9 Interactive Exercise
9.1 Practice: Creating Your Own HDF5 Workflow
Now that you’ve seen the basic operations, try designing a small workflow with data relevant to your work. This helps internalize the concepts through hands-on practice.
# Exercise: Create a mini-analysis workflow
# 1. Generate or load your own data
my_data <- matrix(rnorm(1000 * 500), nrow = 1000, ncol = 500)
# Replace with: read.csv(), or data from your field
# 2. Save to HDF5 with meaningful names
bdCreate_hdf5_matrix(
filename = "my_analysis.hdf5",
object = my_data,
group = "raw_data", # Choose descriptive name
dataset = "measurements", # What does this represent?
overwriteFile = TRUE
)
# 3. Perform an operation
result <- bdCrossprod_hdf5(
filename = "my_analysis.hdf5",
group = "raw_data",
A = "measurements",
outgroup = "processed"
)
# 4. Verify the result
h5ls("my_analysis.hdf5") # Examine the structureAs you work through this exercise, consider:
1. File Organization: - How are you structuring your HDF5 file? (groups and datasets) - If you added more data tomorrow, where would it go? - Could someone else understand your organization scheme?
2. Memory vs. Disk: - At what data size would your operation fail in memory? - How much disk space does your HDF5 file use? - Compare: object size in R (object.size(my_data)) vs. file size on disk
3. Operation Choice: - Why did you choose bdCrossprod_hdf5() over other operations? - What statistical question does this operation answer? - Could you achieve the same result by composing multiple simpler operations?
4. Error Handling: - What happens if you try to create a dataset that already exists? - Try it: What does the error message tell you? - How do you fix it? (overwriteDataset = TRUE or different name?)
5. Scaling Up: - Your test used 1,000 × 500. What about 100,000 × 50,000? - Which operations would work unchanged? - Which would need adjustment (block sizes, memory)?
Don’t worry about “correct” answers - the goal is developing intuition about when and how to use these tools. Each analysis scenario is unique, and hands-on experimentation builds understanding better than reading alone.
10 What You’ve Accomplished
✅ Installed BigDataStatMeth and all dependencies
✅ Created your first HDF5 dataset
✅ Performed block-wise operations on HDF5 data
✅ Understood the HDF5 file structure (groups and datasets)
✅ Verified that block-wise processing works correctly
11 Next Steps
Now that you have BigDataStatMeth working, continue learning:
Continue the tutorial series:
- ✅ Getting Started (you are here)
- → Working with HDF5 Matrices - File operations, data conversion, and management
- → Your First Analysis - Complete workflow from raw data to results
Explore practical workflows:
- Implementing PCA - Principal Component Analysis on genomic data
- Implementing CCA - Canonical Correlation Analysis
Dive deeper into concepts:
- Understanding HDF5 - How HDF5 storage works
- Block-Wise Computing - Algorithms behind the scenes
12 Getting Help
If you encounter issues:
- Check documentation: Most functions have detailed help:
?bdCreate_hdf5_matrix - Review examples: Package vignettes contain working code
- GitHub Issues: Report bugs at isglobal-brge/BigDataStatMeth
# Check package version
packageVersion("BigDataStatMeth")
# View all loaded packages
sessionInfo()
# Test HDF5 installation
rhdf5::h5version()
# Check working directory (where files are created)
getwd()13 Key Takeaways
Let’s consolidate what you’ve learned about setting up and using BigDataStatMeth for the first time.
13.1 Essential Concepts
Installation creates the foundation for all subsequent work with BigDataStatMeth. Without properly installed dependencies (rhdf5, HDF5Array, compilation tools), nothing else works. Windows users face additional complexity requiring Rtools, but following the installation sequence systematically prevents most problems. Testing with small examples immediately after installation catches configuration issues before you invest time in real analyses.
HDF5 files are organized hierarchically like a file system, with groups acting as folders and datasets as files. This structure isn’t just organizational convenience - it enables efficient partial I/O where you read only the data you need. Creating your first HDF5 file teaches this fundamental paradigm: data lives on disk, accessed selectively, rather than entirely in RAM. Good organization from the start saves confusion when projects grow complex.
Block-wise processing happens automatically behind BigDataStatMeth’s functions. You don’t partition matrices manually or manage memory explicitly - the package handles block sizes, I/O patterns, and result aggregation internally. When you call bdCrossprod_hdf5(), it looks like a single function call, but executes sophisticated block-wise algorithms transparently. This abstraction is the package’s main value: complexity hidden, scaling achieved.
Verification prevents wasted effort. Testing installations with small examples (100×100 matrices) catches problems when they’re easy to fix. If basic operations fail on tiny test data, they won’t mysteriously work on real 100,000×100,000 matrices. Small-scale testing establishes that your environment works correctly before investing hours generating or converting large datasets.
The R API suffices for most users. Functions like bdCreate_hdf5_matrix(), bdSVD_hdf5(), and bdCrossprod_hdf5() provide complete functionality for standard analyses. The C++ API exists for developers implementing novel statistical methods who need fine-grained control over algorithms and memory management. Unless you’re developing new methods from scratch, the R interface provides everything needed.
13.2 When to Use BigDataStatMeth
Making informed decisions about when BigDataStatMeth helps versus when simpler approaches suffice saves time and prevents unnecessary complexity.
✅ Use BigDataStatMeth when:
Data exceeds 30% of available RAM - This threshold provides headroom for intermediate computations and operating system needs. Below 30%, traditional R approaches work fine. Above 30%, you risk memory exhaustion during operations, and disk-based computing becomes necessary.
You’re starting a new analysis project - Converting data to HDF5 at the beginning avoids migration pain later. It’s easier to start organized than to reorganize mid-project when you discover your data has grown beyond memory limits.
Multiple analyses will reuse the same data - Converting to HDF5 once pays off when you’ll run PCA, then regression, then association tests on the same dataset. The upfront conversion cost amortizes across repeated analyses.
Your workflow spans multiple tools - If you work in R, Python, and command-line tools, HDF5 provides a common format all can read efficiently. This beats converting between CSV, RData, and tool-specific formats repeatedly.
❌ Traditional R works better when:
Data comfortably fits in less than 20% of RAM - If
data <- read.csv(file)works without issues, stick with familiar R approaches. Traditional methods are simpler, more flexible, and better supported by the broader R ecosystem. Don’t add complexity unnecessarily.You’re doing one-off exploratory analysis - For quick investigations you won’t repeat, the HDF5 conversion overhead outweighs benefits. Load data, explore, save key results, discard working data, and you’re done.
You need maximum flexibility - In-memory R data structures support arbitrary manipulations trivially: reshape, subset, transform however you want. HDF5 adds structure, which aids organization but constrains spontaneous manipulations. If your workflow involves many ad-hoc transformations, staying in memory maintains flexibility.
The key question isn’t just “can my data fit in memory?” but “does my workflow benefit from disk-based computing?” Sometimes the answer is obvious (500 GB dataset, 32 GB RAM → yes). Sometimes it’s contextual (40 GB dataset, 64 GB RAM → depends on specific operations needed). Understanding your analysis requirements and computational resources helps make this decision rationally rather than by trial and error.