BigDataStatMeth
Scalable Statistical Computing with R, C++, and HDF5
1 What is BigDataStatMeth?
BigDataStatMeth is an R package that enables scalable statistical computing on datasets that exceed available memory. By combining:
- HDF5-based storage for disk-backed matrices
- Block-wise algorithms for memory-efficient computation
- High-performance C++ backend with parallel processing
- Dual R/C++ APIs for flexibility and integration
BigDataStatMeth allows you to perform complex statistical analyses on large datasets using standard hardware.
2 What You’ll Learn Here
This documentation goes beyond the API reference to teach you the foundations you need to:
2.1 Learning Objectives
- Understand why traditional in-memory approaches fail with large datasets
- Master HDF5 file format and its role in big data computing
- Grasp block-wise algorithm design and implementation
- Apply BigDataStatMeth to real-world statistical problems
- Develop your own scalable statistical methods
- Integrate BigDataStatMeth into complex analytical workflows
3 Documentation Structure
The documentation is organized as a progressive learning journey:
3.1 Fundamentals
Learn the core concepts that underpin BigDataStatMeth:
- The Big Data Problem - Understanding memory limitations
- Understanding HDF5 - Deep dive into HDF5 storage format
- Block-Wise Computing - Mathematical foundations and practical design
- Linear Algebra Essentials - Key operations and decompositions
3.2 Tutorials
Step-by-step guides to get you started:
- Getting Started - Installation and first steps
- Working with HDF5 Matrices - Creating and managing data
- Your First Analysis - Complete analytical workflow
3.3 Workflows
Complete examples of implementing statistical methods:
- Implementing PCA - Principal Component Analysis from scratch
- Implementing CCA - Canonical Correlation Analysis
- Cross-Platform Workflows - R and C++ integration
3.4 API Reference
Technical documentation for all functions:
- R Functions - Complete R API documentation
- C++ API - C++ header-only library reference
3.5 Technical Details
Advanced topics and optimization:
- Performance Optimization - Benchmarks and tuning strategies
4 Quick Start
# Install stable version from CRAN
install.packages("BigDataStatMeth")
# Load package
library(BigDataStatMeth)# Install development version from GitHub
# (requires devtools package)
install.packages("devtools")
devtools::install_github("isglobal-brge/BigDataStatMeth")
# Load package
library(BigDataStatMeth)4.1 Your First HDF5 Matrix
set.seed(123)
data <- matrix(rnorm(1000 * 500), nrow = 1000, ncol = 500)
bdCreate_hdf5_matrix(
filename = "my_analysis.hdf5",
object = data,
group = "data",
dataset = "matrix1"
)
# Perform SVD on HDF5 data (without loading into memory)
result <- bdSVD_hdf5(
filename = "my_analysis.hdf5",
group = "data",
dataset = "matrix1",
k = 10
)5 Learning Path
We recommend following this sequence:
- Start with Fundamentals if you’re new to HDF5 or block-wise computing
- Follow the Tutorials for hands-on practice with BigDataStatMeth
- Study the Workflows to see complete method implementations
- Refer to API Reference when developing your own methods
- Explore Technical Details for optimization and advanced usage
6 Getting Help
- Documentation: You’re here! Use the navigation menu to explore
- GitHub Issues: Report bugs or request features
- Contact: BRGE ISGlobal
7 Citation
If you use BigDataStatMeth in your research, please cite:
citation("BigDataStatMeth")Or use this BibTeX entry:
@Manual{BigDataStatMeth,
title = {BigDataStatMeth: Scalable Statistical Methods for Big Data},
author = {Dolors Pelegrí-Sisó and Juan R. González},
year = {2025},
note = {R package version 1.0.2},
url = {https://CRAN.R-project.org/package=BigDataStatMeth},
}Ready to start? Head to Understanding HDF5 to begin your learning journey!