BigDataStatMeth

Scalable Statistical Computing with R, C++, and HDF5

Welcome

This website provides comprehensive educational material for the BigDataStatMeth package. Here you’ll find in-depth explanations, tutorials, and practical examples to help you understand the fundamental concepts and develop new statistical methods for large-scale data analysis.

1 What is BigDataStatMeth?

BigDataStatMeth is an R package that enables scalable statistical computing on datasets that exceed available memory. By combining:

HDF5-based storage for disk-backed matrices
Block-wise algorithms for memory-efficient computation
High-performance C++ backend with parallel processing
Dual R/C++ APIs for flexibility and integration

BigDataStatMeth allows you to perform complex statistical analyses on large datasets using standard hardware.

2 What You’ll Learn Here

This documentation goes beyond the API reference to teach you the foundations you need to:

2.1 Learning Objectives

Understand why traditional in-memory approaches fail with large datasets
Master HDF5 file format and its role in big data computing
Grasp block-wise algorithm design and implementation
Apply BigDataStatMeth to real-world statistical problems
Develop your own scalable statistical methods
Integrate BigDataStatMeth into complex analytical workflows

3 Documentation Structure

The documentation is organized as a progressive learning journey:

3.1 Fundamentals

Learn the core concepts that underpin BigDataStatMeth:

The Big Data Problem - Understanding memory limitations
Understanding HDF5 - Deep dive into HDF5 storage format
Block-Wise Computing - Mathematical foundations and practical design
Linear Algebra Essentials - Key operations and decompositions

3.2 Tutorials

Step-by-step guides to get you started:

Getting Started - Installation and first steps
Working with HDF5 Matrices - Creating and managing data
Your First Analysis - Complete analytical workflow

3.3 Workflows

Complete examples of implementing statistical methods:

Implementing PCA - Principal Component Analysis from scratch
Implementing CCA - Canonical Correlation Analysis
Cross-Platform Workflows - R and C++ integration

3.4 API Reference

Technical documentation for all functions:

R Functions - Complete R API documentation
C++ API - C++ header-only library reference

3.5 Technical Details

Advanced topics and optimization:

Performance Optimization - Benchmarks and tuning strategies

4 Quick Start

# Install stable version from CRAN
install.packages("BigDataStatMeth")

# Load package
library(BigDataStatMeth)

# Install development version from GitHub
# (requires devtools package)
install.packages("devtools")
devtools::install_github("isglobal-brge/BigDataStatMeth")

# Load package
library(BigDataStatMeth)

4.1 Your First HDF5 Matrix

set.seed(123)
data <- matrix(rnorm(1000 * 500), nrow = 1000, ncol = 500)

bdCreate_hdf5_matrix(
  filename = "my_analysis.hdf5",
  object = data,
  group = "data",
  dataset = "matrix1"
)

# Perform SVD on HDF5 data (without loading into memory)
result <- bdSVD_hdf5(
  filename = "my_analysis.hdf5",
  group = "data",
  dataset = "matrix1",
  k = 10
)

5 Learning Path

We recommend following this sequence:

Start with Fundamentals if you’re new to HDF5 or block-wise computing
Follow the Tutorials for hands-on practice with BigDataStatMeth
Study the Workflows to see complete method implementations
Refer to API Reference when developing your own methods
Explore Technical Details for optimization and advanced usage

6 Getting Help

Documentation: You’re here! Use the navigation menu to explore
GitHub Issues: Report bugs or request features
Contact: BRGE ISGlobal

7 Citation

If you use BigDataStatMeth in your research, please cite:

citation("BigDataStatMeth")

Or use this BibTeX entry:

@Manual{BigDataStatMeth,
  title = {BigDataStatMeth: Scalable Statistical Methods for Big Data},
  author = {Dolors Pelegrí-Sisó and Juan R. González},
  year = {2025},
  note = {R package version 1.0.2},
  url = {https://CRAN.R-project.org/package=BigDataStatMeth},
}

Ready to start? Head to Understanding HDF5 to begin your learning journey!

--- title: "BigDataStatMeth" subtitle: "Scalable Statistical Computing with R, C++, and HDF5" --- ::: {.callout-note icon=false} ## Welcome This website provides comprehensive educational material for the BigDataStatMeth package. Here you'll find in-depth explanations, tutorials, and practical examples to help you understand the fundamental concepts and develop new statistical methods for large-scale data analysis. ::: ## What is BigDataStatMeth? **BigDataStatMeth** is an R package that enables scalable statistical computing on datasets that exceed available memory. By combining: - **HDF5-based storage** for disk-backed matrices - **Block-wise algorithms** for memory-efficient computation - **High-performance C++ backend** with parallel processing - **Dual R/C++ APIs** for flexibility and integration BigDataStatMeth allows you to perform complex statistical analyses on large datasets using standard hardware. ## What You'll Learn Here This documentation goes beyond the API reference to teach you the **foundations** you need to: ::: {.learning-objectives} ### Learning Objectives - **Understand** why traditional in-memory approaches fail with large datasets - **Master** HDF5 file format and its role in big data computing - **Grasp** block-wise algorithm design and implementation - **Apply** BigDataStatMeth to real-world statistical problems - **Develop** your own scalable statistical methods - **Integrate** BigDataStatMeth into complex analytical workflows ::: ## Documentation Structure The documentation is organized as a progressive learning journey: ### Fundamentals Learn the core concepts that underpin BigDataStatMeth: - [**The Big Data Problem**](fundamentals/big-data-problem.qmd) - Understanding memory limitations - [**Understanding HDF5**](fundamentals/understanding-hdf5.qmd) - Deep dive into HDF5 storage format - [**Block-Wise Computing**](fundamentals/blockwise-computing.qmd) - Mathematical foundations and practical design - [**Linear Algebra Essentials**](fundamentals/linear-algebra.qmd) - Key operations and decompositions ### Tutorials Step-by-step guides to get you started: - [**Getting Started**](tutorials/getting-started.qmd) - Installation and first steps - [**Working with HDF5 Matrices**](tutorials/working-hdf5-matrices.qmd) - Creating and managing data - [**Your First Analysis**](tutorials/first-analysis.qmd) - Complete analytical workflow ### Workflows Complete examples of implementing statistical methods: - [**Implementing PCA**](workflows/implementing-pca.qmd) - Principal Component Analysis from scratch - [**Implementing CCA**](workflows/implementing-cca.qmd) - Canonical Correlation Analysis - [**Cross-Platform Workflows**](workflows/cross-platform.qmd) - R and C++ integration ### API Reference Technical documentation for all functions: - [**R Functions**](api-reference/r-functions.qmd) - Complete R API documentation - [**C++ API**](api-reference/cpp-api.qmd) - C++ header-only library reference ### Technical Details Advanced topics and optimization: - [**Performance Optimization**](technical/performance.qmd) - Benchmarks and tuning strategies ## Quick Start ::: {.panel-tabset} ### Install from CRAN ```r # Install stable version from CRAN install.packages("BigDataStatMeth") # Load package library(BigDataStatMeth) ``` ### Install from GitHub ```r # Install development version from GitHub # (requires devtools package) install.packages("devtools") devtools::install_github("isglobal-brge/BigDataStatMeth") # Load package library(BigDataStatMeth) ``` ::: ### Your First HDF5 Matrix ```r set.seed(123) data <- matrix(rnorm(1000 * 500), nrow = 1000, ncol = 500) bdCreate_hdf5_matrix( filename = "my_analysis.hdf5", object = data, group = "data", dataset = "matrix1" ) # Perform SVD on HDF5 data (without loading into memory) result <- bdSVD_hdf5( filename = "my_analysis.hdf5", group = "data", dataset = "matrix1", k = 10 ) ``` ## Learning Path We recommend following this sequence: 1. **Start with Fundamentals** if you're new to HDF5 or block-wise computing 2. **Follow the Tutorials** for hands-on practice with BigDataStatMeth 3. **Study the Workflows** to see complete method implementations 4. **Refer to API Reference** when developing your own methods 5. **Explore Technical Details** for optimization and advanced usage ## Getting Help - **Documentation**: You're here! Use the navigation menu to explore - **GitHub Issues**: [Report bugs or request features](https://github.com/isglobal-brge/BigDataStatMeth/issues) - **Contact**: [BRGE ISGlobal](https://brge.isglobal.org/) ## Citation If you use BigDataStatMeth in your research, please cite: ```r citation("BigDataStatMeth") ``` Or use this BibTeX entry: ```bibtex @Manual{BigDataStatMeth, title = {BigDataStatMeth: Scalable Statistical Methods for Big Data}, author = {Dolors Pelegrí-Sisó and Juan R. González}, year = {2025}, note = {R package version 1.0.2}, url = {https://CRAN.R-project.org/package=BigDataStatMeth}, } ``` --- **Ready to start?** Head to [Understanding HDF5](fundamentals/understanding-hdf5.qmd) to begin your learning journey!