Getting Started

Installation and First Steps with BigDataStatMeth

1 Overview

BigDataStatMeth enables statistical analysis on datasets too large for RAM by using HDF5 file storage and block-wise processing. This tutorial guides you through installation and your first analysis, ensuring you have a working environment before diving into more complex operations.

Think of this as setting up your laboratory bench before starting experiments. You’ll verify that all your tools work correctly with small, manageable examples before scaling to real big data analyses. This approach prevents frustrating debugging sessions later when you’re working with large files.

1.1 What You’ll Learn

By the end of this tutorial, you will:

Install BigDataStatMeth and all required dependencies correctly
Create your first HDF5 matrix from R data
Perform basic matrix operations on HDF5-stored data
Understand the HDF5 file structure and how to navigate it
Verify your installation works correctly with test examples
Know where to find help when you encounter issues
Be prepared for more advanced tutorials

2 Prerequisites

2.1 System Requirements

Operating System: - Linux (recommended) - macOS - Windows (requires Rtools - see below)

R Version: - R ≥ 4.0.0 (check with R.version.string)

RAM: - Minimum: 4 GB - Recommended: 8+ GB for comfortable work

Disk Space: - ~500 MB for package and dependencies - Additional space for your HDF5 data files

Windows Users: Install Rtools First

BigDataStatMeth is compiled from C++ source code. Windows users must install Rtools before installing the package.

Download the version matching your R installation from the link above.

3 Step 1: Install Dependencies

BigDataStatMeth requires packages from both CRAN and Bioconductor.

3.1 Install BiocManager

# BiocManager helps install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

3.2 Install Required Packages

# CRAN packages
install.packages(c("Matrix", "RcppEigen", "RSpectra"))

# Bioconductor packages  
BiocManager::install(c("rhdf5", "HDF5Array"))

Troubleshooting Dependencies

If installation fails:

Update R: Some packages require recent R versions
Update BiocManager: Run BiocManager::install() with no arguments
Check compilation tools: Especially on Windows/macOS
Installation logs: Look for specific error messages about missing libraries

4 Step 2: Install BigDataStatMeth

4.1 From CRAN (Recommended - Stable Version)

The stable version is available on CRAN:

install.packages("BigDataStatMeth")

This installs the latest stable, tested release.

4.2 From GitHub (Development Version)

For the latest development features:

# Install devtools if needed
if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

# Install development version from GitHub
devtools::install_github("isglobal-brge/BigDataStatMeth")

Which Version Should I Use?

Use CRAN version if: - You want maximum stability - You’re doing production analysis - You prefer well-tested releases

Use GitHub version if: - You need the latest features - You’re contributing to development - You want to test new functionality

5 Step 3: Load and Verify

5.1 Load the Package

library(BigDataStatMeth)

If no errors appear, the package loaded successfully!

About Loading rhdf5

You might see examples that load library(rhdf5). This is only necessary when you use rhdf5 functions directly like h5ls(), h5read(), or H5Fopen().

BigDataStatMeth already depends on rhdf5 internally, so you don’t need to load it for BigDataStatMeth functions. However, in this tutorial we’ll use some rhdf5 functions for file inspection, so we’ll load it when needed.

Key point: BigDataStatMeth implements its own high-level functions. We use rhdf5’s inspection functions because they’re already excellent and there’s no need to reimplement them.

5.2 Quick Verification

Run this simple test to verify everything works:

# Create a small test matrix
test_matrix <- matrix(rnorm(100), nrow = 10, ncol = 10)

# Create HDF5 file
test_file <- "verification_test.hdf5"
bdCreate_hdf5_matrix(
  filename = test_file,
  object = test_matrix,
  group = "test",
  dataset = "data",
  overwriteFile = TRUE
)

$fn
[1] "verification_test.hdf5"

$ds
[1] "test/data"

# Verify file was created
if (file.exists(test_file)) {
  cat("✓ Installation verified!\n")
  cat("✓ HDF5 file created successfully\n")
  
  # Clean up
  file.remove(test_file)
} else {
  cat("✗ Installation issue - file not created\n")
}

✓ Installation verified!
✓ HDF5 file created successfully

[1] TRUE

Expected output:

✓ Installation verified!
✓ HDF5 file created successfully

If Verification Fails

Common issues:

“Package not found”: Restart R session
“HDF5 library error”: Reinstall rhdf5 package
Permission denied: Check write permissions in working directory
Symbol not found: Recompile package from source

Try sessionInfo() to check loaded packages and versions.

6 Step 4: Your First HDF5 Dataset

Now let’s create a realistic dataset and perform basic operations.

6.1 Create Sample Data

# Simulate a genomic dataset: 1,000 samples × 5,000 SNPs
set.seed(123)
n_samples <- 1000
n_snps <- 5000

genotype_data <- matrix(
  sample(0:2, n_samples * n_snps, replace = TRUE),
  nrow = n_samples,
  ncol = n_snps
)

# Add meaningful row/column names
rownames(genotype_data) <- paste0("Sample_", 1:n_samples)
colnames(genotype_data) <- paste0("SNP_", 1:n_snps)

# Check size
format(object.size(genotype_data), units = "MB")

[1] "19.5 Mb"

6.2 Save to HDF5

# Create HDF5 file
data_file <- "my_first_dataset.hdf5"

bdCreate_hdf5_matrix(
  filename = data_file,
  object = genotype_data,
  group = "genotypes",
  dataset = "data",
  overwriteFile = TRUE
)

$fn
[1] "my_first_dataset.hdf5"

$ds
[1] "genotypes/data"

cat("✓ Dataset saved to HDF5\n")

✓ Dataset saved to HDF5

HDF5 File Structure

HDF5 files organize data hierarchically:

File: my_first_dataset.hdf5
- Group: genotypes (like a folder)
  - Dataset: data (the actual matrix)

Think of groups as folders and datasets as files inside them.

6.3 Inspect the File

For file inspection, we use rhdf5 functions:

library(rhdf5)  # Needed for h5ls()

# List file contents
h5ls(data_file)

                      group           name       otype   dclass         dim
0                         /      genotypes   H5I_GROUP                     
1                /genotypes .data_dimnames   H5I_GROUP                     
2 /genotypes/.data_dimnames              1 H5I_DATASET COMPOUND        1000
3 /genotypes/.data_dimnames              2 H5I_DATASET COMPOUND        5000
4                /genotypes           data H5I_DATASET  INTEGER 1000 x 5000

6.4 Read Data Back

# Read a small portion
small_chunk <- h5read(data_file, "/genotypes/data", 
                      index = list(1:5, 1:10))
small_chunk

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    2    1    2    2    2    0    1    0    2     0
[2,]    2    2    1    2    2    1    0    0    0     2
[3,]    2    1    0    0    2    2    1    0    0     0
[4,]    1    0    0    0    2    1    0    0    1     1
[5,]    2    2    0    1    0    2    0    2    2     2

# Verify it matches original
all.equal(small_chunk, genotype_data[1:5, 1:10])

[1] "Attributes: < Length mismatch: comparison on first 1 components >"

Understanding all.equal()

If you see output like "Attributes: < Component 'dimnames': target is NULL, current is list >", this means the matrices are numerically identical but have different attributes (row/column names).

This is expected: HDF5 stores dimension names separately from the data matrix. The numeric values are identical, which is what matters for calculations. You can verify with:

# Compare just the numeric values
all.equal(as.numeric(small_chunk), as.numeric(genotype_data[1:5, 1:10]))

[1] TRUE

7 Step 5: Basic Operations

Let’s perform operations directly on the HDF5 file using BigDataStatMeth functions.

7.1 Matrix Multiplication

# Create two smaller matrices for demonstration
set.seed(456)
A <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)
B <- matrix(rnorm(100 * 200), nrow = 100, ncol = 200)

# Save to HDF5
example_file <- "operations_example.hdf5"
bdCreate_hdf5_matrix(filename = example_file, object = A,
                     group = "matrices", dataset = "A",
                     overwriteFile = TRUE)

$fn
[1] "operations_example.hdf5"

$ds
[1] "matrices/A"

bdCreate_hdf5_matrix(filename = example_file, object = B,
                     group = "matrices", dataset = "B")

$fn
[1] "operations_example.hdf5"

$ds
[1] "matrices/B"

# Perform block-wise multiplication
result <- bdblockmult_hdf5(
  filename = example_file,
  group = "matrices",
  A = "A",
  B = "B",
  outgroup = "results"
)

# Read and display a portion of the result
result_sample <- h5read(result$fn, result$ds, index = list(1:5, 1:5))
cat("Result preview (first 5×5):\n")

Result preview (first 5×5):

print(result_sample)

          [,1]      [,2]       [,3]       [,4]     [,5]
[1,]  3.966529 -1.596117 -10.575227  -1.523170 13.37335
[2,] 10.496300 -4.726875  16.389046   3.817424 16.47317
[3,] -8.331547 -9.299947  -3.084136   4.030101 11.75505
[4,] -5.159786  2.319629   2.815899  -3.407731 -5.72410
[5,]  3.384955 18.492893   1.697668 -13.890756 10.48865

cat("\n✓ Matrix multiplication completed\n")


✓ Matrix multiplication completed

cat("Result stored in:", result$ds, "\n")

Result stored in: results/A_x_B

Memory Efficiency in Action

Notice we multiplied a 500×100 matrix by a 100×200 matrix without loading the full result into memory. BigDataStatMeth processed this in blocks and wrote directly to HDF5.

For truly large matrices (100 GB+), this same code works identically - just takes longer.

7.2 Crossproduct Operation

# Compute t(A) %*% A
crossprod_result <- bdCrossprod_hdf5(
  filename = example_file,
  group = "matrices",
  A = "A"
)

# Read and display a portion
crossprod_sample <- h5read(crossprod_result$fn, crossprod_result$ds, 
                           index = list(1:5, 1:5))
cat("Crossproduct result preview (first 5×5):\n")

Crossproduct result preview (first 5×5):

print(crossprod_sample)

           [,1]       [,2]       [,3]       [,4]      [,5]
[1,] 475.947503 -26.983942   1.705196  18.276113  22.03469
[2,] -26.983942 488.288982 -24.322274   6.712965 -14.34956
[3,]   1.705196 -24.322274 498.798075   9.800220  52.82859
[4,]  18.276113   6.712965   9.800220 511.565405 -26.92909
[5,]  22.034686 -14.349559  52.828589 -26.929093 438.22361

cat("\n✓ Crossprod completed\n")


✓ Crossprod completed

cat("Result dimensions should be 100 × 100\n")

Result dimensions should be 100 × 100

# Verify dimensions
h5ls(example_file)

      group            name       otype dclass       dim
0         /          OUTPUT   H5I_GROUP                 
1   /OUTPUT CrossProd_A_x_A H5I_DATASET  FLOAT 100 x 100
2         /        matrices   H5I_GROUP                 
3 /matrices               A H5I_DATASET  FLOAT 500 x 100
4 /matrices               B H5I_DATASET  FLOAT 100 x 200
5         /         results   H5I_GROUP                 
6  /results           A_x_B H5I_DATASET  FLOAT 500 x 200

8 Step 6: Clean Up

# Close any open HDF5 connections
h5closeAll()

# Remove test files (optional)
file.remove(data_file, example_file)

[1] TRUE TRUE

9 Interactive Exercise

9.1 Practice: Creating Your Own HDF5 Workflow

Now that you’ve seen the basic operations, try designing a small workflow with data relevant to your work. This helps internalize the concepts through hands-on practice.

# Exercise: Create a mini-analysis workflow

# 1. Generate or load your own data
my_data <- matrix(rnorm(1000 * 500), nrow = 1000, ncol = 500)  
# Replace with: read.csv(), or data from your field

# 2. Save to HDF5 with meaningful names
bdCreate_hdf5_matrix(
  filename = "my_analysis.hdf5",
  object = my_data,
  group = "raw_data",           # Choose descriptive name
  dataset = "measurements",     # What does this represent?
  overwriteFile = TRUE
)

# 3. Perform an operation
result <- bdCrossprod_hdf5(
  filename = "my_analysis.hdf5",
  group = "raw_data",
  A = "measurements",
  outgroup = "processed"
)

# 4. Verify the result
h5ls("my_analysis.hdf5")  # Examine the structure

Reflection Questions

As you work through this exercise, consider:

1. File Organization: - How are you structuring your HDF5 file? (groups and datasets) - If you added more data tomorrow, where would it go? - Could someone else understand your organization scheme?

2. Memory vs. Disk: - At what data size would your operation fail in memory? - How much disk space does your HDF5 file use? - Compare: object size in R (object.size(my_data)) vs. file size on disk

3. Operation Choice: - Why did you choose bdCrossprod_hdf5() over other operations? - What statistical question does this operation answer? - Could you achieve the same result by composing multiple simpler operations?

4. Error Handling: - What happens if you try to create a dataset that already exists? - Try it: What does the error message tell you? - How do you fix it? (overwriteDataset = TRUE or different name?)

5. Scaling Up: - Your test used 1,000 × 500. What about 100,000 × 50,000? - Which operations would work unchanged? - Which would need adjustment (block sizes, memory)?

Don’t worry about “correct” answers - the goal is developing intuition about when and how to use these tools. Each analysis scenario is unique, and hands-on experimentation builds understanding better than reading alone.

10 What You’ve Accomplished

✅ Installed BigDataStatMeth and all dependencies
✅ Created your first HDF5 dataset
✅ Performed block-wise operations on HDF5 data
✅ Understood the HDF5 file structure (groups and datasets)
✅ Verified that block-wise processing works correctly

11 Next Steps

Now that you have BigDataStatMeth working, continue learning:

Continue the tutorial series:

✅ Getting Started (you are here)
→ Working with HDF5 Matrices - File operations, data conversion, and management
→ Your First Analysis - Complete workflow from raw data to results

Explore practical workflows:

Implementing PCA - Principal Component Analysis on genomic data
Implementing CCA - Canonical Correlation Analysis

Dive deeper into concepts:

Understanding HDF5 - How HDF5 storage works
Block-Wise Computing - Algorithms behind the scenes

12 Getting Help

If you encounter issues:

Check documentation: Most functions have detailed help: ?bdCreate_hdf5_matrix
Review examples: Package vignettes contain working code
GitHub Issues: Report bugs at isglobal-brge/BigDataStatMeth

Quick Troubleshooting Commands

# Check package version
packageVersion("BigDataStatMeth")

# View all loaded packages
sessionInfo()

# Test HDF5 installation
rhdf5::h5version()

# Check working directory (where files are created)
getwd()

13 Key Takeaways

Let’s consolidate what you’ve learned about setting up and using BigDataStatMeth for the first time.

13.1 Essential Concepts

Installation creates the foundation for all subsequent work with BigDataStatMeth. Without properly installed dependencies (rhdf5, HDF5Array, compilation tools), nothing else works. Windows users face additional complexity requiring Rtools, but following the installation sequence systematically prevents most problems. Testing with small examples immediately after installation catches configuration issues before you invest time in real analyses.

HDF5 files are organized hierarchically like a file system, with groups acting as folders and datasets as files. This structure isn’t just organizational convenience - it enables efficient partial I/O where you read only the data you need. Creating your first HDF5 file teaches this fundamental paradigm: data lives on disk, accessed selectively, rather than entirely in RAM. Good organization from the start saves confusion when projects grow complex.

Block-wise processing happens automatically behind BigDataStatMeth’s functions. You don’t partition matrices manually or manage memory explicitly - the package handles block sizes, I/O patterns, and result aggregation internally. When you call bdCrossprod_hdf5(), it looks like a single function call, but executes sophisticated block-wise algorithms transparently. This abstraction is the package’s main value: complexity hidden, scaling achieved.

Verification prevents wasted effort. Testing installations with small examples (100×100 matrices) catches problems when they’re easy to fix. If basic operations fail on tiny test data, they won’t mysteriously work on real 100,000×100,000 matrices. Small-scale testing establishes that your environment works correctly before investing hours generating or converting large datasets.

The R API suffices for most users. Functions like bdCreate_hdf5_matrix(), bdSVD_hdf5(), and bdCrossprod_hdf5() provide complete functionality for standard analyses. The C++ API exists for developers implementing novel statistical methods who need fine-grained control over algorithms and memory management. Unless you’re developing new methods from scratch, the R interface provides everything needed.

13.2 When to Use BigDataStatMeth

Making informed decisions about when BigDataStatMeth helps versus when simpler approaches suffice saves time and prevents unnecessary complexity.

✅ Use BigDataStatMeth when:

Data exceeds 30% of available RAM - This threshold provides headroom for intermediate computations and operating system needs. Below 30%, traditional R approaches work fine. Above 30%, you risk memory exhaustion during operations, and disk-based computing becomes necessary.
You’re starting a new analysis project - Converting data to HDF5 at the beginning avoids migration pain later. It’s easier to start organized than to reorganize mid-project when you discover your data has grown beyond memory limits.
Multiple analyses will reuse the same data - Converting to HDF5 once pays off when you’ll run PCA, then regression, then association tests on the same dataset. The upfront conversion cost amortizes across repeated analyses.
Your workflow spans multiple tools - If you work in R, Python, and command-line tools, HDF5 provides a common format all can read efficiently. This beats converting between CSV, RData, and tool-specific formats repeatedly.

❌ Traditional R works better when:

Data comfortably fits in less than 20% of RAM - If data <- read.csv(file) works without issues, stick with familiar R approaches. Traditional methods are simpler, more flexible, and better supported by the broader R ecosystem. Don’t add complexity unnecessarily.
You’re doing one-off exploratory analysis - For quick investigations you won’t repeat, the HDF5 conversion overhead outweighs benefits. Load data, explore, save key results, discard working data, and you’re done.
You need maximum flexibility - In-memory R data structures support arbitrary manipulations trivially: reshape, subset, transform however you want. HDF5 adds structure, which aids organization but constrains spontaneous manipulations. If your workflow involves many ad-hoc transformations, staying in memory maintains flexibility.

The key question isn’t just “can my data fit in memory?” but “does my workflow benefit from disk-based computing?” Sometimes the answer is obvious (500 GB dataset, 32 GB RAM → yes). Sometimes it’s contextual (40 GB dataset, 64 GB RAM → depends on specific operations needed). Understanding your analysis requirements and computational resources helps make this decision rationally rather than by trial and error.

--- title: "Getting Started" subtitle: "Installation and First Steps with BigDataStatMeth" --- ## Overview BigDataStatMeth enables statistical analysis on datasets too large for RAM by using HDF5 file storage and block-wise processing. This tutorial guides you through installation and your first analysis, ensuring you have a working environment before diving into more complex operations. Think of this as setting up your laboratory bench before starting experiments. You'll verify that all your tools work correctly with small, manageable examples before scaling to real big data analyses. This approach prevents frustrating debugging sessions later when you're working with large files. ::: {.learning-objectives} ### What You'll Learn By the end of this tutorial, you will: - Install BigDataStatMeth and all required dependencies correctly - Create your first HDF5 matrix from R data - Perform basic matrix operations on HDF5-stored data - Understand the HDF5 file structure and how to navigate it - Verify your installation works correctly with test examples - Know where to find help when you encounter issues - Be prepared for more advanced tutorials ::: --- ## Prerequisites ### System Requirements **Operating System:** - Linux (recommended) - macOS - Windows (requires Rtools - see below) **R Version:** - R ≥ 4.0.0 (check with `R.version.string`) **RAM:** - Minimum: 4 GB - Recommended: 8+ GB for comfortable work **Disk Space:** - ~500 MB for package and dependencies - Additional space for your HDF5 data files ::: {.callout-important} ## Windows Users: Install Rtools First BigDataStatMeth is compiled from C++ source code. Windows users must install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) **before** installing the package. Download the version matching your R installation from the link above. ::: --- ## Step 1: Install Dependencies BigDataStatMeth requires packages from both CRAN and Bioconductor. ### Install BiocManager ```{r install-biocmanager, eval=FALSE} # BiocManager helps install Bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") ``` ### Install Required Packages ```{r install-deps, eval=FALSE} # CRAN packages install.packages(c("Matrix", "RcppEigen", "RSpectra")) # Bioconductor packages BiocManager::install(c("rhdf5", "HDF5Array")) ``` ::: {.callout-tip} ## Troubleshooting Dependencies If installation fails: 1. **Update R:** Some packages require recent R versions 2. **Update BiocManager:** Run `BiocManager::install()` with no arguments 3. **Check compilation tools:** Especially on Windows/macOS 4. **Installation logs:** Look for specific error messages about missing libraries ::: --- ## Step 2: Install BigDataStatMeth ### From CRAN (Recommended - Stable Version) The stable version is available on CRAN: ```{r install-cran, eval=FALSE} install.packages("BigDataStatMeth") ``` This installs the latest stable, tested release. ### From GitHub (Development Version) For the latest development features: ```{r install-github, eval=FALSE} # Install devtools if needed if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") # Install development version from GitHub devtools::install_github("isglobal-brge/BigDataStatMeth") ``` ::: {.callout-note} ## Which Version Should I Use? **Use CRAN version if:** - You want maximum stability - You're doing production analysis - You prefer well-tested releases **Use GitHub version if:** - You need the latest features - You're contributing to development - You want to test new functionality ::: --- ## Step 3: Load and Verify ### Load the Package ```{r load-packages, eval=TRUE, message=FALSE} library(BigDataStatMeth) ``` If no errors appear, the package loaded successfully! ::: {.callout-important} ## About Loading rhdf5 You might see examples that load `library(rhdf5)`. This is **only necessary** when you use rhdf5 functions directly like `h5ls()`, `h5read()`, or `H5Fopen()`. BigDataStatMeth already depends on rhdf5 internally, so you don't need to load it for BigDataStatMeth functions. However, in this tutorial we'll use some rhdf5 functions for file inspection, so we'll load it when needed. **Key point:** BigDataStatMeth implements its own high-level functions. We use rhdf5's inspection functions because they're already excellent and there's no need to reimplement them. ::: ### Quick Verification Run this simple test to verify everything works: ```{r verification, eval=TRUE} # Create a small test matrix test_matrix <- matrix(rnorm(100), nrow = 10, ncol = 10) # Create HDF5 file test_file <- "verification_test.hdf5" bdCreate_hdf5_matrix( filename = test_file, object = test_matrix, group = "test", dataset = "data", overwriteFile = TRUE ) # Verify file was created if (file.exists(test_file)) { cat("✓ Installation verified!\n") cat("✓ HDF5 file created successfully\n") # Clean up file.remove(test_file) } else { cat("✗ Installation issue - file not created\n") } ``` **Expected output:** ``` ✓ Installation verified! ✓ HDF5 file created successfully ``` ::: {.callout-warning} ## If Verification Fails Common issues: - **"Package not found":** Restart R session - **"HDF5 library error":** Reinstall `rhdf5` package - **Permission denied:** Check write permissions in working directory - **Symbol not found:** Recompile package from source Try `sessionInfo()` to check loaded packages and versions. ::: --- ## Step 4: Your First HDF5 Dataset Now let's create a realistic dataset and perform basic operations. ### Create Sample Data ```{r create-data, eval=TRUE} # Simulate a genomic dataset: 1,000 samples × 5,000 SNPs set.seed(123) n_samples <- 1000 n_snps <- 5000 genotype_data <- matrix( sample(0:2, n_samples * n_snps, replace = TRUE), nrow = n_samples, ncol = n_snps ) # Add meaningful row/column names rownames(genotype_data) <- paste0("Sample_", 1:n_samples) colnames(genotype_data) <- paste0("SNP_", 1:n_snps) # Check size format(object.size(genotype_data), units = "MB") ``` ### Save to HDF5 ```{r save-hdf5, eval=TRUE} # Create HDF5 file data_file <- "my_first_dataset.hdf5" bdCreate_hdf5_matrix( filename = data_file, object = genotype_data, group = "genotypes", dataset = "data", overwriteFile = TRUE ) cat("✓ Dataset saved to HDF5\n") ``` ::: {.callout-note} ## HDF5 File Structure HDF5 files organize data hierarchically: - **File:** `my_first_dataset.hdf5` - **Group:** `genotypes` (like a folder) - **Dataset:** `data` (the actual matrix) Think of groups as folders and datasets as files inside them. ::: ### Inspect the File For file inspection, we use rhdf5 functions: ```{r inspect-file, eval=TRUE} library(rhdf5) # Needed for h5ls() # List file contents h5ls(data_file) ``` ### Read Data Back ```{r read-data, eval=TRUE} # Read a small portion small_chunk <- h5read(data_file, "/genotypes/data", index = list(1:5, 1:10)) small_chunk # Verify it matches original all.equal(small_chunk, genotype_data[1:5, 1:10]) ``` ::: {.callout-note} ## Understanding all.equal() If you see output like `"Attributes: < Component 'dimnames': target is NULL, current is list >"`, this means the matrices are numerically identical but have different attributes (row/column names). This is expected: HDF5 stores dimension names separately from the data matrix. The **numeric values** are identical, which is what matters for calculations. You can verify with: ```{r verify-numeric, eval=TRUE} # Compare just the numeric values all.equal(as.numeric(small_chunk), as.numeric(genotype_data[1:5, 1:10])) ``` ::: --- ## Step 5: Basic Operations Let's perform operations directly on the HDF5 file using BigDataStatMeth functions. ### Matrix Multiplication ```{r matrix-mult, eval=TRUE} # Create two smaller matrices for demonstration set.seed(456) A <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100) B <- matrix(rnorm(100 * 200), nrow = 100, ncol = 200) # Save to HDF5 example_file <- "operations_example.hdf5" bdCreate_hdf5_matrix(filename = example_file, object = A, group = "matrices", dataset = "A", overwriteFile = TRUE) bdCreate_hdf5_matrix(filename = example_file, object = B, group = "matrices", dataset = "B") # Perform block-wise multiplication result <- bdblockmult_hdf5( filename = example_file, group = "matrices", A = "A", B = "B", outgroup = "results" ) # Read and display a portion of the result result_sample <- h5read(result$fn, result$ds, index = list(1:5, 1:5)) cat("Result preview (first 5×5):\n") print(result_sample) cat("\n✓ Matrix multiplication completed\n") cat("Result stored in:", result$ds, "\n") ``` ::: {.callout-tip} ## Memory Efficiency in Action Notice we multiplied a 500×100 matrix by a 100×200 matrix **without loading the full result into memory**. BigDataStatMeth processed this in blocks and wrote directly to HDF5. For truly large matrices (100 GB+), this same code works identically - just takes longer. ::: ### Crossproduct Operation ```{r crossprod, eval=TRUE} # Compute t(A) %*% A crossprod_result <- bdCrossprod_hdf5( filename = example_file, group = "matrices", A = "A" ) # Read and display a portion crossprod_sample <- h5read(crossprod_result$fn, crossprod_result$ds, index = list(1:5, 1:5)) cat("Crossproduct result preview (first 5×5):\n") print(crossprod_sample) cat("\n✓ Crossprod completed\n") cat("Result dimensions should be 100 × 100\n") # Verify dimensions h5ls(example_file) ``` --- ## Step 6: Clean Up ```{r cleanup, eval=TRUE} # Close any open HDF5 connections h5closeAll() # Remove test files (optional) file.remove(data_file, example_file) ``` --- ## Interactive Exercise {.exercise} ### Practice: Creating Your Own HDF5 Workflow Now that you've seen the basic operations, try designing a small workflow with data relevant to your work. This helps internalize the concepts through hands-on practice. ```{r} #| eval: false # Exercise: Create a mini-analysis workflow # 1. Generate or load your own data my_data <- matrix(rnorm(1000 * 500), nrow = 1000, ncol = 500) # Replace with: read.csv(), or data from your field # 2. Save to HDF5 with meaningful names bdCreate_hdf5_matrix( filename = "my_analysis.hdf5", object = my_data, group = "raw_data", # Choose descriptive name dataset = "measurements", # What does this represent? overwriteFile = TRUE ) # 3. Perform an operation result <- bdCrossprod_hdf5( filename = "my_analysis.hdf5", group = "raw_data", A = "measurements", outgroup = "processed" ) # 4. Verify the result h5ls("my_analysis.hdf5") # Examine the structure ``` ::: {.callout-tip} ### Reflection Questions As you work through this exercise, consider: **1. File Organization:** - How are you structuring your HDF5 file? (groups and datasets) - If you added more data tomorrow, where would it go? - Could someone else understand your organization scheme? **2. Memory vs. Disk:** - At what data size would your operation fail in memory? - How much disk space does your HDF5 file use? - Compare: object size in R (`object.size(my_data)`) vs. file size on disk **3. Operation Choice:** - Why did you choose `bdCrossprod_hdf5()` over other operations? - What statistical question does this operation answer? - Could you achieve the same result by composing multiple simpler operations? **4. Error Handling:** - What happens if you try to create a dataset that already exists? - Try it: What does the error message tell you? - How do you fix it? (`overwriteDataset = TRUE` or different name?) **5. Scaling Up:** - Your test used 1,000 × 500. What about 100,000 × 50,000? - Which operations would work unchanged? - Which would need adjustment (block sizes, memory)? Don't worry about "correct" answers - the goal is developing intuition about when and how to use these tools. Each analysis scenario is unique, and hands-on experimentation builds understanding better than reading alone. ::: --- ## What You've Accomplished ✅ **Installed** BigDataStatMeth and all dependencies ✅ **Created** your first HDF5 dataset ✅ **Performed** block-wise operations on HDF5 data ✅ **Understood** the HDF5 file structure (groups and datasets) ✅ **Verified** that block-wise processing works correctly --- ## Next Steps Now that you have BigDataStatMeth working, continue learning: **Continue the tutorial series:** 1. ✅ **Getting Started** (you are here) 2. → [Working with HDF5 Matrices](working-hdf5-matrices.qmd) - File operations, data conversion, and management 3. → [Your First Analysis](first-analysis.qmd) - Complete workflow from raw data to results **Explore practical workflows:** - [Implementing PCA](../workflows/implementing-pca.qmd) - Principal Component Analysis on genomic data - [Implementing CCA](../workflows/implementing-cca.qmd) - Canonical Correlation Analysis **Dive deeper into concepts:** - [Understanding HDF5](../fundamentals/understanding-hdf5.qmd) - How HDF5 storage works - [Block-Wise Computing](../fundamentals/blockwise-computing.qmd) - Algorithms behind the scenes --- ## Getting Help **If you encounter issues:** 1. **Check documentation:** Most functions have detailed help: `?bdCreate_hdf5_matrix` 2. **Review examples:** Package vignettes contain working code 3. **GitHub Issues:** Report bugs at [isglobal-brge/BigDataStatMeth](https://github.com/isglobal-brge/BigDataStatMeth/issues) ::: {.callout-tip} ## Quick Troubleshooting Commands ```{r troubleshoot, eval=FALSE} # Check package version packageVersion("BigDataStatMeth") # View all loaded packages sessionInfo() # Test HDF5 installation rhdf5::h5version() # Check working directory (where files are created) getwd() ``` ::: --- ## Key Takeaways {.key-concept} Let's consolidate what you've learned about setting up and using BigDataStatMeth for the first time. ### Essential Concepts **Installation creates the foundation** for all subsequent work with BigDataStatMeth. Without properly installed dependencies (rhdf5, HDF5Array, compilation tools), nothing else works. Windows users face additional complexity requiring Rtools, but following the installation sequence systematically prevents most problems. Testing with small examples immediately after installation catches configuration issues before you invest time in real analyses. **HDF5 files are organized hierarchically** like a file system, with groups acting as folders and datasets as files. This structure isn't just organizational convenience - it enables efficient partial I/O where you read only the data you need. Creating your first HDF5 file teaches this fundamental paradigm: data lives on disk, accessed selectively, rather than entirely in RAM. Good organization from the start saves confusion when projects grow complex. **Block-wise processing happens automatically** behind BigDataStatMeth's functions. You don't partition matrices manually or manage memory explicitly - the package handles block sizes, I/O patterns, and result aggregation internally. When you call `bdCrossprod_hdf5()`, it looks like a single function call, but executes sophisticated block-wise algorithms transparently. This abstraction is the package's main value: complexity hidden, scaling achieved. **Verification prevents wasted effort.** Testing installations with small examples (100×100 matrices) catches problems when they're easy to fix. If basic operations fail on tiny test data, they won't mysteriously work on real 100,000×100,000 matrices. Small-scale testing establishes that your environment works correctly before investing hours generating or converting large datasets. **The R API suffices for most users.** Functions like `bdCreate_hdf5_matrix()`, `bdSVD_hdf5()`, and `bdCrossprod_hdf5()` provide complete functionality for standard analyses. The C++ API exists for developers implementing novel statistical methods who need fine-grained control over algorithms and memory management. Unless you're developing new methods from scratch, the R interface provides everything needed. ### When to Use BigDataStatMeth Making informed decisions about when BigDataStatMeth helps versus when simpler approaches suffice saves time and prevents unnecessary complexity. ✅ **Use BigDataStatMeth when:** - **Data exceeds 30% of available RAM** - This threshold provides headroom for intermediate computations and operating system needs. Below 30%, traditional R approaches work fine. Above 30%, you risk memory exhaustion during operations, and disk-based computing becomes necessary. - **You're starting a new analysis project** - Converting data to HDF5 at the beginning avoids migration pain later. It's easier to start organized than to reorganize mid-project when you discover your data has grown beyond memory limits. - **Multiple analyses will reuse the same data** - Converting to HDF5 once pays off when you'll run PCA, then regression, then association tests on the same dataset. The upfront conversion cost amortizes across repeated analyses. - **Your workflow spans multiple tools** - If you work in R, Python, and command-line tools, HDF5 provides a common format all can read efficiently. This beats converting between CSV, RData, and tool-specific formats repeatedly. ❌ **Traditional R works better when:** - **Data comfortably fits in less than 20% of RAM** - If `data <- read.csv(file)` works without issues, stick with familiar R approaches. Traditional methods are simpler, more flexible, and better supported by the broader R ecosystem. Don't add complexity unnecessarily. - **You're doing one-off exploratory analysis** - For quick investigations you won't repeat, the HDF5 conversion overhead outweighs benefits. Load data, explore, save key results, discard working data, and you're done. - **You need maximum flexibility** - In-memory R data structures support arbitrary manipulations trivially: reshape, subset, transform however you want. HDF5 adds structure, which aids organization but constrains spontaneous manipulations. If your workflow involves many ad-hoc transformations, staying in memory maintains flexibility. The key question isn't just "can my data fit in memory?" but "does my workflow benefit from disk-based computing?" Sometimes the answer is obvious (500 GB dataset, 32 GB RAM → yes). Sometimes it's contextual (40 GB dataset, 64 GB RAM → depends on specific operations needed). Understanding your analysis requirements and computational resources helps make this decision rationally rather than by trial and error.