Linear Algebra for Statistical Methods

Understanding the Mathematical Operations Behind BigDataStatMeth

0.1 What You’ll Learn

By the end of this section, you will:

Understand key matrix operations used in statistical computing
Recognize which operations BigDataStatMeth implements
Know the computational costs of different operations
Understand memory requirements for matrix operations
See how these operations translate to statistical methods
Know which operations are fast vs. slow on disk-based data

1 Why Linear Algebra Matters for Statistics

Statistical methods, especially those applied to high-dimensional data like genomics, are fundamentally built on linear algebra operations. When you compute a correlation, fit a linear model, or perform PCA, you’re executing matrix operations - even if the statistical software hides this fact behind convenient function calls.

Understanding these operations serves two purposes in the context of BigDataStatMeth. First, it helps you understand what the package is computing when you call functions like bdSVD_hdf5() or bdCrossprod_hdf5(). Second, and more importantly, it helps you understand why certain operations are faster or more memory-intensive than others, which informs decisions about how to structure your analyses.

This section provides a practical review of key linear algebra operations, focusing on their statistical interpretation and computational characteristics. We won’t derive theorems or prove properties - there are excellent textbooks for that. Instead, we focus on building intuition for how these operations work and why they matter for big data analysis.

2 Matrix Basics and Dimensions

2.1 What Matrices Represent

In statistical applications, matrices typically represent one of two things:

Data matrices: Rows are observations (individuals, samples, experiments) and columns are variables (genetic variants, genes, measurements). A data matrix X with dimensions n \times p contains n observations of p variables.

Relationship matrices: Both rows and columns represent the same set of entities, and entries represent relationships between them. Common examples include correlation matrices, covariance matrices, and similarity matrices.

Understanding which type of matrix you’re working with helps predict the computational characteristics of operations on it.

2.2 Dimension Compatibility

Matrix operations have specific rules about compatible dimensions:

Matrix multiplication C = A \times B requires: - A is m \times k - B is k \times n
- Result C is m \times n

The inner dimensions must match (k must be the same), while outer dimensions determine the result size. This matters for big data because:

\text{Memory}(C) = m \times n \times 8 \text{ bytes}

Even if A and B fit in memory individually, C might not if m and n are both large.

Element-wise operations (addition, subtraction, element-wise multiplication) require matrices of identical dimensions. These operations are memory-neutral - the result is the same size as the inputs.

3 Essential Matrix Operations

3.1 Transpose

The transpose A^T swaps rows and columns: if A is m \times n, then A^T is n \times m.

Statistical meaning: Transposing a data matrix switches from individuals-by-variables to variables-by-individuals, which changes what operations naturally express. For example: - X^T X is a p \times p matrix of relationships between variables (often proportional to correlation) - X X^T is an n \times n matrix of relationships between individuals

Computational note: Transpose is conceptually simple but can be expensive for large matrices stored on disk, because it requires rewriting data in a different order. BigDataStatMeth’s HDF5 functions sometimes avoid explicit transposition by cleverly reading data in the needed order.

3.2 Matrix Multiplication

Matrix multiplication C = A \times B is ubiquitous in statistics:

Uses in statistics: - Projecting data onto a lower-dimensional space: Y = X W - Computing fitted values in regression: \hat{y} = X \hat{\beta} - Rotating or transforming coordinates: X_{new} = X R

Computational complexity: Multiplying m \times k and k \times n matrices requires m \times k \times n multiplications. For large matrices, this can be substantial:

Small example: 1000 \times 1000 by 1000 \times 1000 → 10^9 operations (< 1 second on modern CPU)
Large example: 100,000 \times 10,000 by 10,000 \times 10,000 → 10^{14} operations (hours without optimization)

Block-wise strategy: As discussed in Block-Wise Computing, matrix multiplication decomposes naturally for disk-based data.

3.3 Crossproduct and Transposed Crossproduct

These are special cases of multiplication involving a matrix and its transpose:

Crossproduct: X^T X (or more generally X^T Y) Transposed Crossproduct: X X^T (or more generally X Y^T)

Why they’re special: These operations are extremely common in statistics and have special mathematical properties (the results are symmetric). More importantly, they can be computed more efficiently than general matrix multiplication.

Statistical applications: - Sample covariance: \frac{1}{n-1} X^T X (for centered X) - Gram matrices for kernel methods: X X^T - Normal equations in regression: X^T X \beta = X^T y

BigDataStatMeth provides dedicated functions (bdCrossprod_hdf5() and bdtCrossprod_hdf5()) that exploit the special structure of these operations for efficiency.

4 Matrix Decompositions

Decompositions factor a matrix into a product of simpler matrices with useful properties. They’re fundamental to many statistical methods.

4.1 Why Decompositions Matter

Rather than solving problems directly, we often decompose matrices and work with the factors. This provides:

Numerical stability: Decompositions can be computed reliably even for ill-conditioned problems Insight: The factors often have interpretable meaning Efficiency: Certain operations on the factors are faster than on the original matrix Flexibility: The same decomposition serves multiple purposes

4.2 Singular Value Decomposition (SVD)

The SVD factors any m \times n matrix X as: X = U \Sigma V^T

where: - U is m \times r with orthonormal columns (left singular vectors) - \Sigma is r \times r diagonal with non-negative entries (singular values) - V is n \times r with orthonormal columns (right singular vectors) - r = \min(m,n) or less if you truncate

Statistical interpretation:

The SVD identifies orthogonal directions of variation in your data: - Singular values (\Sigma) indicate how much variation lies along each direction - Left singular vectors (U) express observations in terms of these directions - Right singular vectors (V) express variables in terms of these directions

Key applications:

Principal Component Analysis (PCA): The columns of V (right singular vectors) are the principal component directions. The columns of U \Sigma are the principal component scores. The squared singular values are proportional to explained variance.

Low-rank approximation: Keeping only the first k singular values and corresponding vectors gives the best rank-k approximation to X (in least-squares sense).

Regression: SVD provides a stable way to solve least squares problems, especially when X^T X is nearly singular.

Computational challenge: Computing the full SVD requires O(\min(mn^2, m^2n)) operations, which is prohibitive for large matrices. Fortunately, we often only need the first k singular values/vectors where k \ll \min(m,n). Iterative algorithms can compute these leading components much more efficiently.

BigDataStatMeth’s bdSVD_hdf5() uses the hierarchical algorithm described in Block-Wise Computing to compute truncated SVD on matrices that don’t fit in memory.

4.3 QR Decomposition

The QR decomposition factors an m \times n matrix X (with m \geq n) as: X = Q R

where: - Q is m \times n with orthonormal columns (Q^T Q = I) - R is n \times n upper triangular

Statistical interpretation:

QR decomposition provides an orthogonal basis for the column space of X. The columns of Q span the same space as the columns of X, but they’re orthonormal (perpendicular and unit length).

Key applications:

Linear regression: Solving X \beta = y via QR is numerically stable. Once you have QR = X: \beta = R^{-1} Q^T y

Solving with the triangular R is fast and stable.

Orthogonalization: QR gives you an orthonormal basis for a set of vectors, useful in many iterative algorithms.

Rank determination: The diagonal elements of R reveal the effective rank of X.

Computational note: QR decomposition requires O(mn^2) operations. For block-wise computation, BigDataStatMeth uses hierarchical QR as described in the previous section.

4.4 Cholesky Decomposition

For a symmetric positive-definite matrix A (common for covariance matrices), the Cholesky decomposition gives: A = L L^T

where L is lower triangular.

Statistical interpretation:

Cholesky decomposition is like taking the square root of a matrix. For covariance matrices, it’s related to linear transformations that generate the covariance structure.

Key applications:

Matrix inversion: A^{-1} = (L^T)^{-1} L^{-1}, and inverting triangular matrices is efficient

Simulation: To generate random variables with covariance A, generate standard normals z and compute L z

Solving linear systems: For A x = b, solve L y = b then L^T x = y (both easy with triangular L)

Computational efficiency: Cholesky is about twice as fast as general matrix factorizations and uses half the storage (only the triangular part).

BigDataStatMeth provides bdCholesky_hdf5() for computing this decomposition and bdInvCholesky_hdf5() for the common use case of inverting symmetric positive-definite matrices.

5 Operations on Structured Matrices

Some matrices have special structure that enables more efficient computation.

5.1 Symmetric Matrices

A matrix A is symmetric if A = A^T. Many statistical matrices are symmetric: - Correlation matrices - Covariance matrices
- Gram matrices (X X^T)

Computational benefits: - Only need to store roughly half the elements (upper or lower triangle) - Many operations can be optimized knowing the structure - Eigenvalues are guaranteed to be real - Eigenvectors are orthogonal

5.2 Diagonal Matrices

A diagonal matrix has non-zero elements only on the diagonal. The variance matrix of independent variables is diagonal.

Computational benefits: - Multiplication is element-wise: (D X)_{ij} = D_{ii} X_{ij} - Inversion is trivial: D^{-1} has diagonal elements 1/D_{ii} - Can be stored as a vector instead of a matrix

5.3 Sparse Matrices

A sparse matrix has mostly zero elements. While less common in basic statistical applications, sparse matrices appear in: - Penalty matrices in regularization - Design matrices for categorical variables - Network adjacency matrices

BigDataStatMeth doesn’t currently specialize for sparse matrices, as genomic data is typically dense. However, this is a potential future enhancement.

6 Matrix Norms and Conditioning

6.1 Norms: Measuring Matrix “Size”

A matrix norm is a scalar that measures the “size” of a matrix. Common norms include:

Frobenius norm: ||A||_F = \sqrt{\sum_{ij} A_{ij}^2} (like treating the matrix as a long vector)

Operator norms: Measure how much a matrix can “stretch” vectors

Statistical relevance: Norms appear in: - Regularization (penalizing large coefficients) - Convergence criteria (stopping when change is small) - Error quantification (measuring approximation quality)

6.2 Condition Number

The condition number measures how sensitive a matrix’s inverse is to small changes in the matrix. For a matrix A: \kappa(A) = ||A|| \cdot ||A^{-1}||

Interpretation: - \kappa = 1: Perfectly conditioned (e.g., identity matrix) - \kappa \approx 10: Well-conditioned - \kappa > 1000: Ill-conditioned, numerical issues likely - \kappa very large: Nearly singular, inversion unreliable

Statistical relevance:

Multicollinearity in regression: High condition number of X^T X indicates highly correlated predictors, leading to unstable coefficient estimates.

Numerical stability: Operations involving ill-conditioned matrices accumulate rounding errors. This is why methods like SVD and QR decomposition are preferred over directly computing (X^T X)^{-1} for regression.

BigDataStatMeth consideration: The package uses numerically stable algorithms, but users should be aware that even with stable algorithms, solving ill-conditioned problems can yield uninformative results. The condition number is something to check in your data, not a limitation of the computational method.

7 Practical Guidelines for Big Data

Understanding these operations informs decisions about computational strategy:

7.1 Memory Requirements Are Predictable

For operation C = f(A, B): - Check dimensions of inputs and outputs - Calculate memory: dimensions × 8 bytes - Peak memory includes inputs, outputs, and any intermediate results

Example: Computing X^T X for X that is 100,000 \times 50,000: - Input: 100,000 \times 50,000 = 40 GB - Output: 50,000 \times 50,000 = 20 GB - Peak memory during computation: Both (60 GB)

This tells you whether an operation is feasible in-memory or requires block-wise processing.

7.2 Asymmetric Operations Create Memory Pressure

Operations where the output is much larger than inputs can exhaust memory:

Example: X X^T for X that is 100,000 \times 1000: - Input: 100,000 \times 1000 = 800 MB - Output: 100,000 \times 100,000 = 80 GB

The input fits easily, but the output doesn’t!

Strategy: BigDataStatMeth can write large results directly to HDF5, avoiding the need to hold them in memory.

7.3 Order of Operations Matters

Matrix multiplication is associative but not commutative: A (BC) = (AB) C but AB \neq BA.

For memory efficiency, consider the order:

Computing A (B C) where A is 1000 \times 10000, B is 10000 \times 100, C is 100 \times 10:

Option 1: Compute BC first (10000 \times 100 by 100 \times 10 → 10000 \times 10), then A(BC) → Final: 1000 \times 10
Option 2: Compute AB first (1000 \times 10000 by 10000 \times 100 → 1000 \times 100), then (AB)C → Final: 1000 \times 10

Both give the same final result, but Option 1 requires much less memory for intermediate results.

7.4 Decompositions Are Investments

Computing a decomposition (SVD, QR, Cholesky) takes time, but the factors can be reused:

Example: After computing X = QR: - Solving X \beta = y is fast (two triangular solves) - Computing X^T X is fast (R^T R) - Many operations with X become operations with Q and R, which may be more efficient

For big data, it’s often worth computing a decomposition once and reusing it, rather than repeatedly computing operations on the original matrix.

8 Interactive Exercise

8.1 Practice: Matching Statistical Methods to Matrix Operations

Understanding which matrix operations underlie statistical methods helps you predict computational requirements and choose appropriate approaches. This exercise builds that connection.

# Common statistical tasks - what operations do they need?

# 1. Principal Component Analysis
pca_operations <- function(X) {
  # Center the data
  # Compute X^T X or directly compute SVD of X
  # Extract eigenvectors/values
  # Question: Which approach is more memory-efficient?
}

# 2. Linear Regression  
regression_operations <- function(X, y) {
  # Solve: (X^T X) β = X^T y
  # Options: Direct inversion? QR decomposition? Normal equations?
  # Question: Which is numerically stable for large p?
}

# 3. Correlation Matrix
correlation_operations <- function(X) {
  # Center/scale X
  # Compute X^T X / (n-1)
  # Question: For p=50,000, what's the memory requirement?
}

Reflection Questions

1. Operation Sequence for PCA: - You need PCA on a 100,000 × 50,000 matrix - Approach A: Compute X^T X (50k × 50k), then eigen-decomposition - Approach B: Direct SVD of X in blocks - Which operations dominate computation time? - Which approach uses less memory?

2. Regression with Many Predictors: - Fitting y ~ X where X is 10,000 × 100,000 - Computing (X^T X)^{-1} creates a 100,000 × 100,000 matrix - How much memory does just this inverse require? - Is direct inversion even practical? - What alternatives exist (QR? Iterative methods?)?

3. Bottleneck Analysis: - Your workflow: Normalize → PCA → Regression - Which operation will be the memory bottleneck? - Which will take the most time? - Could you reorder operations to be more efficient?

4. Block-Wise Feasibility: - Element-wise operations (centering): Easy to block-wise? - Matrix multiplication: Block-amenable? - Computing all pairwise correlations: Challenges? - What makes some operations harder to distribute?

5. Practical Decisions: - You have 64 GB RAM, 200,000 × 100,000 matrix (≈160 GB) - Need to compute t(X) %*% X - Can BigDataStatMeth help? Which function? - What block size would be appropriate?

Thinking through these scenarios before encountering them in real analyses helps you make informed architectural decisions and understand why BigDataStatMeth makes certain implementation choices.

9 Key Takeaways

Let’s consolidate your understanding of linear algebra operations in statistical computing and how they behave with large-scale data.

9.1 Essential Concepts

Matrix operations are the building blocks of statistical methods, even when hidden behind high-level function calls. Understanding PCA, regression, or correlation in terms of matrix operations (SVD, solving linear systems, computing crossproducts) helps you predict computational requirements and memory needs. Every statistical method ultimately reduces to a sequence of linear algebra operations.

Computational costs vary dramatically by operation. Matrix multiplication is O(n³) in the worst case but highly optimized in practice. Inversions are expensive and numerically risky. Element-wise operations are cheap and memory-neutral. These cost differences matter when scaling to big data - an O(n³) operation may be impossible where O(n²) is feasible.

Memory requirements aren’t always obvious. Multiplying an m×k matrix by a k×n matrix produces an m×n output. Even if inputs fit in memory, the output might not. This is why operations like computing all pairwise correlations (producing p×p from n×p) quickly become infeasible as p grows. Understanding output dimensions helps predict whether an operation will succeed.

Decompositions are investments that pay off through reuse. Computing QR, SVD, or Cholesky takes time upfront but enables many subsequent operations to be faster, more stable, or more memory-efficient. For big data analyses involving repeated operations, computing a decomposition once and storing it often beats recomputing basic operations many times.

BigDataStatMeth implements both in-memory and HDF5 versions of key operations. The mathematical operation is identical - X^T X produces the same result whether X lives in memory or on disk. But the computational strategy differs: in-memory operations process complete matrices at once, while HDF5 operations process blocks and manage disk I/O automatically.

Operation choice affects feasibility more than implementation details. With big data, sometimes the question isn’t “how do I implement this faster?” but “is there a mathematically equivalent but computationally cheaper way to get this result?” Reformulating (X^T X)^{-1} X^T y as a QR solve changes the problem from impossible to practical.

9.2 Operation Characteristics

Understanding which operations work well with block-wise disk-based processing helps you design efficient analyses.

✅ Efficient operations (fast, block-amenable):

**Matrix multiplication (X %*% Y)** - Highly optimized, decomposes naturally into blocks. BigDataStatMeth’s implementations scale well.
**Crossproducts (t(X) %*% X)** - Especially efficient because one matrix is transposed. Reduces memory requirements and I/O operations.
Element-wise operations - Adding, subtracting, scaling by constants process independently. Perfect for block-wise with minimal overhead.
Row/column operations - Means, sums, normalizations work naturally when blocks contain complete rows or columns.

❌ Expensive operations (slow, memory-intensive):

Matrix inversions - Computationally expensive O(n³) and numerically unstable. Avoid when possible in favor of solving linear systems or using decompositions.
Full matrix factorizations - QR, Cholesky, eigendecompositions require multiple passes and careful numeric handling. Doable but requires sophisticated block-wise algorithms.
Operations creating large outputs - Computing all pairwise operations (correlations, distances) produces O(n²) results. Even block-wise processing struggles when output explodes.

For most statistical analyses, the operations you need (crossproducts, multiplication, factorizations) have efficient implementations in BigDataStatMeth. More exotic operations may require case-by-case evaluation or algorithmic creativity to adapt for big data.

10 Learning More

This section provided a practical overview of linear algebra operations in statistical computing. For deeper mathematical understanding:

Theoretical foundations: - Matrix Computations by Golub & Van Loan - The definitive reference - Numerical Linear Algebra by Trefethen & Bau - Excellent intuition and theory

Statistical applications: - The Elements of Statistical Learning by Hastie, Tibshirani & Friedman - Chapter 3 on linear methods - Computational Statistics by Gentle - Practical numerical methods

BigDataStatMeth specifics: - API Reference - See which operations are implemented - Block-Wise Computing - How operations adapt for big data - Package vignettes - Worked examples using these operations

11 Next Steps

With understanding of the underlying mathematics, you’re ready to apply these operations to real data:

Getting Started Tutorial → Work through a complete analysis
PCA Workflow → See SVD applied to genomic data
API Reference → Explore available functions

Questions About Linear Algebra?

If mathematical concepts in the documentation are unclear, or you’d like more detailed explanations of specific operations, please open an issue. We’re happy to expand explanations based on user feedback.

--- title: "Linear Algebra for Statistical Methods" subtitle: "Understanding the Mathematical Operations Behind BigDataStatMeth" --- ::: {.learning-objectives} ### What You'll Learn By the end of this section, you will: - Understand key matrix operations used in statistical computing - Recognize which operations BigDataStatMeth implements - Know the computational costs of different operations - Understand memory requirements for matrix operations - See how these operations translate to statistical methods - Know which operations are fast vs. slow on disk-based data ::: ## Why Linear Algebra Matters for Statistics Statistical methods, especially those applied to high-dimensional data like genomics, are fundamentally built on linear algebra operations. When you compute a correlation, fit a linear model, or perform PCA, you're executing matrix operations - even if the statistical software hides this fact behind convenient function calls. Understanding these operations serves two purposes in the context of BigDataStatMeth. First, it helps you understand *what* the package is computing when you call functions like `bdSVD_hdf5()` or `bdCrossprod_hdf5()`. Second, and more importantly, it helps you understand *why* certain operations are faster or more memory-intensive than others, which informs decisions about how to structure your analyses. This section provides a practical review of key linear algebra operations, focusing on their statistical interpretation and computational characteristics. We won't derive theorems or prove properties - there are excellent textbooks for that. Instead, we focus on building intuition for how these operations work and why they matter for big data analysis. ## Matrix Basics and Dimensions ### What Matrices Represent In statistical applications, matrices typically represent one of two things: **Data matrices:** Rows are observations (individuals, samples, experiments) and columns are variables (genetic variants, genes, measurements). A data matrix $X$ with dimensions $n \times p$ contains $n$ observations of $p$ variables. **Relationship matrices:** Both rows and columns represent the same set of entities, and entries represent relationships between them. Common examples include correlation matrices, covariance matrices, and similarity matrices. Understanding which type of matrix you're working with helps predict the computational characteristics of operations on it. ### Dimension Compatibility Matrix operations have specific rules about compatible dimensions: **Matrix multiplication** $C = A \times B$ requires: - $A$ is $m \times k$ - $B$ is $k \times n$ - Result $C$ is $m \times n$ The inner dimensions must match ($k$ must be the same), while outer dimensions determine the result size. This matters for big data because: $$\text{Memory}(C) = m \times n \times 8 \text{ bytes}$$ Even if $A$ and $B$ fit in memory individually, $C$ might not if $m$ and $n$ are both large. **Element-wise operations** (addition, subtraction, element-wise multiplication) require matrices of identical dimensions. These operations are memory-neutral - the result is the same size as the inputs. ## Essential Matrix Operations ### Transpose The transpose $A^T$ swaps rows and columns: if $A$ is $m \times n$, then $A^T$ is $n \times m$. **Statistical meaning:** Transposing a data matrix switches from individuals-by-variables to variables-by-individuals, which changes what operations naturally express. For example: - $X^T X$ is a $p \times p$ matrix of relationships between variables (often proportional to correlation) - $X X^T$ is an $n \times n$ matrix of relationships between individuals **Computational note:** Transpose is conceptually simple but can be expensive for large matrices stored on disk, because it requires rewriting data in a different order. BigDataStatMeth's HDF5 functions sometimes avoid explicit transposition by cleverly reading data in the needed order. ### Matrix Multiplication Matrix multiplication $C = A \times B$ is ubiquitous in statistics: **Uses in statistics:** - Projecting data onto a lower-dimensional space: $Y = X W$ - Computing fitted values in regression: $\hat{y} = X \hat{\beta}$ - Rotating or transforming coordinates: $X_{new} = X R$ **Computational complexity:** Multiplying $m \times k$ and $k \times n$ matrices requires $m \times k \times n$ multiplications. For large matrices, this can be substantial: - Small example: $1000 \times 1000$ by $1000 \times 1000$ → $10^9$ operations (< 1 second on modern CPU) - Large example: $100,000 \times 10,000$ by $10,000 \times 10,000$ → $10^{14}$ operations (hours without optimization) **Block-wise strategy:** As discussed in [Block-Wise Computing](blockwise-computing.qmd), matrix multiplication decomposes naturally for disk-based data. ### Crossproduct and Transposed Crossproduct These are special cases of multiplication involving a matrix and its transpose: **Crossproduct:** $X^T X$ (or more generally $X^T Y$) **Transposed Crossproduct:** $X X^T$ (or more generally $X Y^T$) **Why they're special:** These operations are extremely common in statistics and have special mathematical properties (the results are symmetric). More importantly, they can be computed more efficiently than general matrix multiplication. **Statistical applications:** - Sample covariance: $\frac{1}{n-1} X^T X$ (for centered $X$) - Gram matrices for kernel methods: $X X^T$ - Normal equations in regression: $X^T X \beta = X^T y$ BigDataStatMeth provides dedicated functions (`bdCrossprod_hdf5()` and `bdtCrossprod_hdf5()`) that exploit the special structure of these operations for efficiency. ## Matrix Decompositions Decompositions factor a matrix into a product of simpler matrices with useful properties. They're fundamental to many statistical methods. ### Why Decompositions Matter Rather than solving problems directly, we often decompose matrices and work with the factors. This provides: **Numerical stability:** Decompositions can be computed reliably even for ill-conditioned problems **Insight:** The factors often have interpretable meaning **Efficiency:** Certain operations on the factors are faster than on the original matrix **Flexibility:** The same decomposition serves multiple purposes ### Singular Value Decomposition (SVD) The SVD factors any $m \times n$ matrix $X$ as: $$X = U \Sigma V^T$$ where: - $U$ is $m \times r$ with orthonormal columns (left singular vectors) - $\Sigma$ is $r \times r$ diagonal with non-negative entries (singular values) - $V$ is $n \times r$ with orthonormal columns (right singular vectors) - $r = \min(m,n)$ or less if you truncate **Statistical interpretation:** The SVD identifies orthogonal directions of variation in your data: - **Singular values** ($\Sigma$) indicate how much variation lies along each direction - **Left singular vectors** ($U$) express observations in terms of these directions - **Right singular vectors** ($V$) express variables in terms of these directions **Key applications:** **Principal Component Analysis (PCA):** The columns of $V$ (right singular vectors) are the principal component directions. The columns of $U \Sigma$ are the principal component scores. The squared singular values are proportional to explained variance. **Low-rank approximation:** Keeping only the first $k$ singular values and corresponding vectors gives the best rank-$k$ approximation to $X$ (in least-squares sense). **Regression:** SVD provides a stable way to solve least squares problems, especially when $X^T X$ is nearly singular. **Computational challenge:** Computing the full SVD requires $O(\min(mn^2, m^2n))$ operations, which is prohibitive for large matrices. Fortunately, we often only need the first $k$ singular values/vectors where $k \ll \min(m,n)$. Iterative algorithms can compute these leading components much more efficiently. BigDataStatMeth's `bdSVD_hdf5()` uses the hierarchical algorithm described in [Block-Wise Computing](blockwise-computing.qmd) to compute truncated SVD on matrices that don't fit in memory. ### QR Decomposition The QR decomposition factors an $m \times n$ matrix $X$ (with $m \geq n$) as: $$X = Q R$$ where: - $Q$ is $m \times n$ with orthonormal columns ($Q^T Q = I$) - $R$ is $n \times n$ upper triangular **Statistical interpretation:** QR decomposition provides an orthogonal basis for the column space of $X$. The columns of $Q$ span the same space as the columns of $X$, but they're orthonormal (perpendicular and unit length). **Key applications:** **Linear regression:** Solving $X \beta = y$ via QR is numerically stable. Once you have $QR = X$: $$\beta = R^{-1} Q^T y$$ Solving with the triangular $R$ is fast and stable. **Orthogonalization:** QR gives you an orthonormal basis for a set of vectors, useful in many iterative algorithms. **Rank determination:** The diagonal elements of $R$ reveal the effective rank of $X$. **Computational note:** QR decomposition requires $O(mn^2)$ operations. For block-wise computation, BigDataStatMeth uses hierarchical QR as described in the previous section. ### Cholesky Decomposition For a symmetric positive-definite matrix $A$ (common for covariance matrices), the Cholesky decomposition gives: $$A = L L^T$$ where $L$ is lower triangular. **Statistical interpretation:** Cholesky decomposition is like taking the square root of a matrix. For covariance matrices, it's related to linear transformations that generate the covariance structure. **Key applications:** **Matrix inversion:** $A^{-1} = (L^T)^{-1} L^{-1}$, and inverting triangular matrices is efficient **Simulation:** To generate random variables with covariance $A$, generate standard normals $z$ and compute $L z$ **Solving linear systems:** For $A x = b$, solve $L y = b$ then $L^T x = y$ (both easy with triangular $L$) **Computational efficiency:** Cholesky is about twice as fast as general matrix factorizations and uses half the storage (only the triangular part). BigDataStatMeth provides `bdCholesky_hdf5()` for computing this decomposition and `bdInvCholesky_hdf5()` for the common use case of inverting symmetric positive-definite matrices. ## Operations on Structured Matrices Some matrices have special structure that enables more efficient computation. ### Symmetric Matrices A matrix $A$ is symmetric if $A = A^T$. Many statistical matrices are symmetric: - Correlation matrices - Covariance matrices - Gram matrices ($X X^T$) **Computational benefits:** - Only need to store roughly half the elements (upper or lower triangle) - Many operations can be optimized knowing the structure - Eigenvalues are guaranteed to be real - Eigenvectors are orthogonal ### Diagonal Matrices A diagonal matrix has non-zero elements only on the diagonal. The variance matrix of independent variables is diagonal. **Computational benefits:** - Multiplication is element-wise: $(D X)_{ij} = D_{ii} X_{ij}$ - Inversion is trivial: $D^{-1}$ has diagonal elements $1/D_{ii}$ - Can be stored as a vector instead of a matrix ### Sparse Matrices A sparse matrix has mostly zero elements. While less common in basic statistical applications, sparse matrices appear in: - Penalty matrices in regularization - Design matrices for categorical variables - Network adjacency matrices BigDataStatMeth doesn't currently specialize for sparse matrices, as genomic data is typically dense. However, this is a potential future enhancement. ## Matrix Norms and Conditioning ### Norms: Measuring Matrix "Size" A matrix norm is a scalar that measures the "size" of a matrix. Common norms include: **Frobenius norm:** $||A||_F = \sqrt{\sum_{ij} A_{ij}^2}$ (like treating the matrix as a long vector) **Operator norms:** Measure how much a matrix can "stretch" vectors **Statistical relevance:** Norms appear in: - Regularization (penalizing large coefficients) - Convergence criteria (stopping when change is small) - Error quantification (measuring approximation quality) ### Condition Number The condition number measures how sensitive a matrix's inverse is to small changes in the matrix. For a matrix $A$: $$\kappa(A) = ||A|| \cdot ||A^{-1}||$$ **Interpretation:** - $\kappa = 1$: Perfectly conditioned (e.g., identity matrix) - $\kappa \approx 10$: Well-conditioned - $\kappa > 1000$: Ill-conditioned, numerical issues likely - $\kappa$ very large: Nearly singular, inversion unreliable **Statistical relevance:** **Multicollinearity in regression:** High condition number of $X^T X$ indicates highly correlated predictors, leading to unstable coefficient estimates. **Numerical stability:** Operations involving ill-conditioned matrices accumulate rounding errors. This is why methods like SVD and QR decomposition are preferred over directly computing $(X^T X)^{-1}$ for regression. **BigDataStatMeth consideration:** The package uses numerically stable algorithms, but users should be aware that even with stable algorithms, solving ill-conditioned problems can yield uninformative results. The condition number is something to check in your data, not a limitation of the computational method. ## Practical Guidelines for Big Data Understanding these operations informs decisions about computational strategy: ### Memory Requirements Are Predictable For operation $C = f(A, B)$: - Check dimensions of inputs and outputs - Calculate memory: dimensions × 8 bytes - Peak memory includes inputs, outputs, and any intermediate results **Example:** Computing $X^T X$ for $X$ that is $100,000 \times 50,000$: - Input: $100,000 \times 50,000 = 40$ GB - Output: $50,000 \times 50,000 = 20$ GB - Peak memory during computation: Both ($60$ GB) This tells you whether an operation is feasible in-memory or requires block-wise processing. ### Asymmetric Operations Create Memory Pressure Operations where the output is much larger than inputs can exhaust memory: **Example:** $X X^T$ for $X$ that is $100,000 \times 1000$: - Input: $100,000 \times 1000 = 800$ MB - Output: $100,000 \times 100,000 = 80$ GB The input fits easily, but the output doesn't! **Strategy:** BigDataStatMeth can write large results directly to HDF5, avoiding the need to hold them in memory. ### Order of Operations Matters Matrix multiplication is associative but not commutative: $A (BC) = (AB) C$ but $AB \neq BA$. For memory efficiency, consider the order: Computing $A (B C)$ where $A$ is $1000 \times 10000$, $B$ is $10000 \times 100$, $C$ is $100 \times 10$: - **Option 1:** Compute $BC$ first ($10000 \times 100$ by $100 \times 10$ → $10000 \times 10$), then $A(BC)$ → Final: $1000 \times 10$ - **Option 2:** Compute $AB$ first ($1000 \times 10000$ by $10000 \times 100$ → $1000 \times 100$), then $(AB)C$ → Final: $1000 \times 10$ Both give the same final result, but Option 1 requires much less memory for intermediate results. ### Decompositions Are Investments Computing a decomposition (SVD, QR, Cholesky) takes time, but the factors can be reused: **Example:** After computing $X = QR$: - Solving $X \beta = y$ is fast (two triangular solves) - Computing $X^T X$ is fast ($R^T R$) - Many operations with $X$ become operations with $Q$ and $R$, which may be more efficient For big data, it's often worth computing a decomposition once and reusing it, rather than repeatedly computing operations on the original matrix. ## Interactive Exercise {.exercise} ### Practice: Matching Statistical Methods to Matrix Operations Understanding which matrix operations underlie statistical methods helps you predict computational requirements and choose appropriate approaches. This exercise builds that connection. ```{r} #| eval: false # Common statistical tasks - what operations do they need? # 1. Principal Component Analysis pca_operations <- function(X) { # Center the data # Compute X^T X or directly compute SVD of X # Extract eigenvectors/values # Question: Which approach is more memory-efficient? } # 2. Linear Regression regression_operations <- function(X, y) { # Solve: (X^T X) β = X^T y # Options: Direct inversion? QR decomposition? Normal equations? # Question: Which is numerically stable for large p? } # 3. Correlation Matrix correlation_operations <- function(X) { # Center/scale X # Compute X^T X / (n-1) # Question: For p=50,000, what's the memory requirement? } ``` ::: {.callout-tip} ### Reflection Questions **1. Operation Sequence for PCA:** - You need PCA on a 100,000 × 50,000 matrix - Approach A: Compute X^T X (50k × 50k), then eigen-decomposition - Approach B: Direct SVD of X in blocks - Which operations dominate computation time? - Which approach uses less memory? **2. Regression with Many Predictors:** - Fitting y ~ X where X is 10,000 × 100,000 - Computing (X^T X)^{-1} creates a 100,000 × 100,000 matrix - How much memory does just this inverse require? - Is direct inversion even practical? - What alternatives exist (QR? Iterative methods?)? **3. Bottleneck Analysis:** - Your workflow: Normalize → PCA → Regression - Which operation will be the memory bottleneck? - Which will take the most time? - Could you reorder operations to be more efficient? **4. Block-Wise Feasibility:** - Element-wise operations (centering): Easy to block-wise? - Matrix multiplication: Block-amenable? - Computing all pairwise correlations: Challenges? - What makes some operations harder to distribute? **5. Practical Decisions:** - You have 64 GB RAM, 200,000 × 100,000 matrix (≈160 GB) - Need to compute t(X) %*% X - Can BigDataStatMeth help? Which function? - What block size would be appropriate? Thinking through these scenarios before encountering them in real analyses helps you make informed architectural decisions and understand why BigDataStatMeth makes certain implementation choices. ::: ## Key Takeaways {.key-concept} Let's consolidate your understanding of linear algebra operations in statistical computing and how they behave with large-scale data. ### Essential Concepts **Matrix operations are the building blocks** of statistical methods, even when hidden behind high-level function calls. Understanding PCA, regression, or correlation in terms of matrix operations (SVD, solving linear systems, computing crossproducts) helps you predict computational requirements and memory needs. Every statistical method ultimately reduces to a sequence of linear algebra operations. **Computational costs vary dramatically** by operation. Matrix multiplication is O(n³) in the worst case but highly optimized in practice. Inversions are expensive and numerically risky. Element-wise operations are cheap and memory-neutral. These cost differences matter when scaling to big data - an O(n³) operation may be impossible where O(n²) is feasible. **Memory requirements aren't always obvious.** Multiplying an m×k matrix by a k×n matrix produces an m×n output. Even if inputs fit in memory, the output might not. This is why operations like computing all pairwise correlations (producing p×p from n×p) quickly become infeasible as p grows. Understanding output dimensions helps predict whether an operation will succeed. **Decompositions are investments** that pay off through reuse. Computing QR, SVD, or Cholesky takes time upfront but enables many subsequent operations to be faster, more stable, or more memory-efficient. For big data analyses involving repeated operations, computing a decomposition once and storing it often beats recomputing basic operations many times. **BigDataStatMeth implements both in-memory and HDF5 versions** of key operations. The mathematical operation is identical - X^T X produces the same result whether X lives in memory or on disk. But the computational strategy differs: in-memory operations process complete matrices at once, while HDF5 operations process blocks and manage disk I/O automatically. **Operation choice affects feasibility** more than implementation details. With big data, sometimes the question isn't "how do I implement this faster?" but "is there a mathematically equivalent but computationally cheaper way to get this result?" Reformulating (X^T X)^{-1} X^T y as a QR solve changes the problem from impossible to practical. ### Operation Characteristics Understanding which operations work well with block-wise disk-based processing helps you design efficient analyses. ✅ **Efficient operations (fast, block-amenable):** - **Matrix multiplication (X %*% Y)** - Highly optimized, decomposes naturally into blocks. BigDataStatMeth's implementations scale well. - **Crossproducts (t(X) %*% X)** - Especially efficient because one matrix is transposed. Reduces memory requirements and I/O operations. - **Element-wise operations** - Adding, subtracting, scaling by constants process independently. Perfect for block-wise with minimal overhead. - **Row/column operations** - Means, sums, normalizations work naturally when blocks contain complete rows or columns. ❌ **Expensive operations (slow, memory-intensive):** - **Matrix inversions** - Computationally expensive O(n³) and numerically unstable. Avoid when possible in favor of solving linear systems or using decompositions. - **Full matrix factorizations** - QR, Cholesky, eigendecompositions require multiple passes and careful numeric handling. Doable but requires sophisticated block-wise algorithms. - **Operations creating large outputs** - Computing all pairwise operations (correlations, distances) produces O(n²) results. Even block-wise processing struggles when output explodes. For most statistical analyses, the operations you need (crossproducts, multiplication, factorizations) have efficient implementations in BigDataStatMeth. More exotic operations may require case-by-case evaluation or algorithmic creativity to adapt for big data. ## Learning More This section provided a practical overview of linear algebra operations in statistical computing. For deeper mathematical understanding: **Theoretical foundations:** - *Matrix Computations* by Golub & Van Loan - The definitive reference - *Numerical Linear Algebra* by Trefethen & Bau - Excellent intuition and theory **Statistical applications:** - *The Elements of Statistical Learning* by Hastie, Tibshirani & Friedman - Chapter 3 on linear methods - *Computational Statistics* by Gentle - Practical numerical methods **BigDataStatMeth specifics:** - [API Reference](../api-reference/r-functions.qmd) - See which operations are implemented - [Block-Wise Computing](blockwise-computing.qmd) - How operations adapt for big data - Package vignettes - Worked examples using these operations ## Next Steps With understanding of the underlying mathematics, you're ready to apply these operations to real data: - [**Getting Started Tutorial →**](../tutorials/getting-started.qmd) Work through a complete analysis - [**PCA Workflow →**](../workflows/principal-components.qmd) See SVD applied to genomic data - [**API Reference →**](../api-reference/r-functions.qmd) Explore available functions --- ::: {.callout-note} ### Questions About Linear Algebra? If mathematical concepts in the documentation are unclear, or you'd like more detailed explanations of specific operations, please [open an issue](https://github.com/isglobal-brge/BigDataStatMeth/issues). We're happy to expand explanations based on user feedback. :::