Performs Principal Component Analysis (PCA) on a large matrix stored in an HDF5 file. PCA reduces the dimensionality of the data while preserving as much variance as possible. The implementation uses SVD internally for efficient and numerically stable computation.
Character string. Path to the HDF5 file containing the input matrix.
group
Character string. Path to the group containing the input dataset.
dataset
Character string. Name of the input dataset to analyze.
ncomponents
Integer. Number of principal components to compute (default = 0, which computes all components).
bcenter
Logical. If TRUE, centers the data by subtracting column means. Default is FALSE.
bscale
Logical. If TRUE, scales the centered columns by their standard deviations (if centered) or root mean square. Default is FALSE.
k
Integer. Number of local SVDs to concatenate at each level (default = 2). Controls memory usage in block computation.
q
Integer. Number of levels for SVD computation (default = 1). Higher values can improve accuracy but increase computation time.
rankthreshold
Numeric. Threshold for determining matrix rank (default = 0). Must be between 0 and 0.1.
SVDgroup
Character string. Group name where intermediate SVD results are stored. If SVD was previously computed, results will be reused from this group.
overwrite
Logical. If TRUE, forces recomputation of SVD even if results exist.
method
Character string. Computation method: * “auto”: Automatically selects method based on matrix size * “blocks”: Uses block-based computation (for large matrices) * “full”: Performs direct computation (for smaller matrices)
threads
Integer. Number of threads for parallel computation.
4 Value
A list containing the paths to the PCA results stored in the HDF5 file:
fn: Character string. Path to the HDF5 file containing the results
lambda: Character string. Dataset path to eigenvalues \eqn{
variance: Character string. Dataset path to variance explained by each PC
cumvar: Character string. Dataset path to cumulative variance explained
var.coord: Character string. Dataset path to variable coordinates on the PCs
var.cos2: Character string. Dataset path to squared cosines (quality of representation) for variables
ind.dist: Character string. Dataset path to distances of individuals from the origin
components: Character string. Dataset path to principal components (rotated data)
ind.coord: Character string. Dataset path to individual coordinates on the PCs
ind.cos2: Character string. Dataset path to squared cosines (quality of representation) for individuals
ind.contrib: Character string. Dataset path to contributions of individuals to each PC All results are written to the HDF5 file in the group ‘PCA/dataset’.
5 Details
This function implements a scalable PCA algorithm suitable for large matrices that may not fit in memory. Key features include: - Automatic method selection based on matrix size - Block-based computation for large matrices - Optional data preprocessing (centering and scaling) - Parallel processing support - Memory-efficient incremental algorithm - Reuse of existing SVD results
The implementation uses SVD internally and supports two computation methods: - Full decomposition: Suitable for matrices that fit in memory - Block-based decomposition: For large matrices, uses an incremental algorithm
6 Examples
Code
# Create a sample large matrix in HDF5library(rhdf5)X <-matrix(rnorm(10000), 1000, 10)h5createFile("data.h5")h5write(X, "data.h5", "data/matrix")# Basic PCA with default parametersbdPCA_hdf5("data.h5", "data", "matrix")# PCA with preprocessing and specific number of componentsbdPCA_hdf5("data.h5", "data", "matrix",ncomponents =3,bcenter =TRUE, bscale =TRUE,method ="blocks",threads =4)
---title: "bdPCA_hdf5"subtitle: "bdPCA_hdf5"---<span class="category-badge hdf5_algebra">HDF5_ALGEBRA</span>## DescriptionPerforms Principal Component Analysis (PCA) on a large matrix stored in an HDF5 file.PCA reduces the dimensionality of the data while preserving as much variance aspossible. The implementation uses SVD internally for efficient and numericallystable computation.## Usage```rbdPCA_hdf5(filename, group, dataset, ncomponents =0L, bcenter =FALSE, bscale =FALSE, k =2L, q =1L, rankthreshold =0.0, SVDgroup =NULL, overwrite =FALSE, method =NULL, threads =NULL)```## Arguments::: {.param-table}| Parameter | Description ||-----------|-------------||`filename`| Character string. Path to the HDF5 file containing the input matrix. ||`group`| Character string. Path to the group containing the input dataset. ||`dataset`| Character string. Name of the input dataset to analyze. ||`ncomponents`| Integer. Number of principal components to compute (default = 0, which computes all components). ||`bcenter`| Logical. If TRUE, centers the data by subtracting column means. Default is FALSE. ||`bscale`| Logical. If TRUE, scales the centered columns by their standard deviations (if centered) or root mean square. Default is FALSE. ||`k`| Integer. Number of local SVDs to concatenate at each level (default = 2). Controls memory usage in block computation. ||`q`| Integer. Number of levels for SVD computation (default = 1). Higher values can improve accuracy but increase computation time. ||`rankthreshold`| Numeric. Threshold for determining matrix rank (default = 0). Must be between 0 and 0.1. ||`SVDgroup`| Character string. Group name where intermediate SVD results are stored. If SVD was previously computed, results will be reused from this group. ||`overwrite`| Logical. If TRUE, forces recomputation of SVD even if results exist. ||`method`| Character string. Computation method: * "auto": Automatically selects method based on matrix size * "blocks": Uses block-based computation (for large matrices) * "full": Performs direct computation (for smaller matrices) ||`threads`| Integer. Number of threads for parallel computation. |:::## Value::: {.return-value}A list containing the paths to the PCA results stored in the HDF5 file:- **`fn`**: Character string. Path to the HDF5 file containing the results- **`lambda`**: Character string. Dataset path to eigenvalues \eqn{\lambda- **`variance`**: Character string. Dataset path to variance explained by each PC- **`cumvar`**: Character string. Dataset path to cumulative variance explained- **`var.coord`**: Character string. Dataset path to variable coordinates on the PCs- **`var.cos2`**: Character string. Dataset path to squared cosines (quality of representation) for variables- **`ind.dist`**: Character string. Dataset path to distances of individuals from the origin- **`components`**: Character string. Dataset path to principal components (rotated data)- **`ind.coord`**: Character string. Dataset path to individual coordinates on the PCs- **`ind.cos2`**: Character string. Dataset path to squared cosines (quality of representation) for individuals- **`ind.contrib`**: Character string. Dataset path to contributions of individuals to each PC All results are written to the HDF5 file in the group 'PCA/`dataset`'.:::## DetailsThis function implements a scalable PCA algorithm suitable for large matricesthat may not fit in memory. Key features include:- Automatic method selection based on matrix size- Block-based computation for large matrices- Optional data preprocessing (centering and scaling)- Parallel processing support- Memory-efficient incremental algorithm- Reuse of existing SVD resultsThe implementation uses SVD internally and supports two computation methods:- Full decomposition: Suitable for matrices that fit in memory- Block-based decomposition: For large matrices, uses an incremental algorithm## Examples```{r}#| eval: false#| code-fold: show# Create a sample large matrix in HDF5library(rhdf5)X <-matrix(rnorm(10000), 1000, 10)h5createFile("data.h5")h5write(X, "data.h5", "data/matrix")# Basic PCA with default parametersbdPCA_hdf5("data.h5", "data", "matrix")# PCA with preprocessing and specific number of componentsbdPCA_hdf5("data.h5", "data", "matrix",ncomponents =3,bcenter =TRUE, bscale =TRUE,method ="blocks",threads =4)```## See Also::: {.see-also}- [bdSVD_hdf5](bdSVD_hdf5.html) for the underlying SVD computation- [bdNormalize_hdf5](../hdf5_statistics/bdNormalize_hdf5.html) for data preprocessing options:::