bdSVD_hdf5

HDF5_ALGEBRA

1 Description

Computes the Singular Value Decomposition (SVD) of a large matrix stored in an HDF5 file. The SVD decomposes a matrix A into a product A = UDV’ where U and V are orthogonal matrices and D is a diagonal matrix containing the singular values.

2 Usage

bdSVD_hdf5(filename, group = NULL, dataset = NULL, k = 2L, q = 1L, bcenter = TRUE, bscale = TRUE, rankthreshold = 0.0, overwrite = NULL, method = NULL, threads = NULL)

3 Arguments

Parameter Description
filename Character string. Path to the HDF5 file containing the input matrix.
group Character string. Path to the group containing the input dataset.
dataset Character string. Name of the input dataset to decompose.
k Integer. Number of local SVDs to concatenate at each level (default = 2). Controls the trade-off between memory usage and computation speed.
q Integer. Number of levels for SVD computation (default = 1). Higher values can improve accuracy but increase computation time.
bcenter Logical. If TRUE (default), centers the data by subtracting column means.
bscale Logical. If TRUE (default), scales the centered columns by their standard deviations or root mean square.
rankthreshold Numeric. Threshold for determining matrix rank (default = 0). Must be between 0 and 0.1. Used to approximate rank for nearly singular matrices.
overwrite Logical. If TRUE, allows overwriting existing results.
method Character string. Computation method: * “auto”: Automatically selects between “full” and “blocks” based on matrix size * “blocks”: Uses block-based computation (recommended for large matrices) * “full”: Performs direct computation without partitioning
threads Integer. Number of threads for parallel computation.

4 Value

A list with the following elements:

  • fn: Path to the HDF5 file
  • ds_d: Path to the dataset containing singular values
  • ds_u: Path to the dataset containing left singular vectors
  • ds_v: Path to the dataset containing right singular vectors

5 Details

This function implements a block-based SVD algorithm suitable for large matrices that may not fit in memory. Key features include: - Automatic method selection based on matrix size - Block-based computation for large matrices - Data centering and scaling options - Parallel processing support - Rank approximation through threshold - Memory-efficient incremental algorithm

The implementation uses an incremental algorithm with two key parameters: - k: number of local SVDs to concatenate at each level - q: number of levels in the computation

6 Examples

Code
# Create a sample large matrix in HDF5

library(BigDataStatMeth)
library(rhdf5)

# Create a sample large matrix in HDF5
A <- matrix(rnorm(10000), 1000, 10)

fn <- "test_temp.hdf5"
bdCreate_hdf5_matrix(filename = fn, object = A, group = "data", dataset = "matrix")

# Compute SVD with default parameters
res <- bdSVD_hdf5(fn, "data", "matrix")

# Compute SVD with custom parameters
res <- bdSVD_hdf5(fn, "data", "matrix",
           k = 4, q = 2,
           bcenter = TRUE, bscale = TRUE,
           method = "blocks",
           threads = 4)

# list contents
h5ls(res$fn)

# Extract the result from HDF5 (d)
result_d_hdf5 <- h5read(res$fn, res$ds_d)
result_d_hdf5

# Compute the same SVD in R
result_d_r <- svd(A)$d
result_d_r

# Compare both results (should be TRUE)
all.equal(result_d_hdf5, result_d_r)

# Remove file
if (file.exists(fn)) {
  file.remove(fn)
}

7 See Also