bdSplit_matrix_hdf5

HDF5_IO_MANAGEMENT

1 Description

Splits a large dataset in an HDF5 file into smaller submatrices, with support for both row-wise and column-wise splitting.

2 Usage

bdSplit_matrix_hdf5(filename, group, dataset, outgroup = NULL, outdataset = NULL, nblocks = NULL, blocksize = NULL, bycols = TRUE, overwrite = FALSE)

3 Arguments

Parameter Description
filename Character string. Path to the HDF5 file.
group Character string. Path to the group containing input dataset.
dataset Character string. Name of the dataset to split.
outgroup Character string (optional). Output group path. If NULL, uses input group.
outdataset Character string (optional). Base name for output datasets. If NULL, uses input dataset name with block number suffix.
nblocks Integer (optional). Number of blocks to split into. Mutually exclusive with blocksize.
blocksize Integer (optional). Size of each block. Mutually exclusive with nblocks.
bycols Logical (optional). Whether to split by columns (TRUE) or rows (FALSE). Default is TRUE.
overwrite Logical (optional). Whether to overwrite existing datasets. Default is FALSE.

4 Value

List with components. If an error occurs, all string values are returned as empty strings (““):

  • fn: Character string with the HDF5 filename
  • ds: Character string with the output group path where the split datasets are stored. Multiple datasets are created in this location named as <outdataset>.1, <outdataset>.2, etc.

5 Details

This function provides efficient dataset splitting capabilities with: - Splitting options: - Row-wise or column-wise splitting - Fixed block size splitting - Fixed block count splitting - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting

The function supports two splitting strategies: 1. By number of blocks: Splits the dataset into a specified number of roughly equal-sized blocks 2. By block size: Splits the dataset into blocks of a specified size

6 Examples

Code
library(BigDataStatMeth)

# Create test data
data <- matrix(rnorm(1000), 100, 10)

# Save to HDF5
fn <- "test.hdf5"
bdCreate_hdf5_matrix(fn, data, "data", "matrix1",
                     overwriteFile = TRUE)

# Split by number of blocks
bdSplit_matrix_hdf5(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outgroup = "data_split",
  outdataset = "block",
  nblocks = 4,
  bycols = TRUE
)

# Split by block size
bdSplit_matrix_hdf5(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outgroup = "data_split2",
  outdataset = "block",
  blocksize = 25,
  bycols = TRUE
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}

7 See Also