bdSplit_matrix_hdf5

HDF5_IO_MANAGEMENT

1 Description

Splits a large dataset in an HDF5 file into smaller submatrices, with support for both row-wise and column-wise splitting.

2 Usage

bdSplit_matrix_hdf5(filename, group, dataset, outgroup = NULL, outdataset = NULL, nblocks = NULL, blocksize = NULL, bycols = TRUE, overwrite = FALSE)

3 Arguments

Parameter	Description
`filename`	Character string. Path to the HDF5 file.
`group`	Character string. Path to the group containing input dataset.
`dataset`	Character string. Name of the dataset to split.
`outgroup`	Character string (optional). Output group path. If NULL, uses input group.
`outdataset`	Character string (optional). Base name for output datasets. If NULL, uses input dataset name with block number suffix.
`nblocks`	Integer (optional). Number of blocks to split into. Mutually exclusive with blocksize.
`blocksize`	Integer (optional). Size of each block. Mutually exclusive with nblocks.
`bycols`	Logical (optional). Whether to split by columns (TRUE) or rows (FALSE). Default is TRUE.
`overwrite`	Logical (optional). Whether to overwrite existing datasets. Default is FALSE.

4 Value

List with components. If an error occurs, all string values are returned as empty strings (““):

fn: Character string with the HDF5 filename
ds: Character string with the output group path where the split datasets are stored. Multiple datasets are created in this location named as <outdataset>.1, <outdataset>.2, etc.

5 Details

This function provides efficient dataset splitting capabilities with: - Splitting options: - Row-wise or column-wise splitting - Fixed block size splitting - Fixed block count splitting - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting

The function supports two splitting strategies: 1. By number of blocks: Splits the dataset into a specified number of roughly equal-sized blocks 2. By block size: Splits the dataset into blocks of a specified size

6 Examples

Code

library(BigDataStatMeth)

# Create test data
data <- matrix(rnorm(1000), 100, 10)

# Save to HDF5
fn <- "test.hdf5"
bdCreate_hdf5_matrix(fn, data, "data", "matrix1",
                     overwriteFile = TRUE)

# Split by number of blocks
bdSplit_matrix_hdf5(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outgroup = "data_split",
  outdataset = "block",
  nblocks = 4,
  bycols = TRUE
)

# Split by block size
bdSplit_matrix_hdf5(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outgroup = "data_split2",
  outdataset = "block",
  blocksize = 25,
  bycols = TRUE
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}

7 See Also

bdCreate_hdf5_matrix for creating HDF5 matrices

--- title: "bdSplit_matrix_hdf5" subtitle: "bdSplit_matrix_hdf5" --- <span class="category-badge hdf5_io_management">HDF5_IO_MANAGEMENT</span> ## Description Splits a large dataset in an HDF5 file into smaller submatrices, with support for both row-wise and column-wise splitting. ## Usage ```r bdSplit_matrix_hdf5(filename, group, dataset, outgroup = NULL, outdataset = NULL, nblocks = NULL, blocksize = NULL, bycols = TRUE, overwrite = FALSE) ``` ## Arguments ::: {.param-table} | Parameter | Description | |-----------|-------------| | `filename` | Character string. Path to the HDF5 file. | | `group` | Character string. Path to the group containing input dataset. | | `dataset` | Character string. Name of the dataset to split. | | `outgroup` | Character string (optional). Output group path. If NULL, uses input group. | | `outdataset` | Character string (optional). Base name for output datasets. If NULL, uses input dataset name with block number suffix. | | `nblocks` | Integer (optional). Number of blocks to split into. Mutually exclusive with blocksize. | | `blocksize` | Integer (optional). Size of each block. Mutually exclusive with nblocks. | | `bycols` | Logical (optional). Whether to split by columns (TRUE) or rows (FALSE). Default is TRUE. | | `overwrite` | Logical (optional). Whether to overwrite existing datasets. Default is FALSE. | ::: ## Value ::: {.return-value} List with components. If an error occurs, all string values are returned as empty strings (""): - **`fn`**: Character string with the HDF5 filename - **`ds`**: Character string with the output group path where the split datasets are stored. Multiple datasets are created in this location named as \<outdataset\>.1, \<outdataset\>.2, etc. ::: ## Details This function provides efficient dataset splitting capabilities with: - Splitting options: - Row-wise or column-wise splitting - Fixed block size splitting - Fixed block count splitting - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting The function supports two splitting strategies: 1. By number of blocks: Splits the dataset into a specified number of roughly equal-sized blocks 2. By block size: Splits the dataset into blocks of a specified size ## Examples ```{r} #| eval: false #| code-fold: show library(BigDataStatMeth) # Create test data data <- matrix(rnorm(1000), 100, 10) # Save to HDF5 fn <- "test.hdf5" bdCreate_hdf5_matrix(fn, data, "data", "matrix1", overwriteFile = TRUE) # Split by number of blocks bdSplit_matrix_hdf5( filename = fn, group = "data", dataset = "matrix1", outgroup = "data_split", outdataset = "block", nblocks = 4, bycols = TRUE ) # Split by block size bdSplit_matrix_hdf5( filename = fn, group = "data", dataset = "matrix1", outgroup = "data_split2", outdataset = "block", blocksize = 25, bycols = TRUE ) # Cleanup if (file.exists(fn)) { file.remove(fn) } ``` ## See Also ::: {.see-also} - [bdCreate_hdf5_matrix](bdCreate_hdf5_matrix.html) for creating HDF5 matrices :::