bdsubset_hdf5_dataset

bdsubset_hdf5_dataset

HDF5_IO_MANAGEMENT

1 Description

Creates a new HDF5 dataset containing only the specified rows or columns from an existing dataset. This operation is memory efficient as it uses HDF5’s hyperslab selection for direct disk-to-disk copying without loading the entire dataset into memory.

2 Usage

bdsubset_hdf5_dataset(filename, dataset_path, indices, select_rows = TRUE, new_group = "", new_name = "", overwrite = FALSE)

3 Arguments

Parameter Description
filename Character string. Path to the HDF5 file
dataset_path Character string. Path to the source dataset (e.g., “/group1/dataset1”)
indices Integer vector. Row or column indices to include (1-based, as per R convention)
select_rows Logical. If TRUE, selects rows; if FALSE, selects columns (default: TRUE)
new_group Character string. Target group for the new dataset (default: same as source)
new_name Character string. Name for the new dataset (default: original_name + “_subset”)
overwrite Logical. Whether to overwrite destination if it exists (default: FALSE)

4 Value

Logical. TRUE on success, FALSE on failure

5 Details

This function provides an efficient way to create subsets of large HDF5 datasets without loading all data into memory. It uses HDF5’s native hyperslab selection mechanism for optimal performance with big data.

Key features:

6 Examples

Code
# Select specific rows (e.g., rows 1, 3, 5, 10-15)
success <- bdsubset_dataset("data.h5", 
                           dataset_path = "/matrix/data",
                           indices = c(1, 3, 5, 10:15),
                           select_rows = TRUE,
                           new_name = "selected_rows")

# Select specific columns
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/matrix/data", 
                           indices = c(2, 4, 6:10),
                           select_rows = FALSE,
                           new_group = "/filtered",
                           new_name = "selected_cols")

# Create subset in different group
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/raw_data/matrix",
                           indices = 1:100,  # First 100 rows
                           select_rows = TRUE,
                           new_group = "/processed",
                           new_name = "top_100_rows")

# Extract specific samples for analysis
interesting_samples <- c(15, 23, 45, 67, 89, 123)
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/experiments/results",
                           indices = interesting_samples,
                           select_rows = TRUE,
                           new_name = "analysis_subset")