bdsubset_hdf5_dataset

HDF5_IO_MANAGEMENT

1 Description

Creates a new HDF5 dataset containing only the specified rows or columns from an existing dataset. This operation is memory efficient as it uses HDF5’s hyperslab selection for direct disk-to-disk copying without loading the entire dataset into memory.

2 Usage

bdsubset_hdf5_dataset(filename, dataset_path, indices, select_rows = TRUE, new_group = "", new_name = "", overwrite = FALSE)

3 Arguments

Parameter	Description
`filename`	Character string. Path to the HDF5 file
`dataset_path`	Character string. Path to the source dataset (e.g., “/group1/dataset1”)
`indices`	Integer vector. Row or column indices to include (1-based, as per R convention)
`select_rows`	Logical. If TRUE, selects rows; if FALSE, selects columns (default: TRUE)
`new_group`	Character string. Target group for the new dataset (default: same as source)
`new_name`	Character string. Name for the new dataset (default: original_name + “_subset”)
`overwrite`	Logical. Whether to overwrite destination if it exists (default: FALSE)

4 Value

Logical. TRUE on success, FALSE on failure

5 Details

This function provides an efficient way to create subsets of large HDF5 datasets without loading all data into memory. It uses HDF5’s native hyperslab selection mechanism for optimal performance with big data.

Key features:

6 Examples

Code

# Select specific rows (e.g., rows 1, 3, 5, 10-15)
success <- bdsubset_dataset("data.h5", 
                           dataset_path = "/matrix/data",
                           indices = c(1, 3, 5, 10:15),
                           select_rows = TRUE,
                           new_name = "selected_rows")

# Select specific columns
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/matrix/data", 
                           indices = c(2, 4, 6:10),
                           select_rows = FALSE,
                           new_group = "/filtered",
                           new_name = "selected_cols")

# Create subset in different group
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/raw_data/matrix",
                           indices = 1:100,  # First 100 rows
                           select_rows = TRUE,
                           new_group = "/processed",
                           new_name = "top_100_rows")

# Extract specific samples for analysis
interesting_samples <- c(15, 23, 45, 67, 89, 123)
success <- bdsubset_dataset("data.h5",
                           dataset_path = "/experiments/results",
                           indices = interesting_samples,
                           select_rows = TRUE,
                           new_name = "analysis_subset")

--- title: "bdsubset_hdf5_dataset" subtitle: "bdsubset_hdf5_dataset" --- <span class="category-badge hdf5_io_management">HDF5_IO_MANAGEMENT</span> ## Description Creates a new HDF5 dataset containing only the specified rows or columns from an existing dataset. This operation is memory efficient as it uses HDF5's hyperslab selection for direct disk-to-disk copying without loading the entire dataset into memory. ## Usage ```r bdsubset_hdf5_dataset(filename, dataset_path, indices, select_rows = TRUE, new_group = "", new_name = "", overwrite = FALSE) ``` ## Arguments ::: {.param-table} | Parameter | Description | |-----------|-------------| | `filename` | Character string. Path to the HDF5 file | | `dataset_path` | Character string. Path to the source dataset (e.g., "/group1/dataset1") | | `indices` | Integer vector. Row or column indices to include (1-based, as per R convention) | | `select_rows` | Logical. If TRUE, selects rows; if FALSE, selects columns (default: TRUE) | | `new_group` | Character string. Target group for the new dataset (default: same as source) | | `new_name` | Character string. Name for the new dataset (default: original_name + "_subset") | | `overwrite` | Logical. Whether to overwrite destination if it exists (default: FALSE) | ::: ## Value ::: {.return-value} Logical. TRUE on success, FALSE on failure ::: ## Details This function provides an efficient way to create subsets of large HDF5 datasets without loading all data into memory. It uses HDF5's native hyperslab selection mechanism for optimal performance with big data. Key features: \itemize{ \item Memory efficient - processes one row/column at a time \item Direct disk-to-disk copying using HDF5 hyperslab selection \item Preserves all dataset attributes and properties \item Works with datasets of any size \item Automatic creation of parent groups if needed \item Support for both row and column selection } ## Examples ```{r} #| eval: false #| code-fold: show # Select specific rows (e.g., rows 1, 3, 5, 10-15) success <- bdsubset_dataset("data.h5", dataset_path = "/matrix/data", indices = c(1, 3, 5, 10:15), select_rows = TRUE, new_name = "selected_rows") # Select specific columns success <- bdsubset_dataset("data.h5", dataset_path = "/matrix/data", indices = c(2, 4, 6:10), select_rows = FALSE, new_group = "/filtered", new_name = "selected_cols") # Create subset in different group success <- bdsubset_dataset("data.h5", dataset_path = "/raw_data/matrix", indices = 1:100, # First 100 rows select_rows = TRUE, new_group = "/processed", new_name = "top_100_rows") # Extract specific samples for analysis interesting_samples <- c(15, 23, 45, 67, 89, 123) success <- bdsubset_dataset("data.h5", dataset_path = "/experiments/results", indices = interesting_samples, select_rows = TRUE, new_name = "analysis_subset") ```