Creates a new HDF5 dataset containing only the specified rows or columns from an existing dataset. This operation is memory efficient as it uses HDF5’s hyperslab selection for direct disk-to-disk copying without loading the entire dataset into memory.
Character string. Path to the source dataset (e.g., “/group1/dataset1”)
indices
Integer vector. Row or column indices to include (1-based, as per R convention)
select_rows
Logical. If TRUE, selects rows; if FALSE, selects columns (default: TRUE)
new_group
Character string. Target group for the new dataset (default: same as source)
new_name
Character string. Name for the new dataset (default: original_name + “_subset”)
overwrite
Logical. Whether to overwrite destination if it exists (default: FALSE)
4 Value
Logical. TRUE on success, FALSE on failure
5 Details
This function provides an efficient way to create subsets of large HDF5 datasets without loading all data into memory. It uses HDF5’s native hyperslab selection mechanism for optimal performance with big data.
Key features:
6 Examples
Code
# Select specific rows (e.g., rows 1, 3, 5, 10-15)success <-bdsubset_dataset("data.h5", dataset_path ="/matrix/data",indices =c(1, 3, 5, 10:15),select_rows =TRUE,new_name ="selected_rows")# Select specific columnssuccess <-bdsubset_dataset("data.h5",dataset_path ="/matrix/data", indices =c(2, 4, 6:10),select_rows =FALSE,new_group ="/filtered",new_name ="selected_cols")# Create subset in different groupsuccess <-bdsubset_dataset("data.h5",dataset_path ="/raw_data/matrix",indices =1:100, # First 100 rowsselect_rows =TRUE,new_group ="/processed",new_name ="top_100_rows")# Extract specific samples for analysisinteresting_samples <-c(15, 23, 45, 67, 89, 123)success <-bdsubset_dataset("data.h5",dataset_path ="/experiments/results",indices = interesting_samples,select_rows =TRUE,new_name ="analysis_subset")
Source Code
---title: "bdsubset_hdf5_dataset"subtitle: "bdsubset_hdf5_dataset"---<span class="category-badge hdf5_io_management">HDF5_IO_MANAGEMENT</span>## DescriptionCreates a new HDF5 dataset containing only the specified rows or columnsfrom an existing dataset. This operation is memory efficient as it usesHDF5's hyperslab selection for direct disk-to-disk copying without loadingthe entire dataset into memory.## Usage```rbdsubset_hdf5_dataset(filename, dataset_path, indices, select_rows =TRUE, new_group ="", new_name ="", overwrite =FALSE)```## Arguments::: {.param-table}| Parameter | Description ||-----------|-------------||`filename`| Character string. Path to the HDF5 file ||`dataset_path`| Character string. Path to the source dataset (e.g., "/group1/dataset1") ||`indices`| Integer vector. Row or column indices to include (1-based, as per R convention) ||`select_rows`| Logical. If TRUE, selects rows; if FALSE, selects columns (default: TRUE) ||`new_group`| Character string. Target group for the new dataset (default: same as source) ||`new_name`| Character string. Name for the new dataset (default: original_name + "_subset") ||`overwrite`| Logical. Whether to overwrite destination if it exists (default: FALSE) |:::## Value::: {.return-value}Logical. TRUE on success, FALSE on failure:::## DetailsThis function provides an efficient way to create subsets of large HDF5 datasetswithout loading all data into memory. It uses HDF5's native hyperslab selectionmechanism for optimal performance with big data.Key features:\itemize{ \item Memory efficient - processes one row/column at a time \item Direct disk-to-disk copying using HDF5 hyperslab selection \item Preserves all dataset attributes and properties \item Works with datasets of any size \item Automatic creation of parent groups if needed \item Support for both row and column selection}## Examples```{r}#| eval: false#| code-fold: show# Select specific rows (e.g., rows 1, 3, 5, 10-15)success <-bdsubset_dataset("data.h5", dataset_path ="/matrix/data",indices =c(1, 3, 5, 10:15),select_rows =TRUE,new_name ="selected_rows")# Select specific columnssuccess <-bdsubset_dataset("data.h5",dataset_path ="/matrix/data", indices =c(2, 4, 6:10),select_rows =FALSE,new_group ="/filtered",new_name ="selected_cols")# Create subset in different groupsuccess <-bdsubset_dataset("data.h5",dataset_path ="/raw_data/matrix",indices =1:100, # First 100 rowsselect_rows =TRUE,new_group ="/processed",new_name ="top_100_rows")# Extract specific samples for analysisinteresting_samples <-c(15, 23, 45, 67, 89, 123)success <-bdsubset_dataset("data.h5",dataset_path ="/experiments/results",indices = interesting_samples,select_rows =TRUE,new_name ="analysis_subset")```