bdSort_hdf5_dataset

HDF5_IO_MANAGEMENT

1 Description

Sorts a dataset in an HDF5 file based on a predefined ordering specified through a list of sorting blocks.

2 Usage

bdSort_hdf5_dataset(filename, group, dataset, outdataset, blockedSortlist, func, outgroup = NULL, overwrite = FALSE)

3 Arguments

Parameter	Description
`filename`	Character string. Path to the HDF5 file.
`group`	Character string. Path to the group containing input dataset.
`dataset`	Character string. Name of the dataset to sort.
`outdataset`	Character string. Name for the sorted dataset.
`blockedSortlist`	List of data frames. Each data frame specifies the sorting order for a block of elements. See Details for structure.
`func`	Character string. Function to apply: - “sortRows” for row-wise sorting - “sortCols” for column-wise sorting
`outgroup`	Character string (optional). Output group path. If NULL, uses input group.
`overwrite`	Logical (optional). Whether to overwrite existing dataset. Default is FALSE.

4 Value

List with components. If an error occurs, all string values are returned as empty strings (““):

fn: Character string with the HDF5 filename
ds: Character string with the full dataset path to the sorted dataset (group/dataset)

5 Details

This function provides efficient dataset sorting capabilities with: - Sorting options: - Row-wise sorting - Column-wise sorting - Block-based processing - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting

The sorting order is specified through a list of data frames, where each data frame represents a block of elements to be sorted. Each data frame must contain: - Row names (current identifiers) - chr (new identifiers) - order (current positions) - newOrder (target positions)

Example sorting blocks structure:

Block 1 (maintaining order): chr order newOrder Diagonal TCGA-OR-A5J1 TCGA-OR-A5J1 1 1 1 TCGA-OR-A5J2 TCGA-OR-A5J2 2 2 1 TCGA-OR-A5J3 TCGA-OR-A5J3 3 3 1 TCGA-OR-A5J4 TCGA-OR-A5J4 4 4 1

Block 2 (reordering with new identifiers): chr order newOrder Diagonal TCGA-OR-A5J5 TCGA-OR-A5JA 10 5 1 TCGA-OR-A5J6 TCGA-OR-A5JB 11 6 1 TCGA-OR-A5J7 TCGA-OR-A5JC 12 7 0 TCGA-OR-A5J8 TCGA-OR-A5JD 13 8 1

Block 3 (reordering with identifier swaps): chr order newOrder Diagonal TCGA-OR-A5J9 TCGA-OR-A5J5 5 9 1 TCGA-OR-A5JA TCGA-OR-A5J6 6 10 1 TCGA-OR-A5JB TCGA-OR-A5J7 7 11 1 TCGA-OR-A5JC TCGA-OR-A5J8 8 12 1 TCGA-OR-A5JD TCGA-OR-A5J9 9 13 0

In this example: - Block 1 maintains the original order - Block 2 assigns new identifiers (A5JA-D) to elements - Block 3 swaps identifiers between elements - The Diagonal column indicates whether the element is on the diagonal (1) or not (0)

6 Examples

Code

library(BigDataStatMeth)

# Create test data
data <- matrix(rnorm(100), 10, 10)
rownames(data) <- paste0("TCGA-OR-A5J", 1:10)

# Save to HDF5
fn <- "test.hdf5"
bdCreate_hdf5_matrix(fn, data, "data", "matrix1",
                     overwriteFile = TRUE)

# Create sorting blocks
block1 <- data.frame(
  chr = paste0("TCGA-OR-A5J", c(2,1,3,4)),
  order = 1:4,
  newOrder = c(2,1,3,4),
  row.names = paste0("TCGA-OR-A5J", 1:4)
)

block2 <- data.frame(
  chr = paste0("TCGA-OR-A5J", c(6,5,8,7)),
  order = 5:8,
  newOrder = c(6,5,8,7),
  row.names = paste0("TCGA-OR-A5J", 5:8)
)

# Sort dataset
bdSort_hdf5_dataset(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outdataset = "matrix1_sorted",
  blockedSortlist = list(block1, block2),
  func = "sortRows"
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}

7 See Also

bdCreate_hdf5_matrix for creating HDF5 matrices

--- title: "bdSort_hdf5_dataset" subtitle: "bdSort_hdf5_dataset" --- <span class="category-badge hdf5_io_management">HDF5_IO_MANAGEMENT</span> ## Description Sorts a dataset in an HDF5 file based on a predefined ordering specified through a list of sorting blocks. ## Usage ```r bdSort_hdf5_dataset(filename, group, dataset, outdataset, blockedSortlist, func, outgroup = NULL, overwrite = FALSE) ``` ## Arguments ::: {.param-table} | Parameter | Description | |-----------|-------------| | `filename` | Character string. Path to the HDF5 file. | | `group` | Character string. Path to the group containing input dataset. | | `dataset` | Character string. Name of the dataset to sort. | | `outdataset` | Character string. Name for the sorted dataset. | | `blockedSortlist` | List of data frames. Each data frame specifies the sorting order for a block of elements. See Details for structure. | | `func` | Character string. Function to apply: - "sortRows" for row-wise sorting - "sortCols" for column-wise sorting | | `outgroup` | Character string (optional). Output group path. If NULL, uses input group. | | `overwrite` | Logical (optional). Whether to overwrite existing dataset. Default is FALSE. | ::: ## Value ::: {.return-value} List with components. If an error occurs, all string values are returned as empty strings (""): - **`fn`**: Character string with the HDF5 filename - **`ds`**: Character string with the full dataset path to the sorted dataset (group/dataset) ::: ## Details This function provides efficient dataset sorting capabilities with: - Sorting options: - Row-wise sorting - Column-wise sorting - Block-based processing - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting The sorting order is specified through a list of data frames, where each data frame represents a block of elements to be sorted. Each data frame must contain: - Row names (current identifiers) - chr (new identifiers) - order (current positions) - newOrder (target positions) Example sorting blocks structure: Block 1 (maintaining order): chr order newOrder Diagonal TCGA-OR-A5J1 TCGA-OR-A5J1 1 1 1 TCGA-OR-A5J2 TCGA-OR-A5J2 2 2 1 TCGA-OR-A5J3 TCGA-OR-A5J3 3 3 1 TCGA-OR-A5J4 TCGA-OR-A5J4 4 4 1 Block 2 (reordering with new identifiers): chr order newOrder Diagonal TCGA-OR-A5J5 TCGA-OR-A5JA 10 5 1 TCGA-OR-A5J6 TCGA-OR-A5JB 11 6 1 TCGA-OR-A5J7 TCGA-OR-A5JC 12 7 0 TCGA-OR-A5J8 TCGA-OR-A5JD 13 8 1 Block 3 (reordering with identifier swaps): chr order newOrder Diagonal TCGA-OR-A5J9 TCGA-OR-A5J5 5 9 1 TCGA-OR-A5JA TCGA-OR-A5J6 6 10 1 TCGA-OR-A5JB TCGA-OR-A5J7 7 11 1 TCGA-OR-A5JC TCGA-OR-A5J8 8 12 1 TCGA-OR-A5JD TCGA-OR-A5J9 9 13 0 In this example: - Block 1 maintains the original order - Block 2 assigns new identifiers (A5JA-D) to elements - Block 3 swaps identifiers between elements - The Diagonal column indicates whether the element is on the diagonal (1) or not (0) ## Examples ```{r} #| eval: false #| code-fold: show library(BigDataStatMeth) # Create test data data <- matrix(rnorm(100), 10, 10) rownames(data) <- paste0("TCGA-OR-A5J", 1:10) # Save to HDF5 fn <- "test.hdf5" bdCreate_hdf5_matrix(fn, data, "data", "matrix1", overwriteFile = TRUE) # Create sorting blocks block1 <- data.frame( chr = paste0("TCGA-OR-A5J", c(2,1,3,4)), order = 1:4, newOrder = c(2,1,3,4), row.names = paste0("TCGA-OR-A5J", 1:4) ) block2 <- data.frame( chr = paste0("TCGA-OR-A5J", c(6,5,8,7)), order = 5:8, newOrder = c(6,5,8,7), row.names = paste0("TCGA-OR-A5J", 5:8) ) # Sort dataset bdSort_hdf5_dataset( filename = fn, group = "data", dataset = "matrix1", outdataset = "matrix1_sorted", blockedSortlist = list(block1, block2), func = "sortRows" ) # Cleanup if (file.exists(fn)) { file.remove(fn) } ``` ## See Also ::: {.see-also} - [bdCreate_hdf5_matrix](bdCreate_hdf5_matrix.html) for creating HDF5 matrices :::