bdSort_hdf5_dataset

HDF5_IO_MANAGEMENT

1 Description

Sorts a dataset in an HDF5 file based on a predefined ordering specified through a list of sorting blocks.

2 Usage

bdSort_hdf5_dataset(filename, group, dataset, outdataset, blockedSortlist, func, outgroup = NULL, overwrite = FALSE)

3 Arguments

Parameter Description
filename Character string. Path to the HDF5 file.
group Character string. Path to the group containing input dataset.
dataset Character string. Name of the dataset to sort.
outdataset Character string. Name for the sorted dataset.
blockedSortlist List of data frames. Each data frame specifies the sorting order for a block of elements. See Details for structure.
func Character string. Function to apply: - “sortRows” for row-wise sorting - “sortCols” for column-wise sorting
outgroup Character string (optional). Output group path. If NULL, uses input group.
overwrite Logical (optional). Whether to overwrite existing dataset. Default is FALSE.

4 Value

List with components. If an error occurs, all string values are returned as empty strings (““):

  • fn: Character string with the HDF5 filename
  • ds: Character string with the full dataset path to the sorted dataset (group/dataset)

5 Details

This function provides efficient dataset sorting capabilities with: - Sorting options: - Row-wise sorting - Column-wise sorting - Block-based processing - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting

The sorting order is specified through a list of data frames, where each data frame represents a block of elements to be sorted. Each data frame must contain: - Row names (current identifiers) - chr (new identifiers) - order (current positions) - newOrder (target positions)

Example sorting blocks structure:

Block 1 (maintaining order): chr order newOrder Diagonal TCGA-OR-A5J1 TCGA-OR-A5J1 1 1 1 TCGA-OR-A5J2 TCGA-OR-A5J2 2 2 1 TCGA-OR-A5J3 TCGA-OR-A5J3 3 3 1 TCGA-OR-A5J4 TCGA-OR-A5J4 4 4 1

Block 2 (reordering with new identifiers): chr order newOrder Diagonal TCGA-OR-A5J5 TCGA-OR-A5JA 10 5 1 TCGA-OR-A5J6 TCGA-OR-A5JB 11 6 1 TCGA-OR-A5J7 TCGA-OR-A5JC 12 7 0 TCGA-OR-A5J8 TCGA-OR-A5JD 13 8 1

Block 3 (reordering with identifier swaps): chr order newOrder Diagonal TCGA-OR-A5J9 TCGA-OR-A5J5 5 9 1 TCGA-OR-A5JA TCGA-OR-A5J6 6 10 1 TCGA-OR-A5JB TCGA-OR-A5J7 7 11 1 TCGA-OR-A5JC TCGA-OR-A5J8 8 12 1 TCGA-OR-A5JD TCGA-OR-A5J9 9 13 0

In this example: - Block 1 maintains the original order - Block 2 assigns new identifiers (A5JA-D) to elements - Block 3 swaps identifiers between elements - The Diagonal column indicates whether the element is on the diagonal (1) or not (0)

6 Examples

Code
library(BigDataStatMeth)

# Create test data
data <- matrix(rnorm(100), 10, 10)
rownames(data) <- paste0("TCGA-OR-A5J", 1:10)

# Save to HDF5
fn <- "test.hdf5"
bdCreate_hdf5_matrix(fn, data, "data", "matrix1",
                     overwriteFile = TRUE)

# Create sorting blocks
block1 <- data.frame(
  chr = paste0("TCGA-OR-A5J", c(2,1,3,4)),
  order = 1:4,
  newOrder = c(2,1,3,4),
  row.names = paste0("TCGA-OR-A5J", 1:4)
)

block2 <- data.frame(
  chr = paste0("TCGA-OR-A5J", c(6,5,8,7)),
  order = 5:8,
  newOrder = c(6,5,8,7),
  row.names = paste0("TCGA-OR-A5J", 5:8)
)

# Sort dataset
bdSort_hdf5_dataset(
  filename = fn,
  group = "data",
  dataset = "matrix1",
  outdataset = "matrix1_sorted",
  blockedSortlist = list(block1, block2),
  func = "sortRows"
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}

7 See Also