bdRemovelowdata_hdf5

bdRemovelowdata_hdf5

HDF5_STATISTICS

1 Description

Removes SNPs (Single Nucleotide Polymorphisms) with low representation from genomic data stored in HDF5 format.

2 Usage

bdRemovelowdata_hdf5(filename, group, dataset, outgroup, outdataset, pcent, bycols, overwrite = NULL)

3 Arguments

Parameter Description
filename Character string. Path to the HDF5 file.
group Character string. Path to the group containing input dataset.
dataset Character string. Name of the dataset to filter.
outgroup Character string. Output group path for filtered data.
outdataset Character string. Output dataset name for filtered data.
pcent Numeric (optional). Threshold percentage for removal (0-1). Default is 0.5. SNPs with representation below this threshold are removed.
bycols Logical (optional). Whether to filter by columns (TRUE) or rows (FALSE). Default is TRUE.
overwrite Logical (optional). Whether to overwrite existing dataset. Default is FALSE.

4 Value

List with components. If an error occurs, all string values are returned as empty strings (““):

  • fn: Character string with the HDF5 filename
  • ds: Character string with the full dataset path to the filtered dataset (group/dataset)
  • nremoved: Integer with the number of rows/columns removed due to low data quality

5 Details

This function provides efficient filtering capabilities for genomic data with support for: - Filtering options: - Row-wise or column-wise filtering - Configurable threshold percentage - Flexible output location - Implementation features: - Memory-efficient processing - Safe file operations - Comprehensive error handling - Progress reporting

The function supports both in-place modification and creation of new datasets.

6 Examples

Code
library(BigDataStatMeth)

# Create test SNP data with missing values
snps <- matrix(sample(c(0, 1, 2, NA), 100, replace = TRUE,
                     prob = c(0.3, 0.3, 0.3, 0.1)), 10, 10)

# Save to HDF5
fn <- "snp_data.hdf5"
bdCreate_hdf5_matrix(fn, snps, "genotype", "raw_snps",
                     overwriteFile = TRUE)

# Remove SNPs with low representation
bdRemovelowdata_hdf5(
  filename = fn,
  group = "genotype",
  dataset = "raw_snps",
  outgroup = "genotype_filtered",
  outdataset = "filtered_snps",
  pcent = 0.3,
  bycols = TRUE
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}

7 See Also