bdRemoveMAF_hdf5

HDF5_STATISTICS

1 Description

Filters SNPs (Single Nucleotide Polymorphisms) based on Minor Allele Frequency (MAF) in genomic data stored in HDF5 format.

2 Usage

bdRemoveMAF_hdf5(filename, group, dataset, outgroup, outdataset, maf, bycols, blocksize, overwrite = NULL)

3 Arguments

Parameter Description
filename Character string. Path to the HDF5 file.
group Character string. Path to the group containing input dataset.
dataset Character string. Name of the dataset to filter.
outgroup Character string. Output group path for filtered data.
outdataset Character string. Output dataset name for filtered data.
maf Numeric (optional). MAF threshold for filtering (0-1). Default is 0.05. SNPs with MAF above this threshold are removed.
bycols Logical (optional). Whether to process by columns (TRUE) or rows (FALSE). Default is FALSE.
blocksize Integer (optional). Block size for processing. Default is 100. Larger values use more memory but may be faster.
overwrite Logical (optional). Whether to overwrite existing dataset. Default is FALSE.

4 Value

List with components. If an error occurs, all string values are returned as empty strings (““):

  • fn: Character string with the HDF5 filename
  • ds: Character string with the full dataset path to the filtered dataset (group/dataset)
  • nremoved: Integer with the number of SNPs removed due to low Minor Allele Frequency (MAF)

5 Details

This function provides efficient MAF-based filtering capabilities with: - Filtering options: - MAF threshold-based filtering - Row-wise or column-wise processing - Block-based processing - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting

The function supports both in-place modification and creation of new datasets.

6 Examples

Code
library(BigDataStatMeth)

# Create test SNP data
snps <- matrix(sample(c(0, 1, 2), 1000, replace = TRUE,
                     prob = c(0.7, 0.2, 0.1)), 100, 10)

# Save to HDF5
fn <- "snp_data.hdf5"
bdCreate_hdf5_matrix(fn, snps, "genotype", "raw_snps",
                     overwriteFile = TRUE)

# Remove SNPs with high MAF
bdRemoveMAF_hdf5(
  filename = fn,
  group = "genotype",
  dataset = "raw_snps",
  outgroup = "genotype_filtered",
  outdataset = "filtered_snps",
  maf = 0.1,
  bycols = TRUE,
  blocksize = 50
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}

7 See Also