bdRemoveMAF_hdf5

HDF5_STATISTICS

1 Description

Filters SNPs (Single Nucleotide Polymorphisms) based on Minor Allele Frequency (MAF) in genomic data stored in HDF5 format.

2 Usage

bdRemoveMAF_hdf5(filename, group, dataset, outgroup, outdataset, maf, bycols, blocksize, overwrite = NULL)

3 Arguments

Parameter	Description
`filename`	Character string. Path to the HDF5 file.
`group`	Character string. Path to the group containing input dataset.
`dataset`	Character string. Name of the dataset to filter.
`outgroup`	Character string. Output group path for filtered data.
`outdataset`	Character string. Output dataset name for filtered data.
`maf`	Numeric (optional). MAF threshold for filtering (0-1). Default is 0.05. SNPs with MAF above this threshold are removed.
`bycols`	Logical (optional). Whether to process by columns (TRUE) or rows (FALSE). Default is FALSE.
`blocksize`	Integer (optional). Block size for processing. Default is 100. Larger values use more memory but may be faster.
`overwrite`	Logical (optional). Whether to overwrite existing dataset. Default is FALSE.

4 Value

List with components. If an error occurs, all string values are returned as empty strings (““):

fn: Character string with the HDF5 filename
ds: Character string with the full dataset path to the filtered dataset (group/dataset)
nremoved: Integer with the number of SNPs removed due to low Minor Allele Frequency (MAF)

5 Details

This function provides efficient MAF-based filtering capabilities with: - Filtering options: - MAF threshold-based filtering - Row-wise or column-wise processing - Block-based processing - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting

The function supports both in-place modification and creation of new datasets.

6 Examples

Code

library(BigDataStatMeth)

# Create test SNP data
snps <- matrix(sample(c(0, 1, 2), 1000, replace = TRUE,
                     prob = c(0.7, 0.2, 0.1)), 100, 10)

# Save to HDF5
fn <- "snp_data.hdf5"
bdCreate_hdf5_matrix(fn, snps, "genotype", "raw_snps",
                     overwriteFile = TRUE)

# Remove SNPs with high MAF
bdRemoveMAF_hdf5(
  filename = fn,
  group = "genotype",
  dataset = "raw_snps",
  outgroup = "genotype_filtered",
  outdataset = "filtered_snps",
  maf = 0.1,
  bycols = TRUE,
  blocksize = 50
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}

7 See Also

bdRemovelowdata_hdf5 for removing low-representation SNPs
bdImputeSNPs_hdf5 for imputing missing SNP values

--- title: "bdRemoveMAF_hdf5" subtitle: "bdRemoveMAF_hdf5" --- <span class="category-badge hdf5_statistics">HDF5_STATISTICS</span> ## Description Filters SNPs (Single Nucleotide Polymorphisms) based on Minor Allele Frequency (MAF) in genomic data stored in HDF5 format. ## Usage ```r bdRemoveMAF_hdf5(filename, group, dataset, outgroup, outdataset, maf, bycols, blocksize, overwrite = NULL) ``` ## Arguments ::: {.param-table} | Parameter | Description | |-----------|-------------| | `filename` | Character string. Path to the HDF5 file. | | `group` | Character string. Path to the group containing input dataset. | | `dataset` | Character string. Name of the dataset to filter. | | `outgroup` | Character string. Output group path for filtered data. | | `outdataset` | Character string. Output dataset name for filtered data. | | `maf` | Numeric (optional). MAF threshold for filtering (0-1). Default is 0.05. SNPs with MAF above this threshold are removed. | | `bycols` | Logical (optional). Whether to process by columns (TRUE) or rows (FALSE). Default is FALSE. | | `blocksize` | Integer (optional). Block size for processing. Default is 100. Larger values use more memory but may be faster. | | `overwrite` | Logical (optional). Whether to overwrite existing dataset. Default is FALSE. | ::: ## Value ::: {.return-value} List with components. If an error occurs, all string values are returned as empty strings (""): - **`fn`**: Character string with the HDF5 filename - **`ds`**: Character string with the full dataset path to the filtered dataset (group/dataset) - **`nremoved`**: Integer with the number of SNPs removed due to low Minor Allele Frequency (MAF) ::: ## Details This function provides efficient MAF-based filtering capabilities with: - Filtering options: - MAF threshold-based filtering - Row-wise or column-wise processing - Block-based processing - Implementation features: - Memory-efficient processing - Block-based operations - Safe file operations - Progress reporting The function supports both in-place modification and creation of new datasets. ## Examples ```{r} #| eval: false #| code-fold: show library(BigDataStatMeth) # Create test SNP data snps <- matrix(sample(c(0, 1, 2), 1000, replace = TRUE, prob = c(0.7, 0.2, 0.1)), 100, 10) # Save to HDF5 fn <- "snp_data.hdf5" bdCreate_hdf5_matrix(fn, snps, "genotype", "raw_snps", overwriteFile = TRUE) # Remove SNPs with high MAF bdRemoveMAF_hdf5( filename = fn, group = "genotype", dataset = "raw_snps", outgroup = "genotype_filtered", outdataset = "filtered_snps", maf = 0.1, bycols = TRUE, blocksize = 50 ) # Cleanup if (file.exists(fn)) { file.remove(fn) } ``` ## See Also ::: {.see-also} - [bdRemovelowdata_hdf5](bdRemovelowdata_hdf5.html) for removing low-representation SNPs - [bdImputeSNPs_hdf5](bdImputeSNPs_hdf5.html) for imputing missing SNP values :::