Compute correlation matrix for matrices stored in HDF5 format

HDF5_STATISTICS

1 Description

This function computes Pearson or Spearman correlation matrix for matrices stored in HDF5 format. It automatically detects whether to compute: It automatically selects between direct computation for small matrices and block-wise processing for large matrices to optimize memory usage and performance.

Correlation types supported:

For omics data analysis:

2 Usage

bdCorr_hdf5(filename_x, group_x, dataset_x, filename_y = "", group_y = "", dataset_y = "", trans_x = FALSE, trans_y = FALSE, method = "pearson", use_complete_obs = TRUE, compute_pvalues = TRUE, block_size = 1000L, overwrite = FALSE, output_filename = "", output_group = "", output_dataset_corr = "", output_dataset_pval = "", threads = -1L)

3 Arguments

Parameter Description
filename_x Character string with the path to the HDF5 file containing matrix X
group_x Character string indicating the group containing matrix X
dataset_x Character string indicating the dataset name of matrix X
filename_y Character string with the path to the HDF5 file containing matrix Y (optional, default: ““)
group_y Character string indicating the group containing matrix Y (optional, default: ““)
dataset_y Character string indicating the dataset name of matrix Y (optional, default: ““)
trans_x Logical, whether to transpose matrix X (default: FALSE)
trans_y Logical, whether to transpose matrix Y (default: FALSE, ignored for single matrix)
method Character string indicating correlation method (“pearson” or “spearman”, default: “pearson”)
use_complete_obs Logical, whether to use only complete observations (default: TRUE)
compute_pvalues Logical, whether to compute p-values for correlations (default: TRUE)
block_size Integer, block size for large matrix processing (default: 1000)
overwrite Logical, whether to overwrite existing results (default: FALSE)
output_filename Character string, output HDF5 file (default: same as filename_x)
output_group Character string, custom output group name (default: auto-generated)
output_dataset_corr Character string, custom correlation dataset name (default: “correlation”)
output_dataset_pval Character string, custom p-values dataset name (default: “pvalues”)
threads Integer, number of threads for parallel computation (optional, default: auto)

4 Value

List with components:

  • fn: Character string with the HDF5 filename
  • ds: Character string with the full dataset path to the correlation matrix (group/dataset)

5 Examples

Code
# Backward compatible - existing code works unchanged
result_original <- bdCorr_hdf5("data.h5", "expression", "genes")

# New transpose functionality
# Gene-gene correlations (variables)
gene_corr <- bdCorr_hdf5("omics.h5", "expression", "genes", trans_x = FALSE)

# Sample-sample correlations (individuals) 
sample_corr <- bdCorr_hdf5("omics.h5", "expression", "genes", trans_x = TRUE)

# Cross-correlation: genes vs methylation sites (variables vs variables)
cross_vars <- bdCorr_hdf5("omics.h5", "expression", "genes", 
                         "omics.h5", "methylation", "cpg_sites",
                         trans_x = FALSE, trans_y = FALSE)

# Cross-correlation: samples vs methylation sites (samples vs variables)
samples_vs_cpg <- bdCorr_hdf5("omics.h5", "expression", "genes",
                             "omics.h5", "methylation", "cpg_sites", 
                             trans_x = TRUE, trans_y = FALSE)