bdCorr_hdf5

Compute correlation matrix for matrices stored in HDF5 format

HDF5_STATISTICS

1 Description

This function computes Pearson or Spearman correlation matrix for matrices stored in HDF5 format. It automatically detects whether to compute: It automatically selects between direct computation for small matrices and block-wise processing for large matrices to optimize memory usage and performance.

Correlation types supported:

For omics data analysis:

2 Usage

bdCorr_hdf5(filename_x, group_x, dataset_x, filename_y = "", group_y = "", dataset_y = "", trans_x = FALSE, trans_y = FALSE, method = "pearson", use_complete_obs = TRUE, compute_pvalues = TRUE, block_size = 1000L, overwrite = FALSE, output_filename = "", output_group = "", output_dataset_corr = "", output_dataset_pval = "", threads = -1L)

3 Arguments

Parameter	Description
`filename_x`	Character string with the path to the HDF5 file containing matrix X
`group_x`	Character string indicating the group containing matrix X
`dataset_x`	Character string indicating the dataset name of matrix X
`filename_y`	Character string with the path to the HDF5 file containing matrix Y (optional, default: ““)
`group_y`	Character string indicating the group containing matrix Y (optional, default: ““)
`dataset_y`	Character string indicating the dataset name of matrix Y (optional, default: ““)
`trans_x`	Logical, whether to transpose matrix X (default: FALSE)
`trans_y`	Logical, whether to transpose matrix Y (default: FALSE, ignored for single matrix)
`method`	Character string indicating correlation method (“pearson” or “spearman”, default: “pearson”)
`use_complete_obs`	Logical, whether to use only complete observations (default: TRUE)
`compute_pvalues`	Logical, whether to compute p-values for correlations (default: TRUE)
`block_size`	Integer, block size for large matrix processing (default: 1000)
`overwrite`	Logical, whether to overwrite existing results (default: FALSE)
`output_filename`	Character string, output HDF5 file (default: same as filename_x)
`output_group`	Character string, custom output group name (default: auto-generated)
`output_dataset_corr`	Character string, custom correlation dataset name (default: “correlation”)
`output_dataset_pval`	Character string, custom p-values dataset name (default: “pvalues”)
`threads`	Integer, number of threads for parallel computation (optional, default: auto)

4 Value

List with components:

fn: Character string with the HDF5 filename
ds: Character string with the full dataset path to the correlation matrix (group/dataset)

5 Examples

Code

# Backward compatible - existing code works unchanged
result_original <- bdCorr_hdf5("data.h5", "expression", "genes")

# New transpose functionality
# Gene-gene correlations (variables)
gene_corr <- bdCorr_hdf5("omics.h5", "expression", "genes", trans_x = FALSE)

# Sample-sample correlations (individuals) 
sample_corr <- bdCorr_hdf5("omics.h5", "expression", "genes", trans_x = TRUE)

# Cross-correlation: genes vs methylation sites (variables vs variables)
cross_vars <- bdCorr_hdf5("omics.h5", "expression", "genes", 
                         "omics.h5", "methylation", "cpg_sites",
                         trans_x = FALSE, trans_y = FALSE)

# Cross-correlation: samples vs methylation sites (samples vs variables)
samples_vs_cpg <- bdCorr_hdf5("omics.h5", "expression", "genes",
                             "omics.h5", "methylation", "cpg_sites", 
                             trans_x = TRUE, trans_y = FALSE)

--- title: "bdCorr_hdf5" subtitle: "Compute correlation matrix for matrices stored in HDF5 format" --- <span class="category-badge hdf5_statistics">HDF5_STATISTICS</span> ## Description This function computes Pearson or Spearman correlation matrix for matrices stored in HDF5 format. It automatically detects whether to compute: \itemize{ \item Single matrix correlation cor(X) - when only dataset_x is provided \item Cross-matrix correlation cor(X,Y) - when both dataset_x and dataset_y are provided } It automatically selects between direct computation for small matrices and block-wise processing for large matrices to optimize memory usage and performance. Correlation types supported: \itemize{ \item Single matrix: cor(X) when only dataset_x provided \item Single matrix transposed: cor(t(X)) when trans_x=TRUE \item Cross-correlation: cor(X,Y) when both datasets provided \item Cross with transpose: cor(t(X),Y), cor(X,t(Y)), cor(t(X),t(Y)) } For omics data analysis: \itemize{ \item trans_x=FALSE, trans_y=FALSE: Variables vs Variables (genes vs genes, CpGs vs CpGs) \item trans_x=TRUE, trans_y=FALSE: Samples vs Variables (individuals vs genes) \item trans_x=FALSE, trans_y=TRUE: Variables vs Samples (genes vs individuals) \item trans_x=TRUE, trans_y=TRUE: Samples vs Samples (individuals vs individuals) - optimized to cor(X,Y) } ## Usage ```r bdCorr_hdf5(filename_x, group_x, dataset_x, filename_y = "", group_y = "", dataset_y = "", trans_x = FALSE, trans_y = FALSE, method = "pearson", use_complete_obs = TRUE, compute_pvalues = TRUE, block_size = 1000L, overwrite = FALSE, output_filename = "", output_group = "", output_dataset_corr = "", output_dataset_pval = "", threads = -1L) ``` ## Arguments ::: {.param-table} | Parameter | Description | |-----------|-------------| | `filename_x` | Character string with the path to the HDF5 file containing matrix X | | `group_x` | Character string indicating the group containing matrix X | | `dataset_x` | Character string indicating the dataset name of matrix X | | `filename_y` | Character string with the path to the HDF5 file containing matrix Y (optional, default: "") | | `group_y` | Character string indicating the group containing matrix Y (optional, default: "") | | `dataset_y` | Character string indicating the dataset name of matrix Y (optional, default: "") | | `trans_x` | Logical, whether to transpose matrix X (default: FALSE) | | `trans_y` | Logical, whether to transpose matrix Y (default: FALSE, ignored for single matrix) | | `method` | Character string indicating correlation method ("pearson" or "spearman", default: "pearson") | | `use_complete_obs` | Logical, whether to use only complete observations (default: TRUE) | | `compute_pvalues` | Logical, whether to compute p-values for correlations (default: TRUE) | | `block_size` | Integer, block size for large matrix processing (default: 1000) | | `overwrite` | Logical, whether to overwrite existing results (default: FALSE) | | `output_filename` | Character string, output HDF5 file (default: same as filename_x) | | `output_group` | Character string, custom output group name (default: auto-generated) | | `output_dataset_corr` | Character string, custom correlation dataset name (default: "correlation") | | `output_dataset_pval` | Character string, custom p-values dataset name (default: "pvalues") | | `threads` | Integer, number of threads for parallel computation (optional, default: auto) | ::: ## Value ::: {.return-value} List with components: - **`fn`**: Character string with the HDF5 filename - **`ds`**: Character string with the full dataset path to the correlation matrix (group/dataset) ::: ## Examples ```{r} #| eval: false #| code-fold: show # Backward compatible - existing code works unchanged result_original <- bdCorr_hdf5("data.h5", "expression", "genes") # New transpose functionality # Gene-gene correlations (variables) gene_corr <- bdCorr_hdf5("omics.h5", "expression", "genes", trans_x = FALSE) # Sample-sample correlations (individuals) sample_corr <- bdCorr_hdf5("omics.h5", "expression", "genes", trans_x = TRUE) # Cross-correlation: genes vs methylation sites (variables vs variables) cross_vars <- bdCorr_hdf5("omics.h5", "expression", "genes", "omics.h5", "methylation", "cpg_sites", trans_x = FALSE, trans_y = FALSE) # Cross-correlation: samples vs methylation sites (samples vs variables) samples_vs_cpg <- bdCorr_hdf5("omics.h5", "expression", "genes", "omics.h5", "methylation", "cpg_sites", trans_x = TRUE, trans_y = FALSE) ```