bdblockmult_hdf5

BLOCKWISE_OPS

1 Usage

bdblockmult_hdf5(filename, group, A, B, groupB = NULL, transpose_A = NULL, transpose_B = NULL, block_size = NULL, paral = NULL, threads = NULL, outgroup = NULL, outdataset = NULL, overwrite = NULL)

2 Arguments

Parameter	Description
`filename`	string specifying the path to the HDF5 file
`group`	string specifying the group within the HDF5 file containing matrix A.
`A`	string specifying the dataset name for matrix A. the data matrix to be used in calculus
`B`	string specifying the dataset name for matrix B.
`groupB`	string, (optional), An optional string specifying the group for matrix B. Defaults to the value of `group` if not provided.
`transpose_A`	Whether to transpose matrix A
`transpose_B`	Whether to transpose matrix B
`block_size`	integer (optional), an optional parameter specifying the block size for processing the matrices. If not provided, a default block size is used. The block size should be chosen based on the available memory and the size of the matrices
`paral`	boolean (optional), an optional parameter to enable parallel computation. Defaults to FALSE. Set `paral = true` to force parallel execution
`threads`	integer (optional), an optional parameter specifying the number of threads to use if paral = TRUE. Ignored if paral = FALSE.
`outgroup`	string (optional), An optional parameter specifying the group where the output matrix will be stored. If NULL, the output will be stored in the default group “OUTPUT”.
`outdataset`	string (optional), An optional parameter specifying the dataset name for the output matrix. If NULL, the default name will be constructed as the name of dataset A concatenated with x and the name of dataset B.
`overwrite`	logical (optional), An optional parameter to indicate whether existing results in the HDF5 file should be overwritten. Defaults to FALSE. If FALSE and the dataset already exists, an error will be displayed, and no calculations will be performed. If TRUE and a dataset with the same name as specified in outdataset already exists, it will be overwritten.

3 Value

A list containing the location of the matrix multiplication result:

fn: Character string. Path to the HDF5 file containing the result
ds: Character string. Full dataset path to the A*B multiplication result within the HDF5 file

4 Details

The function bdblockmult_hdf5() is efficient for both matrices that cannot fit into memory (by processing in blocks) and matrices that can be fully loaded into memory, as it optimizes computations based on available resources.
Ensure that the dimensions of A and B matrices are compatible for matrix multiplication.
The block size should be chosen based on the available memory and the size of the matrices.
If bparal = true, number of concurrent threads in parallelization. If paral = TRUE and threads = NULL then threads is set to a half of a maximum number of available threads ## Examples

library("BigDataStatMeth")
library("rhdf5")

N = 1000; M = 1000

set.seed(555)
a <- matrix( rnorm( N*M, mean=0, sd=1), N, M) 
b <- matrix( rnorm( N*M, mean=0, sd=1), M, N) 

fn <- "test_temp.hdf5"
bdCreate_hdf5_matrix(filename = fn, 
                     object = a, group = "groupA", 
                     dataset = "datasetA",
                     transp = FALSE,
                     overwriteFile = TRUE, 
                     overwriteDataset = FALSE, 
                     unlimited = FALSE)
                     
bdCreate_hdf5_matrix(filename = fn, 
                     object = t(b), 
                     group = "groupA", 
                     dataset = "datasetB",
                     transp = FALSE,
                     overwriteFile = FALSE, 
                     overwriteDataset = TRUE, 
                     unlimited = FALSE)
                     
# Multiply two matrix
res <- bdblockmult_hdf5(filename = fn, group = "groupA", 
    A = "datasetA", B = "datasetB", outgroup = "results", 
    outdataset = "res", overwrite = TRUE ) 
 
# list contents
h5ls(fn)

# Extract the result from HDF5
result_hdf5 <- h5read(res$fn, res$ds)[1:3, 1:5]
result_hdf5

# Compute the same multiplication in R
result_r <- (a %*% b)[1:3, 1:5]
result_r

# Compare both results (should be TRUE)
all.equal(result_hdf5, result_r)

# Remove file
if (file.exists(fn)) {
  file.remove(fn)
}

--- title: "bdblockmult_hdf5" subtitle: "bdblockmult_hdf5" --- <span class="category-badge blockwise_ops">BLOCKWISE_OPS</span> ## Usage ```r bdblockmult_hdf5(filename, group, A, B, groupB = NULL, transpose_A = NULL, transpose_B = NULL, block_size = NULL, paral = NULL, threads = NULL, outgroup = NULL, outdataset = NULL, overwrite = NULL) ``` ## Arguments ::: {.param-table} | Parameter | Description | |-----------|-------------| | `filename` | string specifying the path to the HDF5 file | | `group` | string specifying the group within the HDF5 file containing matrix A. | | `A` | string specifying the dataset name for matrix A. the data matrix to be used in calculus | | `B` | string specifying the dataset name for matrix B. | | `groupB` | string, (optional), An optional string specifying the group for matrix B. Defaults to the value of `group` if not provided. | | `transpose_A` | Whether to transpose matrix A | | `transpose_B` | Whether to transpose matrix B | | `block_size` | integer (optional), an optional parameter specifying the block size for processing the matrices. If not provided, a default block size is used. The block size should be chosen based on the available memory and the size of the matrices | | `paral` | boolean (optional), an optional parameter to enable parallel computation. Defaults to FALSE. Set `paral = true` to force parallel execution | | `threads` | integer (optional), an optional parameter specifying the number of threads to use if paral = TRUE. Ignored if paral = FALSE. | | `outgroup` | string (optional), An optional parameter specifying the group where the output matrix will be stored. If NULL, the output will be stored in the default group "OUTPUT". | | `outdataset` | string (optional), An optional parameter specifying the dataset name for the output matrix. If NULL, the default name will be constructed as the name of dataset A concatenated with _x_ and the name of dataset B. | | `overwrite` | logical (optional), An optional parameter to indicate whether existing results in the HDF5 file should be overwritten. Defaults to FALSE. If FALSE and the dataset already exists, an error will be displayed, and no calculations will be performed. If TRUE and a dataset with the same name as specified in outdataset already exists, it will be overwritten. | ::: ## Value ::: {.return-value} A list containing the location of the matrix multiplication result: - **`fn`**: Character string. Path to the HDF5 file containing the result - **`ds`**: Character string. Full dataset path to the A*B multiplication result within the HDF5 file ::: ## Details - The function `bdblockmult_hdf5()` is efficient for both matrices that cannot fit into memory (by processing in blocks) and matrices that can be fully loaded into memory, as it optimizes computations based on available resources. - Ensure that the dimensions of `A` and `B` matrices are compatible for matrix multiplication. - The `block size` should be chosen based on the available memory and the size of the matrices. - If `bparal = true`, number of concurrent threads in parallelization. If `paral = TRUE` and `threads = NULL` then `threads` is set to a half of a maximum number of available threads ## Examples ```{r} #| eval: false #| warning: false library("BigDataStatMeth") library("rhdf5") N = 1000; M = 1000 set.seed(555) a <- matrix( rnorm( N*M, mean=0, sd=1), N, M) b <- matrix( rnorm( N*M, mean=0, sd=1), M, N) fn <- "test_temp.hdf5" bdCreate_hdf5_matrix(filename = fn, object = a, group = "groupA", dataset = "datasetA", transp = FALSE, overwriteFile = TRUE, overwriteDataset = FALSE, unlimited = FALSE) bdCreate_hdf5_matrix(filename = fn, object = t(b), group = "groupA", dataset = "datasetB", transp = FALSE, overwriteFile = FALSE, overwriteDataset = TRUE, unlimited = FALSE) # Multiply two matrix res <- bdblockmult_hdf5(filename = fn, group = "groupA", A = "datasetA", B = "datasetB", outgroup = "results", outdataset = "res", overwrite = TRUE ) # list contents h5ls(fn) # Extract the result from HDF5 result_hdf5 <- h5read(res$fn, res$ds)[1:3, 1:5] result_hdf5 # Compute the same multiplication in R result_r <- (a %*% b)[1:3, 1:5] result_r # Compare both results (should be TRUE) all.equal(result_hdf5, result_r) # Remove file if (file.exists(fn)) { file.remove(fn) } ```