17 Compression of GDS files vs. performance

When working with genotype data, OmicSHIELD offers the possibility of using VCF and GDS formats. When a VCF is supplied, internally is converted to a GDS to work with it, for that reason it is always better to start with a GDS to improve the performance of the pipeline.

When converting to GDS using gdsfmt, there are many compression options that will affect the final file size. This compression comes with an added cost, which is the read time. For very aggressive compression typically the reading time gets severely affected, so it is good to find a good balance. To help choosing the right compression we provide a comparison table between all the compression options, the table has been extracted from the official documentation of the gdsgmt package.

Compression Method Raw ZIP ZIP_ra LZ4 LZ4_ra LZMA LZMA_ra
Data Size (MB) 38.1 1.9 2.1 2.8 2.9 1.4 1.4
Compression Percent 100% 5.08% 5.42% 7.39% 7.60% 3.65% 3.78%
Reading Time (second) 0.21 202.64 2.97 84.43 0.84 462.1 29.7

At ISGlobal we are using the LZ4_ra compression method because it provides a very good compression level with the least effect to the reading time.