17 Compression of GDS files vs. performance
When working with genotype data, OmicSHIELD offers the possibility of using VCF and GDS formats. When a VCF is supplied, internally is converted to a GDS to work with it, for that reason it is always better to start with a GDS to improve the performance of the pipeline.
When converting to GDS using gdsfmt, there are many compression options that will affect the final file size. This compression comes with an added cost, which is the read time. For very aggressive compression typically the reading time gets severely affected, so it is good to find a good balance. To help choosing the right compression we provide a comparison table between all the compression options, the table has been extracted from the official documentation of the gdsgmt
package.
Compression Method | Raw | ZIP | ZIP_ra | LZ4 | LZ4_ra | LZMA | LZMA_ra |
---|---|---|---|---|---|---|---|
Data Size (MB) | 38.1 | 1.9 | 2.1 | 2.8 | 2.9 | 1.4 | 1.4 |
Compression Percent | 100% | 5.08% | 5.42% | 7.39% | 7.60% | 3.65% | 3.78% |
Reading Time (second) | 0.21 | 202.64 | 2.97 | 84.43 | 0.84 | 462.1 | 29.7 |
At ISGlobal we are using the LZ4_ra
compression method because it provides a very good compression level with the least effect to the reading time.