In my last post I looked at how the GZIP variant BGZF (Blocked GNU Zip Format, used in BAM files) allowed efficient random access to large compressed files. This time I'm looking at bzip2 (bz2) which offers better compression than GZIP, but is also block based so in theory the same random access strategy can be employed.
BAM files are compressed using a variant of GZIP (GNU ZIP), called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specification will have seen the terms BGZF and virtual offsets, but what you may not realise is how general purpose this is for random access to any large compressed file. The take home message is:
BGZF files are bigger than GZIP files, but they are much faster for random access.