Putting it all together

11 May 2020

sourmash 3.3 was released last week, and it is the first version supporting zipped databases. Here is my personal account of how that came to be =]

What is a sourmash database?

A sourmash database contains signatures (typically Scaled MinHash sketches built from genomic datasets) and an index for allowing efficient similarity and containment queries over these signatures. The two types of index are SBT, a hierarchical index that uses less memory by keeping data on disk, and LCA, an inverted index that uses more memory but is potentially faster. Indices are described as JSON files, with LCA storing all the data in one JSON file and SBT opting for saving a description of the index structure in JSON, and all the data into a hidden directory with many files.

We distribute some prepared databases (with SBT indices) for Genbank and RefSeq as compressed TAR files. The compressed file is ~8GB, but after decompressing it turns into almost 200k files in a hidden directory, using about 40 GB of disk space.

Can we avoid generating so many hidden files?

The initial issue in this saga is dib-lab/sourmash#490, and the idea was to take the existing support for multiple data storages (hidden dir, TAR files, IPFS and Redis) and save the index description in the storage, allowing loading everything from the storage. Since we already had the databases as TAR files, the first test tried to use them but it didn't take long to see it was a doomed approach: TAR files are terrible from random access (or at least the tarfile module in Python is).

Zip files showed up as a better alternative, and it helps that Python has the zipfile module already available in the standard library. Initial tests were promising, and led to dib-lab/sourmash#648. The main issue was performance: compressing and decompressing was slow, but there was also another limitation...

Loading Nodegraphs from a memory buffer

Another challenge was efficiently loading the data from a storage. The two core methods in a storage are save(location, content), where content is a bytes buffer, and load(location), which returns a bytes buffer that was previously saved. This didn't interact well with the khmer Nodegraphs (the Bloom Filter we use for SBTs), since khmer only loads data from files, not from memory buffers. We ended up doing a temporary file dance, which made things slower for the default storage (hidden dir), where it could have been optimized to work directly with files, and involved interacting with the filesystem for the other storages (IPFS and Redis could be pulling data directly from the network, for example).

This one could be fixed in khmer by exposing C++ stream methods, and I did a small PoC to test the idea. While doable, this is something that was happening while the sourmash conversion to Rust was underway, and depending on khmer was a problem for my Webassembly aspirations... so, having the Nodegraph implemented in Rust seemed like a better direction, That has actually been quietly living in the sourmash codebase for quite some time, but it was never exposed to the Python (and it was also lacking more extensive tests).

After the release of sourmash 3 and the replacement of the C++ for the Rust implementation, all the pieces for exposing the Nodegraph where in place, so dib-lab/sourmash#799 was the next step. It wasn't a priority at first because other optimizations (that were released in 3.1 and 3.2) were more important, but then it was time to check how this would perform. And...

Your Rust code is not so fast, huh?

Turns out that my Nodegraph loading code was way slower than khmer. The Nodegraph binary format is well documented, and doing an initial implementation wasn't so hard by using the byteorder crate to read binary data with the right endianess, and then setting the appropriate bits in the internal fixedbitset in memory. But the khmer code doesn't parse bit by bit: it reads a long char buffer directly, and that is many orders of magnitude faster than setting bit by bit.

And there was no way to replicate this behavior directly with fixedbitset. At this point I could either bit-indexing into a large buffer and lose all the useful methods that fixedbitset provides, or try to find a way to support loading the data directly into fixedbitset and open a PR.

I chose the PR (and even got #42! =]).

It was more straightforward than I expected, but it did expose the internal representation of fixedbitset, so I was a bit nervous it wasn't going to be merged. But bluss was super nice, and his suggestions made the PR way better! This simplified the final Nodegraph code, and actually was more correct (because I was messing a few corner cases when doing the bit-by-bit parsing before). Win-win!

Nodegraphs are kind of large, can we compress them?

Being able to save and load Nodegraphs in Rust allowed using memory buffers, but also opened the way to support other operations not supported in khmer Nodegraphs. One example is loading/saving compressed files, which is supported for Countgraph (another khmer data structure, based on Count-Min Sketch) but not in Nodegraph.

If only there was an easy way to support working with compressed files...

Oh wait, there is! niffler is a crate that I made with Pierre Marijon based on some functionality I saw in one of his projects, and we iterated a bit on the API and documented everything to make it more useful for a larger audience. niffler tries to be as transparent as possible, with very little boilerplate when using it but with useful features nonetheless (like auto detection of the compression format). If you want more about the motivation and how it happened, check this Twitter thread.

The cool thing is that adding compressed files support in sourmash was mostly one-line changes for loading (and a bit more for saving, but mostly because converting compression levels could use some refactoring).

Putting it all together: zipped SBT indices

With all these other pieces in places, it's time to go back to dib-lab/sourmash#648. Compressing and decompressing with the Python zipfile module is slow, but Zip files can also be used just for storage, handing back the data without extracting it. And since we have compression/decompression implemented in Rust with niffler, that's what the zipped sourmash databases are: data is loaded and saved into the Zip file without using the Python module compression/decompression, and all the work is done before (or after) in the Rust side.

This allows keeping the Zip file with similar sizes to the original TAR files we started with, but with very low overhead for decompression. For compression we opted for using Gzip level 1, which doesn't compress perfectly but also doesn't take much longer to run:

Level	Size	Time
0	407 MB	16s
1	252 MB	21s
5	250 MB	39s
9	246 MB	1m48s

In this table, 0 is without compression, while 9 is the best compression. The size difference from 1 to 9 is only 6 MB (~2% difference) but runs 5x faster, and it's only 30% slower than saving the uncompressed data.

The last challenge was updating an existing Zip file. It's easy to support appending new data, but if any of the already existing data in the file changes (which happens when internal nodes change in the SBT, after a new dataset is inserted) then there is no easy way to replace the data in the Zip file. Worse, the Python zipfile will add the new data while keeping the old one around, leading to ginormous files over time¹ So, what to do?

I ended up opting for dealing with the complexity and complicating the ZipStorage implementation a bit, by keeping a buffer for new data. If it's a new file or it already exists but there are no insertions the buffer is ignored and all works as before.

If the file exists and new data is inserted, then it is first stored in the buffer (where it might also replace a previous entry with the same name). In this case we also need to check the buffer when trying to load some data (because it might exist only in the buffer, and not in the original file).

Finally, when the ZipStorage is closed it needs to verify if there are new items in the buffer. If not, it is safe just to close the original file. If there are new items but they were not present in the original file, then we can append the new data to the original file. The final case is if there are new items that were also in the original file, and in this case a new Zip file is created and all the content from buffer and original file are copied to it, prioritizing items from the buffer. The original file is replaced by the new Zip file.

Turns out this worked quite well! And so the PR was merged =]

The future

Zipped databases open the possibility of distributing extra data that might be useful for some kinds of analysis. One thing we are already considering is adding taxonomy information, let's see what else shows up.

Having Nodegraph in Rust is also pretty exciting, because now we can change the internal representation for something that uses less memory (maybe using RRR encoding?), but more importantly: now they can also be used with Webassembly, which opens many possibilities for running not only signature computation but also search and gather in the browser, since now we have all the pieces to build it.

Comments?

Thread on Twitter

Footnotes

The zipfile module does throw a UserWarning pointing that duplicated files were inserted, which is useful during development but generally doesn't show during regular usage... ↩