Mastodon icon github projects my shared stories on newsblur RSS feed icon Vogon Poetry, Computers and (some) biology

Scientific Rust #rust2019

The Rust community requested feedback last year for where the language should go in 2018, and now they are running it again for 2019. Last year I was too new in Rust to organize a blog post, but after an year using it I feel more comfortable writing this!

(Check my previous post about replacing the C++ core in sourmash with Rust for more details on how I spend my year in Rust).

What counts as "scientific Rust"?

Anything that involves doing science using computers counts as scientific programming. It includes from embedded software running on satellites to climate models running in supercomputers, from shell scripts running tools in a pipeline to data analysis using notebooks.

It also makes the discussion harder, because it's too general! But it is very important to keep in mind, because scientists are not your regular user: they are highly qualified in their field of expertise, and they are also pushing the boundaries of what we know (and this might need flexibility in their tools).

In this post I will be focusing more in two areas: array computing (what most people consider 'scientific programming' to be) and "data structures".

Array computing

This one is booming in the last couple of years due to industry interest in data sciences and deep learning (where they will talk about tensors instead of arrays), and has its roots in models running in supercomputers (a field where Fortran is still king!). Data tends to be quite regular (representable with matrices) and amenable to parallel processing.

A good example is the SciPy stack in Python, built on top of NumPy and SciPy. The adoption of the SciPy stack (both in academia and industry) is staggering, and many alternative implementations try to provide a NumPy-like API to try to capture its mindshare.

This is the compute-intensive side science (be it CPU or GPU/TPU), and also the kind of data that pushed CPU evolution and is still very important in defining policy in scientific computing funding (see countries competing for the largest supercomputers and measuring performance in floating point operations per second).

Data structures for efficient data representation

For data that is not so regular the situation is a bit different. I'll use bioinformatics as an example: the data we get out of nucleotide sequencers is usually represented by long strings (of ACGT), and algorithms will do a lot of string processing (be it building string-overlap graphs for assembly, or searching for substrings in large collections). This is only one example: there are many analyses that will work with other types of data, and most of them don't have a universal data representation as in the array computing case.

This is the memory-intensive science, and it's hard to measure performance in floating point operations because... most of the time you're not even using floating point numbers. It also suffers from limited data locality (which is almost a prerequisite for compute-intensive performance).

High performance core, interactive API

There is something common in both cases: while performance-intensive code is implemented in C/C++/Fortran, users usually interact with the API from other languages (especially Python or R) because it's faster to iterate and explore the data, and many of the tools already available in these languages are very helpful for these tasks (think Jupyter/pandas or RStudio/tidyverse). These languages are used to define the computation, but it is a lower-level core library that drives it (NumPy or Tensorflow follow this idea, for example).

How to make Rust better for science?

The biggest barrier to learning Rust is the ownership model, and while we can agree it is an important feature it is also quite daunting for newcomers, especially if they don't have previous programming experience and exposure to what bugs are being prevented. I don't see it being the first language we teach to scientists any time soon, because the majority of scientists are not system programmers, and have very different expectations for a programming language. That doesn't mean that they can't benefit from Rust!

Rust is already great for building the performance-intensive parts, and thanks to Cargo it is also a better alternative for sharing this code around, since they tend to get 'stuck' inside Python or R packages. And the 'easy' approach of vendoring C/C++ instead of having packages make it hard to keep track of changes and doesn't encourage reusable code.

And, of course, if this code is Rust instead of C/C++ it also means that Rust users can use them directly, without depending on the other languages. Seems like a good way to bootstrap a scientific community in Rust =]

What I would like to see in 2019?

An attribute proc-macro like #[wasm_bindgen] but for FFI

While FFI is an integral part of Rust goals (interoperability with C/C++), I have serious envy of the structure and tooling developed for WebAssembly! (Even more now that it works in stable too)

We already have #[no_mangle] and pub extern "C", but they are quite low-level. I would love to see something closer to what wasm-bindgen does, and define some traits (like IntoWasmAbi) to make it easier to pass more complex data types through the FFI.

I know it's not that simple, and there are different design restrictions than WebAssembly to take into account... The point here is not having the perfect solution for all use cases, but something that serves as an entry point and helps to deal with the complexity while you're still figuring out all the quirks and traps of FFI. You can still fallback and have more control using the lower-level options when the need rises.

More -sys and Rust-like crates for interoperability with the larger ecosystems

There are new projects bringing more interoperability to dataframes and tensors. While this ship has already sailed and they are implemented in C/C++, it would be great to be a first-class citizen, and not reinvent the wheel. (Note: the arrow project already have pretty good Rust support!)

In my own corner (bioinformatics), the Rust-bio community is doing a great job of wrapping useful libraries in C/C++ and exposing them to Rust (and also a shout-out to 10X Genomics for doing this work for other tools while also contributing to Rust-bio!).

More (bioinformatics) tools using Rust!

We already have great examples like finch and yacrd, since Rust is great for single binary distribution of programs. And with bioinformatics focusing so much in independent tools chained together in workflows, I think we can start convincing people to try it out =]

A place to find other scientists?

Another idea is to draw inspiration from rOpenSci and have a Rust equivalent, where people can get feedback about their projects and how to better integrate it with other crates. This is quite close to the working group idea, but I think it would serve more as a gateway to other groups, more focused on developing entry-level docs and bringing more scientists to the community.

Final words

In the end, I feel like this post ended up turning into my 'wishful TODO list' for 2019, but I would love to find more people sharing these goals (or willing to take any of this and just run with it, I do have a PhD to finish! =P)


Tags: bioinformatics rust