Gabbleblotchitshttps://blog.luizirber.org/2020-07-24T12:00:00-03:00Vogon Poetry, Computers and (some) biologyMinHashing all the things: a quick analysis of MAG search results2020-07-24T12:00:00-03:002020-07-24T12:00:00-03:00luizirbertag:blog.luizirber.org,2020-07-24:/2020/07/24/mag-results/<p><a href="https://blog.luizirber.org/2020/07/22/mag-search/">Last time</a> I described a way to search MAGs in metagenomes,
and teased about interesting results.
Let's dig in some of them!</p>
<p>I prepared a <a href="https://github.com/luizirber/2020-07-22-mag-search/">repo</a> with the data and a notebook with the analysis I did in this
post.
You can also follow along in <a href="https://mybinder.org">Binder</a>,
as well as do your own analysis! <a href="https://mybinder.org/v2/gh/luizirber/2020-07-22-mag-search/master?filepath=index.ipynb"><img alt="Binder" src="https://mybinder.org/badge_logo.svg"></a></p>
<h2>Preparing some metadata</h2>
<p>The supplemental materials for <a href="https://www.nature.com/articles/sdata2017203">Tully et al</a> include more details about each MAG,
so let's download them.
I prepared a small snakemake workflow to do that,
as well as downloading information about the SRA datasets from Tara Oceans
(the dataset used to generate the MAGs),
as well as from <a href="https://www.nature.com/articles/s41564-017-0012-7">Parks et al</a>,
which also generated MAGs from Tara Oceans.
Feel free to include them in your analysis,
but I was curious to find matches in other metagenomes.</p>
<h2>Loading the data</h2>
<p>The results from the MAG search are in a CSV file,
with a column for the MAG name,
another for the SRA dataset ID for the metagenome and a third column for the
containment of the MAG in the metagenome.
I also fixed the names to make it easier to query,
and finally removed the Tara and Parks metagenomes
(because we already knew they contained these MAGs).</p>
<p>This left us with 23,644 SRA metagenomes with matches,
covering 2,291 of the 2,631 MAGs.
These are results for a fairly low containment (10%),
so if we limit to MAGs with more than 50% containment we still have 1,407 MAGs and 2,938 metagenomes left.</p>
<h2>TOBG_NP-110, I choose you!</h2>
<p>That's still a lot,
so I decided to pick a candidate to check before doing any large scale analysis.
I chose TOBG_NP-110 because there were many matches above 50% containment,
and even some at 99%.
Turns out it is also an Archaeal MAG that failed to be classified further than Phylum level (Euryarchaeota),
with a 70.3% complete score in the original analysis.
Oh, let me dissect the name a bit:
TOBG is "Tara Ocean Binned Genome" and "NP" is North Pacific.</p>
<p>And so I went checking where the other metagenome matches came from.
5 of the 12 matches above 50% containment come from one study,
<a href="https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP044185">SRP044185</a>,
with samples collected from a column of water in a station in Manzanillo, Mexico.
Other 3 matches come from
<a href="https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP003331">SRP003331</a>,
in the South Pacific ocean (in northern Chile).
Another match,
<a href="https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR3256923">ERR3256923</a>,
also comes from the South Pacific.</p>
<h2>What else can I do?</h2>
<p>I'm curious to follow <a href="http://merenlab.org/data/refining-mags/">the refining MAGs</a> tutorial from the Meren Lab and see where this goes,
and especially in using <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02066-4"><code>spacegraphcats</code></a>
to extract neighborhoods from the MAG and better evaluate what is missing or if there are other interesting bits that
the MAG generation methods ended up discarding.</p>
<p>So, for now that's it.
But more important,
I didn't want to sit on these results until there is a publication in press,
especially when there are people that can do so much more with these,
so I decided to make it all public.
It is way more exciting to see this being used to know more about these
organisms than me being the only one with access to this info.</p>
<p>And yesterday I saw <a href="https://twitter.com/DrJonathanRosa/status/1286381346605027328">this tweet</a> by
<a href="https://twitter.com/DrJonathanRosa/status/1286381346605027328">@DrJonathanRosa</a>,
saying:</p>
<blockquote>
<p>I don’t know who told students that the goal of research is to find some
previously undiscovered research topic, claim individual ownership over it,
& fiercely protect it from theft, but that almost sounds like, well,
colonialism, capitalism, & policing </p>
</blockquote>
<p>Amen.</p>
<h2>I want to run this with my data!</h2>
<p>Next time. But we will have a discussion about scientific infrastructure and
sustainability first =]</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://twitter.com/luizirber/status/1286700888111738880">Thread on Twitter</a></li>
</ul>MinHashing all the things: searching for MAGs in the SRA2020-07-22T12:00:00-03:002020-07-22T12:00:00-03:00luizirbertag:blog.luizirber.org,2020-07-22:/2020/07/22/mag-search/<p>(or: Top-down and bottom-up approaches for working around sourmash limitations)</p>
<p>In the last month I updated <a href="https://wort.oxli.org">wort</a>,
the system I developed for computing sourmash signature for public genomic databases,
and started calculating signatures
for the <a href="https://www.ncbi.nlm.nih.gov/sra/?term=%22METAGENOMIC%22%5Bsource%5D+NOT+amplicon%5BAll+Fields%5D)">metagenomes</a> in the <a href="https://www.ncbi.nlm.nih.gov/sra/">Sequence Read Archive</a>.
This is a more challenging subset than the <a href="https://blog.luizirber.org/2016/12/28/soursigs-arch-1/">microbial datasets</a> I was doing previously,
since there are around 534k datasets from metagenomic sources in the SRA,
totalling 447 TB of data.
Another problem is the size of the datasets,
ranging from a couple of MB to 170 GB.
Turns out that the workers I have in <code>wort</code> are very good for small-ish datasets,
but I still need to figure out how to pull large datasets faster from the SRA,
because the large ones take forever to process...</p>
<p>The good news is that I managed to calculate signatures for almost 402k of them
<sup id=sf-mag-search-1-back><a href=#sf-mag-search-1 class=simple-footnote title="pulling about a 100 TB in 3 days, which was pretty fun to see because I ended up DDoS myself because I couldn't download the generated sigs fast enough from the S3 bucket where they are temporarily stored =P">1</a></sup>,
which already let us work on some pretty exciting problems =]</p>
<h2>Looking for MAGs in the SRA</h2>
<p>Metagenome-assembled genomes are essential for studying organisms that are hard to isolate and culture in lab,
especially for environmental metagenomes.
<a href="https://www.nature.com/articles/sdata2017203">Tully et al</a> published 2,631 draft MAGs from 234 samples collected during the Tara Oceans expedition,
and I wanted to check if they can also be found in other metagenomes besides the Tara Oceans ones.
The idea is to extract the reads from these other matches and evaluate how the MAG can be improved,
or at least evaluate what is missing in them.
I choose to use environmental samples under the assumption they are easier to deposit on the SRA and have public access,
but there are many human gut microbiomes in the SRA and this MAG search would work just fine with those too.</p>
<p>Moreover,
I want to search for containment,
and not similarity.
The distinction is subtle,
but similarity takes into account both datasets sizes
(well, the size of the union of all elements in both datasets),
while containment only considers the size of the query.
This is relevant because the similarity of a MAG and a metagenome is going to be very small (and is symmetrical),
but the containment of the MAG in the metagenome might be large
(and is asymmetrical, since the containment of the metagenome in the MAG is likely very small because the metagenome is so much larger than the MAG).</p>
<h2>The computational challenge: indexing and searching</h2>
<p>sourmash signatures are a small fraction of the original size of the datasets,
but when you have hundreds of thousands of them the collection ends up being pretty large too.
More precisely, 825 GB large.
That is way bigger than any index I ever built for sourmash,
and it would also have pretty distinct characteristics than what we usually do:
we tend to index genomes and run <code>search</code> (to find similar genomes) or <code>gather</code>
(to decompose metagenomes into their constituent genomes),
but for this MAG search I want to find which metagenomes have my MAG query above a certain containment threshold.
Sort of a <code>sourmash search --containment</code>,
but over thousands of metagenome signatures.
The main benefit of an SBT index in this context is to avoid checking all signatures because we can prune the search early,
but currently SBT indices need to be totally loaded in memory during <code>sourmash index</code>.
I will have to do this in the medium term,
but I want a solution NOW! =]</p>
<p><a href="https://github.com/dib-lab/sourmash/releases/tag/v3.4.0">sourmash 3.4.0</a> introduced <code>--from-file</code> in many commands,
and since I can't build an index I decided to use it to load signatures for the metagenomes.
But... <code>sourmash search</code> tries to load all signatures in memory,
and while I might be able to find a cluster machine with hundreds of GBs of RAM available,
that's not very practical.</p>
<p>So, what to do?</p>
<h2>The top-down solution: a snakemake workflow</h2>
<p>I don't want to modify sourmash now,
so why not make a workflow and use snakemake to run one <code>sourmash search --containment</code> for each metagenome?
That means 402k tasks,
but at least I can use <a href="https://snakemake.readthedocs.io/en/stable/executing/cli.html#dealing-with-very-large-workflows">batches</a> and <a href="https://slurm.schedmd.com/job_array.html">SLURM job arrays</a> to submit reasonably-sized jobs to our HPC queue.
After running all batches I summarized results for each task,
and it worked well for a proof of concept.</p>
<p>But... it was still pretty resource intensive:
each task was running one query MAG against one metagenome,
and so each task needed to do all the overhead of starting the Python interpreter and parsing the query signature,
which is exactly the same for all tasks.
Extending it to support multiple queries to the same metagenome would involve duplicating tasks,
and 402k metagenomes times 2,631 MAGs is...
a very large number of jobs.</p>
<p>I also wanted to avoid clogging the job queues,
which is not very nice to the other researchers using the cluster.
This limited how many batches I could run in parallel...</p>
<h2>The bottom-up solution: Rust to the rescue!</h2>
<p>Thinking a bit more about the problem,
here is another solution:
what if we load all the MAGs in memory
(as they will be queried frequently and are not that large),
and then for each metagenome signature load it,
perform all MAG queries,
and then unload the metagenome signature from memory?
This way we can control memory consumption
(it's going to be proportional to all the MAG sizes plus the size of the largest metagenome)
and can also efficiently parallelize the code because each task/metagenome is independent
and the MAG signatures can be shared freely (since they are read-only).</p>
<p>This could be done with the sourmash Python API plus <code>multiprocessing</code> or some
other parallelization approach (maybe dask?),
but turns out that everything we need comes from the Rust API.
Why not enjoy a bit of the <a href="https://doc.rust-lang.org/stable/book/ch16-00-concurrency.html">fearless concurrency</a> that is one of the major Rust goals?</p>
<p><a href="https://github.com/luizirber/phd/blob/aa1ed9eb33ba71fdf9b3f2c92931701be6df00cd/experiments/wort/sra_search/src/main.rs">The whole code</a> ended up being 176 lines long,
including command line parsing using <a href="https://docs.rs/structopt/latest/structopt/">strucopt</a> and parallelizing the search using <a href="https://docs.rs/rayon/latest/rayon/">rayon</a>
and a <a href="https://doc.rust-lang.org/std/sync/mpsc/fn.channel.html">multiple-producer, single-consumer channel</a> to write results to an output
(either the terminal or a file).
This version took 11 hours to run,
using less than 5GB of RAM and 32 processors,
to search 2k MAGs against 402k metagenomes.
And, bonus! It can also be parallelized again if you have multiple machines,
so it potentially takes a bit more than an hour to run if you can allocate 10 batch jobs,
with each batch 1/10 of the metagenome signatures.</p>
<h2>So, is bottom-up always the better choice?</h2>
<p>I would like to answer "Yes!",
but bioinformatics software tends to be organized as command line interfaces,
not as libraries.
Libraries also tend to have even less documentation than CLIs,
and this particular case is not a fair comparison because...
Well, I wrote most of the library,
and the Rust API is not that well documented for general use.</p>
<p>But I'm pretty happy with how the sourmash CLI is viable both for the top-down approach
(and whatever workflow software you want to use) as well as how the Rust core worked for the bottom-up approach.
I think the most important is having the option to choose which way to go,
especially because now I can use the bottom-up approach to make the sourmash CLI
and Python API better.
The top-down approach is also way more accessible in general,
because you can pick your favorite workflow software and use all the tricks you're comfortable with.</p>
<h2>But, what about the results?!?!?!</h2>
<p>Next time. But I did find MAGs with over 90% containment in very different locations,
which is pretty exciting!</p>
<p>I also need to find a better way of distributing all these signature,
because storing 4 TB of data in S3 is somewhat cheap,
but transferring data is very expensive.
All signatures are also available on IPFS,
but I need more people to host them and share.
Get in contact if you're interested in helping =]</p>
<p>And while I'm asking for help,
any tips on pulling data faster from the SRA are greatly appreciated!</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://twitter.com/luizirber/status/1285782732790849537">Thread on Twitter</a></li>
</ul><section class=footnotes><hr><h2>Footnotes</h2><ol><li id=sf-mag-search-1><p>pulling about a 100 TB in 3 days, which was pretty fun to see because I
ended up DDoS myself because I couldn't download the generated sigs fast enough
from the S3 bucket where they are temporarily stored =P <a href=#sf-mag-search-1-back class=simple-footnote-back>↩</a></p></li></ol></section>Putting it all together2020-05-11T12:00:00-03:002020-05-11T12:00:00-03:00luizirbertag:blog.luizirber.org,2020-05-11:/2020/05/11/sbt-zip/<p>sourmash <a href="https://twitter.com/ctitusbrown/status/1257418140729868291">3.3</a> was released last week,
and it is the first version supporting <a href="http://ivory.idyll.org/blog/2020-sourmash-databases-as-zip-files.html">zipped databases</a>.
Here is my personal account of how that came to be =]</p>
<h2>What is a sourmash database?</h2>
<p>A sourmash database contains signatures (typically Scaled MinHash sketches built from genomic datasets) and
an index for allowing efficient similarity and containment queries over these signatures.
The two types of index are SBT,
a hierarchical index that uses less memory by keeping data on disk,
and LCA,
an inverted index that uses more memory but is potentially faster.
Indices are described as JSON files,
with LCA storing all the data in one JSON file and SBT opting for saving a description of the index structure in JSON,
and all the data into a hidden directory with many files.</p>
<p>We distribute some <a href="https://sourmash.readthedocs.io/en/v3.3.0/databases.html">prepared databases</a> (with SBT indices) for Genbank and RefSeq as compressed TAR files.
The compressed file is ~8GB,
but after decompressing it turns into almost 200k files in a hidden directory,
using about 40 GB of disk space.</p>
<h2>Can we avoid generating so many hidden files?</h2>
<p>The initial issue in this saga is <a href="https://github.com/dib-lab/sourmash/issues/490">dib-lab/sourmash#490</a>,
and the idea was to take the existing support for multiple data storages
(hidden dir,
TAR files,
IPFS and Redis) and save the index description in the storage,
allowing loading everything from the storage.
Since we already had the databases as TAR files,
the first test tried to use them but it didn't take long to see it was a doomed approach:
TAR files are terrible from random access
(or at least the <code>tarfile</code> module in Python is).</p>
<p>Zip files showed up as a better alternative,
and it helps that Python has the <code>zipfile</code> module already available in the
standard library.
Initial tests were promising,
and led to <a href="https://github.com/dib-lab/sourmash/pull/648">dib-lab/sourmash#648</a>.
The main issue was performance:
compressing and decompressing was slow,
but there was also another limitation...</p>
<h2>Loading Nodegraphs from a memory buffer</h2>
<p>Another challenge was efficiently loading the data from a storage.
The two core methods in a storage are <code>save(location, content)</code>,
where <code>content</code> is a bytes buffer,
and <code>load(location)</code>,
which returns a bytes buffer that was previously saved.
This didn't interact well with the <code>khmer</code> <code>Nodegraph</code>s (the Bloom Filter we use for SBTs),
since <code>khmer</code> only loads data from files,
not from memory buffers.
We ended up doing a temporary file dance,
which made things slower for the default storage (hidden dir),
where it could have been optimized to work directly with files,
and involved interacting with the filesystem for the other storages
(IPFS and Redis could be pulling data directly from the network,
for example).</p>
<p>This one could be fixed in <code>khmer</code> by exposing C++ stream methods,
and I did a <a href="https://github.com/luizirber/2018-cython-streams">small PoC</a> to test the idea.
While doable,
this is something that was happening while the sourmash conversion to Rust was underway,
and depending on <code>khmer</code> was a problem for my Webassembly aspirations...
so,
having the Nodegraph <a href="https://github.com/luizirber/sourmash-rust/pull/15">implemented in Rust</a> seemed like a better direction,
That has actually been quietly living in the sourmash codebase for quite some time,
but it was never exposed to the Python (and it was also lacking more extensive
tests).</p>
<p>After the release of sourmash 3 and the replacement of the C++ for the Rust implementation,
all the pieces for exposing the Nodegraph where in place,
so <a href="https://github.com/dib-lab/sourmash/pull/799">dib-lab/sourmash#799</a> was the next step.
It wasn't a priority at first because other optimizations
(that were released in 3.1 and 3.2)
were more important,
but then it was time to check how this would perform.
And...</p>
<h2>Your Rust code is not so fast, huh?</h2>
<p>Turns out that my Nodegraph loading code was way slower than <code>khmer</code>.
The Nodegraph binary format <a href="https://khmer.readthedocs.io/en/latest/dev/binary-file-formats.html#nodegrap://khmer.readthedocs.io/en/latest/dev/binary-file-formats.html#nodegraph">is well documented</a>,
and doing an initial implementation wasn't so hard by using the <code>byteorder</code> crate
to read binary data with the right endianess,
and then setting the appropriate bits in the internal <code>fixedbitset</code> in memory.
But the khmer code doesn't parse bit by bit:
it <a href="https://github.com/dib-lab/khmer/blob/fe0ce116456b296c522ba24294a0cabce3b2648b/src/oxli/storage.cc#L233">reads</a> a long <code>char</code> buffer directly,
and that is many orders of magnitude faster than setting bit by bit.</p>
<p>And there was no way to replicate this behavior directly with <code>fixedbitset</code>.
At this point I could either bit-indexing into a large buffer
and lose all the useful methods that <code>fixedbitset</code> provides,
or try to find a way to support loading the data directly into <code>fixedbitset</code> and
open a PR.</p>
<p><a href="https://github.com/petgraph/fixedbitset/pull/42">I chose the PR</a> (and even got #42! =]).</p>
<p>It was more straightforward than I expected,
but it did expose the internal representation of <code>fixedbitset</code>,
so I was a bit nervous it wasn't going to be merged.
But <a href="https://github.com/bluss">bluss</a> was super nice,
and his suggestions made the PR way better!
This <a href="https://github.com/dib-lab/sourmash/blob/9a695fb03b99c060bb8d1384ab78bb3797c5eb65/src/core/src/sketch/nodegraph.rs#L235L261">simplified</a> the final <code>Nodegraph</code> code,
and actually was more correct
(because I was messing a few corner cases when doing the bit-by-bit parsing before).
Win-win!</p>
<h2>Nodegraphs are kind of large, can we compress them?</h2>
<p>Being able to save and load <code>Nodegraph</code>s in Rust allowed using memory buffers,
but also opened the way to support other operations not supported in khmer <code>Nodegraph</code>s.
One example is loading/saving compressed files,
which is supported for <code>Countgraph</code>
(another khmer data structure,
based on Count-Min Sketch)
but not in <code>Nodegraph</code>.</p>
<p>If only there was an easy way to support working with compressed files...</p>
<p>Oh wait, there is! <a href="https://github.com/luizirber/niffler">niffler</a> is a crate that I made with <a href="https://twitter.com/pierre_marijon">Pierre Marijon</a> based
on some functionality I saw in one of his projects,
and we iterated a bit on the API and documented everything to make it more
useful for a larger audience.
<code>niffler</code> tries to be as transparent as possible,
with very little boilerplate when using it but with useful features nonetheless
(like auto detection of the compression format).
If you want more about the motivation and how it happened,
check this <a href="https://twitter.com/luizirber/status/1253445504622424064">Twitter thread</a>.</p>
<p>The cool thing is that adding compressed files support in <code>sourmash</code> was mostly
<a href="https://github.com/dib-lab/sourmash/pull/799/files#diff-313a7ff0fdb14f408a64b3f010f46f65R220">one-line changes</a> for loading
(and <a href="https://github.com/dib-lab/sourmash/pull/648/files#diff-d80ae1dd777d07300d7b6066b3318397L249-R273">a bit more</a> for saving,
but mostly because converting compression levels could use some refactoring).</p>
<h2>Putting it all together: zipped SBT indices</h2>
<p>With all these other pieces in places,
it's time to go back to <a href="https://github.com/dib-lab/sourmash/pull/648">dib-lab/sourmash#648</a>.
Compressing and decompressing with the Python <code>zipfile</code> module is slow,
but Zip files can also be used just for storage,
handing back the data without extracting it.
And since we have compression/decompression implemented in Rust with <code>niffler</code>,
that's what the zipped sourmash databases are:
data is loaded and saved into the Zip file without using the Python module
compression/decompression,
and all the work is done before (or after) in the Rust side.</p>
<p>This allows keeping the Zip file with similar sizes to the original TAR files we started with,
but with very low overhead for decompression.
For compression we opted for using Gzip level 1,
which doesn't compress perfectly but also doesn't take much longer to run:</p>
<table>
<thead>
<tr>
<th>Level</th>
<th>Size</th>
<th>Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>407 MB</td>
<td>16s</td>
</tr>
<tr>
<td>1</td>
<td>252 MB</td>
<td>21s</td>
</tr>
<tr>
<td>5</td>
<td>250 MB</td>
<td>39s</td>
</tr>
<tr>
<td>9</td>
<td>246 MB</td>
<td>1m48s</td>
</tr>
</tbody>
</table>
<p>In this table, <code>0</code> is without compression,
while <code>9</code> is the best compression.
The size difference from <code>1</code> to <code>9</code> is only 6 MB (~2% difference)
but runs 5x faster,
and it's only 30% slower than saving the uncompressed data.</p>
<p>The last challenge was updating an existing Zip file.
It's easy to support appending new data,
but if any of the already existing data in the file changes
(which happens when internal nodes change in the SBT,
after a new dataset is inserted) then there is no easy way to replace the data in the Zip file.
Worse,
the Python <code>zipfile</code> will add the new data while keeping the old one around,
leading to ginormous files over time<sup id=sf-sbt-zip-1-back><a href=#sf-sbt-zip-1 class=simple-footnote title="The zipfile module does throw a UserWarning pointing that duplicated files were inserted, which is useful during development but generally doesn't show during regular usage...">1</a></sup>
So, what to do?</p>
<p>I ended up opting for dealing with the complexity and <a href="https://github.com/dib-lab/sourmash/pull/648/files#diff-a99b088adcc872e1b408fbdcca20ebebR110-R248">complicating the ZipStorage</a> implementation a bit,
by keeping a buffer for new data.
If it's a new file or it already exists but there are no insertions
the buffer is ignored and all works as before.</p>
<p>If the file exists and new data is inserted,
then it is first stored in the buffer
(where it might also replace a previous entry with the same name).
In this case we also need to check the buffer when trying to load some data
(because it might exist only in the buffer,
and not in the original file).</p>
<p>Finally,
when the <code>ZipStorage</code> is closed it needs to verify if there are new items in the buffer.
If not,
it is safe just to close the original file.
If there are new items but they were not present in the original file,
then we can append the new data to the original file.
The final case is if there are new items that were also in the original file,
and in this case a new Zip file is created and all the content from buffer and
original file are copied to it,
prioritizing items from the buffer.
The original file is replaced by the new Zip file.</p>
<p>Turns out this worked quite well! And so the PR was merged =]</p>
<h2>The future</h2>
<p>Zipped databases open the possibility of distributing extra data that might be
useful for some kinds of analysis.
One thing we are already considering is adding <a href="https://github.com/dib-lab/sourmash/issues/969">taxonomy information</a>,
let's see what else shows up.</p>
<p>Having <code>Nodegraph</code> in Rust is also pretty exciting,
because now we can change the internal representation for something that uses
less memory (maybe using <a href="https://alexbowe.com/rrr/">RRR encoding</a>?),
but more importantly:
now they can also be used with Webassembly,
which opens many possibilities for running not only <a href="https://blog.luizirber.org/2018/08/27/sourmash-wasm/">signature computation</a> but
also <code>search</code> and <code>gather</code> in the browser,
since now we have all the pieces to build it.</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://twitter.com/luizirber/status/1260031886744621059">Thread on Twitter</a></li>
</ul><section class=footnotes><hr><h2>Footnotes</h2><ol><li id=sf-sbt-zip-1><p>The <code>zipfile</code> module does throw a <code>UserWarning</code> pointing that duplicated files were inserted,
which is useful during development but generally doesn't show during regular usage... <a href=#sf-sbt-zip-1-back class=simple-footnote-back>↩</a></p></li></ol></section>Oxidizing sourmash: PR walkthrough2020-01-10T12:00:00-03:002020-01-10T12:00:00-03:00luizirbertag:blog.luizirber.org,2020-01-10:/2020/01/10/sourmash-pr/<p>sourmash 3 was released last week,
finally landing the Rust backend.
But, what changes when developing new features in sourmash?
I was thinking about how to best document this process,
and since <a href="https://github.com/dib-lab/sourmash/pull/826/files">PR #826</a> is a short example touching all the layers I decided to do a
small walkthrough.</p>
<p>Shall we?</p>
<h2>The problem</h2>
<p>The first step is describing the problem,
and trying to convince reviewers (and yourself) that the changes bring enough benefits to justify a merge.
This is the description I put in the PR:</p>
<blockquote>
<p>Calling <code>.add_hash()</code> on a MinHash sketch is fine,
but if you're calling it all the time it's better to pass a list of hashes and call <code>.add_many()</code> instead.
Before this PR <code>add_many</code> just called <code>add_hash</code> for each hash it was passed,
but now it will pass the full list to Rust (and that's way faster).</p>
<p>No changes for public APIs,
and I changed the <code>_signatures</code> method in LCA to accumulate hashes for each sig first,
and then set them all at once.
This is way faster,
but might use more intermediate memory (I'll evaluate this now).</p>
</blockquote>
<p>There are many details that sound like jargon for someone not familiar with the codebase,
but if I write something too long I'll probably be wasting the reviewers time too.
The benefit of a very detailed description is extending the knowledge for other people
(not necessarily the maintainers),
but that also takes effort that might be better allocated to solve other problems.
Or, more realistically, putting out other fires =P</p>
<p>Nonetheless,
some points I like to add in PR descriptions:
- why is there a problem with the current approach?
- is this the minimal viable change, or is it trying to change too many things
at once? The former is way better, in general.
- what are the trade-offs? This PR is using more memory to lower the runtime,
but I hadn't measure it yet when I opened it.
- Not changing public APIs is always good to convince reviewers.
If the project follows a <a href="https://semver.org/">semantic versioning</a> scheme,
changes to the public APIs are major version bumps,
and that can brings other consequences for users.</p>
<h2>Setting up for changing code</h2>
<p>If this was a bug fix PR,
the first thing I would do is write a new test triggering the bug,
and then proceed to fix it in the code
(Hmm, maybe that would be another good walkthrough?).
But this PR is making performance claims ("it's going to be faster"),
and that's a bit hard to codify in tests.
<sup id=sf-sourmash-pr-1-back><a href=#sf-sourmash-pr-1 class=simple-footnote title="We do have https://asv.readthedocs.io/ set up for micro-benchmarks, and now that I think about it... I could have started by writing a benchmark for add_many, and then showing that it is faster. I will add this approach to the sourmash PR checklist =]">1</a></sup>
Since it's also proposing to change a method (<code>_signatures</code> in LCA indices) that is better to benchmark with a real index (and not a toy example),
I used the same data and command I run in <a href="https://github.com/luizirber/sourmash_resources">sourmash_resources</a> to check how memory consumption and runtime changed.
For reference, this is the command: </p>
<div class=highlight><pre><span></span>sourmash search -o out.csv --scaled <span class=m>2000</span> -k <span class=m>51</span> HSMA33OT.fastq.gz.sig genbank-k51.lca.json.gz
</pre></div>
<p>I'm using the <code>benchmark</code> feature from <a href="https://snakemake.readthedocs.io/">snakemake</a> in <a href="https://github.com/luizirber/sourmash_resources/blob/83ea237397d242e48c9d95eb0d9f50ceb4ad95c7/Snakefile#L99L114">sourmash_resources</a> to
track how much memory, runtime and I/O is used for each command (and version) of sourmash,
and generate the plots in the README in that repo.
That is fine for a high-level view ("what's the maximum memory used?"),
but not so useful for digging into details ("what method is consuming most memory?").</p>
<p>Another additional problem is the dual<sup id=sf-sourmash-pr-2-back><a href=#sf-sourmash-pr-2 class=simple-footnote title="or triple, if you count C">2</a></sup> language nature of sourmash,
where we have Python calling into Rust code (via CFFI).
There are great tools for measuring and profiling Python code,
but they tend to not work with extension code...</p>
<p>So, let's bring two of my favorite tools to help!</p>
<h3>Memory profiling: heaptrack</h3>
<p><a href="https://github.com/KDE/heaptrack">heaptrack</a> is a heap profiler, and I first heard about it from <a href="https://www.vincentprouillet.com/">Vincent Prouillet</a>.
Its main advantage over other solutions (like valgrind's massif) is the low
overhead and... how easy it is to use:
just stick <code>heaptrack</code> in front of your command,
and you're good to go!</p>
<p>Example output:</p>
<div class=highlight><pre><span></span>$ heaptrack sourmash search -o out.csv --scaled <span class=m>2000</span> -k <span class=m>51</span> HSMA33OT.fastq.gz.sig genbank-k51.lca.json.gz
heaptrack stats:
allocations: <span class=m>1379353</span>
leaked allocations: <span class=m>1660</span>
temporary allocations: <span class=m>168984</span>
Heaptrack finished! Now run the following to investigate the data:
heaptrack --analyze heaptrack.sourmash.66565.gz
</pre></div>
<p><code>heaptrack --analyze</code> is a very nice graphical interface for analyzing the results,
but for this PR I'm mostly focusing on the Summary page (and overall memory consumption).
Tracking allocations in Python doesn't give many details,
because it shows the CPython functions being called,
but the ability to track into the extension code (Rust) allocations is amazing
for finding bottlenecks (and memory leaks =P).
<sup id=sf-sourmash-pr-3-back><a href=#sf-sourmash-pr-3 class=simple-footnote title="It would be super cool to have the unwinding code from py-spy in heaptrack, and be able to see exactly what Python methods/lines of code were calling the Rust parts...">3</a></sup></p>
<h3>CPU profiling: py-spy</h3>
<p>Just as other solutions exist for profiling memory,
there are many for profiling CPU usage in Python,
including <code>profile</code> and <code>cProfile</code> in the standard library.
Again, the issue is being able to analyze extension code,
and bringing the cannon (the <code>perf</code> command in Linux, for example) looses the
benefit of tracking Python code properly (because we get back the CPython
functions, not what you defined in your Python code).</p>
<p>Enters <a href="https://github.com/benfred/py-spy">py-spy</a> by <a href="https://www.benfrederickson.com">Ben Frederickson</a>,
based on the <a href="https://github.com/rbspy/rbspy">rbspy</a> project by <a href="https://jvns.ca">Julia Evans</a>.
Both use a great idea:
read the process maps for the interpreters and resolve the full stack trace information,
with low overhead (because it uses sampling).
<a href="https://github.com/benfred/py-spy">py-spy</a> also goes further and resolves <a href="https://www.benfrederickson.com/profiling-native-python-extensions-with-py-spy/">native Python extensions</a> stack traces,
meaning we can get the complete picture all the way from the Python CLI to the
Rust core library!<sup id=sf-sourmash-pr-4-back><a href=#sf-sourmash-pr-4 class=simple-footnote title="Even if py-spy doesn't talk explicitly about Rust, it works very very well, woohoo!">4</a></sup></p>
<p><code>py-spy</code> is also easy to use:
stick <code>py-spy record --output search.svg -n --</code> in front of the command,
and it will generate a flamegraph in <code>search.svg</code>.
The full command for this PR is</p>
<div class=highlight><pre><span></span>py-spy record --output search.svg -n -- sourmash search -o out.csv --scaled <span class=m>2000</span> -k <span class=m>51</span> HSMA.fastq.sig genbank-k51.lca.json.gz
</pre></div>
<h2>Show me the code!</h2>
<p>OK, OK, sheesh. But it's worth repeating: the code is important, but there are
many other aspects that are just as important =]</p>
<h3>Replacing <code>add_hash</code> calls with one <code>add_many</code></h3>
<p>Let's start at the <a href="https://github.com/dib-lab/sourmash/pull/826/files#diff-adf06d14c535d5b22da9fb3862e4a487"><code>_signatures()</code></a> method on LCA indices.
This is the original method:</p>
<div class=highlight><pre><span></span><span class=nd>@cached_property</span>
<span class=k>def</span> <span class=nf>_signatures</span><span class=p>(</span><span class=bp>self</span><span class=p>):</span>
<span class=s2>"Create a _signatures member dictionary that contains {idx: minhash}."</span>
<span class=kn>from</span> <span class=nn>..</span> <span class=kn>import</span> <span class=n>MinHash</span>
<span class=n>minhash</span> <span class=o>=</span> <span class=n>MinHash</span><span class=p>(</span><span class=n>n</span><span class=o>=</span><span class=mi>0</span><span class=p>,</span> <span class=n>ksize</span><span class=o>=</span><span class=bp>self</span><span class=o>.</span><span class=n>ksize</span><span class=p>,</span> <span class=n>scaled</span><span class=o>=</span><span class=bp>self</span><span class=o>.</span><span class=n>scaled</span><span class=p>)</span>
<span class=n>debug</span><span class=p>(</span><span class=s1>'creating signatures for LCA DB...'</span><span class=p>)</span>
<span class=n>sigd</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=n>minhash</span><span class=o>.</span><span class=n>copy_and_clear</span><span class=p>)</span>
<span class=k>for</span> <span class=p>(</span><span class=n>k</span><span class=p>,</span> <span class=n>v</span><span class=p>)</span> <span class=ow>in</span> <span class=bp>self</span><span class=o>.</span><span class=n>hashval_to_idx</span><span class=o>.</span><span class=n>items</span><span class=p>():</span>
<span class=k>for</span> <span class=n>vv</span> <span class=ow>in</span> <span class=n>v</span><span class=p>:</span>
<span class=n>sigd</span><span class=p>[</span><span class=n>vv</span><span class=p>]</span><span class=o>.</span><span class=n>add_hash</span><span class=p>(</span><span class=n>k</span><span class=p>)</span>
<span class=n>debug</span><span class=p>(</span><span class=s1>'=> </span><span class=si>{}</span><span class=s1> signatures!'</span><span class=p>,</span> <span class=nb>len</span><span class=p>(</span><span class=n>sigd</span><span class=p>))</span>
<span class=k>return</span> <span class=n>sigd</span>
</pre></div>
<p><code>sigd[vv].add_hash(k)</code> is the culprit.
Each call to <code>.add_hash</code> has to go thru CFFI to reach the extension code,
and the overhead is significant.
It is a similar situation to accessing array elements in NumPy:
it works,
but it is way slower than using operations that avoid crossing from Python to
the extension code.
What we want to do instead is call <code>.add_many(hashes)</code>,
which takes a list of hashes and process it entirely in Rust
(ideally. We will get there).</p>
<p>But, to have a list of hashes, there is another issue with this code.</p>
<div class=highlight><pre><span></span><span class=k>for</span> <span class=p>(</span><span class=n>k</span><span class=p>,</span> <span class=n>v</span><span class=p>)</span> <span class=ow>in</span> <span class=bp>self</span><span class=o>.</span><span class=n>hashval_to_idx</span><span class=o>.</span><span class=n>items</span><span class=p>():</span>
<span class=k>for</span> <span class=n>vv</span> <span class=ow>in</span> <span class=n>v</span><span class=p>:</span>
<span class=n>sigd</span><span class=p>[</span><span class=n>vv</span><span class=p>]</span><span class=o>.</span><span class=n>add_hash</span><span class=p>(</span><span class=n>k</span><span class=p>)</span>
</pre></div>
<p>There are two nested for loops,
and <code>add_hash</code> is being called with values from the inner loop.
So... we don't have the list of hashes beforehand.</p>
<p>But we can change the code a bit to save the hashes for each signature
in a temporary list,
and then call <code>add_many</code> on the temporary list.
Like this:</p>
<div class=highlight><pre><span></span><span class=n>temp_vals</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=nb>list</span><span class=p>)</span>
<span class=k>for</span> <span class=p>(</span><span class=n>k</span><span class=p>,</span> <span class=n>v</span><span class=p>)</span> <span class=ow>in</span> <span class=bp>self</span><span class=o>.</span><span class=n>hashval_to_idx</span><span class=o>.</span><span class=n>items</span><span class=p>():</span>
<span class=k>for</span> <span class=n>vv</span> <span class=ow>in</span> <span class=n>v</span><span class=p>:</span>
<span class=n>temp_vals</span><span class=p>[</span><span class=n>vv</span><span class=p>]</span><span class=o>.</span><span class=n>append</span><span class=p>(</span><span class=n>k</span><span class=p>)</span>
<span class=k>for</span> <span class=n>sig</span><span class=p>,</span> <span class=n>vals</span> <span class=ow>in</span> <span class=n>temp_vals</span><span class=o>.</span><span class=n>items</span><span class=p>():</span>
<span class=n>sigd</span><span class=p>[</span><span class=n>sig</span><span class=p>]</span><span class=o>.</span><span class=n>add_many</span><span class=p>(</span><span class=n>vals</span><span class=p>)</span>
</pre></div>
<p>There is a trade-off here:
if we save the hashes in temporary lists,
will the memory consumption be so high that the runtime gains of calling
<code>add_many</code> in these temporary lists be cancelled?</p>
<p>Time to measure it =]</p>
<table>
<thead>
<tr>
<th align=left>version</th>
<th align=left>mem</th>
<th align=left>time</th>
</tr>
</thead>
<tbody>
<tr>
<td align=left>original</td>
<td align=left>1.5 GB</td>
<td align=left>160s</td>
</tr>
<tr>
<td align=left><code>list</code></td>
<td align=left>1.7GB</td>
<td align=left>173s</td>
</tr>
</tbody>
</table>
<p>Wait, it got worse?!?! Building temporary lists only takes time and memory,
and bring no benefits!</p>
<p>This mystery goes away when you look at the <a href="https://github.com/dib-lab/sourmash/pull/826/files#diff-2f53b2a5be4083c39a0275847c87f88fR190">add_many method</a>:</p>
<div class=highlight><pre><span></span><span class=k>def</span> <span class=nf>add_many</span><span class=p>(</span><span class=bp>self</span><span class=p>,</span> <span class=n>hashes</span><span class=p>):</span>
<span class=s2>"Add many hashes in at once."</span>
<span class=k>if</span> <span class=nb>isinstance</span><span class=p>(</span><span class=n>hashes</span><span class=p>,</span> <span class=n>MinHash</span><span class=p>):</span>
<span class=bp>self</span><span class=o>.</span><span class=n>_methodcall</span><span class=p>(</span><span class=n>lib</span><span class=o>.</span><span class=n>kmerminhash_add_from</span><span class=p>,</span> <span class=n>hashes</span><span class=o>.</span><span class=n>_objptr</span><span class=p>)</span>
<span class=k>else</span><span class=p>:</span>
<span class=k>for</span> <span class=nb>hash</span> <span class=ow>in</span> <span class=n>hashes</span><span class=p>:</span>
<span class=bp>self</span><span class=o>.</span><span class=n>_methodcall</span><span class=p>(</span><span class=n>lib</span><span class=o>.</span><span class=n>kmerminhash_add_hash</span><span class=p>,</span> <span class=nb>hash</span><span class=p>)</span>
</pre></div>
<p>The first check in the <code>if</code> statement is a shortcut for adding hashes from
another <code>MinHash</code>, so let's focus on <code>else</code> part...
And turns out that <code>add_many</code> is lying!
It doesn't process the <code>hashes</code> in the Rust extension,
but just loops and call <code>add_hash</code> for each <code>hash</code> in the list.
That's not going to be any faster than what we were doing in <code>_signatures</code>.</p>
<p>Time to fix <code>add_many</code>!</p>
<h3>Oxidizing <code>add_many</code></h3>
<p>The idea is to change this loop in <code>add_many</code>:</p>
<div class=highlight><pre><span></span><span class=k>for</span> <span class=nb>hash</span> <span class=ow>in</span> <span class=n>hashes</span><span class=p>:</span>
<span class=bp>self</span><span class=o>.</span><span class=n>_methodcall</span><span class=p>(</span><span class=n>lib</span><span class=o>.</span><span class=n>kmerminhash_add_hash</span><span class=p>,</span> <span class=nb>hash</span><span class=p>)</span>
</pre></div>
<p>with a call to a Rust extension function:</p>
<div class=highlight><pre><span></span><span class=bp>self</span><span class=o>.</span><span class=n>_methodcall</span><span class=p>(</span><span class=n>lib</span><span class=o>.</span><span class=n>kmerminhash_add_many</span><span class=p>,</span> <span class=nb>list</span><span class=p>(</span><span class=n>hashes</span><span class=p>),</span> <span class=nb>len</span><span class=p>(</span><span class=n>hashes</span><span class=p>))</span>
</pre></div>
<p><code>self._methodcall</code> is a convenience method defined in <a href="https://github.com/dib-lab/sourmash/blob/c6cbdf0398ef836797492e13371a38373c544ae1/sourmash/utils.py#L24">RustObject</a>
which translates a method-like call into a function call,
since our C layer only has functions.
This is the C prototype for this function:</p>
<div class=highlight><pre><span></span><span class=kt>void</span><span class=w> </span><span class=nf>kmerminhash_add_many</span><span class=p>(</span><span class=w></span>
<span class=w> </span><span class=n>KmerMinHash</span><span class=w> </span><span class=o>*</span><span class=n>ptr</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=k>const</span><span class=w> </span><span class=kt>uint64_t</span><span class=w> </span><span class=o>*</span><span class=n>hashes_ptr</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=kt>uintptr_t</span><span class=w> </span><span class=n>insize</span><span class=w></span>
<span class=w> </span><span class=p>);</span><span class=w></span>
</pre></div>
<p>You can almost read it as a Python method declaration,
where <code>KmerMinHash *ptr</code> means the same as the <code>self</code> in Python methods.
The other two arguments are a common idiom when passing pointers to data in C,
with <code>insize</code> being how many elements we have in the list.
<sup id=sf-sourmash-pr-5-back><a href=#sf-sourmash-pr-5 class=simple-footnote title="Let's not talk about lack of array bounds checks in C...">5</a></sup>.
<code>CFFI</code> is very good at converting Python lists into pointers of a specific type,
as long as the type is of a primitive type
(<code>uint64_t</code> in our case, since each hash is a 64-bit unsigned integer number).</p>
<p>And the Rust code with the implementation of the function:</p>
<div class=highlight><pre><span></span><span class=n>ffi_fn</span><span class=o>!</span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=k>unsafe</span><span class=w> </span><span class=k>fn</span> <span class=nf>kmerminhash_add_many</span><span class=p>(</span><span class=w></span>
<span class=w> </span><span class=n>ptr</span>: <span class=o>*</span><span class=k>mut</span><span class=w> </span><span class=n>KmerMinHash</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=n>hashes_ptr</span>: <span class=o>*</span><span class=k>const</span><span class=w> </span><span class=kt>u64</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=n>insize</span>: <span class=kt>usize</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=p>)</span><span class=w> </span>-> <span class=nb>Result</span><span class=o><</span><span class=p>()</span><span class=o>></span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=w> </span><span class=kd>let</span><span class=w> </span><span class=n>mh</span><span class=w> </span><span class=o>=</span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=w> </span><span class=fm>assert!</span><span class=p>(</span><span class=o>!</span><span class=n>ptr</span><span class=p>.</span><span class=n>is_null</span><span class=p>());</span><span class=w></span>
<span class=w> </span><span class=o>&</span><span class=k>mut</span><span class=w> </span><span class=o>*</span><span class=n>ptr</span><span class=w></span>
<span class=w> </span><span class=p>};</span><span class=w></span>
<span class=w> </span><span class=kd>let</span><span class=w> </span><span class=n>hashes</span><span class=w> </span><span class=o>=</span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=w> </span><span class=fm>assert!</span><span class=p>(</span><span class=o>!</span><span class=n>hashes_ptr</span><span class=p>.</span><span class=n>is_null</span><span class=p>());</span><span class=w></span>
<span class=w> </span><span class=n>slice</span>::<span class=n>from_raw_parts</span><span class=p>(</span><span class=n>hashes_ptr</span><span class=w> </span><span class=k>as</span><span class=w> </span><span class=o>*</span><span class=k>mut</span><span class=w> </span><span class=kt>u64</span><span class=p>,</span><span class=w> </span><span class=n>insize</span><span class=p>)</span><span class=w></span>
<span class=w> </span><span class=p>};</span><span class=w></span>
<span class=w> </span><span class=k>for</span><span class=w> </span><span class=n>hash</span><span class=w> </span><span class=k>in</span><span class=w> </span><span class=n>hashes</span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=w> </span><span class=n>mh</span><span class=p>.</span><span class=n>add_hash</span><span class=p>(</span><span class=o>*</span><span class=n>hash</span><span class=p>);</span><span class=w></span>
<span class=w> </span><span class=p>}</span><span class=w></span>
<span class=w> </span><span class=nb>Ok</span><span class=p>(())</span><span class=w></span>
<span class=p>}</span><span class=w></span>
<span class=p>}</span><span class=w></span>
</pre></div>
<p>Let's break what's happening here into smaller pieces.
Starting with the function signature:</p>
<div class=highlight><pre><span></span><span class=n>ffi_fn</span><span class=o>!</span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=k>unsafe</span><span class=w> </span><span class=k>fn</span> <span class=nf>kmerminhash_add_many</span><span class=p>(</span><span class=w></span>
<span class=w> </span><span class=n>ptr</span>: <span class=o>*</span><span class=k>mut</span><span class=w> </span><span class=n>KmerMinHash</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=n>hashes_ptr</span>: <span class=o>*</span><span class=k>const</span><span class=w> </span><span class=kt>u64</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=n>insize</span>: <span class=kt>usize</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=p>)</span><span class=w> </span>-> <span class=nb>Result</span><span class=o><</span><span class=p>()</span><span class=o>></span><span class=w></span>
</pre></div>
<p>The weird <code>ffi_fn! {}</code> syntax around the function is a macro in Rust:
it changes the final generated code to convert the return value (<code>Result<()></code>) into something that is valid C code (in this case, <code>void</code>).
What happens if there is an error, then?
The Rust extension has code for passing back an error code and message to Python,
as well as capturing panics (when things go horrible bad and the program can't recover)
in a way that Python can then deal with (raising exceptions and cleaning up).
It also sets the <code>#[no_mangle]</code> attribute in the function,
meaning that the final name of the function will follow C semantics (instead of Rust semantics),
and can be called more easily from C and other languages.
This <code>ffi_fn!</code> macro comes from <a href="https://github.com/getsentry/symbolic">symbolic</a>,
a big influence on the design of the Python/Rust bridge in sourmash.</p>
<p><code>unsafe</code> is the keyword in Rust to disable some checks in the code to allow
potentially dangerous things (like dereferencing a pointer),
and it is required to interact with C code.
<code>unsafe</code> doesn't mean that the code is always unsafe to use:
it's up to whoever is calling this to verify that valid data is being passed and invariants are being preserved.</p>
<p>If we remove the <code>ffi_fn!</code> macro and the <code>unsafe</code> keyword,
we have</p>
<div class=highlight><pre><span></span><span class=k>fn</span> <span class=nf>kmerminhash_add_many</span><span class=p>(</span><span class=w></span>
<span class=w> </span><span class=n>ptr</span>: <span class=o>*</span><span class=k>mut</span><span class=w> </span><span class=n>KmerMinHash</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=n>hashes_ptr</span>: <span class=o>*</span><span class=k>const</span><span class=w> </span><span class=kt>u64</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=n>insize</span>: <span class=kt>usize</span>
<span class=p>);</span><span class=w></span>
</pre></div>
<p>At this point we can pretty much map between Rust and the C function prototype:</p>
<div class=highlight><pre><span></span><span class=kt>void</span><span class=w> </span><span class=nf>kmerminhash_add_many</span><span class=p>(</span><span class=w></span>
<span class=w> </span><span class=n>KmerMinHash</span><span class=w> </span><span class=o>*</span><span class=n>ptr</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=k>const</span><span class=w> </span><span class=kt>uint64_t</span><span class=w> </span><span class=o>*</span><span class=n>hashes_ptr</span><span class=p>,</span><span class=w></span>
<span class=w> </span><span class=kt>uintptr_t</span><span class=w> </span><span class=n>insize</span><span class=w></span>
<span class=w> </span><span class=p>);</span><span class=w></span>
</pre></div>
<p>Some interesting points:</p>
<ul>
<li>We use <code>fn</code> to declare a function in Rust.</li>
<li>The type of an argument comes after the name of the argument in Rust,
while it's the other way around in C.
Same for the return type (it is omitted in the Rust function, which means it
is <code>-> ()</code>, equivalent to a <code>void</code> return type in C).</li>
<li>In Rust everything is <strong>immutable</strong> by default, so we need to say that we want
a mutable pointer to a <code>KmerMinHash</code> item: <code>*mut KmerMinHash</code>).
In C everything is mutable by default.</li>
<li><code>u64</code> in Rust -> <code>uint64_t</code> in C</li>
<li><code>usize</code> in Rust -> <code>uintptr_t</code> in C</li>
</ul>
<p>Let's check the implementation of the function now.
We start by converting the <code>ptr</code> argument (a raw pointer to a <code>KmerMinHash</code> struct)
into a regular Rust struct:</p>
<div class=highlight><pre><span></span><span class=kd>let</span><span class=w> </span><span class=n>mh</span><span class=w> </span><span class=o>=</span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=w> </span><span class=fm>assert!</span><span class=p>(</span><span class=o>!</span><span class=n>ptr</span><span class=p>.</span><span class=n>is_null</span><span class=p>());</span><span class=w></span>
<span class=w> </span><span class=o>&</span><span class=k>mut</span><span class=w> </span><span class=o>*</span><span class=n>ptr</span><span class=w></span>
<span class=p>};</span><span class=w></span>
</pre></div>
<p>This block is asserting that <code>ptr</code> is not a null pointer,
and if so it dereferences it and store in a mutable reference.
If it was a null pointer the <code>assert!</code> would panic (which might sound extreme,
but is way better than continue running because dereferencing a null pointer is
BAD).
Note that functions always need all the types in arguments and return values,
but for variables in the body of the function
Rust can figure out types most of the time,
so no need to specify them.</p>
<p>The next block prepares our list of hashes for use:</p>
<div class=highlight><pre><span></span><span class=kd>let</span><span class=w> </span><span class=n>hashes</span><span class=w> </span><span class=o>=</span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=w> </span><span class=fm>assert!</span><span class=p>(</span><span class=o>!</span><span class=n>hashes_ptr</span><span class=p>.</span><span class=n>is_null</span><span class=p>());</span><span class=w></span>
<span class=w> </span><span class=n>slice</span>::<span class=n>from_raw_parts</span><span class=p>(</span><span class=n>hashes_ptr</span><span class=w> </span><span class=k>as</span><span class=w> </span><span class=o>*</span><span class=k>mut</span><span class=w> </span><span class=kt>u64</span><span class=p>,</span><span class=w> </span><span class=n>insize</span><span class=p>)</span><span class=w></span>
<span class=p>};</span><span class=w></span>
</pre></div>
<p>We are again asserting that the <code>hashes_ptr</code> is not a null pointer,
but instead of dereferencing the pointer like before we use it to create a <code>slice</code>,
a dynamically-sized view into a contiguous sequence.
The list we got from Python is a contiguous sequence of size <code>insize</code>,
and the <code>slice::from_raw_parts</code> function creates a slice from a pointer to data and a size.</p>
<p>Oh, and can you spot the bug?
I created the slice using <code>*mut u64</code>,
but the data is declared as <code>*const u64</code>.
Because we are in an <code>unsafe</code> block Rust let me change the mutability,
but I shouldn't be doing that,
since we don't need to mutate the slice.
Oops.</p>
<p>Finally, let's add hashes to our MinHash!
We need a <code>for</code> loop, and call <code>add_hash</code> for each <code>hash</code>:</p>
<div class=highlight><pre><span></span><span class=k>for</span><span class=w> </span><span class=n>hash</span><span class=w> </span><span class=k>in</span><span class=w> </span><span class=n>hashes</span><span class=w> </span><span class=p>{</span><span class=w></span>
<span class=w> </span><span class=n>mh</span><span class=p>.</span><span class=n>add_hash</span><span class=p>(</span><span class=o>*</span><span class=n>hash</span><span class=p>);</span><span class=w></span>
<span class=p>}</span><span class=w></span>
<span class=nb>Ok</span><span class=p>(())</span><span class=w></span>
</pre></div>
<p>We finish the function with <code>Ok(())</code> to indicate no errors occurred.</p>
<p>Why is calling <code>add_hash</code> here faster than what we were doing before in Python?
Rust can optimize these calls and generate very efficient native code,
while Python is an interpreted language and most of the time don't have the same
guarantees that Rust can leverage to generate the code.
And, again,
calling <code>add_hash</code> here doesn't need to cross FFI boundaries or,
in fact,
do any dynamic evaluation during runtime,
because it is all statically analyzed during compilation.</p>
<h2>Putting it all together</h2>
<p>And... that's the PR code.
There are some other unrelated changes that should have been in new PRs,
but since they were so small it would be more work than necessary.
OK, that's a lame excuse:
it's confusing for reviewers to see these changes here,
so avoid doing that if possible!</p>
<p>But, did it work?</p>
<table>
<thead>
<tr>
<th align=left>version</th>
<th align=left>mem</th>
<th align=left>time</th>
</tr>
</thead>
<tbody>
<tr>
<td align=left>original</td>
<td align=left>1.5 GB</td>
<td align=left>160s</td>
</tr>
<tr>
<td align=left><code>list</code></td>
<td align=left>1.7GB</td>
<td align=left>73s</td>
</tr>
</tbody>
</table>
<p>We are using 200 MB of extra memory,
but taking less than half the time it was taking before.
I think this is a good trade-off,
and so did the <a href="https://github.com/dib-lab/sourmash/pull/826#pullrequestreview-339020803">reviewer</a> and the PR was approved.</p>
<p>Hopefully this was useful, 'til next time!</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://social.lasanha.org/@luizirber/103461534713587975">Thread on Mastodon</a></li>
<li><a href="https://twitter.com/luizirber/status/1215772245928235008">Thread on Twitter</a></li>
<li><a href="https://lobste.rs/s/xnaugq/oxidizing_sourmash_pr_walkthrough">Lobste.rs submission</a></li>
</ul>
<h2>Bonus: <code>list</code> or <code>set</code>?</h2>
<p>The first version of the PR used a <code>set</code> instead of a <code>list</code> to accumulate hashes.
Since a <code>set</code> doesn't have repeated elements,
this could potentially use less memory.
The code:</p>
<div class=highlight><pre><span></span><span class=n>temp_vals</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=nb>set</span><span class=p>)</span>
<span class=k>for</span> <span class=p>(</span><span class=n>k</span><span class=p>,</span> <span class=n>v</span><span class=p>)</span> <span class=ow>in</span> <span class=bp>self</span><span class=o>.</span><span class=n>hashval_to_idx</span><span class=o>.</span><span class=n>items</span><span class=p>():</span>
<span class=k>for</span> <span class=n>vv</span> <span class=ow>in</span> <span class=n>v</span><span class=p>:</span>
<span class=n>temp_vals</span><span class=p>[</span><span class=n>vv</span><span class=p>]</span><span class=o>.</span><span class=n>add</span><span class=p>(</span><span class=n>k</span><span class=p>)</span>
<span class=k>for</span> <span class=n>sig</span><span class=p>,</span> <span class=n>vals</span> <span class=ow>in</span> <span class=n>temp_vals</span><span class=o>.</span><span class=n>items</span><span class=p>():</span>
<span class=n>sigd</span><span class=p>[</span><span class=n>sig</span><span class=p>]</span><span class=o>.</span><span class=n>add_many</span><span class=p>(</span><span class=n>vals</span><span class=p>)</span>
</pre></div>
<p>The runtime was again half of the original,
but...</p>
<table>
<thead>
<tr>
<th align=left>version</th>
<th align=left>mem</th>
<th align=left>time</th>
</tr>
</thead>
<tbody>
<tr>
<td align=left>original</td>
<td align=left>1.5 GB</td>
<td align=left>160s</td>
</tr>
<tr>
<td align=left><code>set</code></td>
<td align=left>3.8GB</td>
<td align=left>80s</td>
</tr>
<tr>
<td align=left><code>list</code></td>
<td align=left>1.7GB</td>
<td align=left>73s</td>
</tr>
</tbody>
</table>
<p>... memory consumption was almost 2.5 times the original! WAT</p>
<p>The culprit this time? The new <code>kmerminhash_add_many</code> call in the <code>add_many</code>
method.
This one:</p>
<div class=highlight><pre><span></span><span class=bp>self</span><span class=o>.</span><span class=n>_methodcall</span><span class=p>(</span><span class=n>lib</span><span class=o>.</span><span class=n>kmerminhash_add_many</span><span class=p>,</span> <span class=nb>list</span><span class=p>(</span><span class=n>hashes</span><span class=p>),</span> <span class=nb>len</span><span class=p>(</span><span class=n>hashes</span><span class=p>))</span>
</pre></div>
<p><code>CFFI</code> doesn't know how to convert a <code>set</code> into something that C understands,
so we need to call <code>list(hashes)</code> to convert it into a list.
Since Python (and <code>CFFI</code>) can't know if the data is going to be used later
<sup id=sf-sourmash-pr-6-back><a href=#sf-sourmash-pr-6 class=simple-footnote title="something that the memory ownership model in Rust does, BTW">6</a></sup>
it needs to keep it around
(and be eventually deallocated by the garbage collector).
And that's how we get at least double the memory being allocated...</p>
<p>There is another lesson here.
If we look at the <code>for</code> loop again:</p>
<div class=highlight><pre><span></span><span class=k>for</span> <span class=p>(</span><span class=n>k</span><span class=p>,</span> <span class=n>v</span><span class=p>)</span> <span class=ow>in</span> <span class=bp>self</span><span class=o>.</span><span class=n>hashval_to_idx</span><span class=o>.</span><span class=n>items</span><span class=p>():</span>
<span class=k>for</span> <span class=n>vv</span> <span class=ow>in</span> <span class=n>v</span><span class=p>:</span>
<span class=n>temp_vals</span><span class=p>[</span><span class=n>vv</span><span class=p>]</span><span class=o>.</span><span class=n>add</span><span class=p>(</span><span class=n>k</span><span class=p>)</span>
</pre></div>
<p>each <code>k</code> is already unique because they are keys in the <code>hashval_to_idx</code> dictionary,
so the initial assumption
(that a <code>set</code> might save memory because it doesn't have repeated elements)
is... irrelevant for the problem =]</p><section class=footnotes><hr><h2>Footnotes</h2><ol><li id=sf-sourmash-pr-1><p>We do have https://asv.readthedocs.io/ set up for micro-benchmarks,
and now that I think about it...
I could have started by writing a benchmark for <code>add_many</code>,
and then showing that it is faster.
I will add this approach to the sourmash PR checklist =] <a href=#sf-sourmash-pr-1-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-pr-2><p>or triple, if you count C <a href=#sf-sourmash-pr-2-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-pr-3><p>It would be super cool to have the unwinding code from py-spy in heaptrack,
and be able to see exactly what Python methods/lines of code were calling the
Rust parts... <a href=#sf-sourmash-pr-3-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-pr-4><p>Even if py-spy doesn't talk explicitly about Rust,
it works very very well, woohoo! <a href=#sf-sourmash-pr-4-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-pr-5><p>Let's not talk about lack of array bounds checks in C... <a href=#sf-sourmash-pr-5-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-pr-6><p>something that the memory ownership model in Rust does, BTW <a href=#sf-sourmash-pr-6-back class=simple-footnote-back>↩</a></p></li></ol></section>Interoperability #rust20202019-12-01T12:00:00-03:002019-12-01T12:00:00-03:00luizirbertag:blog.luizirber.org,2019-12-01:/2019/12/01/rust-2020/<p>In January I wrote a <a href="https://blog.luizirber.org/2019/01/05/rust-2019/">post</a> for the Rust 2019 call for blogs.
The <a href="https://blog.rust-lang.org/2019/10/29/A-call-for-blogs-2020.html">2020 call</a> is aiming for an RFC and roadmap earlier this time,
so here is my 2020 post =]</p>
<h3>Last call review: what happened?</h3>
<h4>An attribute proc-macro like <code>#[wasm_bindgen]</code> but for FFI</h4>
<p>This sort of happened... because WebAssembly is growing =]</p>
<p>I was very excited when <a href="https://hacks.mozilla.org/2019/08/webassembly-interface-types/">Interface Types</a> showed up in August,
and while it is still very experimental it is moving fast and bringing saner
paths for interoperability than raw C FFIs.
David Beazley even point this at the end of his <a href="https://www.youtube.com/watch?v=r-A78RgMhZU">PyCon India keynote</a>,
talking about how easy is to get information out of a WebAssembly module
compared to what had to be done for SWIG.</p>
<p>This doesn't solve the problem where strict C compatibility is required,
or for platforms where a WebAssembly runtime is not available,
but I think it is a great solution for scientific software
(or, at least, for my use cases =]).</p>
<h4>"More -sys and Rust-like crates for interoperability with the larger ecosystems" and "More (bioinformatics) tools using Rust!"</h4>
<p>I did some of those this year (<a href="https://crates.io/crates/bbhash-sys">bbhash-sys</a> and <a href="https://crates.io/crates/mqf">mqf</a>),
and also found some great crates to use in my projects.
Rust is picking up steam in bioinformatics,
being used as the primary choice for high quality software
(like <a href="https://varlociraptor.github.io/">varlociraptor</a>,
or the many coming from <a href="https://github.com/10XGenomics/">10X Genomics</a>)
but it is still somewhat hard to find more details
(I mostly find it on Twitter,
and sometime Google Scholar alerts).
It would be great to start bringing this info together,
which leads to...</p>
<h4>"A place to find other scientists?"</h4>
<p>Hey, this one happened! <a href="https://twitter.com/algo_luca/status/1081966759048028162">Luca Palmieri</a> started a conversation on <a href="https://www.reddit.com/r/rust/comments/ae77gt/scientific_computingmachine_learning_do_we_want_a/">reddit</a> and
the <a href="https://discord.gg/EXTSq4v">#science-and-ai</a> Discord channel on the Rust community server was born!
I think it works pretty well,
and Luca also has being doing a great job running <a href="https://github.com/LukeMathWalker/ndarray-koans">workshops</a>
and guiding the conversation around <a href="https://github.com/rust-ml/discussion">rust-ml</a>.</p>
<h2>Rust 2021: Interoperability</h2>
<p>Rust is amazing because it is very good at bringing many concepts and ideas that
seem contradictory at first,
but can really shine when <a href="https://rust-lang.github.io/rustconf-2018-keynote/#127">synthesized</a>.
But can we share this combined wisdom and also improve the situation in other
places too?
Despite the "Rewrite it in Rust" meme,
increased interoperability is something that is already driving a lot of the
best aspects of Rust:</p>
<ul>
<li>
<p>Interoperability with other languages: as I said before,
with WebAssembly (and Rust being having the best toolchain for it)
there is a clear route to achieve this,
but it will not replace all the software that already exist and can benefit
from FFI and C compatibility.
Bringing together developers from the many language specific binding
generators (<a href="https://github.com/tildeio/helix">helix</a>, <a href="https://github.com/neon-bindings/neon">neon</a>, <a href="https://github.com/rusterlium/rustler">rustler</a>, <a href="https://github.com/PyO3/pyo3">PyO3</a>...) and figuring out what's missing from
them (or what is the common parts that can be shared) also seems productive.</p>
</li>
<li>
<p>Interoperability with new and unexplored domains.
I think Rust benefits enormously from not focusing only in one domain,
and choosing to prioritize CLI, WebAssembly, Networking and Embedded is a good
subset to start tackling problems,
but how to guide other domains to also use Rust and come up with new
contributors and expose missing pieces of the larger picture?</p>
</li>
</ul>
<p>Another point extremely close to interoperability is training.
A great way to interoperate with other languages and domains is having good
documentation and material from transitioning into Rust without having to figure
everything at once.
Rust documentation is already amazing,
especially considering the many books published by each working group.
But... there is a gap on the transitions,
both from understanding the basics of the language and using it,
to the progression from beginner to intermediate and expert.</p>
<p>I see good resources for <a href="https://github.com/yoshuawuyts/rust-for-js-people">JavaScript</a> and <a href="https://github.com/rochacbruno/py2rs">Python</a> developers,
but we are still covering a pretty small niche:
programmers curious enough to go learn another language,
or looking for solutions for problems in their current language.</p>
<p>Can we bring more people into Rust?
<a href="https://rustbridge.com/">RustBridge</a> is obviously the reference here,
but there is space for much,
much more.
Using Rust in <a href="https://carpentries.org/">The Carpentries</a> lessons?
Creating <code>RustOpenSci</code>,
mirroring the communities of practice of <a href="https://ropensci.org/about/">rOpenSci</a> and <a href="https://www.pyopensci.org/">pyOpenSci</a>?</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://social.lasanha.org/@luizirber/103236549475802733">Thread on Mastodon</a></li>
<li><a href="https://twitter.com/luizirber/status/1201373423592562690">Thread on Twitter</a></li>
</ul>Scientific Rust #rust20192019-01-05T17:00:00-02:002019-01-05T17:00:00-02:00luizirbertag:blog.luizirber.org,2019-01-05:/2019/01/05/rust-2019/<p>The Rust community requested feedback last year for where the language should go
in 2018, and now they are running it again for 2019.
Last year I was too new in Rust to organize a blog post,
but after an year using it I feel more comfortable writing this!</p>
<p>(Check my previous post about <a href="https://blog.luizirber.org/2018/08/23/sourmash-rust/">replacing the C++ core in sourmash with Rust</a> for more details on how I spend my year in Rust).</p>
<h2>What counts as "scientific Rust"?</h2>
<p>Anything that involves doing science using computers counts as
scientific programming. It includes from embedded software
<a href="https://www.youtube.com/watch?v=y5Yd3FC-kh8">running on satellites</a> to climate models
running in supercomputers, from shell scripts running tools in a pipeline to
data analysis using notebooks.</p>
<p>It also makes the discussion harder, because it's too general! But it is very
important to keep in mind, because scientists are not your regular user: they
are highly qualified in their field of expertise, and they are also pushing the
boundaries of what we know (and this might need flexibility in their tools).</p>
<p>In this post I will be focusing more in two areas: array computing (what most
people consider 'scientific programming' to be) and "data structures".</p>
<h3>Array computing</h3>
<p>This one is booming in the last couple of years due to industry interest in data
sciences and deep learning (where they will talk about tensors instead of arrays),
and has its roots in models running in supercomputers (a field where Fortran is
still king!). Data tends to be quite regular (representable with matrices) and
amenable to parallel processing.</p>
<p>A good example is the SciPy stack in Python, built on top of NumPy and SciPy.
The adoption of the SciPy stack (both in academia and industry) is staggering,
and many <a href="https://github.com/cupy/cupy">alternative implementations</a> try to provide a NumPy-like API to try to
capture its mindshare.</p>
<p>This is the compute-intensive side science (be it CPU or GPU/TPU), and also the kind
of data that pushed CPU evolution and is still very important in defining policy
in scientific computing funding (see countries competing for the largest
supercomputers and measuring performance in floating point operations per second).</p>
<h3>Data structures for efficient data representation</h3>
<p>For data that is not so regular the situation is a bit different. I'll use
bioinformatics as an example: the data we get out of nucleotide sequencers is usually
represented by long strings (of ACGT), and algorithms will do a lot of string processing
(be it building string-overlap graphs for assembly, or searching for substrings
in large collections). This is only one example: there are many analyses that
will work with other types of data, and most of them don't have a
universal data representation as in the array computing case.</p>
<p>This is the memory-intensive science, and it's hard to measure performance in
floating point operations because... most of the time you're not even using
floating point numbers. It also suffers from limited data locality (which is
almost a prerequisite for compute-intensive performance).</p>
<h2>High performance core, interactive API</h2>
<p>There is something common in both cases: while performance-intensive
code is implemented in C/C++/Fortran, users usually interact with the API from
other languages (especially Python or R) because it's faster to iterate and
explore the data, and many of the tools already available in these languages are
very helpful for these tasks (think Jupyter/pandas or RStudio/tidyverse).
These languages are used to define the computation, but it is a lower-level core
library that drives it (NumPy or Tensorflow follow this idea, for example).</p>
<h2>How to make Rust better for science?</h2>
<p>The biggest barrier to learning Rust is the ownership model, and while we can
agree it is an important feature it is also quite daunting for newcomers,
especially if they don't have previous programming experience and exposure to
what bugs are being prevented. I don't see it being the first language we teach
to scientists any time soon, because the majority of scientists are not system
programmers, and have very different expectations for a programming language.
That doesn't mean that they can't benefit from Rust!</p>
<p>Rust is already great for building the performance-intensive parts,
and thanks to Cargo it is also a better alternative for sharing this code around,
since they tend to get 'stuck' inside Python or R packages.
And the 'easy' approach of vendoring C/C++ instead of having packages make it
hard to keep track of changes and doesn't encourage reusable code.</p>
<p>And, of course, if this code is Rust instead of C/C++ it also means that Rust
users can use them directly, without depending on the other languages. Seems
like a good way to bootstrap a scientific community in Rust =]</p>
<h2>What I would like to see in 2019?</h2>
<h3>An attribute proc-macro like <code>#[wasm_bindgen]</code> but for FFI</h3>
<p>While FFI is an integral part of Rust goals (interoperability with C/C++), I
have serious envy of the structure and tooling developed for WebAssembly! (Even
more now that it works in stable too)</p>
<p>We already have <code>#[no_mangle]</code> and <code>pub extern "C"</code>, but they are quite
low-level. I would love to see something closer to what wasm-bindgen does,
and define some traits (like <a href="https://rustwasm.github.io/wasm-bindgen/api/wasm_bindgen/convert/trait.IntoWasmAbi.html"><code>IntoWasmAbi</code></a>) to make it easier to
pass more complex data types through the FFI.</p>
<p>I know it's not that simple, and there are different design restrictions than
WebAssembly to take into account... The point here is not having the perfect
solution for all use cases, but something that serves as an entry point and helps
to deal with the complexity while you're still figuring out all the quirks and
traps of FFI. You can still fallback and have more control using the lower-level
options when the need rises.</p>
<h3>More -sys and Rust-like crates for interoperability with the larger ecosystems</h3>
<p>There are new projects bringing more interoperability to <a href="https://arrow.apache.org/">dataframes</a> and <a href="https://xnd.io/">tensors</a>.
While this ship has already sailed and they are implemented in C/C++,
it would be great to be a first-class citizen,
and not reinvent the wheel.
(Note: the arrow project already have pretty good Rust support!)</p>
<p>In my own corner (bioinformatics), the <a href="https://github.com/rust-bio">Rust-bio community</a> is doing a
great job of wrapping <a href="https://github.com/rust-bio/rust-htslib">useful libraries in C/C++</a> and exposing them to
Rust (and also a shout-out to 10X Genomics for doing this work for
<a href="https://github.com/10XGenomics/rust-bwa">other tools</a> while also contributing to Rust-bio!).</p>
<h3>More (bioinformatics) tools using Rust!</h3>
<p>We already have great examples like <a href="https://github.com/onecodex/finch-rs">finch</a> and <a href="https://github.com/natir/yacrd">yacrd</a>,
since Rust is great for single binary distribution of programs.
And with bioinformatics focusing so much in independent tools chained together in workflows,
I think we can start convincing people to try it out =]</p>
<h3>A place to find other scientists?</h3>
<p>Another idea is to draw inspiration from <a href="https://ropensci.org/about/">rOpenSci</a> and have a Rust equivalent,
where people can get feedback about their projects and how to better integrate it with other crates.
This is quite close to the working group idea,
but I think it would serve more as a gateway to other groups,
more focused on developing entry-level docs and bringing more scientists to the
community.</p>
<h2>Final words</h2>
<p>In the end, I feel like this post ended up turning into my 'wishful TODO list'
for 2019, but I would love to find more people sharing these goals (or willing
to take any of this and just run with it, I do have a PhD to finish! =P)</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://social.lasanha.org/@luizirber/101367104235280253">Thread on Mastodon</a></li>
<li><a href="https://twitter.com/luizirber/status/1081729107170193408">Thread on Twitter</a></li>
</ul>What open science is about2018-09-24T17:00:00-03:002018-09-24T17:00:00-03:00luizirbertag:blog.luizirber.org,2018-09-24:/2018/09/24/open-science/<p>Today I got a pleasant surprise: <a href="http://www.olgabotvinnik.com/">Olga Botvinnik</a> posted on <a href="https://twitter.com/olgabot/status/1044292704513839104">Twitter</a>
about a <a href="https://github.com/czbiohub/kmer-hashing/blob/olgabot/search-compare-ignore-abundance/figures/presentations/2018-09-24_beyond_the_cell_atlas_poster/2018-09-24_Beyond_the_Cell_Atlas.pdf">poster</a> she is presenting at the Beyond the Cell Atlas conference
and she name-dropped a bunch of people that helped her. The cool thing? They
are all open source developers, and Olga interacted thru GitHub to <a href="https://github.com/dib-lab/sourmash/pull/543">ask</a> <a href="https://github.com/dib-lab/sourmash/issues/545">for</a> <a href="https://github.com/betteridiot/bamnostic/issues/15">features</a>,
<a href="https://github.com/betteridiot/bamnostic/issues/20">report bugs</a> and even <a href="https://github.com/dib-lab/sourmash/pull/543">submit</a> <a href="https://github.com/dib-lab/sourmash/pull/539">pull</a> <a href="https://github.com/dib-lab/sourmash/pull/529">requests</a>.</p>
<p>That's what open science is about: collaboration, good practices, and in the end coming up
with something that is larger than each individual piece. Now sourmash is better,
bamnostic is better, reflow is better. I would like to see this becoming more and
more common =]</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://social.lasanha.org/@luizirber/100783375923088450">Thread on Mastodon</a></li>
<li><a href="https://twitter.com/luizirber/status/1044369382233665536">Thread on Twitter</a></li>
</ul>New crate: nthash2018-09-13T17:00:00-03:002018-09-13T17:00:00-03:00luizirbertag:blog.luizirber.org,2018-09-13:/2018/09/13/nthash/<p>A quick announcement: I wrote a <a href="https://github.com/luizirber/nthash">Rust implementation</a> of <a href="https://github.com/bcgsc/ntHash">ntHash</a> and published
it in <a href="https://crates.io/crates/nthash">crates.io</a>. It implements an <code>Iterator</code> to take advantage of the
rolling properties of <code>ntHash</code> which make it so useful in bioinformatics (where
we work a lot with sliding windows over sequences).</p>
<p>It's a pretty small crate, and probably was a better project to learn Rust than
doing a <a href="https://blog.luizirber.org/2018/08/23/sourmash-rust/">sourmash implementation</a> because it doesn't involve gnarly FFI
issues. I also put <a href="https://github.com/luizirber/nthash/blob/d0c16d7deb0a78b8aeb29090db91bba954c14fe8/src/lib.rs#L91">some docs</a>, <a href="https://github.com/luizirber/nthash/blob/d0c16d7deb0a78b8aeb29090db91bba954c14fe8/benches/nthash.rs#L11">benchmarks</a> using <a href="https://japaric.github.io/criterion.rs/">criterion</a>,
and even an <a href="https://github.com/luizirber/nthash/blob/d0c16d7deb0a78b8aeb29090db91bba954c14fe8/tests/nthash.rs#L80">oracle property-based test</a> with <a href="https://github.com/BurntSushi/quickcheck">quickcheck</a>.</p>
<p>More info <a href="https://docs.rs/nthash/">in the docs</a>, and if you want an <s>optimization</s> versioning bug
discussion be sure to check the <a href="https://github.com/luizirber/nthash_bug"><code>ntHash bug?</code></a> repo,
which has a (slow) Python implementation and a pretty nice <a href="https://nbviewer.jupyter.org/github/luizirber/nthash_bug/blob/master/analysis.ipynb">analysis</a> notebook.</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://social.lasanha.org/@luizirber/100721133117928424">Thread on Mastodon</a></li>
<li><a href="https://twitter.com/luizirber/status/1040386666089705472">Thread on Twitter</a></li>
</ul>Oxidizing sourmash: WebAssembly2018-08-27T15:30:00-03:002018-08-27T15:30:00-03:00luizirbertag:blog.luizirber.org,2018-08-27:/2018/08/27/sourmash-wasm/<p>sourmash calculates MinHash signatures for genomic datasets,
meaning we are reducing the data (via subsampling) to a small
representative subset (a signature) capable of answering one question:
how similar is this dataset to another one? The key here is that a dataset with
10-100 GB will be reduced to something in the megabytes range, and two approaches
for doing that are:</p>
<ul>
<li>The user install our software in their computer.
This is not so bad anymore (yay bioconda!), but still requires knowledge
about command line interfaces and how to install all this stuff. The user
data never leaves their computer, and they can share the signatures later
if they want to.</li>
<li>Provide a web service to calculate signatures. In this case no software
need to be installed, but it's up to someone (me?) to maintain a server running with
an API and frontend to interact with the users. On top of requiring more
maintenance, another drawback is that
the user need to send me the data, which is very inefficient network-wise
and lead to questions about what I can do with their raw data (and I'm not
into surveillance capitalism, TYVM).</li>
</ul>
<h2>But... what if there is a third way?</h2>
<p>What if we could keep the frontend code from the web service (very
user-friendly) but do all the calculations client-side (and avoid the network
bottleneck)? The main hurdle
here is that our software is implemented in Python (and C++), which are not
supported in browsers. My first solution was to write the core features of
<a href="https://github.com/luizirber/sourmash-node">sourmash in JavaScript</a>, but that quickly started hitting annoying things
like JavaScript not supporting 64-bit integers. There is also the issue of
having another codebase to maintain and keep in sync with the original sourmash,
which would be a relevant burden for us. I gave a <a href="https://drive.google.com/open?id=1JvXiDaEA4J3hmEKw6sV-VHMpuHG_sxls3fLxJOht28E">lab meeting</a> about this
approach, using a <a href="https://soursigs-dnd-luizirber.hashbase.io/">drag-and-drop UI as proof of concept</a>. It did work but it
was finicky (dealing with the 64-bit integer hashes is not fun). The good thing
is that at least I had a working UI for further testing<sup id=sf-sourmash-wasm-1-back><a href=#sf-sourmash-wasm-1 class=simple-footnote title="even if horrible, I need to get some design classes =P">1</a></sup></p>
<p>In "<a href="https://blog.luizirber.org/2018/08/23/sourmash-rust/">Oxidizing sourmash: Python and FFI</a>" I described my road to learn Rust,
but something that I omitted was that around the same time the <code>WebAssembly</code>
support in Rust started to look better and better and was a huge influence in
my decision to learn Rust. Reimplementing the sourmash C++ extension in Rust and
use the same codebase in the browser sounded very attractive,
and now that it was working I started looking into how to use the WebAssembly
target in Rust.</p>
<h2>WebAssembly?</h2>
<p>From the <a href="https://webassembly.org/">official site</a>,</p>
<blockquote>
<p>WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine.
Wasm is designed as a portable target for compilation of high-level languages like C/C++/Rust,
enabling deployment on the web for client and server applications.</p>
</blockquote>
<p>You can write WebAssembly by hand, but the goal is to have it as lower level
target for other languages. For me the obvious benefit is being able to use
something that is not JavaScript in the browser, even though the goal is not to replace
JS completely but complement it in a big pain point: performance. This also
frees JavaScript from being the target language for other toolchains,
allowing it to grow into other important areas (like language ergonomics).</p>
<p>Rust is not the only language targeting WebAssembly: Go 1.11 includes
<a href="https://golang.org/doc/go1.11#wasm">experimental support for WebAssembly</a>, and there are even projects bringing
the <a href="https://github.com/iodide-project/pyodide">scientific Python to the web</a> using WebAssembly. </p>
<h2>But does it work?</h2>
<p>With the <a href="https://github.com/luizirber/sourmash-rust">Rust implementation in place</a> and with all tests working on sourmash, I
added the finishing touches using <a href="https://github.com/rustwasm/wasm-bindgen"><code>wasm-bindgen</code></a> and built an NPM package using
<a href="https://github.com/rustwasm/wasm-pack"><code>wasm-pack</code></a>: <a href="https://www.npmjs.com/package/sourmash">sourmash</a> is a Rust codebase compiled to WebAssembly and ready
to use in JavaScript projects.</p>
<p>(Many thanks to Madicken Munk, who also presented during SciPy about how they used
<a href="https://munkm.github.io/2018-07-13-scipy/">Rust and WebAssembly to do interactive visualization in Jupyter</a>
and helped with a good example on how to do this properly =] )</p>
<p>Since I already had the working UI from the previous PoC, I <a href="https://github.com/luizirber/wort-dnd">refactored the code</a>
to use the new WebAssembly module and voilà! <a href="https://wort-dnd.hashbase.io/">It works!</a><sup id=sf-sourmash-wasm-2-back><a href=#sf-sourmash-wasm-2 class=simple-footnote title="the first version of this demo only worked in Chrome because they implemented the BigInt proposal, which is not in the official language yet. The funny thing is that BigInt would have made the JS implementation of sourmash viable, and I probably wouldn't have written the Rust implementation =P. Turns out that I didn't need the BigInt support if I didn't expose any 64-bit integers to JS, and that is what I'm doing now.">2</a></sup>.
<sup id=sf-sourmash-wasm-3-back><a href=#sf-sourmash-wasm-3 class=simple-footnote title="Along the way I ended up writing a new FASTQ parser... because it wouldn't be bioinformatics if it didn't otherwise, right? =P">3</a></sup>
But that was the demo from a year ago with updated code and I got a bit
better with frontend development since then, so here is the new demo:</p>
<div id=files class=box ondragover=event.preventDefault()>
<h2>sourmash + Wasm</h2>
<div id=drag-container>
<p><b>Drag & drop</b> a FASTA or FASTQ file here to calculate the sourmash signature.</p>
</div>
<div id=progress-container>
<div id=progress-bar></div>
</div>
<div class=columns>
<fieldset class="box input-button" id=params>
<label for=ksize-input>k-mer size:</label>
<input id=ksize-input type=number value=21>
<label for=scaled-input>scaled:</label>
<input id=scaled-input type=number value=0>
<label for=num-input>number of hashes:</label>
<input id=num-input type=number value=500>
<label for=dna-protein-group>Input type:</label>
<div id=dna-protein-group>
<input id=dna-input name=dna-protein-input type=radio value="DNA/RNA" checked>
<label for=dna-input>DNA/RNA</label>
<input id=protein-input name=dna-protein-input type=radio value=Protein>
<label for=protein-input>Protein</label>
</div>
<label for=track-abundance-input>Track abundance?</label>
<input id=track-abundance-input type=checkbox checked>
</fieldset>
<div class=box id=download>
<button id=download_btn type=button disabled>Download</button>
</div>
</div>
</div>
<p><link rel=stylesheet href="https://blog.luizirber.org/static/sourmash-wasm/app.css">
<script src="https://blog.luizirber.org/static/sourmash-wasm/dist/bundle.js"></script></p>
<p>For the source code for this demo, check the <a href="https://blog.luizirber.org/static/sourmash-wasm/index.html">sourmash-wasm</a> directory.</p>
<h2>Next steps</h2>
<p>The proof of concept works, but it is pretty useless right now.
I'm thinking about building it as a <a href="https://www.webcomponents.org/">Web Component</a> and making it really easy
to add to any webpage<sup id=sf-sourmash-wasm-4-back><a href=#sf-sourmash-wasm-4 class=simple-footnote title="or maybe a React component? I really would like to have something that works independent of framework, but not sure what is the best option in this case...">4</a></sup>.</p>
<p>Another interesting feature would be supporting more input formats (the GMOD
project implemented a lot of those!), but more features are probably better
after something simple but functional is released =P</p>
<h2>Next time!</h2>
<p>Where we will go next? Maybe explore some decentralized web technologies like
IPFS and dat, hmm? =]</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://social.lasanha.org/@luizirber/100624574917435477">Thread on Mastodon</a></li>
<li><a href="https://www.reddit.com/r/rust/comments/9atie8/blog_post_clientside_bioinformatics_in_the/">Thread on reddit</a></li>
<li><a href="https://twitter.com/luizirber/status/1034206952773935104">Thread on Twitter</a></li>
</ul>
<h2>Updates</h2>
<ul>
<li>2018-08-30: Added a demo in the blog post.</li>
</ul><section class=footnotes><hr><h2>Footnotes</h2><ol><li id=sf-sourmash-wasm-1><p>even if horrible, I
need to get some design classes =P <a href=#sf-sourmash-wasm-1-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-wasm-2><p>the first version
of this demo only worked in Chrome because they implemented the <a href="https://github.com/tc39/proposal-bigint">BigInt proposal</a>,
which is not in the official language yet. The funny thing is that BigInt would
have made the JS implementation of sourmash viable, and I probably wouldn't have
written the Rust implementation =P.
Turns out that I didn't need the BigInt support if I didn't expose any 64-bit
integers to JS, and that is what I'm doing now. <a href=#sf-sourmash-wasm-2-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-wasm-3><p>Along the way I ended up writing a new FASTQ parser... because it wouldn't
be bioinformatics if it didn't otherwise, right? =P <a href=#sf-sourmash-wasm-3-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-wasm-4><p>or maybe a React component? I really would like to
have something that works independent of framework, but not sure what is the
best option in this case... <a href=#sf-sourmash-wasm-4-back class=simple-footnote-back>↩</a></p></li></ol></section>Oxidizing sourmash: Python and FFI2018-08-23T17:00:00-03:002018-08-23T17:00:00-03:00luizirbertag:blog.luizirber.org,2018-08-23:/2018/08/23/sourmash-rust/<p>I think the first time I heard about Rust was because Frank Mcsherry chose it to
write a <a href="https://github.com/frankmcsherry/timely-dataflow">timely dataflow</a> implementation.
Since then it started showing more and more in my news sources,
leading to Armin Ronacher publishing a post in the Sentry blog last November about
writing <a href="https://blog.sentry.io/2017/11/14/evolving-our-rust-with-milksnake">Python extensions in Rust</a>.</p>
<p>Last December I decided to give it a run:
I spent some time porting the C++ bits of <a href="https://github.com/dib-lab/sourmash">sourmash</a>
to Rust.
The main advantage here is that it's a problem I know well,
so I know what the code is supposed to do and can focus on figuring out
syntax and the mental model for the language.
I started digging into the <code>symbolic</code> codebase and understanding what they did,
and tried to mirror or improve it for my use cases.</p>
<p>(About the post title: The process of converting a codebase to Rust is referred as <a href="https://wiki.mozilla.org/Oxidation">"Oxidation"</a> in
the Rust community, following the codename Mozilla chose for the process of
integrating Rust components into the Firefox codebase.
<sup id=sf-sourmash-rust-1-back><a href=#sf-sourmash-rust-1 class=simple-footnote title='The creator of the language is known to keep making up different explanations for the name of the language, but in this case "oxidation" refers to the chemical process that creates rust, and rust is the closest thing to metal (metal being the hardware). There are many terrible puns in the Rust community.'>1</a></sup>
Many of these components were tested and derived in Servo, an experimental
browser engine written in Rust, and are being integrated into Gecko,
the current browser engine (mostly written in C++).)</p>
<h2>Why Rust?</h2>
<p>There are other programming languages more focused on scientific software
that could be used instead, like Julia<sup id=sf-sourmash-rust-2-back><a href=#sf-sourmash-rust-2 class=simple-footnote title="Even more now that it hit 1.0, it is a really nice language">2</a></sup>. Many programming languages start from a
specific niche (like R and statistics,
or Maple and mathematics) and grow into larger languages over time.
While Rust goal is not to be a scientific language,
its focus on being a general purpose language allows a phenomenon similar
to what happened with Python, where people from many areas pushed the
language in different directions (system scripting, web development,
numerical programming...) allowing developers to combine all these things
in their systems.</p>
<p>But by far my interest in Rust is for the many best practices it brings to the default experience:
integrated package management (with Cargo),
documentation (with rustdoc), testing and benchmarking.
It's understandable that older languages like C/C++ need
more effort to support some of these features (like modules and an unified
build system), since they are designed by standard and need to keep backward
compatibility with codebases that already exist.
Nonetheless, the lack of features increase the effort needed to have good
software engineering practices, since you need to choose a solution that might
not be compatible with other similar but slightly different options,
leading to fragmentation and increasing the impedance to use these features.</p>
<p>Another big reason is that Rust doesn't aim to completely replace what already
exists, but complement and extend it. Two very good talks about how to do this,
one by <a href="https://ashleygwilliams.github.io/rustfest-2017/#1">Ashley Williams</a>, another by <a href="http://talks.edunham.net/lca2018/should-you-rewrite-in-rust/beamer.pdf">E. Dunham</a>.</p>
<h2>Converting from a C++ extension to Rust</h2>
<p>The current implementation of the core data structures in sourmash is in a
C++ extension wrapped with Cython. My main goals for converting the code are:</p>
<ul>
<li>
<p>support additional languages and platforms. sourmash is available as a Python
package and CLI, but we have R users in the lab that would benefit from having
an R package, and ideally we wouldn't need to rewrite the software every time
we want to support a new language.</p>
</li>
<li>
<p>reducing the number of wheel packages necessary (one for each OS/platform).</p>
</li>
<li>
<p>in the long run, use the Rust memory management concepts (lifetimes, borrowing)
to increase parallelism in the code.</p>
</li>
</ul>
<p>Many of these goals are attainable with our current C++ codebase, and
"rewrite in a new language" is rarely the best way to solve a problem.
But the reduced burden in maintenance due to better tooling,
on top of features that would require careful planning to execute
(increasing the parallelism without data races) while maintaining compatibility
with the current codebase are promising enough to justify this experiment.
<a href="https://github.com/luizirber/2018-python-rust/blob/master/03.current_impl.md"><img src="/images/arch_cpp.png" title="" current="" implementation""="" alt=""></a></p>
<p>Cython provides a nice gradual path to migrate code from Python to C++,
since it is a superset of the Python syntax. It also provides low overhead
for many C++ features, especially the STL containers, with makes it easier
to map C++ features to the Python equivalent.
For research software this also lead to faster exploration of solutions before
having to commit to lower level code, but without a good process it might also
lead to code never crossing into the C++ layer and being stuck in the Cython
layer. This doesn't make any difference for a Python user, but it becomes
harder from users from other languages to benefit from this code (since your
language would need some kind of support to calling Python code, which is not
as readily available as calling C code).</p>
<p>Depending on the requirements, a downside is that Cython is tied to the CPython API,
so generating the extension requires a development environment set up with
the appropriate headers and compiler. This also makes the extension specific
to a Python version: while this is not a problem for source distributions,
generating wheels lead to one wheel for each OS and Python version supported.</p>
<h2>The new implementation</h2>
<p>This is the overall architecture of the Rust implementation:
<a href="https://github.com/luizirber/2018-python-rust/blob/master/04.rust_impl.md"><img src="/images/arch_rust.png" title="" the="" rust="" implementation""="" alt=""></a>
It is pretty close to what <code>symbolic</code> does,
so let's walk through it.</p>
<h3>The Rust code</h3>
<p>If you take a look at my Rust code, you will see it is very... C++. A lot of the
code is very similar to the original implementation, which is both a curse and a
blessing: I'm pretty sure that are more idiomatic and performant ways of doing
things, but most of the time I could lean on my current mental model for C++ to
translate code. The biggest exception was the <code>merge</code> function, were I was doing
something on the C++ implementation that the borrow checker didn't like.
Eventually I found it was because it couldn't keep track of the lifetime
correctly and putting braces around it fixed the problem,
which was both an epiphany and a WTF moment. <a href="https://play.rust-lang.org/?gist=eae9de12950d1b2a7699cd49a3571c37&version=stable">Here</a> is an example that triggers
the problem, and the <a href="https://play.rust-lang.org/?gist=c8733c5125766930a589c8d0412af99c&version=stable">solution</a>.</p>
<p>"Fighting the borrow checker" seems to be a common theme while learning Rust,
but the compiler really tries to help you to understand what is happening and
(most times) how to fix it. A lot of people grow to hate the borrow checker,
but I see it more as a 'eat your vegetables' situation: you might not like it
at first, but it's better in the long run. Even though I don't have a big
codebase in Rust yet, it keeps you from doing things that will come back to bite
you hard later.</p>
<h3>Generating C headers for Rust code: cbindgen</h3>
<p>With the Rust library working, the next step was taking the Rust code and generate C headers describing the
functions and structs we expose with the <code>#[no_mangle]</code> attribute in Rust
(these are defined in the <a href="https://github.com/luizirber/sourmash-rust/blob/ead9ae0ed3b2d16c9e3b8379919f3bfd2efd21ae/src/ffi.rs"><code>ffi.rs</code></a> module in <code>sourmash-rust</code>).
This attribute tells the Rust compiler to generate names that are compatible
with the C ABI, and so can be called from other languages that implement FFI
mechanisms. FFI (the foreign function interface) is quite low-level,
and pretty much defines things that C can represent: integers, floats, pointers
and structs. It doesn't support higher level concepts like objects or generics,
so in a sense it looks like a feature funnel.
This might sound bad, but ends up being something that other languages can
understand without needing too much extra functionality in their runtimes,
which means that most languages have support to calling code through an FFI.</p>
<p>Writing the C header by hand is possible, but is very error prone.
A better solution is to use <a href="https://github.com/eqrion/cbindgen"><code>cbindgen</code></a>,
a program that takes Rust code and generate a C header file automatically.
<code>cbindgen</code> is developed primarily to generate the C headers for <a href="https://github.com/servo/webrender/">webrender</a>,
the GPU-based renderer for servo,
so it's pretty likely that if it can handle a complex codebase it will work
just fine for the majority of projects.</p>
<h3>Interfacing with Python: CFFI and Milksnake</h3>
<p>Once we have the C headers, we can use the FFI to
call Rust code in Python. Python has a FFI module in the standard library: <code>ctypes</code>,
but the Pypy developers also created CFFI, which has more features.</p>
<p>The C headers generated by cbindgen can be interpreted by CFFI to generate
a low-level Python interface for the code. This is the equivalent of declaring
the functions/methods and structs/classes in a <code>pxd</code> file (in the Cython
world): while the code is now usable in Python, it is not well adapted to
the features and idioms available in the language.</p>
<p>Milksnake is the package developed by Sentry that takes care of running cargo
for the Rust compilation and generating the CFFI boilerplate,
making it easy to load the low-level CFFI bindings in Python.
With this low-level binding available we can now write something more Pythonic
(the <code>pyx</code> file in Cython), and I ended up just renaming the <code>_minhash.pyx</code> file
back to <code>minhash.py</code> and doing one-line fixes to replace the Cython-style code
with the equivalent CFFI calls.</p>
<p>All of these changes should be transparent to the Python code, and to guarantee
that I made sure that all the current tests that we have (both for the Python
module and the command line interface) are still working after the changes.
It also led to finding some quirks in the implementation,
and even improvements in the current C++ code (because we were moving a lot of
data from C++ to Python).</p>
<h2>Where I see this going</h2>
<p>It seems it worked as an experiment,
and I presented a <a href="https://github.com/luizirber/2018-python-rust">poster</a> at <a href="https://gccbosc2018.sched.com/event/FEWp/b23-oxidizing-python-writing-extensions-in-rust">GCCBOSC 2018</a> and <a href="https://scipy2018.scipy.org/ehome/index.php?eventid=299527&tabid=712461&cid=2233543&sessionid=21618890&sessionchoice=1&">SciPy 2018</a> that
was met with excitement by many people.
Knowing that it is possible,
I want to reiterate some points why Rust is pretty exciting for bioinformatics
and science in general.</p>
<h3>Bioinformatics as libraries (and command line tools too!)</h3>
<p>Bioinformatics is an umbrella term for many different methods, depending on
what analysis you want to do with your data (or model).
In this sense, it's distinct from other scientific areas where it is possible
to rely on a common set of libraries (numpy in linear algebra, for example), since a
library supporting many disjoint methods tend to grow too big and hard to
maintain.</p>
<p>The environment also tends to be very diverse, with different languages being
used to implement the software. Because it is hard to interoperate,
the methods tend to be implemented in command line programs that are stitched
together in pipelines, a workflow describing how to connect the input and output of many different tools to
generate results.
Because the basic unit is a command-line tool,
pipelines tend to rely on standard operating system abstractions like
files and pipes to make the tools communicate with each other. But since tools
might have input requirements distinct from what the previous tool provides,
many times it is necessary to do format conversion or other adaptations to make the
pipeline work.</p>
<p>Using tools as blackboxes, controllable through specific parameters at the
command-line level, make exploratory analysis and algorithm reuse harder:
if something needs to be investigated the user needs to resort to perturbations
of the parameters or the input data, without access to the more feature-rich and
meaningful abstraction happening inside the tool.</p>
<p>Even if many languages are used for writing the software, most of the time there
is some part written in C or C++ for performance reasons, and these tend to be
the core data structures of the computational method. Because it is not easy to
package your C/C++ code in a way that other people can readily use it,
most of this code is reinvented over and over again, or is copy and pasted into
codebases and start diverging over time. Rust helps solve this problem with the
integrated package management, and due to the FFI it can also be reused inside
other programs written in other languages.</p>
<p>sourmash is not going to be Rust-only and abandon Python,
and it would be crazy to do so when it has so many great exploratory tools
for scientific discovery. But now we can also use our method in other languages
and environment, instead of having our code stuck in one language.</p>
<h3>Don't rewrite it all!</h3>
<p>I could have gone all the way and rewrite sourmash in Rust<sup id=sf-sourmash-rust-3-back><a href=#sf-sourmash-rust-3 class=simple-footnote title="and risk being kicked out of the lab =P">3</a></sup>, but it would be incredibly disruptive for
the current sourmash users and it would take way longer to pull off. Because
Rust is so focused in supporting existing code, you can do a slow transition and
reuse what you already have while moving into more and more Rust code.
A great example is this one-day effort by Rob Patro to bring <a href="https://github.com/COMBINE-lab/cqf-rust">CQF</a> (a C
codebase) into Rust, using <code>bindgen</code> (a generator of C bindings for Rust).
Check <a href="https://blog.luizirber.org/2018/08/27/sourmash-wasm/">the Twitter thread</a> for more =]</p>
<h3>Good scientific citizens</h3>
<p>There is another MinHash implementation already written in Rust, <a href="https://github.com/onecodex/finch-rs">finch</a>.
Early in my experiment I got an email from them asking if I wanted to work
together, but since I wanted to learn the language I kept doing my thing. (They
were totally cool with this, by the way). But the fun thing is that Rust has a
pair of traits called <a href="https://doc.rust-lang.org/rust-by-example/conversion/from_into.html"><code>From</code> and <code>Into</code></a> that you can implement for your
type, and so I <a href="https://github.com/luizirber/sourmash-rust/pull/1">did that</a> and now we can have interoperable
implementations. This synergy allows <code>finch</code> to use <code>sourmash</code> methods,
and vice versa.</p>
<p>Maybe this sounds like a small thing, but I think it is really exciting. We can
stop having incompatible but very similar methods, and instead all benefit from
each other advances in a way that is supported by the language.</p>
<h2>Next time!</h2>
<p>Turns out Rust supports WebAssembly as a target,
so... what if we run sourmash in the browser?
That's what I'm covering in the <a href="https://blog.luizirber.org/2018/08/27/sourmash-wasm/">next blog post</a>,
so stay tuned =]</p>
<h2>Comments?</h2>
<ul>
<li><a href="https://social.lasanha.org/@luizirber/100602525575698239">Thread on Mastodon</a></li>
<li><a href="https://www.reddit.com/r/rust/comments/99vakd/blog_post_converting_c_to_rust_and_interoperate/">Thread on reddit</a></li>
<li><a href="https://twitter.com/luizirber/status/1032779995129597952">Thread on Twitter</a></li>
</ul>
<!-- rust libs -->
<!-- why rust -->
<!-- WASM and demos -->
<!-- python and FFI -->
<!-- rust examples -->
<!-- cbindgen -->
<!-- rust for pythonistas -->
<!-- understanding rust -->
<!-- rust and bio -->
<!-- rust error handling -->
<!-- other notes
- Implement Default trait for defaults (less verbose than ::new...)
https://doc.rust-lang.org/std/default/trait.Default.html
--><section class=footnotes><hr><h2>Footnotes</h2><ol><li id=sf-sourmash-rust-1><p>The creator of the language is known to keep making up different
explanations for the <a href="https://www.reddit.com/r/rust/comments/27jvdt/internet_archaeology_the_definitive_endall_source/">name of the language</a>,
but in this case "oxidation" refers to the chemical process that creates
rust, and rust is the closest thing to metal (metal being the hardware).
There are many terrible puns in the Rust community. <a href=#sf-sourmash-rust-1-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-rust-2><p>Even more now that it hit 1.0,
it is a really nice language <a href=#sf-sourmash-rust-2-back class=simple-footnote-back>↩</a></p></li><li id=sf-sourmash-rust-3><p>and risk
being kicked out of the lab =P <a href=#sf-sourmash-rust-3-back class=simple-footnote-back>↩</a></p></li></ol></section>Minhashing all the things (part 1): microbial genomes2016-12-28T12:00:00-02:002016-12-28T12:00:00-02:00luizirbertag:blog.luizirber.org,2016-12-28:/2016/12/28/soursigs-arch-1/<p>With the <a href="http://ivory.idyll.org/blog/2016-sourmash-sbt.html">MinHash</a> <a href="http://ivory.idyll.org/blog/2016-sourmash-sbt-more.html">craze</a> currently going on in the <a href="http://ivory.idyll.org/lab/">lab</a>,
we started discussing how to calculate signatures efficiently,
how to index them for search and also how to distribute them.
As a proof of concept I started implementing a system to read public data available on the <a href="https://www.ncbi.nlm.nih.gov/sra">Sequence Read Archive</a>,
as well as a variation of the <a href="https://www.cs.cmu.edu/~ckingsf/software/bloomtree/">Sequence Bloom Tree</a> using Minhashes as leaves/datasets instead of the whole k-mer set (as Bloom Filters).</p>
<p>Since this is a PoC,
I also wanted to explore some solutions that allow maintaining the least amount of explicit servers:
I'm OK with offloading a queue system to <a href="https://aws.amazon.com/sqs/">Amazon SQS</a> instead of maintaining a server running <a href="https://www.rabbitmq.com/">RabbitMQ</a>,
for example.
Even with all the DevOps movement you still can't ignore the Ops part,
and if you have a team to run your infrastructure,
good for you!
But I'm a grad student and the last thing I want to be doing is babysitting servers =]</p>
<h2>Going serverless: AWS Lambda</h2>
<p>The first plan was to use <a href="https://aws.amazon.com/lambda/">AWS Lambda</a> to calculate signatures.
Lambda is a service that exposes functions,
and it manages all the runtime details (server provisioning and so on),
while charging by the time and memory it takes to run the function.
Despite all the promises,
it is a bit annoying to balance everything to make an useful Lambda,
so I used the <a href="https://gordon.readthedocs.io/en/latest/">Gordon framework</a> to structure it.
I was pretty happy with it,
until I added our <a href="https://github.com/dib-lab/sourmash">MinHash package</a> and,
since it is a C++ extension,
needed to compile and send the resulting package to Lambda.
I was using my local machine for that,
but Lambda packaging is pretty much 'put all the Python files in one directory,
compress and upload it to S3',
which of course didn't work because I don't have the same library versions that <a href="https://aws.amazon.com/amazon-linux-ami/">Amazon Linux</a> runs.
I managed to hack a <a href="https://github.com/jorgebastida/gordon/compare/master...luizirber:refactor/python_package">fix</a>,
but it would be wonderful if Amazon adopted wheels and stayed more in line with the <a href="https://packaging.python.org/">Python Package Authority</a> solutions
(and hey, <a href="https://www.python.org/dev/peps/pep-0513/">binary wheels</a> even work on Linux now!).</p>
<p>Anyway,
after I deployed the Lambda function and tried to run it...
I fairly quickly realized that 5 minutes is far too short to calculate a signature.
This is not a CPU-bound problem,
it's just that we are downloading the data and network I/O is the bottleneck.
I think Lambda will still be a good solution together with <a href="https://aws.amazon.com/api-gateway/">API Gateway</a>
for triggering calculations and providing other useful services despite the drawbacks,
but at this point I started looking for alternative architectures.</p>
<h2>Back to the comfort zone: Snakemake</h2>
<p>Focusing on computing signatures first and thinking about other issues later,
I wrote a quick <a href="https://bitbucket.org/snakemake/snakemake/wiki/Home">Snakemake</a> rules file and started calculating signatures
for all the <a href="https://github.com/luizirber/soursigs/blob/6c6acf6429cec2e2e4a076dfc32adbf27fab1eed/Snakefile#L81">transcriptomic</a> datasets I could find on the SRA.
Totaling 671 TB,
it was way over my storage capacity,
but since both the <a href="https://github.com/ncbi/sra-tools">SRA Toolkit</a> and <a href="https://github.com/dib-lab/sourmash">sourmash</a> have streaming modes,
I piped the output of the first as the input for the second and... voila!
We have a duct-taped but working system.
Again,
the issue becomes network bottlenecks:
the SRA seems to limit each IP to ~100 Mbps,
it would take 621 days to calculate everything.
Classes were happening during these development,
so I just considered it good enough and started running it in a 32-core server hosted at <a href="https://www.rackspace.com/openstack/public">Rackspace</a>
to at least have some signatures to play with.</p>
<h2>Offloading computation: Celery + Amazon SQS</h2>
<p>With classes over,
we changed directions a bit:
instead of going through the transcriptomic dataset,
we decided to focus on microbial genomes,
especially all those unassembled ones on SRA.
(We didn't forget the transcriptomic dataset,
but microbial genomes are small-ish,
more manageable and we already have the microbial SBTs to search against).
There are 412k SRA IDs matching the <a href="https://github.com/luizirber/soursigs/blob/a049cbc5733adbcffaaf91e176bbcda43763ed23/Snakefile#L71">new search</a>,
totalling 28 TB of data.
We have storage to save it,
but since we want a scalable solution (something that would work with the 8 PB of data in the SRA,
for example),
I avoided downloading all the data beforehand and kept doing it in a streaming way.</p>
<p>I started to redesign the Snakemake solution:
first thing was to move the body of the rule to a <a href="http://docs.celeryproject.org/en/latest/userguide/tasks.html">Celery task</a>
and use Snakemake to control what tasks to run and get the results,
but send the computation to a (local or remote) Celery worker.
I checked other work queue solutions,
but they were either too simple or required running specialized servers.
(and thanks to <a href="http://ggmarcondes.com">Gabriel Marcondes</a> for enlightening me about how to best
use Celery!).
With Celery I managed to use <a href="https://aws.amazon.com/sqs/">Amazon SQS</a> as a broker
(the queue of tasks to be executed,
in Celery parlance),
and <a href="https://github.com/robgolding/celery-s3">celery-s3</a> as the results backend.
While not an official part of Celery,
using S3 to keep results allowed to avoid deploying another service
(usually Celery uses redis or RabbitMQ for result backend).
I didn't configure it properly tho,
and ended up racking up \$200 in charges because I was querying S3 too much,
but my advisor thought it was <a href="https://twitter.com/ctitusbrown/status/812003429535006721">funny and mocked me on Twitter</a> (I don't mind,
he is the one paying the bill =P).
For initial tests I just ran the workers locally on the 32-core server,
but... What if the worker was easy to deploy,
and other people wanted to run additional workers?</p>
<h3>Docker workers</h3>
<p>I wrote a <a href="https://github.com/luizirber/soursigs/blob/6c6acf6429cec2e2e4a076dfc32adbf27fab1eed/Dockerfile">Dockerfile</a> with all the dependencies,
and made it available on <a href="https://hub.docker.com/r/luizirber/soursigs/tags/">Docker hub</a>.
I still need to provide credentials to access SQS and S3,
but now I can deploy workers anywhere,
even... on the <a href="https://cloud.google.com/">Google Cloud Platform</a>.
They have a free trial with \$300 in credits,
so I used the <a href="https://cloud.google.com/container-engine/">Container Engine</a> to deploy a <a href="http://kubernetes.io/">Kubernetes</a> cluster and run
workers under a <a href="http://kubernetes.io/docs/user-guide/replication-controller/">Replication Controller</a>.</p>
<p>Just to keep track: we are posting Celery tasks from a Rackspace server
to Amazon SQS,
running workers inside Docker managed by Kubernetes on GCP,
putting results on Amazon S3
and finally reading the results on Rackspace and then posting it to <a href="https://ipfs.io/ipns/minhash.oxli.org/microbial/">IPFS</a>.
IPFS is the Interplanetary File System,
a decentralized solution to share data.
But more about this later!</p>
<h3>HPCC workers</h3>
<p>Even with Docker workers running on GCP and the Rackspace server,
it was progressing slowly and,
while it wouldn't be terribly expensive to spin up more nodes on GCP,
I decided to go use the resources we already have:
the <a href="https://wiki.hpcc.msu.edu/">MSU HPCC</a>.
I couldn't run Docker containers there (HPC is wary of Docker,
but <a href="https://github.com/NERSC/2016-11-14-sc16-Container-Tutorial">we are trying to change that!</a>),
so I used Conda to create a clean environment and used the <a href="https://github.com/luizirber/soursigs/blob/a049cbc5733adbcffaaf91e176bbcda43763ed23/requirements.txt">requirements</a>
file (coupled with some <code>PATH</code> magic) to replicate what I have inside the Docker container.
The Dockerfile was very useful,
because I mostly ran the same commands to recreate the environment.
Finally,
I wrote a <a href="https://github.com/luizirber/soursigs/blob/a049cbc5733adbcffaaf91e176bbcda43763ed23/submit">submission script</a> to start a job array with 40 jobs,
and after a bit of tuning I decided to use 12 Celery workers for each job,
totalling 480 workers.</p>
<p>This solution still requires a bit of babysitting,
especially when I was tuning how many workers to run per job,
but it achieved around 1600 signatures per hour,
leading to about 10 days to calculate for all 412k datasets.
Instead of downloading the whole dataset,
we are <a href="https://github.com/luizirber/soursigs/blob/a049cbc5733adbcffaaf91e176bbcda43763ed23/soursigs/tasks.py#L15">reading the first million reads</a> and using our <a href="https://peerj.com/preprints/890/">streaming error trimming</a>
solution to calculate the signatures
(and also to test if it is the best solution for this case).</p>
<h3>Clever algorithms are better than brute force?</h3>
<p>While things were progressing,
Titus was using the <a href="https://github.com/dib-lab/sourmash/pull/45">Sequence Bloom Tree + Minhash</a> code to categorize the new datasets into the 50k genomes in the [RefSeq] database,
but 99\% of the signatures didn't match anything.
After assembling a dataset that didn't match,
he found out it did match something,
so... The current approach is not so good.</p>
<p>(UPDATE: it was a bug in the search,
so this way of calculating signatures probably also work.
Anyway,
the next approach is faster and more reasonable,
so yay bug!)</p>
<p>Yesterday he came up with a new way to filter solid k-mers instead of doing
error trimming (and named it... <a href="https://github.com/dib-lab/syrah">syrah</a>?
Oh, SyRAh...
So many puns in this lab).
I <a href="https://github.com/luizirber/soursigs/blob/a049cbc5733adbcffaaf91e176bbcda43763ed23/soursigs/tasks.py#L34">created a new Celery task</a> and refactored the Snakemake rule,
and started running it again...
And wow is it faster!
It is currently doing around 4200 signatures per hour,
and it will end in less than five days.
The syrah approach probably works for the vast majority of the SRA,
but metagenomes and metatranscriptomes will probably fail because the minority members of the population will not be represented.
But hey,
we have people in the lab working on that too =]</p>
<h1>Future</h1>
<p>The solution works,
but several improvements can be made.
First,
I use Snakemake at both ends,
both to keep track of the work done and get the workers results.
I can make the workers a bit smarter and post the results to a S3 bucket,
and so I only need to use Snakemake to track what work needs to be done and post tasks to the queue.
This removes the need for celery-s3 and querying S3 all the time,
and opens the path to use Lambda again to trigger updates to IPFS.</p>
<p>I'm insisting on using IPFS to make the data available because...
Well, it is super cool!
I always wanted to have a system like bittorrent to distribute data,
but IPFS builds up on top of other very good ideas from bitcoin (bitswap),
and git (the DAG representation) to make a resilient system and,
even more important,
something that can be used in a scientific context to both increase bandwidth for important resources (like, well, the SRA)
and to make sure data can stay around if the centralized solution goes away.
The <a href="https://github.com/ga4gh/cgtd">Cancer Gene Trust</a> project is already using it,
and I do hope more projects show up and adopt IPFS as a first-class dependency.
And,
even crazier,
we can actually use IPFS to store our SBT implementation,
but more about this in part 2!</p>When life gives you lemons2015-10-05T12:00:00-03:002015-10-05T12:00:00-03:00luizirbertag:blog.luizirber.org,2015-10-05:/2015/10/05/zotero/<p>Two weeks ago,
during the CS graduate orientation at <a href="http://ucdavis.edu">UC Davis</a>,
the IT person was showing that we have access to 50 GB on <a href="https://ucdavis.box.com/">Box</a> and UNLIMITED storage on Google Drive
(I'm currently testing the limits of UNLIMITED, stay tuned).
I promptly forgot about Box because I've been using Syncthing for file syncing and it's been working quite well,
or at least way better than owncloud (which took infinite amounts of time to sync my stuff).</p>
<p>Anyway,
today I was checking <a href="http://www.papershipapp.com/">PaperShip</a> (wonderful app, by the way)
and noticed my <a href="https://www.zotero.org/">Zotero</a> free account was almost full (it is only 300 MB).
I knew Zotero had <a href="https://www.zotero.org/support/sync#webdav">WebDAV</a> support,
but I don't want to maintain a WebDAV server.
Turns out Box has <a href="https://support.box.com/hc/en-us/articles/200519748-Does-Box-support-WebDAV-">WebDAV support</a>,
and even found <a href="http://web.archive.org/web/20180617051357/http://guides.library.cornell.edu/zotero_to_Box">instructions</a> on how to set everything up.</p>HOWTO: Moleculo raw reads coverage in reference genome2014-02-12T12:00:00-02:002014-02-12T12:00:00-02:00luizirbertag:blog.luizirber.org,2014-02-12:/2014/02/12/moleculo/<style type="text/css">/*!
*
* IPython notebook
*
*/
/* CSS font colors for translated ANSI colors. */
.ansibold {
font-weight: bold;
}
/* use dark versions for foreground, to improve visibility */
.ansiblack {
color: black;
}
.ansired {
color: darkred;
}
.ansigreen {
color: darkgreen;
}
.ansiyellow {
color: #c4a000;
}
.ansiblue {
color: darkblue;
}
.ansipurple {
color: darkviolet;
}
.ansicyan {
color: steelblue;
}
.ansigray {
color: gray;
}
/* and light for background, for the same reason */
.ansibgblack {
background-color: black;
}
.ansibgred {
background-color: red;
}
.ansibggreen {
background-color: green;
}
.ansibgyellow {
background-color: yellow;
}
.ansibgblue {
background-color: blue;
}
.ansibgpurple {
background-color: magenta;
}
.ansibgcyan {
background-color: cyan;
}
.ansibggray {
background-color: gray;
}
div.cell {
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: vertical;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: vertical;
-moz-box-align: stretch;
display: box;
box-orient: vertical;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: column;
align-items: stretch;
border-radius: 2px;
box-sizing: border-box;
-moz-box-sizing: border-box;
-webkit-box-sizing: border-box;
border-width: 1px;
border-style: solid;
border-color: transparent;
width: 100%;
padding: 5px;
/* This acts as a spacer between cells, that is outside the border */
margin: 0px;
outline: none;
border-left-width: 1px;
padding-left: 5px;
background: linear-gradient(to right, transparent -40px, transparent 1px, transparent 1px, transparent 100%);
}
div.cell.jupyter-soft-selected {
border-left-color: #90CAF9;
border-left-color: #E3F2FD;
border-left-width: 1px;
padding-left: 5px;
border-right-color: #E3F2FD;
border-right-width: 1px;
background: #E3F2FD;
}
@media print {
div.cell.jupyter-soft-selected {
border-color: transparent;
}
}
div.cell.selected {
border-color: #ababab;
border-left-width: 0px;
padding-left: 6px;
background: linear-gradient(to right, #42A5F5 -40px, #42A5F5 5px, transparent 5px, transparent 100%);
}
@media print {
div.cell.selected {
border-color: transparent;
}
}
div.cell.selected.jupyter-soft-selected {
border-left-width: 0;
padding-left: 6px;
background: linear-gradient(to right, #42A5F5 -40px, #42A5F5 7px, #E3F2FD 7px, #E3F2FD 100%);
}
.edit_mode div.cell.selected {
border-color: #66BB6A;
border-left-width: 0px;
padding-left: 6px;
background: linear-gradient(to right, #66BB6A -40px, #66BB6A 5px, transparent 5px, transparent 100%);
}
@media print {
.edit_mode div.cell.selected {
border-color: transparent;
}
}
.prompt {
/* This needs to be wide enough for 3 digit prompt numbers: In[100]: */
min-width: 14ex;
/* This padding is tuned to match the padding on the CodeMirror editor. */
padding: 0.4em;
margin: 0px;
font-family: monospace;
text-align: right;
/* This has to match that of the the CodeMirror class line-height below */
line-height: 1.21429em;
/* Don't highlight prompt number selection */
-webkit-touch-callout: none;
-webkit-user-select: none;
-khtml-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
/* Use default cursor */
cursor: default;
}
@media (max-width: 540px) {
.prompt {
text-align: left;
}
}
div.inner_cell {
min-width: 0;
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: vertical;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: vertical;
-moz-box-align: stretch;
display: box;
box-orient: vertical;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: column;
align-items: stretch;
/* Old browsers */
-webkit-box-flex: 1;
-moz-box-flex: 1;
box-flex: 1;
/* Modern browsers */
flex: 1;
}
/* input_area and input_prompt must match in top border and margin for alignment */
div.input_area {
border: 1px solid #cfcfcf;
border-radius: 2px;
background: #f7f7f7;
line-height: 1.21429em;
}
/* This is needed so that empty prompt areas can collapse to zero height when there
is no content in the output_subarea and the prompt. The main purpose of this is
to make sure that empty JavaScript output_subareas have no height. */
div.prompt:empty {
padding-top: 0;
padding-bottom: 0;
}
div.unrecognized_cell {
padding: 5px 5px 5px 0px;
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: horizontal;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: horizontal;
-moz-box-align: stretch;
display: box;
box-orient: horizontal;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: row;
align-items: stretch;
}
div.unrecognized_cell .inner_cell {
border-radius: 2px;
padding: 5px;
font-weight: bold;
color: red;
border: 1px solid #cfcfcf;
background: #eaeaea;
}
div.unrecognized_cell .inner_cell a {
color: inherit;
text-decoration: none;
}
div.unrecognized_cell .inner_cell a:hover {
color: inherit;
text-decoration: none;
}
@media (max-width: 540px) {
div.unrecognized_cell > div.prompt {
display: none;
}
}
div.code_cell {
/* avoid page breaking on code cells when printing */
}
@media print {
div.code_cell {
page-break-inside: avoid;
}
}
/* any special styling for code cells that are currently running goes here */
div.input {
page-break-inside: avoid;
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: horizontal;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: horizontal;
-moz-box-align: stretch;
display: box;
box-orient: horizontal;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: row;
align-items: stretch;
}
@media (max-width: 540px) {
div.input {
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: vertical;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: vertical;
-moz-box-align: stretch;
display: box;
box-orient: vertical;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: column;
align-items: stretch;
}
}
/* input_area and input_prompt must match in top border and margin for alignment */
div.input_prompt {
color: #303F9F;
border-top: 1px solid transparent;
}
div.input_area > div.highlight {
margin: 0.4em;
border: none;
padding: 0px;
background-color: transparent;
}
div.input_area > div.highlight > pre {
margin: 0px;
border: none;
padding: 0px;
background-color: transparent;
}
/* The following gets added to the <head> if it is detected that the user has a
* monospace font with inconsistent normal/bold/italic height. See
* notebookmain.js. Such fonts will have keywords vertically offset with
* respect to the rest of the text. The user should select a better font.
* See: https://github.com/ipython/ipython/issues/1503
*
* .CodeMirror span {
* vertical-align: bottom;
* }
*/
.CodeMirror {
line-height: 1.21429em;
/* Changed from 1em to our global default */
font-size: 14px;
height: auto;
/* Changed to auto to autogrow */
background: none;
/* Changed from white to allow our bg to show through */
}
.CodeMirror-scroll {
/* The CodeMirror docs are a bit fuzzy on if overflow-y should be hidden or visible.*/
/* We have found that if it is visible, vertical scrollbars appear with font size changes.*/
overflow-y: hidden;
overflow-x: auto;
}
.CodeMirror-lines {
/* In CM2, this used to be 0.4em, but in CM3 it went to 4px. We need the em value because */
/* we have set a different line-height and want this to scale with that. */
padding: 0.4em;
}
.CodeMirror-linenumber {
padding: 0 8px 0 4px;
}
.CodeMirror-gutters {
border-bottom-left-radius: 2px;
border-top-left-radius: 2px;
}
.CodeMirror pre {
/* In CM3 this went to 4px from 0 in CM2. We need the 0 value because of how we size */
/* .CodeMirror-lines */
padding: 0;
border: 0;
border-radius: 0;
}
/*
Original style from softwaremaniacs.org (c) Ivan Sagalaev <Maniac@SoftwareManiacs.Org>
Adapted from GitHub theme
*/
.highlight-base {
color: #000;
}
.highlight-variable {
color: #000;
}
.highlight-variable-2 {
color: #1a1a1a;
}
.highlight-variable-3 {
color: #333333;
}
.highlight-string {
color: #BA2121;
}
.highlight-comment {
color: #408080;
font-style: italic;
}
.highlight-number {
color: #080;
}
.highlight-atom {
color: #88F;
}
.highlight-keyword {
color: #008000;
font-weight: bold;
}
.highlight-builtin {
color: #008000;
}
.highlight-error {
color: #f00;
}
.highlight-operator {
color: #AA22FF;
font-weight: bold;
}
.highlight-meta {
color: #AA22FF;
}
/* previously not defined, copying from default codemirror */
.highlight-def {
color: #00f;
}
.highlight-string-2 {
color: #f50;
}
.highlight-qualifier {
color: #555;
}
.highlight-bracket {
color: #997;
}
.highlight-tag {
color: #170;
}
.highlight-attribute {
color: #00c;
}
.highlight-header {
color: blue;
}
.highlight-quote {
color: #090;
}
.highlight-link {
color: #00c;
}
/* apply the same style to codemirror */
.cm-s-ipython span.cm-keyword {
color: #008000;
font-weight: bold;
}
.cm-s-ipython span.cm-atom {
color: #88F;
}
.cm-s-ipython span.cm-number {
color: #080;
}
.cm-s-ipython span.cm-def {
color: #00f;
}
.cm-s-ipython span.cm-variable {
color: #000;
}
.cm-s-ipython span.cm-operator {
color: #AA22FF;
font-weight: bold;
}
.cm-s-ipython span.cm-variable-2 {
color: #1a1a1a;
}
.cm-s-ipython span.cm-variable-3 {
color: #333333;
}
.cm-s-ipython span.cm-comment {
color: #408080;
font-style: italic;
}
.cm-s-ipython span.cm-string {
color: #BA2121;
}
.cm-s-ipython span.cm-string-2 {
color: #f50;
}
.cm-s-ipython span.cm-meta {
color: #AA22FF;
}
.cm-s-ipython span.cm-qualifier {
color: #555;
}
.cm-s-ipython span.cm-builtin {
color: #008000;
}
.cm-s-ipython span.cm-bracket {
color: #997;
}
.cm-s-ipython span.cm-tag {
color: #170;
}
.cm-s-ipython span.cm-attribute {
color: #00c;
}
.cm-s-ipython span.cm-header {
color: blue;
}
.cm-s-ipython span.cm-quote {
color: #090;
}
.cm-s-ipython span.cm-link {
color: #00c;
}
.cm-s-ipython span.cm-error {
color: #f00;
}
.cm-s-ipython span.cm-tab {
background: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADAAAAAMCAYAAAAkuj5RAAAAAXNSR0IArs4c6QAAAGFJREFUSMft1LsRQFAQheHPowAKoACx3IgEKtaEHujDjORSgWTH/ZOdnZOcM/sgk/kFFWY0qV8foQwS4MKBCS3qR6ixBJvElOobYAtivseIE120FaowJPN75GMu8j/LfMwNjh4HUpwg4LUAAAAASUVORK5CYII=);
background-position: right;
background-repeat: no-repeat;
}
div.output_wrapper {
/* this position must be relative to enable descendents to be absolute within it */
position: relative;
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: vertical;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: vertical;
-moz-box-align: stretch;
display: box;
box-orient: vertical;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: column;
align-items: stretch;
z-index: 1;
}
/* class for the output area when it should be height-limited */
div.output_scroll {
/* ideally, this would be max-height, but FF barfs all over that */
height: 24em;
/* FF needs this *and the wrapper* to specify full width, or it will shrinkwrap */
width: 100%;
overflow: auto;
border-radius: 2px;
-webkit-box-shadow: inset 0 2px 8px rgba(0, 0, 0, 0.8);
box-shadow: inset 0 2px 8px rgba(0, 0, 0, 0.8);
display: block;
}
/* output div while it is collapsed */
div.output_collapsed {
margin: 0px;
padding: 0px;
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: vertical;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: vertical;
-moz-box-align: stretch;
display: box;
box-orient: vertical;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: column;
align-items: stretch;
}
div.out_prompt_overlay {
height: 100%;
padding: 0px 0.4em;
position: absolute;
border-radius: 2px;
}
div.out_prompt_overlay:hover {
/* use inner shadow to get border that is computed the same on WebKit/FF */
-webkit-box-shadow: inset 0 0 1px #000;
box-shadow: inset 0 0 1px #000;
background: rgba(240, 240, 240, 0.5);
}
div.output_prompt {
color: #D84315;
}
/* This class is the outer container of all output sections. */
div.output_area {
padding: 0px;
page-break-inside: avoid;
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: horizontal;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: horizontal;
-moz-box-align: stretch;
display: box;
box-orient: horizontal;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: row;
align-items: stretch;
}
div.output_area .MathJax_Display {
text-align: left !important;
}
div.output_area
div.output_area
div.output_area img,
div.output_area svg {
max-width: 100%;
height: auto;
}
div.output_area img.unconfined,
div.output_area svg.unconfined {
max-width: none;
}
/* This is needed to protect the pre formating from global settings such
as that of bootstrap */
.output {
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: vertical;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: vertical;
-moz-box-align: stretch;
display: box;
box-orient: vertical;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: column;
align-items: stretch;
}
@media (max-width: 540px) {
div.output_area {
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: vertical;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: vertical;
-moz-box-align: stretch;
display: box;
box-orient: vertical;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: column;
align-items: stretch;
}
}
div.output_area pre {
margin: 0;
padding: 0;
border: 0;
vertical-align: baseline;
color: black;
background-color: transparent;
border-radius: 0;
}
/* This class is for the output subarea inside the output_area and after
the prompt div. */
div.output_subarea {
overflow-x: auto;
padding: 0.4em;
/* Old browsers */
-webkit-box-flex: 1;
-moz-box-flex: 1;
box-flex: 1;
/* Modern browsers */
flex: 1;
max-width: calc(100% - 14ex);
}
div.output_scroll div.output_subarea {
overflow-x: visible;
}
/* The rest of the output_* classes are for special styling of the different
output types */
/* all text output has this class: */
div.output_text {
text-align: left;
color: #000;
/* This has to match that of the the CodeMirror class line-height below */
line-height: 1.21429em;
}
/* stdout/stderr are 'text' as well as 'stream', but execute_result/error are *not* streams */
div.output_stderr {
background: #fdd;
/* very light red background for stderr */
}
div.output_latex {
text-align: left;
}
/* Empty output_javascript divs should have no height */
div.output_javascript:empty {
padding: 0;
}
.js-error {
color: darkred;
}
/* raw_input styles */
div.raw_input_container {
line-height: 1.21429em;
padding-top: 5px;
}
pre.raw_input_prompt {
/* nothing needed here. */
}
input.raw_input {
font-family: monospace;
font-size: inherit;
color: inherit;
width: auto;
/* make sure input baseline aligns with prompt */
vertical-align: baseline;
/* padding + margin = 0.5em between prompt and cursor */
padding: 0em 0.25em;
margin: 0em 0.25em;
}
input.raw_input:focus {
box-shadow: none;
}
p.p-space {
margin-bottom: 10px;
}
div.output_unrecognized {
padding: 5px;
font-weight: bold;
color: red;
}
div.output_unrecognized a {
color: inherit;
text-decoration: none;
}
div.output_unrecognized a:hover {
color: inherit;
text-decoration: none;
}
.rendered_html {
color: #000;
/* any extras will just be numbers: */
}
.rendered_html :link {
text-decoration: underline;
}
.rendered_html :visited {
text-decoration: underline;
}
.rendered_html h1:first-child {
margin-top: 0.538em;
}
.rendered_html h2:first-child {
margin-top: 0.636em;
}
.rendered_html h3:first-child {
margin-top: 0.777em;
}
.rendered_html h4:first-child {
margin-top: 1em;
}
.rendered_html h5:first-child {
margin-top: 1em;
}
.rendered_html h6:first-child {
margin-top: 1em;
}
.rendered_html * + ul {
margin-top: 1em;
}
.rendered_html * + ol {
margin-top: 1em;
}
.rendered_html pre,
.rendered_html tr,
.rendered_html th,
.rendered_html td,
.rendered_html * + table {
margin-top: 1em;
}
.rendered_html * + p {
margin-top: 1em;
}
.rendered_html * + img {
margin-top: 1em;
}
.rendered_html img,
.rendered_html img.unconfined,
div.text_cell {
/* Old browsers */
display: -webkit-box;
-webkit-box-orient: horizontal;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: horizontal;
-moz-box-align: stretch;
display: box;
box-orient: horizontal;
box-align: stretch;
/* Modern browsers */
display: flex;
flex-direction: row;
align-items: stretch;
}
@media (max-width: 540px) {
div.text_cell > div.prompt {
display: none;
}
}
div.text_cell_render {
/*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/
outline: none;
resize: none;
width: inherit;
border-style: none;
padding: 0.5em 0.5em 0.5em 0.4em;
color: #000;
box-sizing: border-box;
-moz-box-sizing: border-box;
-webkit-box-sizing: border-box;
}
a.anchor-link:link {
text-decoration: none;
padding: 0px 20px;
visibility: hidden;
}
h1:hover .anchor-link,
h2:hover .anchor-link,
h3:hover .anchor-link,
h4:hover .anchor-link,
h5:hover .anchor-link,
h6:hover .anchor-link {
visibility: visible;
}
.text_cell.rendered .input_area {
display: none;
}
.text_cell.rendered
.text_cell.unrendered .text_cell_render {
display: none;
}
.cm-header-1,
.cm-header-2,
.cm-header-3,
.cm-header-4,
.cm-header-5,
.cm-header-6 {
font-weight: bold;
font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
}
.cm-header-1 {
font-size: 185.7%;
}
.cm-header-2 {
font-size: 157.1%;
}
.cm-header-3 {
font-size: 128.6%;
}
.cm-header-4 {
font-size: 110%;
}
.cm-header-5 {
font-size: 100%;
font-style: italic;
}
.cm-header-6 {
font-size: 100%;
font-style: italic;
}
</style>
<style type="text/css">pre { line-height: 125%; }
td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
.highlight .hll { background-color: #ffffcc }
.highlight { background: #f8f8f8; }
.highlight .c { color: #3D7B7B; font-style: italic } /* Comment */
.highlight .err { border: 1px solid #FF0000 } /* Error */
.highlight .k { color: #008000; font-weight: bold } /* Keyword */
.highlight .o { color: #666666 } /* Operator */
.highlight .ch { color: #3D7B7B; font-style: italic } /* Comment.Hashbang */
.highlight .cm { color: #3D7B7B; font-style: italic } /* Comment.Multiline */
.highlight .cp { color: #9C6500 } /* Comment.Preproc */
.highlight .cpf { color: #3D7B7B; font-style: italic } /* Comment.PreprocFile */
.highlight .c1 { color: #3D7B7B; font-style: italic } /* Comment.Single */
.highlight .cs { color: #3D7B7B; font-style: italic } /* Comment.Special */
.highlight .gd { color: #A00000 } /* Generic.Deleted */
.highlight .ge { font-style: italic } /* Generic.Emph */
.highlight .gr { color: #E40000 } /* Generic.Error */
.highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */
.highlight .gi { color: #008400 } /* Generic.Inserted */
.highlight .go { color: #717171 } /* Generic.Output */
.highlight .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
.highlight .gs { font-weight: bold } /* Generic.Strong */
.highlight .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
.highlight .gt { color: #0044DD } /* Generic.Traceback */
.highlight .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
.highlight .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
.highlight .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
.highlight .kp { color: #008000 } /* Keyword.Pseudo */
.highlight .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
.highlight .kt { color: #B00040 } /* Keyword.Type */
.highlight .m { color: #666666 } /* Literal.Number */
.highlight .s { color: #BA2121 } /* Literal.String */
.highlight .na { color: #687822 } /* Name.Attribute */
.highlight .nb { color: #008000 } /* Name.Builtin */
.highlight .nc { color: #0000FF; font-weight: bold } /* Name.Class */
.highlight .no { color: #880000 } /* Name.Constant */
.highlight .nd { color: #AA22FF } /* Name.Decorator */
.highlight .ni { color: #717171; font-weight: bold } /* Name.Entity */
.highlight .ne { color: #CB3F38; font-weight: bold } /* Name.Exception */
.highlight .nf { color: #0000FF } /* Name.Function */
.highlight .nl { color: #767600 } /* Name.Label */
.highlight .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
.highlight .nt { color: #008000; font-weight: bold } /* Name.Tag */
.highlight .nv { color: #19177C } /* Name.Variable */
.highlight .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
.highlight .w { color: #bbbbbb } /* Text.Whitespace */
.highlight .mb { color: #666666 } /* Literal.Number.Bin */
.highlight .mf { color: #666666 } /* Literal.Number.Float */
.highlight .mh { color: #666666 } /* Literal.Number.Hex */
.highlight .mi { color: #666666 } /* Literal.Number.Integer */
.highlight .mo { color: #666666 } /* Literal.Number.Oct */
.highlight .sa { color: #BA2121 } /* Literal.String.Affix */
.highlight .sb { color: #BA2121 } /* Literal.String.Backtick */
.highlight .sc { color: #BA2121 } /* Literal.String.Char */
.highlight .dl { color: #BA2121 } /* Literal.String.Delimiter */
.highlight .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
.highlight .s2 { color: #BA2121 } /* Literal.String.Double */
.highlight .se { color: #AA5D1F; font-weight: bold } /* Literal.String.Escape */
.highlight .sh { color: #BA2121 } /* Literal.String.Heredoc */
.highlight .si { color: #A45A77; font-weight: bold } /* Literal.String.Interpol */
.highlight .sx { color: #008000 } /* Literal.String.Other */
.highlight .sr { color: #A45A77 } /* Literal.String.Regex */
.highlight .s1 { color: #BA2121 } /* Literal.String.Single */
.highlight .ss { color: #19177C } /* Literal.String.Symbol */
.highlight .bp { color: #008000 } /* Name.Builtin.Pseudo */
.highlight .fm { color: #0000FF } /* Name.Function.Magic */
.highlight .vc { color: #19177C } /* Name.Variable.Class */
.highlight .vg { color: #19177C } /* Name.Variable.Global */
.highlight .vi { color: #19177C } /* Name.Variable.Instance */
.highlight .vm { color: #19177C } /* Name.Variable.Magic */
.highlight .il { color: #666666 } /* Literal.Number.Integer.Long */</style>
<style type="text/css">
/* Temporary definitions which will become obsolete with Notebook release 5.0 */
.ansi-black-fg { color: #3E424D; }
.ansi-black-bg { background-color: #3E424D; }
.ansi-black-intense-fg { color: #282C36; }
.ansi-black-intense-bg { background-color: #282C36; }
.ansi-red-fg { color: #E75C58; }
.ansi-red-bg { background-color: #E75C58; }
.ansi-red-intense-fg { color: #B22B31; }
.ansi-red-intense-bg { background-color: #B22B31; }
.ansi-green-fg { color: #00A250; }
.ansi-green-bg { background-color: #00A250; }
.ansi-green-intense-fg { color: #007427; }
.ansi-green-intense-bg { background-color: #007427; }
.ansi-yellow-fg { color: #DDB62B; }
.ansi-yellow-bg { background-color: #DDB62B; }
.ansi-yellow-intense-fg { color: #B27D12; }
.ansi-yellow-intense-bg { background-color: #B27D12; }
.ansi-blue-fg { color: #208FFB; }
.ansi-blue-bg { background-color: #208FFB; }
.ansi-blue-intense-fg { color: #0065CA; }
.ansi-blue-intense-bg { background-color: #0065CA; }
.ansi-magenta-fg { color: #D160C4; }
.ansi-magenta-bg { background-color: #D160C4; }
.ansi-magenta-intense-fg { color: #A03196; }
.ansi-magenta-intense-bg { background-color: #A03196; }
.ansi-cyan-fg { color: #60C6C8; }
.ansi-cyan-bg { background-color: #60C6C8; }
.ansi-cyan-intense-fg { color: #258F8F; }
.ansi-cyan-intense-bg { background-color: #258F8F; }
.ansi-white-fg { color: #C5C1B4; }
.ansi-white-bg { background-color: #C5C1B4; }
.ansi-white-intense-fg { color: #A1A6B2; }
.ansi-white-intense-bg { background-color: #A1A6B2; }
.ansi-bold { font-weight: bold; }
</style>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Since last month I'm a PhD student in MSU at <a href="http://ivory.idyll.org/blog/">Titus</a> <a href="http://ged.msu.edu/">lab</a>, and my
research is focused on building infrastructure for exploring and
merging multiple read types and multiples assemblies.</p>
<p>Titus and all the labbies are awesome mentors, and I'm making some
progress while learning how do deal with this brave new world.</p>
<p>One thing I'm doing is checking how good is a <em>Gallus gallus</em> Moleculo
sequencing dataset we have, which is being used for the
<a href="http://ivory.idyll.org/blog/2013-posted-chick-improvement-grant.html">chicken genome sequence improvement project</a>. The first question was:
How many of Moleculo raw reads align to the reference genome, and how much is
the coverage?</p>
<p>To answer these questions we are using <a href="http://arxiv.org/abs/1303.3997">bwa-mem</a> to do the alignments,
<a href="http://samtools.sourceforge.net/">samtools</a> to work with the alignment data and a and a mix of Bash and
Python scripts to glue everything together and do analysis.</p>
<p>First, let's download the reference genome.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget -c ftp://hgdownload.cse.ucsc.edu/goldenPath/galGal4/bigZips/galGal4.fa.masked.gz
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>With the reference genome available, we need to prepare it for the BWA algorithms by constructing its FM-index. The command to do this is</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>bwa index galGal4.fa.masked.gz
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>SAMtools also require preprocessing of the original FASTA file:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>samtools faidx galGal4.fa.masked.gz
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>I had 10 files with Moleculo reads, varying from 500 bp to 16 Kbp. In this example let's assume all reads are in the same file, <em>reads.fastq</em>, but in the original analysis I ran the next commands inside a Bash for-loop.</p>
<p>Let's align reference and reads using the BWA-MEM algorithm:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>bwa mem galGal4.fa.masked.gz reads.fastq > reads.fastq.sam
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Next we are going to optimize the <em>reads.fastq.sam</em> file, transforming it into the BAM format (a binary version of SAM). We also sort based on leftmost coordinates and index the file for faster access:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>samtools import galGal4.fa.masked.gz.fai reads.fastq.sam reads.fastq.bam
<span class="o">!</span>samtools sort reads.fastq.bam reads.fastq.sorted
<span class="o">!</span>samtools index reads.fastq.sorted.bam
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We can query our alignments using the <em>view</em> commands from samtools. How many reads didn't align with the reference?</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>samtools view -c -f <span class="m">4</span> reads.fastq.sorted.bam
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In my case I got 7,985 reads that didn't align to the reference, from a total of 1,579,060 reads.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>samtools view -c reads.fastq.sorted.bam
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>There were 4,411,380 possible alignments.</p>
<p>But how good is the coverage of these alignments in the reference genome? To do this calculation I refactored <a href="https://github.com/ngs-docs/ngs-scripts/blob/master/blast/calc-blast-cover.py">calc-blast-cover</a>, resulting in <a href="https://github.com/luizirber/bioinfo/blob/master/bioinfo/bam_coverage.py">bam-coverage</a>. The idea is to create arrays initialized to zero with the size of the sequence for each sequence in the reference. We go over the alignments and set the array position to 1 if there is an alignment
matching this position. After doing this we can calculate the coverage by summing the array and dividing by the sequence size.</p>
<p>To make it easier to use I've started a new project (oh no! Another bioinformatics scripts collection!)
with my modifications of the original script. This project is called <a href="https://github.com/luizirber/bioinfo">bioinfo</a> and it is available
in <a href="https://pypi.python.org/pypi/bioinfo">PyPI</a>.
Basic dependencies are just <a href="http://docopt.org">docopt</a> (which is awesome, BTW) and
<a href="https://github.com/noamraph/tqdm">tqdm</a> (same). Additional dependencies are needed based on which command
you intend to run. For example, bam_coverage needs <a href="http://pysam.readthedocs.org/en/latest/">PySAM</a> and <a href="http://screed.readthedocs.org/en/latest/">screed</a>. At first
this seems counter-intuitive, because the user need to explicitly install new
packages, but this way avoids another problem: installing all the packages in
the world just to run a subset of the program. I intend to give an informative
message when the user try to run a command and dependencies are missing.</p>
<p>If you've never used Python before, a good way to have a working environment is to use <a href="http://continuum.io/downloads">Anaconda</a>.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Since bioinfo is available in PyPI, you can install it with</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>pip install bioinfo
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To see available commands and options in bioinfo you can run</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>bioinfo -h
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Running the bam_coverage command over the alignment generated by BWA-MEM:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>bioinfo bam_coverage galGal4.fa.masked reads.fastq.sorted.bam <span class="m">200</span> reads.fastq --mapq<span class="o">=</span><span class="m">30</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The same command is available as a function in bioinfo and can be run inside a Python script:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">bioinfo</span> <span class="kn">import</span> <span class="n">bam_coverage</span>
<span class="c1">## same call, using the module.</span>
<span class="n">bam_coverage</span><span class="p">(</span><span class="s2">"galGal4.fa.masked"</span><span class="p">,</span> <span class="s2">"reads.fastq.sorted.bam"</span><span class="p">,</span> <span class="mi">200</span><span class="p">,</span> <span class="s2">"reads.fastq"</span><span class="p">,</span> <span class="mi">30</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>If you don't want to install bioinfo (why not?!??!), you can just download bam-coverage:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>wget -c https://github.com/luizirber/bioinfo/blob/master/bioinfo/bam_coverage.py
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>And pass the options to the script:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [ ]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>python bam_coverage.py galGal4.fa.masked reads.fastq.sorted.bam <span class="m">200</span> reads.fastq <span class="m">45</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The result I got for my analysis was a 82.3% coverage of the reference genome by the alignments generated with BWA-MEM.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<pre><code>reading query
1579060 [elapsed: 01:16, 20721.82 iters/sec]
creating empty lists
15932 [elapsed: 01:08, 232.67 iters/sec]
building coverage
4411380 [elapsed: 34:36, 2123.96 iters/sec]
Summing stats
|##########| 15932/15932 100% [elapsed: 00:07 left: 00:00, 2008.90 iters/sec]
total bases in reference: 1046932099
total ref bases covered : 861340070
fraction : 0.822727730693
reference : galGal4.fa.masked
alignment file : ../moleculo/LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
query sequences : ../moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fastq</code></pre>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This post was written entirely in the IPython notebook. You can <a href="http://blog.luizirber.org/downloads/notebooks/bam_coverage.ipynb">download</a> this notebook, or see a static view <a href="http://nbviewer.ipython.org/url/blog.luizirber.org/downloads/notebooks/bam_coverage.ipynb">here</a>.</p>
</div>
</div>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Of ThinkPads and MacBooks2013-06-27T12:00:00-03:002013-06-27T12:00:00-03:00luizirbertag:blog.luizirber.org,2013-06-27:/2013/06/27/thinkpad/<p>Since 2009 I was a Mac user. I was working with iOS development, and it made
sense to have a MacBook for the SDK. I was curious too, because I've been
using Linux distros (Debian, then Ubuntu, then Gentoo when Ubuntu was getting
too heavy for my old laptop) for some time and was a bit tired of making
everything work. Losing control was discomfortable at first, but so many
things working out of the box (like sleep and hibernation!) was worth it. And
Mac apps were much more polished (oh, Garageband).</p>
<p>When I arrived at INPE I got a Linux workstation, the mighty Papera (all the
computers there have Tupi names, Tupi being a language spoken by native
indians here in Brazil). And I tested some new things, like using Awesome<a href="http://awesome.naquadah.org/">1</a>
as a window manager, and love it. But it lasted just for some months, because
the machines were swapped for some iMacs and Papera was assigned for other
person. I missed a tiling manager, but I also found Homebrew<a href="http://mxcl.github.io/homebrew/">2</a> and it helped
a lot setting up a dev environment in OSX (I know macports and fink existed,
but writing a Homebrew formula is pretty easy, I even contributed one back),
so no big problems in the transition.</p>
<p>But after some time I was getting uneasy. New OSX versions seemed to remove
features instead of adding then (sigh, matrix-organized Spaces...). Lack of
expansibility on new laptops (despite MacBook Air being an awesome computer)
was pushing me back too, because a maxed one would cost way more than I was
willing to pay. And I was spending most of my time in SSH sessions to other
computers or using web apps, so why not go back to Linux?</p>
<p>At the end of 2012 I bought a used ThinkPad X220 with the dock and everything.
When I was younger I always liked the visual, with its black and red look, and
the durability (MacBooks are pretty, but they are easy to scratch and bend).
And the X220 was cheap and in perfect state, and with a small upgrade when I
went to PyCon (ahem, 16 GB RAM and a 128 GB SSD) it is a BEAST now. And all
these benefits too:</p>
<ul>
<li>
<p>I have Awesome again!</p>
</li>
<li>
<p>Updated packages thanks to pacman (I installed Arch Linux and I'm loving
it)</p>
</li>
<li>
<p>When I need a new package it is as easy to write a PKGBUILD file as it was
to write a Homebrew formula. I wrote some Debian packages in the past and
they worked, but there were so many rules and parts that I don't think I
want to write one again. I recognize that a lot of the rules and parts make
sense with a project as big as Debian (and Ubuntu and everyone else), but
it could be simpler.</p>
</li>
<li>
<p>Sleep works! Hibernation works! <a href="http://forums.lenovo.com/t5/X-Series-ThinkPad-Laptops/x220-does-not-resume-from-sleep/m-p/1083233/highlight/false#M48825">Except when it doesn't because your EFI
is half full after the kernel wrote some stacktraces and the chip refuses
to wake up.</a></p>
</li>
</ul>
<p>It isn't for those faint of heart, but I'm happy to be back =]</p>A Tale of Two Stadiums2013-06-17T22:00:00-03:002013-06-17T22:00:00-03:00luizirbertag:blog.luizirber.org,2013-06-17:/2013/06/17/two-stadiums/<p>Last weekend I went with some friends to Maracanã watch Italy vs Mexico
at the first round of the Confederations Cup.
We got some cheap tickets (R$ 57, about US$ 25) on an area meant for
Brazilians only (and that's why they were so cheap, usually tickets cost
at least double). And it was so good to go again to a stadium, it's a very
different experience from watching a game on TV, where the camera give you
a limited perspective. Our seats were behind one of the goals, and we were
lucky: we saw two goals on our side, one from Mexico (a penalty kick) and
one from Italy, a beautiful goal by Balotelli, who was a bit of a diva, by
the way, always complaining and making drama.</p>
<p>It was the third time I went to a stadium. Previous ones were Grêmio matches,
one versus River Plate in 2002 and other versus Figueirense in 2008. These
games were on Grêmio's last stadium, Olímpico Monumental, which gave way to
a new one, the Arena, built at the entrance of the city (and far from the
old one). This is happening a lot around here, given that next year we have
the World Cup and there are at least ten new stadiums built or reformed for
the competition. At first they should have been prepared with private
funding, but as time passed and they were all late public funding came into
the picture, and only in Maracanã more than one billion brazilian reais were
spent.</p>
<p>They are now much closer to developed countries' stadiums, like those we used
to see on TV. By brazilian standards they aren't even stadiums anymore,
looking more like theaters, were you just sit and watch the game. It
shouldn't be bad, but I couldn't avoid a comparison between this weekend
game and my previous experience. OK, this time the crowd didn't have a
prefered side and so it wasn't cheering up as much as I saw before, but it
was worrisome because it was obvious that almost everyone at the match was
people that could pay for the expensive tickets, and sometimes didn't even
have a strong connection with football, something that always gave a match
that catarsis aura.</p>
<p>Last week I tried to go with my family to see the new Grêmio stadium and we
weren't so lucky. Cheapest tickets cost almost R$ 100 each, and for the four
of us this meant R$ 400 less, not out of league but way too expensive. This
high cost ticket means the usual fan can't go to the stadium anymore,
and is replaced by this new fan, a consumer above everything else. And
maybe it's just a romantic vision, and of course there were a lot of problems
before, but the price paid for comfort might be too expensive in the long run.</p>I finally met Tupã2013-06-07T23:45:00-03:002013-06-07T23:45:00-03:00luizirbertag:blog.luizirber.org,2013-06-07:/2013/06/07/tupa/<p>So, after two and a half years, I finally met <a href="http://supercomputacao.inpe.br/recursos2">Tupã</a>.</p>
<p><a href="http://www.flickr.com/photos/luizirber/sets/72157633820118765/"><img class="center" src="https://farm3.staticflickr.com/2875/8896385074_5a48573f57.jpg" width="500" height="375"></a></p>
<p>Curiously, it was only after I left INPE. <a href="http://ciclotux.blogspot.com">Arnaldo</a> came to Cachoeira
to visit us and I planned to show him the center, but I wasn't expecting
to do the whole tour. When I began working there they only showed the
public part, because Tupã was off-limits at that time, and I thought it
still was since then. Well, me and everybody else that got into the group
after me... We went on a small tour to see Tupã, the engines, no-breaks
and all the electrical installation needed to support the building.</p>
<p>And it was <strong>AMAZING!</strong> I always liked more software than hardware, but it
is awesome to see things that you just knew from a SSH session. I could see
the tape library, disk arrays, the blinkenlights... It is in some ways the
physical manifestation of your code, or at least you have a better notion
of how many things are in motion when you execute something on the computer.</p>
<p>All in all, good times. It wasn't exactly what I expected at first, but I
made some great friends there, I could bike to work everyday, and I
learned (by trial and error) that you need to choose better who do you
trust. So it goes.</p>Ocean2011-04-19T19:01:00-03:002011-04-19T19:01:00-03:00luizirbertag:blog.luizirber.org,2011-04-19:/2011/04/19/ocean/<p>Tudo começou com <a href="http://www.reddit.com/r/compsci/comments/gp0wr/bored_with_webdevelopment_want_to_change_careers/c1p731j">uma pergunta no Reddit</a>. Compartilhei ela no
<a href="https://profiles.google.com/107474098146584337559/posts/Ln5D9CFTMjS">Google Reader</a>, comentando que era bem parecido com o que eu faço no
INPE. O <a href="https://profiles.google.com/thiago.camposmoraes/about">Thiago</a> ficou interessado e perguntou se eu não podia contar
um pouco mais.</p>
<p>Fazem cinco meses que comecei a trabalhar no grupo de modelagem oceânica
acoplada do INPE. O nosso grupo é responsável pelo <a href="http://www.ccst.inpe.br/modelo-brasileiro.php">MBSCG</a>, além da
produção científica. Eu estou desenvolvendo ferramentas para auxiliar os
pesquisadores na análise das saídas do modelo e como relacionar elas com
as observações dos instrumentos espalhados pelo mundo.</p>
<p>Mas isso é uma descrição muito genérica. O primeiro aplicativo que fiz
(ainda não tem release público, shame on me, mas não falta muito) é um
editor de grades oceânicas. O modelo oceânico necessita de um arquivo
descrevendo a grade que ele vai usar para saber como é o mundo (onde é
terra, onde é água, qual a profundidade da água em cada local). Em
resoluções baixas, costuma subdividir o mundo em incrementos de 1 grau
(cerca de 100km no Equador). Em qualquer resolução escolhida ocorrem
vários problemas diferentes:</p>
<ul>
<li>Em resoluções baixas o estreito de Gibraltar pode ficar fechado, e o
Mediterrâneo vira um lago gigante sem trocas com o oceano.
Obviamente o resultado fica bem longe da realidade.</li>
<li>Em resoluções altas o canal do Panamá pode abrir, o que causa um
fluxo inexistente entre o Pacífico e o Atlântico.</li>
</ul>
<p>Entre muitos outros. O modelo tem ferramentas para gerar uma grade a
partir de uma batimetria fornecida, mas ele não corrige todos os
problemas que surgem. Até hoje era muito trabalhoso arrumar essa grade,
pois era difícil visualizar e selecionar cada ponto da grade e conferir
se ele é consistente com as necessidades do experimento.</p>
<p>Então o que eu fiz foi fazer um aplicativo que pega a grade, plota na
projeção desejada (ortográfica, por padrão), e permite a seleção de
células e edição de profundidade delas em diferentes níveis de zoom.
Relativamente simples, mas de uma utilidade muito grande para os
pesquisadores prepararem seus experimentos.</p>
<p>Hoje estou trabalhando num módulo para extrair dados do modelo e
compará-los com observações disponíveis. Um exemplo são os dados do
projeto <a href="http://www.pmel.noaa.gov/pirata/">PIRATA</a>, que inclui <a href="http://www.whoi.edu/instruments/viewInstrument.do?id=1003">CTD</a>s e <a href="http://en.wikipedia.org/wiki/Radiosonde">radiossondas</a> (quer quiser
brincar um pouco para ver o que é disponibilizado pode ver <a href="http://opendap.ccst.inpe.br/pirata/">esse site</a>
que o <a href="http://roberto.dealmeida.net">Beto</a> fez). O PIRATA tem cruzeiros de manutenção das boias em
alguns períodos do ano, e é interessante comparar os dados coletados
nesses cruzeiros com um cruzeiro virtual que passe na mesma localização
no modelo. Inicialmente fiz algo bem simples, que só pega a latitude e
longitude da medição e interpola os arredores dessa localização no
modelo (a resolução pode não ser tão alta a ponto de incluir o ponto
exato, por isso a interpolação). Mesmo com uma técnica tão simples já é
possível perceber como o modelo se aproxima bastante da realidade em
muitos lugares. Na semana passada o <a href="http://blog.castelao.net/">Guilherme</a> chegou para trabalhar
no nosso grupo e fez algumas sugestões para fazer uma análise mais
avançada, e na verdade esse email todo é para comentar sobre isso.</p>
<p>Assim como o pessoal da computação é muito procurado já a algum tempo,
devido à ascensão do computador a ferramenta essencial da maioria das
atividades da humanidade, os próximos serão os estatísticos (e
cientistas da informação, onde cursos de biblioteconomia se
atualizaram). Cada vez mais geramos montanhas de dados, mas não sabemos
muito bem como lidar com toda essa complexidade. E existem técnicas que
são comuns a muitas áreas de conhecimento, e portanto genéricas: você
pode aplicá-las para analisar uma gama gigantesca de dados. Por exemplo,
eu tive análise de sinais na universidade, e apesar de orientada à
circuitos quase tudo se aplica no estudo de medidas realizadas pelos
instrumentos. Foi até curioso, porque eu associava tanto a circuitos que
ficava com a impressão de que nunca usaria na vida, porque eu não ia
trabalhar com hardware (sim, eu era um bixo burro).</p>
<p>Relacionado a isso surgiu pela tarde outro assunto, conversando com o
<a href="http://vainalousachefe.wordpress.com">GG</a>. Falei que, durante a graduação, tinha matérias da engenharia que
ficávamos nos perguntando pra que serviriam. Fenômenos dos transportes?
Estatística? A já citada análise de sinais? Só pra morder a língua, são
as três coisas que mais uso aqui: a primeira é essencial para a
modelagem de processos físicos (oceanografia geofísica) que governam o
modelo, a segunda e a terceira para análise das saídas. E também
percebemos que não apenas a computação é uma área-meio (que pode ser
aplicada a várias áreas-fim), como também a engenharia o é. Fico feliz
te ter achado um meio que permita usar as minhas profissões genéricas.</p>Minha (quase) vida bandida2010-08-10T14:08:00-03:002010-08-10T14:08:00-03:00luizirbertag:blog.luizirber.org,2010-08-10:/2010/08/10/minha-quase-vida-bandida/<p>Blog juntando moscas, deixa eu ressucitar um projetinho de fim de semana
para ver se anima um pouco.</p>
<p>A ideia inicial desse post surgiu na <a href="http://www.pythonbrasil.org.br/">PythonBrasil</a> do ano passado.
Pensei em fazer uma lightning talk, mas não ficou pronta a tempo (Nota:
sempre bom deixar alguma lightning talk preparada).</p>
<p>Como começou a história: recebi email de uma prima, pedindo para que os
amigos e parentes ajudassem a votar na filha dela em um site de roupas
infantis. A criança mais votada participaria de um comercial e ia
embolsar um monte de roupas.</p>
<p>Ok, ok. Odeio esse tipo de spam, mas também não custa ajudar, né? Fui na
página de votação. Votei uma vez, depois de preencher um captcha, e
tentei votar de novo para ver o que acontecia. "Você votou na última
hora, aguarde para votar novamente". Hmm. Como será que o controle disso
é feito?</p>
<p>Abri o código. Hmm, essa função em javascript aqui que processa o evento
do botão, ela chama uma URL...</p>
<figure class='code'>
<figcaption><span>votoAprovar.js</span> <a href='/downloads/code/votoAprovar.js'>download</a></figcaption>
<div class="highlight"><pre><span></span><span class="kd">function</span> <span class="nx">votoAprovar</span><span class="p">(</span><span class="nx">cadastroId</span><span class="p">){</span>
<span class="nx">captcha</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">getElementById</span><span class="p">(</span><span class="s1">'cadastroCaptcha'</span><span class="p">).</span><span class="nx">value</span><span class="p">;</span>
<span class="nb">window</span><span class="p">.</span><span class="nx">location</span> <span class="o">=</span> <span class="s1">'voto_v.php?votoStatus=1&cadastroId='</span><span class="o">+</span><span class="nx">cadastroId</span><span class="o">+</span><span class="s2">"&captcha="</span><span class="o">+</span><span class="nx">captcha</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
</figure>
<p>Opa. E se eu tentar acessar essa URL direto?</p>
<div class="highlight"><pre><span></span>voto_v.php?votoStatus=1&cadastroId=98374&captcha=adb356
</pre></div>
<p>Tenho que acertar o captcha. Onde está o captcha? Ah, olha só, o link da
imagem é um arquivo captcha.php, será que dá para acessar direto? Deu.</p>
<p><img class="center" src="/images/original1.png" width="173" title="A imagem original." alt="A imagem original."></p>
<p>Resumindo, eu tinha a URL para votar, e atualizando o captcha eu
conseguia votar quantas vezes quisesse. Mas ficar fazendo isso na mão é
chato. Como será que funciona identificação de captcha? Uma pesquisa
rapidinha e caí <a href="http://alwaysmovefast.com/2007/11/21/cracking-captchas-for-fun-and-profit/">nesse site</a>. E em Python, para facilitar ainda mais
minha vida.</p>
<p>Brinquei um pouco com o PIL, e consegui deixar a imagem com caracteres
bem definidos. Incrivelmente, só precisei converter para escala de
cinza, e aplicar um limiar.</p>
<figure class='code'>
<figcaption><span>clean_captcha.py</span> <a href='/downloads/code/clean_captcha.py'>download</a></figcaption>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">captcha_to_greyscale</span><span class="p">(</span><span class="n">captcha</span><span class="p">):</span>
<span class="k">if</span> <span class="n">captcha</span><span class="o">.</span><span class="n">mode</span> <span class="o">==</span> <span class="s1">'L'</span><span class="p">:</span>
<span class="k">return</span> <span class="n">captcha</span>
<span class="n">captcha</span> <span class="o">=</span> <span class="n">captcha</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s1">'L'</span><span class="p">,</span> <span class="p">(</span><span class="mf">.4</span><span class="p">,</span> <span class="mf">.4</span><span class="p">,</span> <span class="mf">.4</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="k">return</span> <span class="n">captcha</span>
<span class="k">def</span> <span class="nf">light_pixels_to_white_pixels</span><span class="p">(</span><span class="n">pixels</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">):</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="n">w</span><span class="p">):</span>
<span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="n">h</span><span class="p">):</span>
<span class="k">if</span> <span class="n">pixels</span><span class="p">[</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">]</span> <span class="o">></span> <span class="mi">50</span><span class="p">:</span>
<span class="n">pixels</span><span class="p">[</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">]</span> <span class="o">=</span> <span class="mi">255</span>
<span class="k">return</span> <span class="n">pixels</span>
<span class="k">def</span> <span class="nf">clean_captcha</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
<span class="n">img2</span> <span class="o">=</span> <span class="n">captcha_to_greyscale</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="n">w</span><span class="p">,</span> <span class="n">h</span> <span class="o">=</span> <span class="n">img2</span><span class="o">.</span><span class="n">size</span>
<span class="n">light_pixels_to_white_pixels</span><span class="p">(</span><span class="n">img2</span><span class="o">.</span><span class="n">load</span><span class="p">(),</span> <span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
<span class="k">return</span> <span class="n">img2</span>
</pre></div>
</figure>
<p><img class="center" src="/images/limpa1.png" width="173" title="Imagem, depois de processada pelo PIL." alt="Imagem, depois de processada pelo PIL."></p>
<p>E, conforme ia acumulando mais imagens, vi que a minha vida seria mais
fácil ainda: o captcha só tinha caracteres hexadecimais, então nem
precisaria mapear o alfabeto inteiro, só de zero a nove e de 'a' até
'f'. Depois de limpar algumas imagens e juntá-las numa pasta, rodei o
treinador do <a href="http://code.google.com/p/tesseract-ocr/">tesseract-ocr</a>, e depois dos arquivos de treinos
prontos, tinha 100% de acerto nas imagens. Sigh, que maravilha de
captcha...</p>
<p>Agora, testar. Criei um perfil falso, e me assustei. Tentei rodar o
script para ver se contava um voto, e quando abri o perfil já tinha 5!
Aparentemente, as mães fazem um "vote-no-meu-filho-que-eu-voto-no-seu",
e como os perfis mais novos aparecem na página principal, me acharam
rapidinho. Ok, rodemos um loop então, cem votos. Yep, todos contados.</p>
<figure class='code'>
<figcaption><span>break_captcha.py</span> <a href='/downloads/code/break_captcha.py'>download</a></figcaption>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">commands</span> <span class="kn">import</span> <span class="n">getoutput</span>
<span class="kn">from</span> <span class="nn">StringIO</span> <span class="kn">import</span> <span class="n">StringIO</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="kn">import</span> <span class="nn">mechanize</span>
<span class="k">def</span> <span class="nf">captcha_to_greyscale</span><span class="p">(</span><span class="n">captcha</span><span class="p">):</span>
<span class="k">if</span> <span class="n">captcha</span><span class="o">.</span><span class="n">mode</span> <span class="o">==</span> <span class="s1">'L'</span><span class="p">:</span>
<span class="k">return</span> <span class="n">captcha</span>
<span class="n">captcha</span> <span class="o">=</span> <span class="n">captcha</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s1">'L'</span><span class="p">,</span> <span class="p">(</span><span class="mf">.4</span><span class="p">,</span> <span class="mf">.4</span><span class="p">,</span> <span class="mf">.4</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="k">return</span> <span class="n">captcha</span>
<span class="k">def</span> <span class="nf">light_pixels_to_white_pixels</span><span class="p">(</span><span class="n">pixels</span><span class="p">):</span>
<span class="n">w</span><span class="p">,</span> <span class="n">h</span> <span class="o">=</span> <span class="n">pixels</span><span class="o">.</span><span class="n">size</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="n">w</span><span class="p">):</span>
<span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="n">h</span><span class="p">):</span>
<span class="k">if</span> <span class="n">pixels</span><span class="p">[</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">]</span> <span class="o">></span> <span class="mi">50</span><span class="p">:</span>
<span class="n">pixels</span><span class="p">[</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">]</span> <span class="o">=</span> <span class="mi">255</span>
<span class="k">return</span> <span class="n">pixels</span>
<span class="k">def</span> <span class="nf">clean_captcha</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
<span class="n">img2</span> <span class="o">=</span> <span class="n">captcha_to_greyscale</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="n">light_pixels_to_white_pixels</span><span class="p">(</span><span class="n">img2</span><span class="o">.</span><span class="n">load</span><span class="p">())</span>
<span class="k">return</span> <span class="n">img2</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">br</span> <span class="o">=</span> <span class="n">mechanize</span><span class="o">.</span><span class="n">Browser</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="nb">print</span> <span class="s1">'voto'</span><span class="p">,</span> <span class="n">i</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">br</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">'*****/captcha/captcha.php'</span><span class="p">)</span>
<span class="n">img_str</span> <span class="o">=</span> <span class="n">StringIO</span><span class="p">(</span><span class="n">page</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
<span class="n">fp</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'original.tif'</span><span class="p">,</span> <span class="s1">'wb'</span><span class="p">)</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">img_str</span><span class="p">)</span>
<span class="n">img</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">fp</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s1">'tiff'</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">clean_captcha</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="n">fp</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'tmp.tif'</span><span class="p">,</span> <span class="s1">'wb'</span><span class="p">)</span>
<span class="n">output</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">fp</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s1">'tiff'</span><span class="p">)</span>
<span class="n">fp</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">getoutput</span><span class="p">(</span><span class="s1">'tesseract tmp.tif output -l brand'</span><span class="p">)</span>
<span class="n">fp</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'output.txt'</span><span class="p">)</span>
<span class="n">captcha</span> <span class="o">=</span> <span class="n">fp</span><span class="o">.</span><span class="n">read</span><span class="p">()[:</span><span class="mi">6</span><span class="p">]</span>
<span class="n">fp</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">cadId</span> <span class="o">=</span> <span class="mi">28477</span>
<span class="n">vote_page</span> <span class="o">=</span> <span class="s1">'*****/voto_v.php?votoStatus=1&cadastroId=</span><span class="si">%d</span><span class="s1">&captcha=</span><span class="si">%s</span><span class="s1">'</span> <span class="o">%</span> <span class="p">(</span>
<span class="n">cadId</span><span class="p">,</span> <span class="n">captcha</span><span class="p">)</span>
<span class="n">br</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">vote_page</span><span class="p">)</span>
</pre></div>
</figure>
<p>Omiti o endereço do site, mas basicamente o script é esse.</p>
<p>Agora chega o grande momento, o clímax da história, onde o herói escolhe
entre a fama e fortuna ou o que parece moralmente certo. (Que
grandioso!). "Com grandes poderes vêm grandes responsabilidades!". E
todo esse lero-lero.</p>
<p>Apesar de o propósito inicial ter sido ajudar a minha prima, rodar o
script me pareceu uma ajuda grande demais. E meu objetivo era testar o
buraco no sistema de votação, não me aproveitar dele. Acabei deixando
pra lá, e o código ficou mofando no meu computador.</p>
<p>Hoje, quando fui escrever o post, vi que já aconteceu a segunda edição
do concurso, e miseravelmente o sistema é exatamente o mesmo. Vou mandar
esse post para a empresa, quem sabe para o próximo corrijam.</p>
<p>UPDATE: O <a href="http://lameiro.wordpress.com">Lameiro</a> deu a dica nos comentários: buscando no
google.com.br por "captcha PHP" temos, como primeiro hit,
<a href="http://www.numaboa.com/informatica/tutos/php/949-captcha">um tutorial ensinando a gerar o captcha</a>
que esse site usa. E, como ele bem notou,
deve ter muitos sites no Brasil com esse mesmo problema.</p>
<p>Fica a dica: nunca confie no primeiro hit do google para implementar a
sua solução de segurança. Aliás, não confie em nenhuma, até saber como
ela funciona.</p>Robô Shrek2009-06-12T11:29:00-03:002009-06-12T11:29:00-03:00luizirbertag:blog.luizirber.org,2009-06-12:/2009/06/12/robo-shrek/<p>Na última quarta-feira aconteceu a Feira de Informática Aplicada, ligada
à Semana de Computação da UFSCar. Eu e Alphalpha resolvemos participar
para testar o Arduino que eu tinha comprado no começo do ano. O tempo
era curto, mas mesmo assim fomos em frente e juntamos algumas peças que
sobraram do <a href="http://www.youtube.com/watch?v=4PPXEsENpCY">robô</a> do <a href="http://www2.dc.ufscar.br/~gedai/">GEDAI</a>, um N800, um Arduino Duemilanove e
montamos nosso próprio robô, chamado de Shrek.</p>
<p>Por que Shrek? Porque Shrek é um ogro, ogros são como cebolas
(<span style="text-decoration:line-through;">fedem</span> são feitos de
camadas), e nosso robô é feito de várias camadas simples que, juntas,
fazem algo complexo.</p>
<p>Como funciona? O Arduino controla os motores, e recebe dados pelo USB
(como uma porta serial) vindos do N800. O N800 está conectado em uma
rede wifi, e recebe comandos via socket. Além disso, também envia vídeo
e áudio para a aplicação (que no momento roda em um PC), e a aplicação
envia os comandos e exibe o vídeo e o áudio.</p>
<p>Devido ao pouco tempo, apenas 3 comandos simples foram implementados
(frente, giro à esquerda, giro à direita), mas nosso objetivo estava
cumprido: a comunicação entre as partes estava funcionando direitinho, e
agora podemos partir para incrementá-lo.</p>
<p>Todo o código está disponível no <a href="http://github.com/luizirber/shrekenc">GitHub</a>, e queremos levá-lo
para o FISL (depois de arranjarmos motores melhores).</p>OMG Kitties!2009-05-26T08:49:00-03:002009-05-26T08:49:00-03:00luizirbertag:blog.luizirber.org,2009-05-26:/2009/05/26/omg-kitties/<p>Ontem foi o dia da toalha. Uma imagem vale mais que mil palavras.</p>
<p><img class="center" src="/images/toalha2009.jpg" width="300"></p>
<p>À esquerda, minha querida toalha. Ao fundo, <a href="http://www.wowarmory.com/character-sheet.xml?r=Ravenholdt&n=Lebriziur">World of Warcraft</a> (sim,
cometi esse erro. Seja o que Deus quiser). E na frente, em primeiro
plano, Ada.</p>
<p>Quem é Ada? Semana passada minha namorada veio perguntar se eu já tinha
ouvido falar de Ada Byron, condessa de Lovelace (não confundir com
<a href="http://en.wikipedia.org/wiki/Linda_Lovelace">Linda Lovelace</a>, condessa de... clique no link e veja). Eu disse
"claro, a primeira programadora", e começamos a conversar sobre isso.
Ela reclamou da falta de reconhecimento da importância dela, e acabamos
concluindo que Ada é um nome legal.</p>
<p>Na sexta, fomos até a Arca de São Franscisco, uma entidade de proteção
aos animais em São Carlos. E lá achamos essa gatinha, que deveria ter
ficado em outro lugar, mas por incompatibilidade de gênios foi rejeitada
e acabou ficando no apartamento mesmo. E resolvemos batizar ela como
Ada. Diga-se que é a primeira vez que tenho um gato, o único animal de
estimação que tive foi uma coelha, quando eu tinha uns 6 anos. Mas tô
curtindo por enquanto =]</p>
<p>E acabei descobrindo que essa tirinha do XKCD realmente é verdade:</p>
<p><a href="http://xkcd.com/231/"><img class="center" src="https://imgs.xkcd.com/comics/cat_proximity.png" width="300" title=""Yes you are! And youre sitting there! Hi, kitty!"" alt=""Yes you are! And youre sitting there! Hi, kitty!""></a></p>
<p><em>Randall Munroe, mestre.</em></p>
<p>Alguém aí tem dicas?</p>Hand2009-02-12T07:13:00-02:002009-02-12T07:13:00-02:00luizirbertag:blog.luizirber.org,2009-02-12:/2009/02/12/hand/<p>Ontem, enquanto gravava um DVD, comecei a generalizar os scripts que
geram os feeds do post anterior. E disso surgiu o Hand, um gerador de
feeds RSS.</p>
<p><strong>Como funciona?</strong></p>
<p>O meu objetivo inicial era gerar feeds para sites que não os
disponibilizavam, recorrendo ao bom e velho <a href="http://en.wikipedia.org/wiki/Screen_scraping">screen scraping</a>. Comecei
fazendo o feed dos quadrinhos da Folha, o mais complicado, pois era
necessário fazer autenticação de usuário e percorrer várias páginas para
extrair links. Ao fazer o dos Malvados, segui a mesma estrutura de
funções, e comecei a perceber que dava para generalizar bastante o
processo.</p>
<p>Eis que surge Hand. No fundo é uma classe que implementa alguns métodos
(build_date, generate_description, build_feed, process), e exige que
você derive a classe e implemente o método generate_data.
generate_data é um método que retorna uma lista de dicionários, com
cada dicionário contendo os campos title, page_link, description,
pubDate e guid correspondentes a um item do feed. Simples assim.</p>
<p><strong>E funciona?</strong></p>
<p>Yep. Mantenho quatro feeds no momento:</p>
<ul>
<li><a href="http://luizirber.org/rss/fsp.xml">Quadrinhos da Folha de São Paulo</a></li>
<li><a href="http://luizirber.org/rss/malvados.xml">Malvados</a> <strong>Update:</strong> <a href="http://feed43.com/malvados.xml">Feed alternativo, sem o CATA CORNO GOOGLE</a></li>
<li>Magias e Barbaridades <strong>Update:</strong> <a href="http://magiasebarbaridades.blogspot.com/feeds/posts/default">Feed oficial</a></li>
<li>Rehabilitating Mr. Wiggles <strong>Update:</strong> <a href="http://www.instantrimshot.com">Feed oficial</a></li>
</ul>
<p><strong>Onde posso ver esta maravilha?</strong></p>
<p>O código está disponível no <a href="http://github.com/luizirber/Hand">GitHub</a>, mas ainda está bem cru, preciso
empacotá-lo direito.</p>
<p><strong>Quais os próximos passos?</strong></p>
<p>O feed da Folha demora para ser gerado, porque toda vez que o script é
rodado ele precisa consultar todas as páginas. Portanto penso em
adicionar persistência, mas bem simples, um sqlite é mais que
suficiente.</p>
<p>Além disso, quero descrever a configuração do feed (onde gerá-lo, qual
template usar) num arquivo, e fazer a classe base ler essas opções.
Assim fica ainda mais fácil fazer um novo feed.</p>
<p><strong>Que nominho, hein?</strong></p>
<p>Para quem não entendeu o nome: qual um bom para um gerador de feeds?
Enquanto pensava, lembrei de uma música do <a href="http://www.nin.com/">NIN</a> chamada
<a href="http://en.wikipedia.org/wiki/The_Hand_That_Feeds">'The Hand That Feeds'</a>. E, além disso, ele também te dá uma mão
para gerar feeds, certo? <a href="http://www.instantrimshot.com">*TU-DUM-TISH*</a>!</p>Feeds de quadrinhos2009-01-20T17:29:00-02:002009-01-20T17:29:00-02:00luizirbertag:blog.luizirber.org,2009-01-20:/2009/01/20/feeds-de-quadrinhos/<p>Infelizmente a <a href="http://www1.folha.uol.com.br/fsp/">Folha de São Paulo</a> não disponibiliza feeds dos
quadrinhos diários dela. Isso significa privar-nos de <a href="http://www.laerte.com.br">Laerte</a> e
<a href="http://www2.uol.com.br/adaoonline/">Adão</a>, mas não temam! Caso queiram tirinhas frescas toda a manhã no
seu leitor de feeds favorito, basta usar o que <a href="http://luizirber.org/rss/fsp.xml">eu fiz</a>.</p>
<p>A idéia (sim, sou antigo, meus netos ainda vão dizer 'meu vô é do tempo
que se escrevia ideia com acento') é simples, e foi baseada na do
<a href="http://leandrosiqueira.com/quadrinhos/">Leandro Siqueira</a>: apesar do conteúdo da Folha ser exclusivo para
assinantes, as imagens das tirinhas são acessíveis. Basta descobrir o
padrão do nome delas. Mas percebi que as tirinhas de domingo não estavam
aparecendo, pois existem autores diferentes nesse dia (Allan Sieber e
irmãos Bá, atualmente). Então resolvi fazer um que fosse um pouquinho
mais dinâmico, e deu certo, porque quando houve a transição das
dominicais o feed continuou funcionando sem modificações.</p>
<p>Como foi feito? <a href="http://www.python.org">Python</a>, <a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> e <a href="http://wwwsearch.sourceforge.net/mechanize/">Mechanize</a>. O script
autentica no site, busca o índice dos quadrinhos, e acha o link das
imagens para gerar o feed. Aliás, Beautiful Soup é uma das bibliotecas
mais úteis que já usei, para mexer com HTML não tem nada melhor.</p>
<p>E, como o mais complicado já estava feito, semana passada fiz rapidinho
um <a href="http://luizirber.org/rss/malvados.xml">feed</a> para os <a href="http://www.malvados.com.br">Malvados</a> também. Até tem um feed lá, mas é só
para o blog. Ainda não resolvi como fazer para mostrar as séries, mas as
normais aparecem sem problemas no feed (acho, eu uso Google Reader e
aparece. Por favor, testem em outros readers e me avisem).</p>
<p>Ando pensando em generalizar um pouco os dois scripts, para facilitar a
escrita de <a href="http://en.wikipedia.org/wiki/Screen_scraping">screen scrapers</a>, mas não sei se vale a pena, já que eles
são extremamente dependentes da estrutura da página. Mas vamos ver o que
sai =D</p>Tidbits2008-12-29T06:54:00-02:002008-12-29T06:54:00-02:00luizirbertag:blog.luizirber.org,2008-12-29:/2008/12/29/tidbits/<p>Um monte de coisas que merecem ser ditas, mas pequenas demais para posts
separados. Vamos lá:</p>
<ul>
<li>
<p>A minha palestra na PyConBrasil desse ano está no <a href="http://video.google.com/videoplay?docid=-2177235750911656588">Google Video</a>.
Desde metade de novembro, mas só descobri agora =D</p>
</li>
<li>
<p>Aliás, eu dei uma palestra em um evento chamado <a href="http://arcodigital.com.br/mobile/prog_1712.php">Mobile Expert</a>,
promovido pela Editora Europa, na Livraria Cultura. Os slides estão
<a href="http://www.luizirber.org/talks/python_games">aqui</a>. O detalhe é que eles erraram meu nome (que é com Z), o nome da
linguagem (é PYTHON) e o nome da empresa (Pinuts StudioS, com S no
final). Mas foi bem massa mesmo assim.</p>
</li>
<li>
<p>Liniers lançou o Macanudo #6 por conta própria, e a primeira edição,
de 5000 exemplares, saiu com a capa em branco. E ele DESENHOU TODAS A
MÃO. Quando fiquei sabendo disso logo dei um jeito de comprar dois, por
uma loja argentina que vende na internet. Eles chegaram no começo de
dezembro, e são umas belezinhas. Um ficou comigo, o outro dei de
presente de natal para a Na.</p>
</li>
<li>
<p>Que, fodasticamente, fez três miniaturas em biscuit de alguns
personagens da tirinha e me deu de Natal. Olhaê:</p>
</li>
</ul>
<p><a href="http://www.flickr.com/photos/luizirber/3143780731/"><img class="center" src="https://farm4.static.flickr.com/3224/3143780731_746a8da28d.jpg"></a> Duende, Fellini y El Misterioso Hombre de Negro</p>
<ul>
<li>
<p>Finalizando com Liniers: <a href="http://autoliniers.blogspot.com/2008/12/liniers-macanudo_28.html">ontem</a> ele deu um susto naquelas pessoas
que acompanham a tirinha e não entendem espanhol muito bem. Só que ontem
era Dia dos Inocentes em países hispânicos, o equivalente ao nosso
Primeiro de Abril. Tsc.</p>
</li>
<li>
<p>Agora MTV pega na parabólica. Yay! Não que seja grande coisa, mas pelo
menos passa alguns clipes legais de madrugada, e por pior que esteja a
programação ainda é uma opção interessante quando comparada aos outros
canais disponíveis na TV aberta...</p>
</li>
<li>
<p>Fallout 2 é FOODA. Tudo bem que tá todo mundo falando do 3, mas esse
não roda no meu PC. Então aproveitei para jogar o <a href="http://www.nma-fallout.com/forum/viewtopic.php?t=40443">Restoration
Project</a>, mod que adiciona alguns detalhes que tinham sido planejados,
mas não foram implementados. Sweet!</p>
</li>
<li>
<p>Alguém aí sabe jogar Magic? Ganhei um deck de natal, mas preciso de
oponentes (de preferência compreensivos, e dispostos a explicar tudo).</p>
</li>
<li>
<p><a href="http://www.imdb.com/name/nm0594898/">Ela</a> <a href="http://www.imdb.com/title/tt0490196/">prova</a> que <a href="http://vainalousachefe.wordpress.com/">GG</a> é <a href="http://www.flickr.com/photos/luizirber/3145992838/">lindo</a>.</p>
</li>
</ul>Presente de aniversário2008-11-22T11:20:00-02:002008-11-22T11:20:00-02:00luizirbertag:blog.luizirber.org,2008-11-22:/2008/11/22/presente-de-aniversario/<p>O meu aniversário foi dia 17, mas há pessoas impacientes no mundo. No
dia 05 de novembro já tinha recebido um presente =]</p>
<p>Teaser:</p>
<p><img class="center" src="/images/asuka-hd1.jpg"></p>
<p>O presente:</p>
<p><img class="center" src="/images/asuka-gk1.jpg"></p>
<p>Tão legal quanto o presente é a reação das pessoas. Os que nunca ouviram
falar de <a href="http://en.wikipedia.org/wiki/Garage_kit">Garage Kit</a> e <a href="http://en.wikipedia.org/wiki/Asuka_Langley_Soryu">Evangelion</a> dizem 'Mas isso não é coisa de
menininha?'. Os que entendem dizem 'Que foda!'</p>
<p>Eu? Eu adorei, qualquer que seja a opinião dos outros =D</p>
<p>Obrigado, Na!</p>PETAR2008-10-22T18:01:00-02:002008-10-22T18:01:00-02:00luizirbertag:blog.luizirber.org,2008-10-22:/2008/10/22/petar/<p>Fim de semana emocionante. No sábado de manhã fomos eu, Maja (a croata
que esteve morando com a gente por dois meses, fazendo intercâmbio em
Sanca) e a Stéfanie (a vizinha) para o <a href="http://www.petaronline.com.br/">PETAR</a>, a convite do Sinfa.
Saímos bem cedo de São Carlos, cinco da manhã, passamos por São Paulo
para encontrar o Sinfa e a namorada dele, e depois vamos em direção ao
parque. Que, como fica na região sul de São Paulo, é longe pra dedéu.
Chegamos lá só às quatro e tanto da tarde.</p>
<p>Mas já chegamos preocupados. Ao sair de São Carlos o tempo estava
quente, chegando em Sampa chovia, e quanto mais para o sul íamos, mais a
chuva castigava. E, ao chegar no PETAR, ela não tinha parado. Péssimo
sinal, tínhamos poucas roupas para frio, e estávamos indo para o núcleo
mais isolado e selvagem do lugar, o <a href="http://www.petaronline.com.br/caboclos.htm">Caboclos</a>.</p>
<p>Depois de sacolejar por mais uma hora em estradas de chão batido,
chegamos ao local do acampamento. Várias cabanas, mas todas reservadas
para pesquisadores. Vamos dar uma olhada no local do camping, e meus
ossos já ficam gelados só de imaginar como vai ser a noite...</p>
<p>Eu e a Maja montamos rapidinho a barraca onde eu, ela e a Stéfanie vamos
dormir, e depois vamos ajudar o Sinfa a montar a dele e da namorada
(que, por sinal, é bem maior). Mais uns 20 minutos, e estamos
suficientemente ensopados para começar a amaldiçoar a viagem. Ah, que
vontade de estar em casa, assistindo a um filminho e comendo pipoca...</p>
<p>Mas como diz o sábio, já que tá que vá. Jantamos, e quando começamos a
nos ajeitar para dormir, percebemos que a nossa barraca não é tão
impermeável assim. Começamos a discutir seriamente a possibilidade de
seguir viagem e ir para Curitiba. Mas antes que a situação piore, vamos
todos para a barraca do Sinfa, e lá jogamos um pouco de truco antes de
ir dormir (eu e Maja ganhamos, apesar de eu quase nunca jogar e de ser a
primeira vez dela jogando =] )</p>
<p>No domingo de manhã somos acordados pela chegada do guia, mas a vontade
de sair da cama era nula. Eu e a Stéfanie já tínhamos idéia de onde nos
metemos, e somos contra seguir com a trilha planejada, até porque
demoraríamos demais para ir embora e chegaríamos muito tarde em São
Carlos. Mas a Maja ainda achava que seria uma trilha bem aberta e
tranquila, com uma caverna acessível. Como eram 3 votos contra 2, vamos
todos fazer a trilha.</p>
<p>Como eu sou um simplório, me diverti horrores. Cada vez que eu
escorregava e caia na lama eu dava risada, possivelmente pela insanidade
da coisa. E situações dessas são ótimas para deixar o humor negro e a
ironia aflorar, então eu e a Stéfanie estávamos impossíveis. Pena que o
guia não entendia metade das piadas que a gente fazia.</p>
<p>Aliás, o guia. Ele fez curso técnico em mecatrônica, mas enquanto não
arranja um emprego trabalha no PETAR como guia. Ele deu umas explicações
que deixariam cientistas de cabelo em pé (destaque para o Bagre Cego),
mas era gente fina. E comentou, em certo ponto, que achou que nós
seríamos um grupo de aventureiros experientes loucões, por querer ir na
chuva na trilha mais difícil do parque. HAH! Um computeiro, uma croata
que nunca viu mato na vida, e uma anêmica? Loucos com certeza,
aventureiros contra a vontade, mas experientes nunca...</p>
<p>Depois de três horas de muitos tombos, arranhões e lama, chegamos até a
caverna. Para descobrir que não podemos entrar, porque o rio está muito
alto. Sem muitas opções, começamos a volta, agora por outro lado.
Atravessamos o rio duas vezes (na segunda, muita correnteza), subimos
uma encosta íngreme auxiliados por cordas, e cada vez mais me sinto um
escoteirinho. O guia não está dando muita bola, mas a Maja e a Stéfanie
estão exaustas, e várias vezes tenho que ajudá-las a atravessar alguns
pontos. Aliás, é admirável a calma do guia: em vários pontos um
escorregão levaria a uma queda beeeem horrenda, pelo menos uns 30 metros
rolando ribanceira abaixo, e ele nem aí.</p>
<p>Perto do fim, a coisa está realmente feia. Maja desesperada porque vai
ficar doente (err, e realmente ficou...), Stéfanie com hipoglicemia,
Leandro e namorada um pouco mais atrás de nós. E eu? Sei lá de onde, mas
ainda com energia para seguir adiante. Que dizer, sei sim, mas isso é
assunto para outro post. Por mais que tenha sido uma roubada, que a
gente tivesse que tomar banho frio logo depois (energia elétrica? pfff),
que tívessemos que voltar para Sanca dividindo o único par de tênis
limpos, eu estava me divertindo. Nada melhor do que rir de nós mesmos =D</p>
<p>Na segunda nem consegui ir trabalhar, foi muito difícil simplesmente
sair da cama. Acho que a adrenalina da hora não deixou perceber que eu
estava me estropiando todo.</p>
<p>Portanto, recomendações: Se for ao PETAR, comece pela caverna de
Santana, e depois se arrisque nas mais selvagens. Vale a pena, mas evite
ir na chuva! =D</p>
<p>UPDATE: já que tive um comentário ilustre por parte da Stéfanie, fique
explicado que:</p>
<ul>
<li>
<p>Ela ajudou a pregar no chão e a por a cobertura. Portanto, ajudou na
montagem da nossa barraca.</p>
</li>
<li>
<p>Sim, eu fiquei a maior parte do tempo atrás dela. Ou seja, não fui
de muita ajuda para ela (só quando ela escorregava e quase caia no chão,
já no final da trilha). Ah, e também quando ela tava com hipoglicemia,
servindo Negresco e dando apoio. Portanto, meu lado escoteirinho
auxiliador apareceu mais para a Maja e, às vezes, pra namorada do Sinfa.</p>
</li>
<li>
<p>E ela sabe muito bem de onde eu tirei energia =]</p>
</li>
</ul>#iphonedev-br2008-09-30T05:14:00-03:002008-09-30T05:14:00-03:00luizirbertag:blog.luizirber.org,2008-09-30:/2008/09/30/iphonedev-br/<p>Buenas. Estava eu no #python-br quando o <a href="http://elyezer.com/">Elyézer</a> perguntou algo
sobre Python e Objective-C. Indiquei o PyObjC, e ele perguntou se ele
estava disponível no iPhone. Hmm, não está.</p>
<p>Mas achei mais alguém desenvolvendo pra iPhone!</p>
<p>Antes que ficasse off-topic demais pra #python-br, fomos para o canal
#iphonedev-br no Freenode. E agora precisamos popularizá-lo!</p>
<p>Então, se você procura por um lugar em português para discutir as
agruras do Objective C (que, na minha opinião, é beeeem melhor que C++
...), apareça no #iphonedev-br @ irc.freenode.org .</p>
<p>UPDATE: Se você não tem um cliente IRC instalado, o <a href="http://mibbit.com/">mibbit</a> pode te
salvar. Dica do <a href="http://ayharano.wordpress.com/">Frank</a>!</p>Minha primeira hackeada2008-08-10T10:27:00-03:002008-08-10T10:27:00-03:00luizirbertag:blog.luizirber.org,2008-08-10:/2008/08/10/minha-primeira-hackeada/<p>Memes de internet são como powerpoints no email: você pode até seguí-los
ou lê-los, mas não deve repassar adiante essas pragas. Como o <a href="http://vainalousachefe.wordpress.com/2008/07/31/minha-primeira-hackeada/" title="Vai na Lousa, Chéééééoééfe!">GG</a>
levantou a bola vou chutá-la, mas não recomendo que mais ninguém faça o
mesmo! =D</p>
<p>Minha primeira hackeada? O primeiro computador lá de casa era um 286,
comprado em 1993. Só rodava um editor de texto chamado Pangloss, e um
misto de agenda, organizador pessoal e planilha chamado Every. E sei lá
como, eu ainda conseguia me divertir com a bagaça. A minha mãe arranjou
um curso de informática do Banco do Brasil chamado INFO2000, que
ensinava enquanto contava uma historinha muito legalzinha sobre um
viajante espacial.</p>
<p>E todos os softwares citados são tão velhos ou obscuros que não achei
nada no Google sobre eles. Vou tentar recuperá-los da próxima vez que
for para casa e achar mais informações.</p>
<p>Um dia meu primo levou 'Outrun' e 'Prince of Persia' em dois disquetes.
E daí eu tive bastante o que fazer =D</p>
<p>Mas, hackeada mesmo, eu lembro que só fui fazer em um 486, em 1994. Um
amigo meu trouxe vários Commander Keen (2, 3 e 5, se não me engano), e
eu fiz um .BAT com um menu para acessar os jogos. Inútil, mas foi meu
primeiro 'shell script': tinha que imprimir mensagens, ler a entrada do
usuário, e executar o jogo correto.</p>
<p>Além disso, eu lembro que tive que fuçar várias vezes no CONFIG.SYS e no
AUTOEXEC.BAT para desabilitar algumas coisas a fim de liberar memória
(ah, os saudosos 640KB...) para algum jogo. Pra quem tinha nove anos na
época e não tinha com quem aprender, até que está bom, né?</p>
<p>Uma outra história interessante: quando eu ia na quarta série as freiras
do meu colégio compraram um computador e deixaram na secretaria. E,
certo dia, a secretária foi me chamar na sala de aula, porque não estava
conseguindo fazer algo (não lembro o quê, fazem 14 anos, oras!). Fui lá,
resolvi, e voltei para a aula. Suporte técnico desde criança, hehehe.</p>Dia do Amigo2008-07-20T18:03:00-03:002008-07-20T18:03:00-03:00luizirbertag:blog.luizirber.org,2008-07-20:/2008/07/20/dia-do-amigo/<p><img class="center" src="/images/20072008278-001.jpg"></p>
<p>Às vezes dá vontade de fazer um blog só com tirinhas do Liniers, mas daí
eu lembro que você sempre pode ir no site da <a href="http://www.livrariacultura.com.br/scripts/cultura/catalogo/busca.asp?tipo_pesq=titulo&palavra=macanudo&topo=livro&sid=01351501910720655571867732&k5=259E5B04&uid=&lastreg=&parceiro=121233">Cultura</a> e comprar as
coletâneas. Tá esperando o quê?</p>Herbie Hancock2008-06-02T14:28:00-03:002008-06-02T14:28:00-03:00luizirbertag:blog.luizirber.org,2008-06-02:/2008/06/02/herbie-hancock/<p><img class="center" src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Herbie_hancock.jpg/600px-Herbie_hancock.jpg"></p>
<p>Fui no show dele ontem, no Parque Villa-Lobos. Muito bom, apesar de eu
conhecer muito pouco de sua carreira.</p>
<p>Mas, o que eu realmente queria dizer é que poucas vezes me senti tão
nerd. Eu ouvia ele tocando e lembrava da trilha sonora do
<a href="http://en.wikipedia.org/wiki/Transport_Tycoon">Transport Tycoon</a> =D</p>Crônicas da Fundação2008-05-25T12:18:00-03:002008-05-25T12:18:00-03:00luizirbertag:blog.luizirber.org,2008-05-25:/2008/05/25/cronicas-da-fundacao/<blockquote>
<p>Você é uma criatura complexa, Dors, e não tenho respostas simples para
lhe oferecer. Na minha vida, conheci vários indivíduos em cuja
presença sentia-me melhor comigo mesmo. Tentei comparar o prazer que
sentira na presença dessas pessoas com a tristeza que senti quando
elas finalmente se foram, para saber se o saldo tinha sido positivo ou
negativo. Cheguei à conclusão de que o prazer de sua companhia
superava de longe a tristeza de sua perda. Minha conclusão, portanto,
é de que o que está experimentado agora fará bem a você.</p>
</blockquote>
<p>Crônicas da Fundação, Isaac Asimov.</p>
<p>Agora só falta "A Fundação e a Terra", vou completar a leitura da série
antes de comentar. Por enquanto, fiquem com outra citação =D</p>Zone of the Enders2008-05-23T11:53:00-03:002008-05-23T11:53:00-03:00luizirbertag:blog.luizirber.org,2008-05-23:/2008/05/23/zone-of-the-enders/<p><img class="left" src="/images/jehuty1.jpg" width="195"></p>
<p>Como comentado a alguns posts atrás, comprei um PS2. E, assim,
começou a busca por bons jogos.</p>
<p>Eu sempre li vários blogs de jogos, mas mais por curiosidade sobre do
que necessidade de jogá-los. E a poucos dias, em um <a href="http://continue.com.br/29/04/2008/kojima-abre-o-bico-enders-saindo-da-zona-do-esquecimento">deles</a>,
comentaram sobre uma série produzida por Hideo Kojima, criador do Metal
Gear Solid, chamada <a href="http://en.wikipedia.org/wiki/Zone_of_the_Enders">Zone of the Enders</a>. Te colocar no controle de
robôs gigantes, no espaço, com várias armas diferentes? Sonho de
qualquer nerd que cresceu assistindo a <a href="http://en.wikipedia.org/wiki/Choushinsei_Flashman">Flashman</a> e que adora <a href="http://en.wikipedia.org/wiki/Neon_Genesis_Evangelion_(TV_series)">EVA</a>.</p>
<p>O jogo citado era o segundo da série, mas resolvi
<span style="text-decoration:line-through;">baixar</span> começar pelo
primeiro, para não estragar a história. Aliás, é por isso que ainda não
comecei a jogar MGS, tenho que jogar o primeiro antes. Coloquei ele no
videogame, comecei, e é bonitão, ainda mais para um jogo de 2001. A
câmera é em terceira pessoa, logo atrás do
<span style="text-decoration:line-through;">robô</span> Orbital Frame, e
há liberdade completa de movimento, já que a maior parte do tempo você
fica voando, e não no chão.</p>
<p>O jogo conta a história de Leo, um garoto que vê seus amigos serem
mortos por um Orbital Frame dos invasores de sua colônia. Durante a sua
fuga ele encontra um outro Orbital Frame, aparentemente abandonado, e
durante uma explosão cai dentro da cabine do OF e é reconhecido como o
piloto (ou Runner). A partir daí ele deve levar o OF para o movimento de
resistência, apesar de contra sua vontade, e conviver com o dilema de
matar outros seres humanos.</p>
<p>Ou seja, basicamente é a história padrão de todas as histórias de
mechas. Mesmo que não seja revolucionária como EVA, a história flui bem.
Depois de cinco horas e pouco de jogo, chego até o ponto de encontro
para entregar o OF, e aparece Anubis, o OF gêmeo de Jehuty, OF que
pilotei até o momento. E tomo um piau desgraçado, mas só resta fugir,
porque Anubis é muito forte. Atlantis, a nave que levará Jehuty até
Marte, auxilia e consigo escapar. E começam os créditos.</p>
<p>Não é a primeira vez que vejo os créditos aparecerem tão longe do começo
do jogo, então acompanho-os. Até que a tela fica preta, sobem mais
alguns e... o jogo reinicia. COMO ASSIM? Já é o fim do jogo? Tá certo
que não é um RPG, mas menos de seis horas de jogo? E deixando um gancho
desse tamanho?!? Como o jogo foi lançado junto com um demo de MGS:2,
fiquei com a péssima impressão de que ele foi feito só para tapar o
buraco do MGS:2, e se aproveitou disso para vender.</p>
<p>Mas, então, porque investir em uma história tão boa, para abandoná-la
pela metade?</p>
<p>Paciência. Já estou
<span style="text-decoration:line-through;">baixando</span>
providenciando o segundo, e este parece ser um pouco mais longo.
Felizmente, pelos <a href="http://www.youtube.com/watch?v=K_B1x8E_mJc">vídeos no YouTube</a>, a jogabilidade ainda é a mesma,
e os gráficos estão ainda melhores.</p>
<p><img class="center" src="/images/zoe1.jpg" width="300"></p>Fundação2008-05-12T14:53:00-03:002008-05-12T14:53:00-03:00luizirbertag:blog.luizirber.org,2008-05-12:/2008/05/12/fundacao/<blockquote>
<p>A linguagem foi, originalmente, o expediente por meio do qual o Homem
aprendeu, imperfeitamente, a transmitir os pensamentos e emoções do
seu espírito. Erigindo sons arbitrários e combinações de sons como
representação de gradação de cores mentais, desenvolveu um método de
comunicação, porém um método que, na sua inabilidade e pesada
inadequação, fez degenerar toda a delicadeza do espírito numa
transmissão grosseira e gutural de sinais.<br>
Os resultados podem ser seguidos profundamente e todo o sofrimento
que a humanidade conheceu pode ser avaliado apenas pelo fato de nenhum
homem na história da Galáxia até <a href="http://en.wikipedia.org/wiki/Hari_Seldon">Hari Seldon</a>, e muito poucos
homens depois dele, ter conseguido compreender realmente outro homem.
Cada ser humano vivia atrás de uma parede impenetrável de névoa
sufocante, dentro da qual ninguém mais existia senão ele. Havia,
ocasionalmente, os sinais sumidos das profundidades da caverna em que
o outro homem estava metido, de modo que cada um podia caminhar às
apalpadelas na direção do outro. Contudo, por não se conhecerem uns
aos outros, não poderem compreender-se uns aos outros, não ousarem
confiar uns nos outros, e sentirem desde a infância os terrores e
insegurança desse isolamento definitivo, havia o medo da perseguição
do homem pelo homem, a rapacidade selvagem do homem para com o homem.</p>
</blockquote>
<p>Isaac Asimov, Segunda Fundação.</p>Okami2008-05-03T11:23:00-03:002008-05-03T11:23:00-03:00luizirbertag:blog.luizirber.org,2008-05-03:/2008/05/03/okami/<p><img class="left 150" src="/images/okami1.jpg"></p>
<p>5 meses de silêncio. Não que eu tenha ido em retiro
espiritual, ou que não tenha acontecido nada dos últimos meses
(aconteceu muita coisa legal, com o tempo eu posto aqui), mas é que eu
sempre esqueço de escrever =D</p>
<p>Mas hoje tem algo que vai quebrar meu silêncio. Na terça-feira chegou o
PS2 que eu comprei de um bixo. Meu primeiro videogame! Sim, você leu
corretamente, tive meu primeiro videogame com 22 anos!</p>
<p>Na verdade, tem uma história legal sobre isso: quando eu era pequeno a
minha família foi visitar Foz do Iguaçu, e obviamente cruzamos a
fronteira para ir até Ciudad del Este. Lá, a minha mãe me fez uma
proposta: "O que tu prefere, um teclado musical ou um videogame?" . Como
toda boa criança consciente de seu potencial, respondi "Videogame!". E
ela me deu um teclado. Que, aliás, está até hoje encostado lá em casa.
Não que tenha me impedido de jogar, mais tarde eu ganhei um computador e
daí não teve jeito (ah, adventures da LucasArts...). Mas digresso.</p>
<p>Hoje quero falar sobre um jogo chamado <a href="http://www.capcom.com/okami/">Okami</a>. Ele foi desenvolvido
pelo <a href="http://en.wikipedia.org/wiki/Clover_Studio">Clover Studio</a> para o PS2, e conta a história de Okami
Amaterasu, um lobo que é a encarnação da deusa da mitologia japonesa
<a href="http://en.wikipedia.org/wiki/Amaterasu">Amaterasu</a>, e sua luta contra o monstro <a href="http://en.wikipedia.org/wiki/Orochi">Orochi</a>. Toda a história
do jogo segue a mitologia japonesa, e também a arte. Entre várias
características marcantes uma das mais interessantes é o uso de Cell
Shading, o que contribui muito para o impacto visual do jogo.</p>
<p>Mas, apesar de ser de um nível técnico impressionante (existem poucos
jogos tão bonitos por aí, e isso no hardware do PS2, que nem é tão
poderoso assim hoje em dia), o que me chamou a atenção mesmo foi a
história. Muito bem escrita, com personagens cativantes e várias
referências a outras mitologias e culturas, pode-se perceber que muito
esforço foi dedicado ao roteiro, algo bem raro na indústria de
videogames.</p>
<p>Para que? O Clover Studio sobreviveu durante dois anos, e foi fechado em
2006, porque não estava dando retorno financeiro. Por que as pessoas não
gostaram de Okami, apesar de toda a aclamação da crítica?</p>
<p>O maior defeito, a meu ver, é o início do jogo. São quase 20 minutos
fazendo a introdução da história, e isso para os consumidores ávidos por
ação de hoje em dia é insuportável. E nem se tem a opção de pular a
introdução. Eu entendo porque eles fizeram isso, porque realmente é
importante para a história, mas o mundo não é muito paciente.</p>
<p>Azar de quem não tem paciência, perderam um dos jogos com a jogabilidade
mais curiosa que já vi. Existem 13 Brush Gods no jogo, e cada um
possibilita o uso de uma técnica diferente. Para aplicar essa técnica,
você pára o jogo e aparece uma tela de pintura, e usa o controle
analógico para desenhar na tela. Um risco horizontal faz um corte, dois
riscos tornam o tempo mais lento, um símbolo de infinito gera fogo. É
simplesmente genial, e torna o jogo extremamente divertido.</p>
<p>O final até dá certas esperanças de uma continuação, mas infelizmente o
Clover foi fechado logo após o lançamento do jogo. Uma pena, mas fica a
lição final dada pelo narrador ao término do jogo, que realmente mostra
uma preocupação que deveria ser mais recorrente na indústria dos
videogames.</p>
<p>Apesar do Clover ter fechado, a Capcom portou o jogo para Wii e lançou-o
no dia 15 de abril, e com o Wiimote o uso dos poderes celestiais fica
muito mais interessante, porque você realmente usa o controle como um
pincel e desenha na tela, e não tem que usar o direcional analógico (que
é meio complicadinho às vezes...).</p>
<p>Farewell, Okami!</p>Feliz ano novo!2008-01-01T07:02:00-02:002008-01-01T07:02:00-02:00luizirbertag:blog.luizirber.org,2008-01-01:/2008/01/01/feliz-ano-novo/<p><img class="center" src="/images/7669141.jpg"></p>Nicholas Was…2007-12-24T23:00:00-02:002007-12-24T23:00:00-02:00luizirbertag:blog.luizirber.org,2007-12-24:/2007/12/24/nicholas-was/<blockquote>
<p>older than sin, and his beard could grow no whiter. He wanted to
die.The dwarfish natives of the Arctic caverns did not speak his
language, but conversed in their own , twittering tongue, conducted
incomprehensible rituals, when they were not actually working in the
factories.</p>
<p>Once every year they forced him, sobbing and protesting, into Endless
Night. During the journey he would stand near every child in the
world, leave one of the dwarves' invisible gifts by its bedside. The
children slept, frozen into time.</p>
<p>He envied <a href="http://en.wikipedia.org/wiki/Prometheus">Prometheus</a> and <a href="http://en.wikipedia.org/wiki/Loki">Loki</a>, <a href="http://en.wikipedia.org/wiki/Sisyphus">Sisyphus</a> and <a href="http://en.wikipedia.org/wiki/Judas_Iscariot">Judas</a>. His
punishment was harsher.</p>
<p>Ho.</p>
<p>Ho.</p>
<p>Ho.</p>
</blockquote>
<p><a href="http://en.wikipedia.org/wiki/Neil_Gaiman">Neil Gaiman</a>, <a href="http://en.wikipedia.org/wiki/Smoke_and_Mirrors_%28book%29">Smoke and Mirrors</a>. E um Feliz Natal! =D</p>Maestrograd2007-12-19T00:47:00-02:002007-12-19T00:47:00-02:00luizirbertag:blog.luizirber.org,2007-12-19:/2007/12/19/maestrograd/<p><a href="http://maestrograd.myminicity.com/">http://maestrograd.myminicity.com/</a></p>
<p>O nome Maestrograd surgiu quando eu precisei fazer um repositório para
meus trabalhos de graduação (Maestro + Graduação = Maestrograd). Mas é
um nome tão legal, parece uma cidade da URSS...</p>
<p>Então, aproveitando o tal My Mini City, resolvi fazer uma cidade! Ela
fica no Brasil, mas como não dá para especificar exatamente onde lá no
site, resolvi explicar aqui. Fica no noroeste gaúcho, perto de Três de
Maio (e isso explica porque tem tão poucos habitantes =] ), e é o último
refúgio dos comunistas russos no mundo. Ahm, pelo jeito, dos comunistas
russos ORIGINAIS, 1917 style, porque não tem quase ninguém lá =D</p>
<p>Se você é um comunista russo, venha para nossa utópica terrinha! Venha
morar em Maestrograd!</p>
<p>(Bah, que baita idéia a desses caras do MyMiniCity. Estupidamente
simples, estupidamente legal).</p>Livro de auto-ajuda?2007-11-20T01:04:00-02:002007-11-20T01:04:00-02:00luizirbertag:blog.luizirber.org,2007-11-20:/2007/11/20/livro-de-auto-ajuda/<blockquote>
<p>Nevertheless, this algorithm is slower, more complicated, more
expensive, and less robust than the original centralized one. Why
bother studying it under these conditions? (...) <strong>Finally, like
eating spinach and learning Latin in high school, some things are said
to be good for you in some abstract way</strong>.</p>
</blockquote>
<p>Andrew Tanenbaum, Distributed Operating Systems. Às vezes o rapaz viaja
um pouco no meio das explicações =D</p>When yer 222007-11-17T20:04:00-02:002007-11-17T20:04:00-02:00luizirbertag:blog.luizirber.org,2007-11-17:/2007/11/17/when-yer-22/<p>Stuck in the perpetual motion<br>
Dying against the machine<br>
The whole thing leaves<br>
You a nothing instead of a these<br>
The sun is black and the black halos fly<br>
And your number is backwards again when you try<br>
The sound is so cute when you're 22<br>
When you're 22</p>
<p>Eggs break when you walk on the scramble<br>
You're living against the machine<br>
The whole thing leaves<br>
You a nothing instead of a these<br>
The bone is cracked and the cracked eggshells fly<br>
And your number is backwards again when you drive<br>
The whole thing's removed when you're 22<br>
When you're 22</p>
<p>Nada a ver, mas não é todo dia que se faz aniversário e a idade bate com
a música de uma de suas <a href="http://www.flaminglips.com">bandas preferidas</a>.</p>
<p>E isto também é um teste da adulteração de posts do wordpress, já que eu
não tinha internet dia 17 e postei isso no dia 19.</p>Piada Nerd Infame do dia2007-11-02T20:50:00-02:002007-11-02T20:50:00-02:00luizirbertag:blog.luizirber.org,2007-11-02:/2007/11/02/piada-nerd-infame-do-dia/<p>Pelos poderes do GREP, eu sou...</p>
<p>RE-MAN!</p>O fim da infância2007-11-02T20:41:00-02:002007-11-02T20:41:00-02:00luizirbertag:blog.luizirber.org,2007-11-02:/2007/11/02/o-fim-da-infancia/<blockquote>
<p>Havia, é claro, alguns vadios, mas o número de pessoas com a força de
vontade necessária para viver em completa ociosidade é muito menor do
que geralmente se supõe. Manter esses parasitas custava muito menos do
que sustentar os exércitos e coletores de bilhetes, empregados de
lojas, funcionários de bancos, corretores, etc., cuja principal
função, do ponto de vista global, era transladar itens de um livro
para outro.</p>
</blockquote>
<p>Arthur C. Clarke, "O fim da infância", 1953.</p>Matemática Concreta2007-10-19T20:43:00-02:002007-10-19T20:43:00-02:00luizirbertag:blog.luizirber.org,2007-10-19:/2007/10/19/matematica-concreta/<p>Aqui em São Carlos tem um sebo muito legal chamado Outros Contos. Eu
costumo ir pelo menos uma vez por mês lá, e ontem foi uma dessas idas.
Mas o sebo tinha mudado de local!</p>
<p>Após muitas voltas, eu e o Frank achamos o novo lugar. Parece que a casa
onde ele ficava antigamente corria risco de desabamento, e fazia uma
semana que ele tinha reaberto. Depois de circular um pouco lá por
dentro, deu pra notar que ainda estava um tanto quanto desorganizado
(apesar de sebo desorganizado ser pleonasmo =D), mas ainda assim
consegui achar um Terry Pratchett novinho por R$ 10. Quando estávamos
quase indo embora, o Frank me aparece com um livro na mão. "Olha, é do
Knuth". Meu queixo quase caiu no chão.</p>
<p>Não era nenhum volume do The Art of Programming, pois já seria pedir
demais. Era uma edição em português do Concrete Mathematics, um livro
que eu já tinha achado na biblioteca da USP e estava querendo comprar a
tempos. O problema é que o <a href="http://www.amazon.com/Concrete-Mathematics-Foundation-Computer-Science/dp/0201558025/ref=pd_bbs_sr_1/105-2191388-3077258?ie=UTF8&s=books&qid=1192826225&sr=8-1">original americano</a> custa US$ 55.29 na
Amazon, e meu bolso não aguenta uma compra dessas. Eu dei uma olhada e
devolvi pro Frank, pensando que ele fosse levar, mas ele virou-se e foi
recolocar ele no lugar.</p>
<ul>
<li>Peraí, tu não vai levar? - Disse eu. </li>
<li>Não. </li>
<li>ENTÃO DÁ AQUI QUE EU LEVO!</li>
</ul>
<p>(Não foi tão gritado assim, mas quase =D). Fazia quase dois anos que ele
estava lá, mas eu nunca tinha visto. Lição: Não pulem a seção de
matemática só porque tem vários Lethoud lá.</p>
<p>Agora tenho que começar a fazer os exercícios, por enquanto só li o
capítulo inicial. E obrigado mais uma vez, Frank! =D</p>I gave my heart to a simple chord2007-10-13T03:25:00-03:002007-10-13T03:25:00-03:00luizirbertag:blog.luizirber.org,2007-10-13:/2007/10/13/i-gave-my-heart-to-a-simple-chord/<blockquote>
<p>People worry about kids playing with guns, and teenagers watching
violent videos; we are scared that some sort of culture of violence
will take them over. Nobody worries about kids listening to thousands
- literally thousands - of songs about broken hearts and rejection and
pain and misery and loss. The unhappiest people I know, romantically
speaking, are the ones who like pop music the most; and I don't know
whether pop music had caused this unhappiness, but I know that they've
been listening to sad songs longer than they've been living the
unhappy lives.</p>
</blockquote>
<p><a href="http://en.wikipedia.org/wiki/High_Fidelity_%28novel%29">High Fidelity</a>, by <a href="http://en.wikipedia.org/wiki/Nick_Hornby">Nick Hornby</a>.</p>Quase lá2007-08-07T22:02:00-03:002007-08-07T22:02:00-03:00luizirbertag:blog.luizirber.org,2007-08-07:/2007/08/07/quase-la/<p><a href="http://mingle2.com/geek-quiz">80% Geek</a></p>
<p>Dica do <a href="http://palacehotel.blogspot.com">André</a>.</p>Dia da Toalha!2007-05-25T23:00:00-03:002007-05-25T23:00:00-03:00luizirbertag:blog.luizirber.org,2007-05-25:/2007/05/25/dia-da-toalha/<p><a href="http://www.flickr.com/photos/luizirber/sets/72157600266409006/detail/"><img class="center" src="https://farm1.static.flickr.com/198/513954323_d0936294fd.jpg"></a></p>
<p>Hoje comemorou-se mais um dia da toalha, homenagem dos fãs de Douglas
Adams à sua obra. Para mais algumas fotos tiradas durante o dia (ééé,
dessa vez fiquei com a toalha do meu lado o dia inteiro =D), cliquem na
foto acima.</p>Um fim de semana produtivo2007-05-21T00:17:00-03:002007-05-21T00:17:00-03:00luizirbertag:blog.luizirber.org,2007-05-21:/2007/05/21/um-fim-de-semana-produtivo/<p>Um screenshot vale mais que mil palavras:</p>
<p><img class="center" src="/images/bugbrother-stats.jpg" title=""Ignorem os vales, valorizem os picos!"" alt=""Ignorem os vales, valorizem os picos!""></p>
<p>Brinde: com preguiça de escrever aquele monte de macros e boilerplate
pra declarar um GObject? Use o <a href="http://www.5z.com/jirka/gob.html">gob</a>! Aparentemente ele tá sem
desenvolvimento ativo a algum tempo, mas funciona bem. Mas é claro que
você vai escrever o código direitinho quando tiver tempo, certo? ;-D</p>
<p>Outro software interessantíssimo: <a href="http://sourceforge.net/projects/g-inspector">G-Inspector</a>. Esse faz mais tempo
ainda que tá sem desenvolvimento, mas também funciona muito bem. Permite
que você analise interfaces gráficas construídas com GTK+. Basta passar
o mouse em cima e ele indica nome, posição na memória, e se você olhar
direitinho, mais um monte de propriedades.</p>
<p>Happy Hacking!</p>
<p>P.S: Esperem sentados pela segunda parte do relato do FISL.</p>FISL 8.0 - Primeira parte2007-04-13T21:05:00-03:002007-04-13T21:05:00-03:00luizirbertag:blog.luizirber.org,2007-04-13:/2007/04/13/fisl-80-primeira-parte/<p><strong>11.04</strong></p>
<p>Madrugada. Chegada em Porto Alegre. Táxi, casa da irmã, cama.<br>
Manhã. Ajeitar modem adsl e access point (que ninguém é de ferro,
oras). Depois, fazer a apresentação pro WSL. </p>
<p>Meio dia. Irmã em casa. Saímos para almoçar, micro-passeio pelo centro,
mega-refeição, R$7,50 buffet livre, com sobremesa (sorvete). Barriga
dói. Voltamos pro apartamento, passando antes na Toca do Disco (que está
fechada). Apresentação ainda em branco. </p>
<p>Lá pelas 4 e pouco da tarde. Indo pro aeroporto. Ônibus passa na frente
do apartamento e deixa na frente do aeroporto. Mordomia, exceto pelo
fato de que o ônibus parece que vai desmontar cara vez que pára (ar
condicionado, pelo jeito). Mais um tempinho esperando o avião chegar e
lá vem as manchinhas vermelha e preta. Eram <a href="http://fabiocpn.wordpress.com">Pacu</a> e <a href="http://vainalousachefe.wordpress.com">GG</a>, este
último em sua primeira viagem de avião. Apresentação ainda em branco.<br>
Noite. Conexão WiFi zoada (culpa do note do Pacu, que não suporta WPA).
Access point resetado, sem criptografia. Gambiarra é o nome da nova
rede. Pizza. Banho. </p>
<p>Apresentação ainda em branco. OpenOffice aberto, artigo aberto, a
canvas ainda em branco. Rola certa emoção, pensando que aquela brancura
toda significa um mundo sem limites para a minha criatividade. </p>
<p>Meia hora depois. Termino de ler os RSS atrasados. Apresentação ainda
em branco. Hum. Hora de trabalhar. </p>
<p>Mais meia hora. Apresentação aparentemente pronta. Algumas pequenas
melhorias na disposição do texto. Salvar em PDF, copiar pro 770, Cama.</p>
<p><strong>12.04</strong> </p>
<p>Manhã. Algumas acordadas durante a noite, carros passando na rua
pareciam chuva. Logo surge a preocupação de ter de tomar táxi até o
evento. Preocupação infundada, dia nublado mas nem sinal de chuva. Saída
cedo para ir ao local do evento, que fica longe. </p>
<p>Quinze para as nove. Palestra inicia às 10 horas da manhã, mas já
estamos com crachá na mão. Algumas pequenas verificações no conteúdo do
kit FISL, separar propagandas que não interessam, lixo. </p>
<p>Nove e meia. Entro no hall principal, e erro a sala da minha palestra.
D''oh. Depois de achar a sala certa, primeiro a chegar. Aguardar. </p>
<p>Dez e quinze. Chegam mais dois palestrantes. Nada da coordenação do
WSL. Mais alguns minutos e chega Marinho, coordenador do WSL, mas não da
mesa. Aguardar. Aparentemente primeiro palestrante não veio, logo devo
fazer a primeira. Copiar palestra do 770 pro note, abrir, aguardar
início. </p>
<p>Dez e meia. Agora vai. Início da palestra. Algumas gaguejadinhas,
coloco a mão no bolso antes de me dar conta que não é o recomendado.
Palestra flui bem. <a href="http://www.gnome.org/~lucasr/">Lucas</a> aparece na entrada, cumprimenta e tira uma
foto. Devolvo o cumprimento com um aceno de mão, o que deve ter parecido
estranho pra quem está assistindo à palestra (pouca gente, basicamente
Pacu e GG, <a href="http://www.iei.org.br/~rafael/blog/">Petro</a> e seu irmão e os palestrantes). Enrolo-me um pouco
no final, abro para perguntas. Nada. OK, acabou. Não foi tão difícil. </p>
<p>Meio dia. Almoço. Mesmo lugar do ano passado. </p>
<p>Meio da tarde. Descubro que Kurt Vonnegut morreu ontem, aos 84 anos.
Luto oficial. </p>
<p>Fim da tarde. Vamos embora. Passamos no cinema e assisitimos As Férias
de Mr. Bean. Seria tolo, se não fosse tão engraçado. </p>
<p>Noite. Torradas de janta. Um pouco de enrolação. Cama.</p>Poeminha da aula de Redes2007-03-20T22:34:00-03:002007-03-20T22:34:00-03:00luizirbertag:blog.luizirber.org,2007-03-20:/2007/03/20/poeminha-da-aula-de-redes/<p><strong>A Sincronia do Amor</strong></p>
<p>- Mariazinha, eu te amo.<br>
- ACK.</p>Café-da-manhã dos Campeões2007-01-13T03:24:00-02:002007-01-13T03:24:00-02:00luizirbertag:blog.luizirber.org,2007-01-13:/2007/01/13/cafe-da-manha-dos-campeoes/<blockquote>
<p>Conforme meu qüinquagésimo aniversário se aproximava, tornava-me mais
e mais enfurecido e assombrado pelas decisões estúpidas tomadas pelos
meus compatriotas. E então, de repente, comecei a sentir pena deles,
porque compreendi como para eles era inocente e natural se comportar
de modo tão abominável e com resultados tão abomináveis: estavam
fazendo o melhor possível para viverem pessoas inventadas em livros de
histórias. Este era o motivo pelo qual os americanos matavam uns aos
outros a tiro com tanta freqüência: era um truque literário
conveniente para terminar contos e livros.<br>
Por que tantos americanos eram tratados por seus governos como se
suas vidas fossem descartáveis como lenços de papel? Porque era assim
que os autores costumavam tratar personagens menores em suas histórias
inventadas.<br>
E assim por diante.<br>
Depois que compreendi o que estava tornando a América uma nação tão
perigosa e infeliz, de pessoas que não tinham nada a ver com a vida
real, decidi me abster de contar histórias. Eu escreveria sobre a
vida. Cada pessoa seria exatamente tão importante quanto qualquer
outra. Todos os fatos também receberiam o mesmo peso. Nada seria
deixado de fora. Deixaria os outros trazerem ordem ao caos. Eu, em vez
disso, traria caos à ordem, o que acho que acabei fazendo.<br>
Se todos os escritores fizessem isso, talvez os cidadãos fora dos
ofícios literários compreendessem que não há uma ordem no mundo ao
nosso redor, que, em vez disso, devemos nos adaptar às exigências do
caos.<br>
É difícil se adaptar ao caos, mas é possível. Sou uma prova viva
disso: é possível</p>
</blockquote>
<p>Kurt Vonnegut, Café-da-manhã dos Campeões (Breakfast of Champions),
1973.</p>
<p>Postando muitos partes de livros do Vonnegut ultimamente, mas essa não
tinha como passar.</p>Os próximos quatro anos prometem2007-01-06T02:23:00-02:002007-01-06T02:23:00-02:00luizirbertag:blog.luizirber.org,2007-01-06:/2007/01/06/os-proximos-quatro-anos-prometem/<p><img class="center" src="/images/yeda.jpg"></p>
<p>Ah, viva o anti-petismo aqui no RS. O mandato dela nem tinha começado e
ela já tinha um projeto de reforma tributária engavetado pela
assembléia, e com o apoio do vice-governador.</p>Show dos Lips, um ano depois2006-11-26T21:35:00-02:002006-11-26T21:35:00-02:00luizirbertag:blog.luizirber.org,2006-11-26:/2006/11/26/show-dos-lips-um-ano-depois/<p>A preguiça é a mãe da invenção. Ou não?</p>
<p>Pra não precisar escrever tudo de novo, resgato um texto do finado
Notícias do LC. Quase melhor show da vida, se Wilco não tivesse sido tão
foda, um mês antes.</p>
<p>Boa leitura.</p>
<p>x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x<br>
Postado originalmente em 4 de dezembro de 2005<br>
x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x</p>
<p>Preciso erguer um altar às empresas de telefonia celular. Primeiro, a
TIM, que traz Wilco. E agora, a Claro, que trouxe Stooges, Nine Inch
Nails e Flaming Lips. Minhas duas bandas favoritas (Wilco ganha por uns
milímetros de Flaming Lips) em menos de um mês. Viva o celular! (apesar
de dar câncer).</p>
<p>Só falta confirmarem show do Radiohead na Argentina. E ainda tem Rolling
Stones e U2 no começo do ano que vem, sendo que RS deve ser de GRAÇA.
Que temporada. (UPDATE: parece que o show do Radiohead vai desmentido
pelo porta-voz da banda. pena).</p>
<p>UPDATE MESMO: Eu fui no show dos Rolling Stones. Lero lero! =D</p>
<p>Pois bem, o show do Wilco foi relatado na última edição. Ainda acho esse
o melhor show do ano, principalmente considerando o aspecto emocional. O
que aqueles caras fazem apenas com os instrumentos deles, sem telões,
sem luzes, sem pirotecnia, é sublime. O CQÉR, por outro lado, foi a
apologia do efeito especial. Flaming Lips e Nine Inch Nails que o digam.</p>
<p>Para mim, o festival começou no início de outubro, quando os ingressos
foram postos à venda. Estava no primeiro dia no local de venda para
comprar o meu. Na verdade não houve necessidade para tanto desespero,
pois não foram vendidos todos os ingressos (mas nada próximo do que
houve no Rio, onde cambistas vendiam 3 ingressos por R\$10). Mas eu não
podiam conceber a possibilidade de perder o show dos Lips. Notem que, do
mesmo modo que o último Notícias, eu pareço uma fanzoca de Backstreet
Boys, ou qualquer outro tipo de banda que menininhas adoram quando
adolescentes e depois de alguns anos tem vergonha de assumir. E eu não
vejo problema nenhum nisso, primeiro porque não dá pra comparar uma
banda com a outra, e segundo porque eu realmente acredito que a música
pode ser muito mais poderosa do que alguns acreditam. Pode-se mudar o
mundo com a música? Não, seu hippie. Pode-se mudar pessoas tentando
convencê-las a escutar determinado tipo de música? Não, seu fascista,
música não pode ser impingida. Mas a música pode mudar as pessoas?
Podem, e como. E, portanto, eu não vejo nada de estranho, exdrúxulo ou
ridículo no que vem agora.</p>
<p>Duas semanas atrás, uma semana antes do show, um dos caras que mora na
minha república veio todo feliz pedir para mim entrar no site da rádio
89fm para escutar o programa em que a banda dele participou. Enquanto
entro no site para baixar o programa, vejo uma promoção: "Suba no palco
com os Flaming Lips!". Fiquei entusiasmado. Para quem nunca ouviu falar
deles: é uma banda de Oklahoma, EUA, com uns 20 anos de estrada. Sempre
foram reconhecidos pelos shows anárquicos. Mas, nas turnês dos dois
últimos discos, eles transformaram o show deles em algo que pode ser
comparado a ~~um circo~~ uma festa de aniversário psicodélica: o
vocalista anda dentro de uma bolha de plástico pela platéia antes
do/span> show, tem máquina de fumaça, bolhas, confetes e serpentinas e,
para completar, várias pessoas fantasiadas nas laterais do palco (mais
ou menos mega-bichinhos Parmalat). Fui correndo me cadastrar e já mandei
e-mail para um amigo meu de são paulo, o <a href="http://palacehotel.blogspot.com">André</a>, que é o maior fã da
banda que eu conheço (depois de mim). Eu não consegui me cadastrar
(problemas no site), mas não me estressei muito, eu sabia que o show ia
ser espetacular de qualquer jeito.</p>
<p>Sábado de tarde, chego na casa do André para ir pro show, e ele já chega
falando que ganhou a promoção. Ele ia subir no palco! Eu fiquei um tanto
boquiaberto, mas todo mundo por lá comemorou bastante. Haviam no mínimo
20 pessoas na casa dele para ir ao show, a maioria uma galera aqui de
São Carlos que tinha sido colega no curso de imagem e som, mas também
tinham dois caras de Santa Cruz do Sul. Que fizeram eu me sentir
envergonhado de estar perdendo o sotaque. Droga.</p>
<p>Cinco e pouco da tarde foi toda a patota pegar ônibus para chegar no
local do show (que era longe pra burro). Busão lotado, meia hora de
viagem, chegamos: estava rolando o show do Good Charlotte. Pena que
ainda não tinha terminado. O ponto de encontro com os organizadores da
89fm era num posto médico ao lado do palco, e a gente ficou por perto
até os caras aparecerem e o André ir junto. Quando os caras apareceram,
começamos a ir embora, mas eu e o perdido (visitem
<a href="http://www.roqueealfredo.com.br/">www.roqueealfredo.com.br</a> , criação dele) vimos que estava faltando
gente. "e se a gente tentasse subir no palco?". Lá fomos nós dois, e
mais 4 outras pessoas que tavam junto com a gente. A saber: pereira,
daniel, tati, amiga da tati (não sei o nome dela) e tiago. E não é que
os caras da 89 deixaram a gente entrar?</p>
<p>Chegando atrás do palco, um papai noel magricelo veio nos recepcionar.
Era o responsável pelas fantasias. Enquanto a gente assinava um contrato
de imagem com a MTV (cedendo direitos, tomara que não tenha cedido a
alma também, nem li), aparece Wayne Coyne, vocalista da banda. Eu
cutuquei o André, e nossos queixos caíram no chão. Tivemos a típica
conversa de fã babão e ele completou, quando o André perguntou o que
devíamos fazer no palco: "Act like you were on drugs!".</p>
<p>Pausa para a explicação da apologia: realmente é difícil acreditar que
Wayne não toma ácido, como ele afirma. Aparentemente ele é a pessoa que
mais acredita em felicidade no mundo, mas sem se deslumbrar. Exemplo:
numa entrevista publicada antes do show ele disse que seria uma
"celebração pré-fim do mundo". Então, já que o mundo vai acabar, "que
esse seja em grande estilo". E, numa de suas melhores músicas, "Do You
Realize??", sai uma estrofe fantástica:</p>
<blockquote>
<p>"Do you realize? That you have the most beautiful face?<br>
Do you realize? We're floating in space?<br>
Do you realize? That everyone you know someday will die...</p>
<p>And instead of saying all your goodbyes, let them know<br>
You realize that life goes fast,<br>
It's hard to make the good things last,<br>
You realize the sun don't go down<br>
It's just an illusion caused by the world spinning round..."</p>
</blockquote>
<p>Desculpe, mas ao menos pra mim isso faz todo o sentido do mundo. E
continuemos.</p>
<p>Wayne entra no camarim, nós vamos botar a fantasia. Tiago foi de gorila.
Daniel de elefante. Pereira de pinguim. Tati de jacaré. Fabi de
pintinho. Perdido de zebra. André de girafa. E eu de dálmata. Quando
nossa felicidade não podia ser maior, aparece alguém falando que tem
cerveja e água de graça. Sendo que elas custavam 4 reais a latinha e 3
reais o copo nos bares. Tem como ficar melhor? Tem.</p>
<p>Termina o show do Fantomas, e somos chamados para subir no palco. Atrás
do telão, esperando pelo início, está Steven Drozd, multinstrumentista e
cérebro musical da banda. Vestido de papai noel, mas com enchimento,
então esse realmente parecia um papai noel. Vou lá conversar com ele, e
quanto mostro a mão para ele apertar, ele tem que segurar a garrafa de
cerveja e a de whiski na mesma mão. Bem, pra quem era viciado em heroína
e conseguiu largar o vício, até que não é muito. Ele perguntou como eu
conhecia a banda, e eu tive que explicar que infelizmente os cd's deles
só sairiam em dezembro (agora já saíram), e que eu conheci pela
internet. Nisso aparece um cara da produção e ele vai pro palco.</p>
<p>Volto pro meu lado (à esquerda do palco), e todos nos preparamos. Wayne
sobe no palco, faz um discurso (com tradução em português) e pede para
todos contarem uma mentirinha no outro dia, dizendo que ele veio do
espaço. É a deixa para ele entrar na Bolha. Ele entra e sai rolando pela
platéia, e dá pra ver na cara das pessoas a felicidade brotando. Alguns
já compararam um show dos lips à primeira vez em que uma criança vai ao
circo. E eu concordo. Voltando ao palco, começa "Race for the Prize",
uma música que fala sobre dois cientistas em busca da cura de uma doença
que irá salvar a humanidade, e como o sacríficio deles influi na sua
vida pessoal. Maravilha. Logo depois, uma cover de Bohemian Rhapsody, do
Queen, que eles gravaram para um tributo. O maior karaokê do mundo,
fácil. A letra aparecia no telão e até pessoas que tinham preconceito
com o Queen, como eu, cantaram a plenos pulmões, mais alto que o próprio
Wayne (culpa da produção, que não conseguiu equalizar o som dos dois
palcos do mesmo jeito. toscos. felizmente eu tava do lado da caixa de
som, então ouvi tudo).</p>
<p>Depois, "Fight Test". Essa fala sobre como nem sempre é possível fugir
das lutas da vida, e que às vezes é necessário enfrentá-las, sem estar
pronto pra enfrentá-las. E aí começou a bater o calor. Estava 15ºC, mas
dentro de uma fantasia de pelúcia tava muito mais que isso. Acho que
nunca suei tanto na vida. Sorte que tinha água atrás do telão. Pega a
água, toma, e volta a pular. O legal é que algumas cabeças era abertas,
e outras, como a minha, fechadas (não dava pra ver nada). Então a gente
começou a trocar de cabeças. Teve dálmata-girafa, elefante-pinguim,
zebra-macaco, um verdadeiro zoológico transgênico.</p>
<p>E, finalmente, a minha preferida: "The Gash". O subtítulo dessa é
"Battle Hymn of the wounded mathematician". O refrão?</p>
<blockquote>
<p>"Will the fight for our sanity<br>
Be the fight of our lives?<br>
Now that we´ve lost all the reasons<br>
That we thought that we had"</p>
</blockquote>
<p>Só não chorei porque não sou disso. Mas acho que nunca vi meu sorriso
maior. Depois, a música que batiza o último cd, "Yoshimi Battles the
Pink Robots". Uma música de amor? Sei lá, mas Wayne tentou fazer todo
mundo cantar junto, com um fantoche de freira na mão. E o povo cantou,
apesar de não saber o que se passava. Terminada a música, Wayne fala
sobre um presente que ganhou de um fã em Indiana, um tecladinho de
criança com sons de bichinhos. E acontece a "Cow Jam", basicamente Wayne
apertando alternadamente o botão da vaca e do pato e Steven mandando ver
na guitarra. Aliás, Steven dava um golão no whiski ao fim de toda
música.</p>
<p>Chegando perto do fim do show, finalmente o maior hit deles, "She Don't
Use Jelly", de 1993. Essa fala sobre uma garota que prefere que seu
cabelo seja realmente laranja, então usa tangerinas para tingí-lo. Entre
outras loucuras. Não é minha preferida, mas tem um baita riff de
guitarra, grudento que só ele.</p>
<p>O momento da apoteose: "Do You Realize??". Todos os bichinhos se
abraçando no palco e fazendo coro. Pelo menos os do lado esquerdo,
porque no lado direito tinha muita gente que nem sabia o que ia fazer
quando resolveu participar do sorteio. E pensar que tanta gente queria
estar lá e não pode. Tsc.</p>
<p>Finalizando o show, Wayne faz um discurso criticando Bush e emendam "War
Pigs", do Black Sabbath. Primeira vez na história que um dálmata faz air
guitar, acho. E, pra tristeza de todo mundo, Wayne agradece e o show
acaba. O público pede mais, mas Stooges já vai começar no outro palco.
Saímos todos, devolvemos a fantasia, e vamos assistir ao próximo show,
todos ainda estranhamente quietos. Quando finalmente caiu a ficha do que
tínhamos acabado de fazer, foi impossível não notar os sorrisos de
satisfação na cara de todos. Que dia.</p>
<p>E ainda aconteceu muita coisa boa. Stooges é a coisa mais roqueira que
existe no mundo. Iggy Pop, no alto de seus 58 anos, corre o palco
inteiro, simula sexo com o amplificador, não pára de pular e ainda causa
confusão quando pede que todos subam no palco. Deixa qualquer moleque no
chinelo, quem dirá os caras cheios de "atitude" que aparecem por aí.
*COF*Chorão*COF*. Tocaram todos os clássicos, centrando o repertório
nos dois primeiros discos. Ou seja, teve "Raw Power", teve "I wanna be
your dog", teve "TV eye", teve "No fun". e teve mais. E ainda xingou a
MTV, que tava filmando o show e a toda hora ligava as luzes. "Turn the
fucking lights on! I don't care about MTV! FUCK MTV!". Genial.</p>
<p>Depois, Sonic Youth. Que eu não conhecia muito, mas que seria realmente
impressionante se o som não estivesse tão baixo. Quase não dava pra
escutar. Pau no cu da produção. E, por fim, Nine Inch Nails. Com o som
mais alto de todo o festival. Fechando a noite. 25 toneladas de efeitos
especiais no palco. E, simplesmente, a coisa mais pesada que eu já ouvi.
Qualquer banda de heavy metal fica no chinelo. Sendo que as músicas são
algo bem próximo de música eletrônica, mas com guitarras distorcidas.
Quase não dava pra ficar parado, a vontade de sair pulando foi grande,
mas o cansaço venceu. Já tava no bagaço. Mas, sem dúvida, show digno de
fechar o festival, e olha que o lendário Stooges tinha tocado antes
deles. Foda.</p>
<p>E terminou. Cheguei na casa do meu tio às 6 horas da manhã, cansado, com
fome, sujo de lama (por que o show foi num gramado, e tinha chovido à
tarde). Mas feliz. Com certeza um dos melhores dias da minha vida. E
viva o rock!</p>
<p>E chega. Cansei de escrever. Se quiserem saber mais, o andré publicou
suas memórias nesses links: </p>
<p><a href="http://palacehotel.blogspot.com/2005/11/oh-my-gawd-its-flaming-lips-claro-que.html">http://palacehotel.blogspot.com/2005/11/oh-my-gawd-its-flaming-lips-claro-que.html</a> </p>
<p><a href="http://palacehotel.blogspot.com/2005/11/this-here-giraffe-laughed-claro-que_29.html">http://palacehotel.blogspot.com/2005/11/this-here-giraffe-laughed-claro-que_29.html</a></p>
<p>inclusive tem fotos. Eu acabei tendo sorte, apareci numa lado do
baixista, bem quando tinha baixado John Travolta fase Embalos de Sábado
à Noite.</p>
<p>x-x-x-x-x-x-x-x-x-x-x-x-x-x-x</p>O ópio do povo2006-11-21T22:55:00-02:002006-11-21T22:55:00-02:00luizirbertag:blog.luizirber.org,2006-11-21:/2006/11/21/o-opio-do-povo/<blockquote>
<p>Quanto às igrejas fechadas por Stalin e aquelas fechadas na China de
hoje: tal supressão da religião foi supostamente justificada pela
afirmação de Karl Marx de que "a religião é o ópio do povo". Marx
disse isso em 1844, quando o ópio e derivados do ópio eram os únicos
analgésicos eficazes que uma pessoa podia tomar. O próprio Marx os
havia tomado. E ficara agradecido pelo alívio temporário que lhe
deram. Estava simplesmente observando, e certamente não condenando, o
fato de que a religião também podia ser confortadora para pessoas em
dificuldades econômicas ou sociais. Era um truísmo casual, não um
ditame.</p>
</blockquote>
<p>Kurt Vonnegut, "Um Homem Sem Pátria", 2006.</p>
<p>Gênio. Além de ter escrito Matadouro 5 e Hocus Pocus, ainda nos
presenteia com esse belo livro que, apesar de conter apenas comentários
um tanto randômicos, são melhores que quase tudo que se publica hoje no
mundo.</p>
<p>Salve Kurt Vonnegut!</p>Planeta Comp@UFSCAR2006-11-12T00:29:00-02:002006-11-12T00:29:00-02:00luizirbertag:blog.luizirber.org,2006-11-12:/2006/11/12/planeta-compufscar/<p>Grande Pacu! Agora você que quer saber tudo sobre o que rola nos
bastidores da vida computeira ufscariana já tem onde encontrar!</p>
<p><a href="http://www.comp.ufscar.br/~fcpn/planet/">Planeta Comp@UFSCAR</a> !</p>Orkut + GTalk?2006-11-07T23:25:00-02:002006-11-07T23:25:00-02:00luizirbertag:blog.luizirber.org,2006-11-07:/2006/11/07/orkut-gtalk/<p>Tava navegando no orkut hoje e percebi, graças ao flashblock, que tinha
algum flash sendo carregado. Peraí, flash no orkut? Essa é nova.</p>
<p>Aí fui dar uma olhada no código da página, e achei isso:<br>
<a href="http://www.orkut.com/GoogleTalkSignup.aspx">http://www.orkut.com/GoogleTalkSignup.aspx</a></p>
<p>Pra quem não quiser acessar, taí o screenshot:</p>
<p><img class="center" src="/images/screenshot-orkut-google-talk-firefox.png"></p>
<p>Agora sim as coisas começam a ficar interessantes...</p>Algo não cheira bem…2006-11-03T15:05:00-03:002006-11-03T15:05:00-03:00luizirbertag:blog.luizirber.org,2006-11-03:/2006/11/03/algo-nao-cheira-bem/<blockquote>
<p>“Novell is actually just a proxy for its customers, and it's only for
its customers. This does not apply to any forms of Linux other than
Novell's SUSE Linux. And if people want to have patent peace and
interoperability, they'll look at Novell's SUSE Linux. If they make
other choices, they have all of the compliance and intellectual
property issues that are associated with that,” commented Steve
Ballmer.</p>
</blockquote>
<p><a href="http://www.businessreviewonline.com/os/archives/2006/11/novell_and_micr.html">Fonte</a></p>
<p>Será que a Microsoft abraçou a Novell (abraço de urso, diga-se) por
causa do anúncio da Oracle na semana passada?</p>Ubuntu Feisty Fawn2006-11-03T13:04:00-03:002006-11-03T13:04:00-03:00luizirbertag:blog.luizirber.org,2006-11-03:/2006/11/03/ubuntu-feisty-fawn/<p>Você, como eu, sente aflição quando uma distribuição baseada em Debian é
estabilizada e pára de atualizar a maior parte dos pacotes? Não se
importa de possivelmente deixar sua máquina imprestável?</p>
<p>Desde ontem estão funcionando os repositórios do Feisty, a próxima
versão do Ubuntu. Basta trocar o "edgy" do sources.list por "feisty".</p>
<p>A melhor justificativa que eu achei é: Já que muitos reclamam que o
upgrade de versões estáveis estraga algumas configurações, por que não
fazer o upgrade agora, que os repositórios do Feisty são praticamente
idênticos ao do Edgy, e ir fazendo o upgrade aos poucos? =D</p>Stevey' Blog Rants2006-11-02T21:49:00-03:002006-11-02T21:49:00-03:00luizirbertag:blog.luizirber.org,2006-11-02:/2006/11/02/stevey-blog-rants/<p>Nada pra fazer no feriadão?</p>
<p><a href="http://steve-yegge.blogspot.com/">Stevey' Blog Rants</a>.</p>
<p>Vários artigos interessantes (principalmente estes sobre <a href="http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom-of-nouns.html">Java</a>,
<a href="http://steve-yegge.blogspot.com/2006/09/good-agile-bad-agile_27.html">metodologias ágeis</a> e a <a href="http://steve-yegge.blogspot.com/2006/10/egomania-itself.html">resposta</a> desse último post).</p>BugBrother2006-10-31T22:44:00-03:002006-10-31T22:44:00-03:00luizirbertag:blog.luizirber.org,2006-10-31:/2006/10/31/bugbrother/<p><a href="http://sourceforge.net/projects/bugbrother">BugBrother</a> é o protótipo (ou melhor, versão de desenvolvimento) do
programa que eu estou desenvolvendo na <a href="http://www.cnpdia.embrapa.br/">Embrapa</a>, o <a href="http://repositorio.agrolivre.gov.br/projects/sacam">SACAM</a>, que
significa Sistema de Análise Comportamental de Animais em Movimento. Sua
função, agora que a sigla é conhecida, é um tanto óbvia: Consiste em
capturar imagens de uma entrada de vídeo e aplicar algoritmos de
detecção de movimento, a fim de guardar uma trilha do movimento
realizado pelo animal. A partir dessa trilha e após definir áreas numa
imagem de referência é possível reportar estatísticas úteis para
estudos, realizados geralmente por entomólogos. O experimento básico
feito por eles é realizado com o auxílio de um olfatômetro em forma de
Y, onde o inseto é liberado no pé do Y e nas outras duas extremidades
são liberados feromônios ou outras substâncias. As estatísticas geradas
são úteis no estudo de novas armadilhas químicas para insetos, muito
menos danosas do que agrotóxicos.O funcionamento básico do programa
passa por essas etapas: </p>
<ul>
<li>Capturar a imagem de uma entrada de vídeo (atualmente qualquer
dispositivo compatível com Video4Linux) </li>
<li>Aplicar um algoritmo de detecção de movimento e guardar as
coordenadas (X,Y) geradas pelo movimento detectado. </li>
<li>A partir dessas coordenadas calcular parâmetros como tortuosidade,
desvio angular e velocidade do inseto analisado. </li>
<li>Gerar relatórios. </li>
</ul>
<p>Atualmente os dois primeiros itens estão funcionais, o terceiro está
próximo de ser terminado e o último começará a ser desenvolvido na
próxima semana.</p>
<p>Um ponto que possa ter causado interesse: digo que é o protótipo pois no
momento ele está sendo desenvolvido usando <a href="http://www.python.org">Python</a> + <a href="http://www.pygtk.org">PyGTK</a> +
<a href="http://www.gstreamer.org">GStreamer</a>. Devido a problemas de desempenho (principalmente em
relação ao algoritmo de detecção) ele será implementado novamente, em C
+ <a href="http://www.gtk.org">GTK+</a> + <a href="http://www.gstreamer.org">GStreamer</a>, após a conclusão e avaliação do protótipo. O
que é uma pena, pois do ponto de vista de clareza de código e
reaproveitamento será um retrocesso. Mas a vida não é sempre o que a
gente quer =D</p>GStreamer2006-10-25T02:11:00-03:002006-10-25T02:11:00-03:00luizirbertag:blog.luizirber.org,2006-10-25:/2006/10/25/gstreamer/<p>No ano passado eu descobri um dos melhores livros que eu li na minha
vida, <a href="http://fcpn.multiply.com/reviews/item/1">O Guia do Mochileiro das Galáxias</a>, de Douglas Adams. Pouco
depois de terminar de ler (duas vezes) eu fiz uma busca por Douglas
Adams no Google. E um dos links retornados foi uma sessão de perguntas e
respostas do <a href="http://slashdot.org/interviews/00/06/21/1217242.shtml">Ask Slashdot</a>, onde ele fala sobre <a href="http://www.cycling74.com/twiki/bin/view/FAQs/MaxMSPHistory">MAX</a>, uma
linguagem de programação musical de alto nível orientada a objetos,
quando perguntaram sobre a obsessão do personagem Richard McDuff, do
livro Dirk Gently's Holistic Detective Agency, de mapear processos
naturais em música. Eu comecei a procurar por MAX e acabei achando que
seu criador, <a href="http://www.crca.ucsd.edu/%7Emsp/">Miller Puckette</a>, criou também o PureData, que faz o
mesmo e é open source. </p>
<p>Depois de algum tempo eu entendi como ele funcionava, e eu fiz algumas
coisinhas simples, mas eu não usei ele por muito tempo. E então eu
fiquei sabendo da existência do GStreamer. Apesar de algumas diferenças,
eu fiquei maravilhado com quão simples era fazer algumas coisas que eram
bem difíceis no PureData. </p>
<p>De fato, eu não sei como as coisas funcionam dentro do PureData, mas eu
realmente gostei de como o GStreamer foi feito: existem Elements, que
tem uma função específica, como ler dados de um arquivo, decodificar
dados ou mandar dados para uma placa de som; existem Bins, um container
para uma coleção de elementos, e Pipelines, um tipo especial de Bin que
permite a execução dos elementos contidos; e existem Pads, que são
usadas para negociar ligações e fluxo de dados entre elementos. E é só
isso. Com essas partes simples, todas juntas, coisas muito complexas
podem ser feitas, como o <a href="http://www.flumotion.net/">Flumotion</a>, um servidor de streaming, e o
<a href="http://pitivi.sourceforge.net/">PITIVI</a>, um editor não-linear de vídeo.E, claro, o grande
<a href="http://sourceforge.net/projects/bugbrother/">BugBrother</a>, o protótipo do programa que eu estou fazendo lá na
Embrapa. Mais sobre esse em outro post.</p>