Gabbleblotchits

Vogon Poetry, Computers and (some) biology

When life gives you lemons

Two weeks ago, during the CS graduate orientation at UC Davis, the IT person was showing that we have access to 50 GB on Box and UNLIMITED storage on Google Drive (I'm currently testing the limits of UNLIMITED, stay tuned). I promptly forgot about Box because I've been using Syncthing for file syncing and it's been working quite well, or at least way better than owncloud (which took infinite amounts of time to sync my stuff).

Anyway, today I was checking PaperShip (wonderful app, by the way) and noticed my Zotero free account was almost full (it is only 300 MB). I knew Zotero had WebDAV support, but I don't want to maintain a WebDAV server. Turns out Box has WebDAV support, and even found instructions on how to set everything up.

HOWTO: Moleculo raw reads coverage in reference genome

Since last month I'm a PhD student in MSU at Titus lab, and my research is focused on building infrastructure for exploring and merging multiple read types and multiples assemblies.

Titus and all the labbies are awesome mentors, and I'm making some progress while learning how do deal with this brave new world.

One thing I'm doing is checking how good is a Gallus gallus Moleculo sequencing dataset we have, which is being used for the chicken genome sequence improvement project. The first question was: How many of Moleculo raw reads align to the reference genome, and how much is the coverage?

To answer these questions we are using bwa-mem to do the alignments, samtools to work with the alignment data and a and a mix of Bash and Python scripts to glue everything together and do analysis.

First, let's download the reference genome.

In []:
!wget -c ftp://hgdownload.cse.ucsc.edu/goldenPath/galGal4/bigZips/galGal4.fa.masked.gz

With the reference genome available, we need to prepare it for the BWA algorithms by constructing its FM-index. The command to do this is

In []:
!bwa index galGal4.fa.masked.gz

SAMtools also require preprocessing of the original FASTA file:

In []:
!samtools faidx galGal4.fa.masked.gz

I had 10 files with Moleculo reads, varying from 500 bp to 16 Kbp. In this example let's assume all reads are in the same file, reads.fastq, but in the original analysis I ran the next commands inside a Bash for-loop.

Let's align reference and reads using the BWA-MEM algorithm:

In []:
!bwa mem galGal4.fa.masked.gz reads.fastq > reads.fastq.sam

Next we are going to optimize the reads.fastq.sam file, transforming it into the BAM format (a binary version of SAM). We also sort based on leftmost coordinates and index the file for faster access:

In []:
!samtools import galGal4.fa.masked.gz.fai reads.fastq.sam reads.fastq.bam
!samtools sort reads.fastq.bam reads.fastq.sorted
!samtools index reads.fastq.sorted.bam

We can query our alignments using the view commands from samtools. How many reads didn't align with the reference?

In []:
!samtools view -c -f 4 reads.fastq.sorted.bam

In my case I got 7,985 reads that didn't align to the reference, from a total of 1,579,060 reads.

In []:
!samtools view -c reads.fastq.sorted.bam

There were 4,411,380 possible alignments.

But how good is the coverage of these alignments in the reference genome? To do this calculation I refactored calc-blast-cover, resulting in bam-coverage. The idea is to create arrays initialized to zero with the size of the sequence for each sequence in the reference. We go over the alignments and set the array position to 1 if there is an alignment matching this position. After doing this we can calculate the coverage by summing the array and dividing by the sequence size.

To make it easier to use I've started a new project (oh no! Another bioinformatics scripts collection!) with my modifications of the original script. This project is called bioinfo and it is available in PyPI. Basic dependencies are just docopt (which is awesome, BTW) and tqdm (same). Additional dependencies are needed based on which command you intend to run. For example, bam_coverage needs PySAM and screed. At first this seems counter-intuitive, because the user need to explicitly install new packages, but this way avoids another problem: installing all the packages in the world just to run a subset of the program. I intend to give an informative message when the user try to run a command and dependencies are missing.

If you've never used Python before, a good way to have a working environment is to use Anaconda.

Since bioinfo is available in PyPI, you can install it with

In []:
!pip install bioinfo

To see available commands and options in bioinfo you can run

In []:
!bioinfo -h

Running the bam_coverage command over the alignment generated by BWA-MEM:

In []:
!bioinfo bam_coverage galGal4.fa.masked reads.fastq.sorted.bam 200 reads.fastq --mapq=30

The same command is available as a function in bioinfo and can be run inside a Python script:

In []:
from bioinfo import bam_coverage

## same call, using the module.
bam_coverage("galGal4.fa.masked", "reads.fastq.sorted.bam", 200, "reads.fastq", 30)

If you don't want to install bioinfo (why not?!??!), you can just download bam-coverage:

In []:
!wget -c https://github.com/luizirber/bioinfo/blob/master/bioinfo/bam_coverage.py

And pass the options to the script:

In []:
!python bam_coverage.py galGal4.fa.masked reads.fastq.sorted.bam 200 reads.fastq 45

The result I got for my analysis was a 82.3% coverage of the reference genome by the alignments generated with BWA-MEM.

reading query
1579060 [elapsed: 01:16, 20721.82 iters/sec]
creating empty lists
15932 [elapsed: 01:08, 232.67 iters/sec]
building coverage
4411380 [elapsed: 34:36, 2123.96 iters/sec]
Summing stats
|##########| 15932/15932 100% [elapsed: 00:07 left: 00:00, 2008.90 iters/sec]

total bases in reference: 1046932099
total ref bases covered : 861340070
fraction                : 0.822727730693
reference               : galGal4.fa.masked
alignment file          : ../moleculo/LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
query sequences         : ../moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fastq

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

Of ThinkPads and MacBooks

Since 2009 I was a Mac user. I was working with iOS development, and it made sense to have a MacBook for the SDK. I was curious too, because I've been using Linux distros (Debian, then Ubuntu, then Gentoo when Ubuntu was getting too heavy for my old laptop) for some time and was a bit tired of making everything work. Losing control was discomfortable at first, but so many things working out of the box (like sleep and hibernation!) was worth it. And Mac apps were much more polished (oh, Garageband).

When I arrived at INPE I got a Linux workstation, the mighty Papera (all the computers there have Tupi names, Tupi being a language spoken by native indians here in Brazil). And I tested some new things, like using Awesome1 as a window manager, and love it. But it lasted just for some months, because the machines were swapped for some iMacs and Papera was assigned for other person. I missed a tiling manager, but I also found Homebrew2 and it helped a lot setting up a dev environment in OSX (I know macports and fink existed, but writing a Homebrew formula is pretty easy, I even contributed one back), so no big problems in the transition.

But after some time I was getting uneasy. New OSX versions seemed to remove features instead of adding then (sigh, matrix-organized Spaces...). Lack of expansibility on new laptops (despite MacBook Air being an awesome computer) was pushing me back too, because a maxed one would cost way more than I was willing to pay. And I was spending most of my time in SSH sessions to other computers or using web apps, so why not go back to Linux?

At the end of 2012 I bought a used ThinkPad X220 with the dock and everything. When I was younger I always liked the visual, with its black and red look, and the durability (MacBooks are pretty, but they are easy to scratch and bend). And the X220 was cheap and in perfect state, and with a small upgrade when I went to PyCon (ahem, 16 GB RAM and a 128 GB SSD) it is a BEAST now. And all these benefits too:

  • I have Awesome again!

  • Updated packages thanks to pacman (I installed Arch Linux and I'm loving it)

  • When I need a new package it is as easy to write a PKGBUILD file as it was to write a Homebrew formula. I wrote some Debian packages in the past and they worked, but there were so many rules and parts that I don't think I want to write one again. I recognize that a lot of the rules and parts make sense with a project as big as Debian (and Ubuntu and everyone else), but it could be simpler.

  • Sleep works! Hibernation works! Except when it doesn't because your EFI is half full after the kernel wrote some stacktraces and the chip refuses to wake up.

It isn't for those faint of heart, but I'm happy to be back =]

A Tale of Two Stadiums

Last weekend I went with some friends to Maracanã watch Italy vs Mexico at the first round of the Confederations Cup. We got some cheap tickets (R$ 57, about US$ 25) on an area meant for Brazilians only (and that's why they were so cheap, usually tickets cost at least double). And it was so good to go again to a stadium, it's a very different experience from watching a game on TV, where the camera give you a limited perspective. Our seats were behind one of the goals, and we were lucky: we saw two goals on our side, one from Mexico (a penalty kick) and one from Italy, a beautiful goal by Balotelli, who was a bit of a diva, by the way, always complaining and making drama.

It was the third time I went to a stadium. Previous ones were Grêmio matches, one versus River Plate in 2002 and other versus Figueirense in 2008. These games were on Grêmio's last stadium, Olímpico Monumental, which gave way to a new one, the Arena, built at the entrance of the city (and far from the old one). This is happening a lot around here, given that next year we have the World Cup and there are at least ten new stadiums built or reformed for the competition. At first they should have been prepared with private funding, but as time passed and they were all late public funding came into the picture, and only in Maracanã more than one billion brazilian reais were spent.

They are now much closer to developed countries' stadiums, like those we used to see on TV. By brazilian standards they aren't even stadiums anymore, looking more like theaters, were you just sit and watch the game. It shouldn't be bad, but I couldn't avoid a comparison between this weekend game and my previous experience. OK, this time the crowd didn't have a prefered side and so it wasn't cheering up as much as I saw before, but it was worrisome because it was obvious that almost everyone at the match was people that could pay for the expensive tickets, and sometimes didn't even have a strong connection with football, something that always gave a match that catarsis aura.

Last week I tried to go with my family to see the new Grêmio stadium and we weren't so lucky. Cheapest tickets cost almost R$ 100 each, and for the four of us this meant R$ 400 less, not out of league but way too expensive. This high cost ticket means the usual fan can't go to the stadium anymore, and is replaced by this new fan, a consumer above everything else. And maybe it's just a romantic vision, and of course there were a lot of problems before, but the price paid for comfort might be too expensive in the long run.

I finally met Tupã

So, after two and a half years, I finally met Tupã.

Curiously, it was only after I left INPE. Arnaldo came to Cachoeira to visit us and I planned to show him the center, but I wasn't expecting to do the whole tour. When I began working there they only showed the public part, because Tupã was off-limits at that time, and I thought it still was since then. Well, me and everybody else that got into the group after me... We went on a small tour to see Tupã, the engines, no-breaks and all the electrical installation needed to support the building.

And it was AMAZING! I always liked more software than hardware, but it is awesome to see things that you just knew from a SSH session. I could see the tape library, disk arrays, the blinkenlights... It is in some ways the physical manifestation of your code, or at least you have a better notion of how many things are in motion when you execute something on the computer.

All in all, good times. It wasn't exactly what I expected at first, but I made some great friends there, I could bike to work everyday, and I learned (by trial and error) that you need to choose better who do you trust. So it goes.