20 January, 2008

Bringing back the dead...

Richard posted the next release intentions here:

ensembl-dev archive

Lots of good stuff - orangutan, horse being released, the usual tweaks about contamination (viral genes) into the gene sets being removed, little details.

But one thing is quite a change. It is from Javier's Compara team, and it is simply stated as

"Generate the 7-way alignments using the new enredo-pecan-ortheus pipline"

Unpacking this statement, it is a big change in how we're thinking about comparative genomics alignments. Enredo is a method to produce a set of co-linear regions, sometimes called a "synteny map" though this term is a dreadful term. The key thing is that it handles duplications in the genome, allowing (say) two regions of human to be co-linear with one region of mouse. This is hard to handle on a genome-wide scale in a scaleable manner. Pecan is the multiple aligner written by the brilliant Ben Paten (used to be my student, and wrote Pecan whilst at the EBI; he is now at UCSC with Jim and David and co). Pecan is the best aligner - by both simulation testing and testing via ancient repeat alignability criteria - it has the highest sensitivity of alignment with the same specificity as the next best aligner. Finally Ortheus, also from Ben, provides (potentially) realignment whilst simultaneously sampling correctly from a probabilistic model of sequence evolution, critically including insertion and deletions, and thus as a side effect, producing likely ancestral sequences. This also has been stringently tested using a hold-one-out criteria, basically can we "predict" the marmoset sequence only using other extant species (answer - not completely correctly, but better than any other method, eg taking the nearest sequence).

So - what does this all mean. Basically there are two key things:

  1. Handling lineage specific duplications. This is a headache, and we have a good solution, providing the alignment of therefore the paralogous and orthologous regions (the paralogy is limited to relatively recent paralogy, ie, within mammals) simultaneously
  2. We can reliably predict ancestoral sequences
One headache is that some of the things we display, in particular the GERP continuous conservation score, needs to be adapted to work on the basis now of regions with paralogy. There is a fascinating piece of theory to work through here - what is the concept of the "neutral tree" when there has been a lineage specific duplication? How should one treat paralogs? Currently this is ignored by virtue of the fact that the alignments don't allow this. Now the alignments do allow this, and we need to do something sensible, as well as stimulate evolution theory people to look at the data and work out new methods.

The next headache is what do we do with the ancestral sequences? Dump them? Display them? Gene predict on them? If so, how?

The end result is that release 49, even the comparative genomics, wont look very different, but it will have these new alignments, and over 2008 we will be working out how to present, analyse and leverage them more - so if you are interested, please do take them for a spin!

(Release 49 is due to be out sometime in mid-Feb)

