The Ensembl Weblog: January 2008

22 January, 2008

Ensembl and the 1000 Genomes

Today the 1000 Genomes projects was announced. By any measure this is a big deal.

The goal is simple: to create the most comprehensive and medically useful collection of human variation ever assembled by producing approximately 6 terabases of sequence. To put this amount of data in prospective, 6 terabases is more than 60 times the amount of data that is currently available in the DDBJ/GenBank/EMBL Archive and that took more than 25 years to collect. At the peak production of the 1000 Genomes project more that 8 billion basepairs per day will be sequenced. It's data output of the the entire human genome project every week. All made publicly available.

The data generation rate and the short read length mean that the bioinformatics requires for the project are equally ambitious (or terrifying depending on your point of view). The EBI and NCBI, working together, are creating a joint DCC (data coordination centre) to collect, organise and provide the data to the world. Steve Sherry at the NCBI and I are eager to take this on.

At Ensembl we've been expecting this development and built support for re-sequencing data into our variation database a couple of years ago. So far, we have data for about 6 humans, 5 mouse strains, and a smattering of rat data. Small stuff compared to six months from now, but large enough that we have both experience and confidence dealing with the large-scale resequencing data. We are probably going to need both.

Check out more at http://www.1000genomes.org

21 January, 2008

Upcoming Workshops - February

Ensembl Workshops in January took us to the West Coast (USA) and the Netherlands. Workshops in February:

Browser workshop (Institute for Animal Health) Compton, UK 12 Feb

Browser workshop (EURATools, University of Edinburgh) Edinburgh, UK 12 Feb

Browser workshop (Cambridge University, Dept of Genetics) Cambridge, UK 28-29 Feb

Picture: Nijmegen, the Netherlands. The NBIC was the site of an Ensembl Browser workshop on 16 Jan, 2008

Interested in hosting a workshop? Contact us!

20 January, 2008

Bringing back the dead...

Richard posted the next release intentions here:

ensembl-dev archive

Lots of good stuff - orangutan, horse being released, the usual tweaks about contamination (viral genes) into the gene sets being removed, little details.

But one thing is quite a change. It is from Javier's Compara team, and it is simply stated as

"Generate the 7-way alignments using the new enredo-pecan-ortheus pipline"

Unpacking this statement, it is a big change in how we're thinking about comparative genomics alignments. Enredo is a method to produce a set of co-linear regions, sometimes called a "synteny map" though this term is a dreadful term. The key thing is that it handles duplications in the genome, allowing (say) two regions of human to be co-linear with one region of mouse. This is hard to handle on a genome-wide scale in a scaleable manner. Pecan is the multiple aligner written by the brilliant Ben Paten (used to be my student, and wrote Pecan whilst at the EBI; he is now at UCSC with Jim and David and co). Pecan is the best aligner - by both simulation testing and testing via ancient repeat alignability criteria - it has the highest sensitivity of alignment with the same specificity as the next best aligner. Finally Ortheus, also from Ben, provides (potentially) realignment whilst simultaneously sampling correctly from a probabilistic model of sequence evolution, critically including insertion and deletions, and thus as a side effect, producing likely ancestral sequences. This also has been stringently tested using a hold-one-out criteria, basically can we "predict" the marmoset sequence only using other extant species (answer - not completely correctly, but better than any other method, eg taking the nearest sequence).

So - what does this all mean. Basically there are two key things:

Handling lineage specific duplications. This is a headache, and we have a good solution, providing the alignment of therefore the paralogous and orthologous regions (the paralogy is limited to relatively recent paralogy, ie, within mammals) simultaneously
We can reliably predict ancestoral sequences

One headache is that some of the things we display, in particular the GERP continuous conservation score, needs to be adapted to work on the basis now of regions with paralogy. There is a fascinating piece of theory to work through here - what is the concept of the "neutral tree" when there has been a lineage specific duplication? How should one treat paralogs? Currently this is ignored by virtue of the fact that the alignments don't allow this. Now the alignments do allow this, and we need to do something sensible, as well as stimulate evolution theory people to look at the data and work out new methods.

The next headache is what do we do with the ancestral sequences? Dump them? Display them? Gene predict on them? If so, how?

The end result is that release 49, even the comparative genomics, wont look very different, but it will have these new alignments, and over 2008 we will be working out how to present, analyse and leverage them more - so if you are interested, please do take them for a spin!

(Release 49 is due to be out sometime in mid-Feb)

10 January, 2008

Locus Specific Databases and Diagnostic databases

The beginning of this week myself and Paul Flicek were in lovely Rotterdam at the Gen2Phen kick off meeting, an EU project lead by Tony Brookes from Leicester. Like all large European projects, the kick off meeting is a get-to-know everyone, have beers (very good ones in Holland) and get a feel for the project.

For me, the exciting thing was getting closer to the locus specific databases - in the project is Johan den Dunnen (from just down the road in Leiden, Holland) and Andy Devereau (from Manchester) who run locus specific databases and diagnostic databases respectively. Getting this valuable data coordinated with genome data (and the fiddly bit is about sequence coordinates, at least at first) is going to be great thing to do.

There's lots to do in this area - certainly this is something that effects all the big browsers (UCSC, NCBI, ourselves) and has a had a long history of complex systems and sociological tensions in getting things sorted. But my sense in this small room hidden away in the Erasmus medical centre was that we had good people in the room, committed to finding a good solution whilst understanding the complexity of problem. Next up will be more technical meetings, but it was an excellent start. Don't expect anything tomorrow, but I think we can expect something end of 2008/2009.

And did I mention the beer was good as well?

09 January, 2008

News Flash- Pig updated in the Pre! site.

A pig assembly that includes chromosome 5 and 15 has been updated on our Pre! site. See here for further information.

07 January, 2008

Ensembl Blogging broadens

We're going to be experimenting with broader content generated by the Ensembl team in the Ensembl blog - at the very least by myself, Ewan Birney. So you can expect to read more about what we're doing, the things which are coming up in the pipeline and our thoughts on how genomic infrastructure is going to evolve over time. Ensembl is a big team, with alot of components, so it is often hard to track what we're doing and why we've made some decisions. This blog hopefully will keep you up to date with our progress in an informal manner.

06 January, 2008

January Workshops

Happy New Year from Ensembl!

Our upcoming workshops this month are as follows:

Browser workshop (University of Nottingham) Nottigham, UK 8 Jan
Demo at the PAGXVI Conference, San Diego, CA, USA 13 Jan (8:00-11:20 AM)
* Tutorials for EBI resources presented at PAGXVI are here
Browser workshop (Netherlands Bioinformatics Center) Nijmegen, NL 16 Jan
Developers workshop: the core API (Netherlands Bioinformatics Center) Nijmegen, NL 17 Jan
Browser workshop (City of Hope) Duarte, CA, USA 18 Jan
Browser workshop (University of Oregon) Eugene, OR, USA 22 Jan
Browser workshop (University of California, San Francisco) San Francisco, CA, USA 24 Jan
Browser workshop (University of California, Santa Cruz) Santa Cruz, CA, USA 28 Jan
Browser workshop (University of California, Los Angeles) Los Angeles, CA, USA 30 Jan

Remember, the next Ensembl release is due 26 Feb, 2008.