The Ensembl Weblog: April 2008

23 April, 2008

Upcoming Workshops and Instructional Videos

Hello to our readers, I hope everyone is having a nice April. In the UK we are experiencing a long winter with some rain, but spring seems to be around the corner... as are these upcoming workshops...

Did you know? The EBI has released tutorial videos.

Have a look at the Ensembl browser videos for information and direction to some of its pages! Or, learn more about BioMart, a fast data mining tool.

Upcoming workshops- May

Browser workshop at the WHO in Cairo (12-13 May)
Module in the Open Door Workshop at the Sanger (12-14 May)
Ensembl in China: The Shanghai Center for Bioinformation Technology (14-16 May)
Ensembl in China: Center for Bioinformatics, Beijing (19-21 May)
Browser and API workshops at the GTPB in Oeiras, Portugal (27-30 May)
Presentation at the European Human Genetics Conference in Barcelona (30 May)

11 April, 2008

The gene love-in

We have four groups on campus interested in human genes: Ensembl, Havana, whos data forms the bulk of the Vega database, HGNC, the human gene nomenclature committee, and finally UniProt, which has a special initiative on human proteins. With all these groups on the hinxton campus, and with all of them reporting to (at least one) of myself (Ewan Birney), Rolf Apweiler or Tim Hubbard, who form the three-way coordination body now called "Hinxton Sequence Forum", HSF it should all work out well, right?

Which is sort of true; the main thing that has recently changed over the last year has been far, far closer coordination between these four groups than there was ever before, meaning we will be achieving an even closer coordination of our data, leaving us hopefully with the only differences being update cycle and genes which cannot be coordinated fully (eg, due to gaps in the assembly).

Each of these groups have a unique view point on the problem. Ensembl wants to create as best-as-possible geneset across the entire human genome, and its genesis back in 2000 was that this had to be largely automatic to be achievable in the time scale desired, being months (not years) after data was present. Havana wants to provide the best possible individual gene calls when they annotate a region, integrating both computational, high throughput and individual literature references together, UniProt wants to provide maximal functional information on the protein products of genes, using many literature references on protein function which are not directly informative on gene structure and finally HGNC wants to provide a single, unique, symbol for each gene to provide a framework for discussing genes, in particular between practicing scientists.

Three years ago, each group knew of the other's existence, often discussed things, was friendly enough but rarely tried to understand in depth why certain data items were causing conflicts as they moved between the different groups. Result: many coordinated genes but a rather persistent set of things which was not coordinated. Result of that: irritated users.

This year, this has already changed, and will change even more over 2008 and 2009. Ensembl is now using full length Havana genes in the gene build, such that when Havana has integrated the usually complex web of high throughput cDNAs, ESTs and literature information, these gene structures "lock down" this part of the genome. About one third of the genome has Havana annotation, and because of the ENCODE Scale up award to a consortium headed by Tim Hubbard, this will now both extend across the entire genome and be challenged and refined by some of the leading computational gene finders world wide (Michael Brent, Mark Diekhans and Manolis Kellis, please take a bow). Previously Ensembl brought in Havana on a one-off basis; now this process has been robustly engineered, and Steve Searle, the co-head of the Gene Build team, is confident this can work in a 4-monthly cycle. This means it seems possible that we can promise a worse-case response to a bad gene structure being fixed in six months, with the fixed gene structure also being present far faster on the Vega web site. It also means that the Ensembl "automated" system will be progressively replaced by this expert lead "manual" annotation over the next 3 years across the entire genome.

(An aside. I hate using the words "automated" and "manual" for these two processes. The Ensembl gene build is, in parts, very un-automated, with each gene build being precisely tailored to the genome of interest in a manual manner, by the so called "gene builder". In contrast "manual" annotation is an expert curator looking at the results of many computational tools, each usually using different experimental information mapped, in often sophisticated ways, onto the genome. Both use alot of human expertise and alot of computational expertise. The "Ensembl" approach is to use human expertise in the crafting of rules, parameters and choosing which evidence is the most reliable in the context of the genome of interest, but having the final decision executed on those rules systematically, whereas the "Havana" curation approach is to use human expertise inherently gene-by-gene to provide the decision making in each case, and have the computational expertise focus on making this decision making as efficient as possible. Both will continue as critical parts of what we do, with higher investment genomes (or gene regions in some genomes) deserving the more human-resource hungry per genome annotated "manual" curation whereas "automated" systems, which still have a considerable human resource, can be scaled across many more genomes easily).

This joint Havana/Ensembl build will, by construction, be both more correct and more stable over time due to the nature of the Havana annotation process. This means other groups interacting with Havana/Ensembl can work in a smoother, more predictable way. In particular on campus it provides a route for the UniProt team to both schedule their own curation in a smart way (basically, being post-Havana curation) and provide a feedback route for issues noticed in UniProt curation which can be fixed in a gene-by-gene manner. This coordination also helps drive down the issues with HGNC. HGNC always had a tight relationship with Havana, providing HGNC names to their structures, but the HGNC naming process did not coordinate so well with the Ensembl models, with gene names in complex cases becoming confused. This now can be untangled at the right levels - when it is an issue with gene structures, prioritise those for the manual route, when it is an issue with the transfer of the assignment of HGNC names (which primarily has individual sequences, with notes to provide disambiguation) to the final Havana/Ensembl gene models this can be triaged and fixed. HGNC will be providing new classifiers of gene names to deal with complex scenarios where there is just no consistent rule-based way of classifying the difference between "gene" "locus" and "transcript" in a way which can work genome-wide. The most extreme example are the ig loci, with a specialised naming scheme for the components of each locus, but there are other oddities in the genome, such as the proto-cadherin locus which is... just complex. By having these flags, we can warn users that they are looking at a complex scenario, and provide the ability for people who want to work only with cases that follow the "simple" rules (one gene, in one location, with multiple transcripts) the ability to work just in that genome space, without pretending that these parts of biology don't exist.

It also means our relationships to the other groups in this area; in particular NCBI and UCSC (via the CCDS collaboration), NCBI EntrezGenes (via the HGNC collaboration) and other places worldwide can (a) work better with us because we've got more of our shop in order and (b) we can provide a system where if we want to change information or a system, we have only one place we need to change it.

End result; far more synchrony of data, far less confusion for users, far better use of our own resources and better integration with other groups. Everyone's a winner. Although this is all fiddly, sometimes annoying, detail orientated work, it really makes me happy to see us on a path where we can see this resolved.

06 April, 2008

High dimensions, hetreogenity statistics

Last week I was a co-organiser of a Newton Institute workshop on high dimensional statistics in biology. It was a great meeting and there were lots of interesting discussions, in particular on chip-seq methods and protein-DNA binding array work. I also finally heard Peter Bickel talk about the "Genome Structure Correction" method (GSC), something which he developed for ENCODE statistics, which I now, finally, understand. It is a really important advance in the way we think about statistics on the genome.

The headache for genome analysis is that we know for sure that it is a heterogeneous place - lots of things vary, from gene density to GC content to ... nearly anything you name. This means that naive parametric statistical measures, for example, assuming everything is poisson, is will completely overestimate the significance. In contrast, naive randomisation experiments, to build some potential empirical distribution of the genome can easily lead to over-dispersed null distributions, ie, end up under estimating the significance (given a choice it is always better to underestimate). What's nice is that Peter has come up with a sampling method to give you the "right" empirical null distribution. This involves a segmented-block-bootstrap method where in effect you create "feasible" miniature genome samples by sampling the existing data. As well as being intuitively correct, Peter can show it is actually correct given only two assumptions; one that genome's heterogeneity is block-y at a suitably larger scale than the items being measured, and secondly that the genome has independence of structure once one samples from far enough way, a sort of mixing property. Finally Peter appeals the same ergodic theory used in physics to convert a sampling over space to being a sampling over time; in other words, that by sampling the single genome's heterogeneity from the genome we have, this produces a set of samples of "potential genomes" that evolution could have created. All these are justifiable, and certainly this is far fewer assumptions than other statistics. Using this method, empirical distributions (which in some cases can be safely assumed to be gaussian, so then far fewer points are needed to get the estimate) can be generated, and test statistics built off these distributions. (Peter prefers confidence limits of a null distribution).

End result - one can control, correctly, for heterogeneity (of certain sorts, but many of the class you want to, eg, gene density). Peter is part of the ENCODE DAC group I am putting together, and Peter and his postdoc, Ben Brown, are going to be making Perl, pseudo-code and R routines for this statistic. We in Ensembl will implement this I think in a web page, so that everyone can easily use this. Overall... it is a great step forward in handling genome-wide statistics.

It is also about as mathematical as I get.

The Ensembl Weblog