06 April, 2008

High dimensions, hetreogenity statistics

Last week I was a co-organiser of a Newton Institute workshop on high dimensional statistics in biology. It was a great meeting and there were lots of interesting discussions, in particular on chip-seq methods and protein-DNA binding array work. I also finally heard Peter Bickel talk about the "Genome Structure Correction" method (GSC), something which he developed for ENCODE statistics, which I now, finally, understand. It is a really important advance in the way we think about statistics on the genome.

The headache for genome analysis is that we know for sure that it is a heterogeneous place - lots of things vary, from gene density to GC content to ... nearly anything you name. This means that naive parametric statistical measures, for example, assuming everything is poisson, is will completely overestimate the significance. In contrast, naive randomisation experiments, to build some potential empirical distribution of the genome can easily lead to over-dispersed null distributions, ie, end up under estimating the significance (given a choice it is always better to underestimate). What's nice is that Peter has come up with a sampling method to give you the "right" empirical null distribution. This involves a segmented-block-bootstrap method where in effect you create "feasible" miniature genome samples by sampling the existing data. As well as being intuitively correct, Peter can show it is actually correct given only two assumptions; one that genome's heterogeneity is block-y at a suitably larger scale than the items being measured, and secondly that the genome has independence of structure once one samples from far enough way, a sort of mixing property. Finally Peter appeals the same ergodic theory used in physics to convert a sampling over space to being a sampling over time; in other words, that by sampling the single genome's heterogeneity from the genome we have, this produces a set of samples of "potential genomes" that evolution could have created. All these are justifiable, and certainly this is far fewer assumptions than other statistics. Using this method, empirical distributions (which in some cases can be safely assumed to be gaussian, so then far fewer points are needed to get the estimate) can be generated, and test statistics built off these distributions. (Peter prefers confidence limits of a null distribution).


End result - one can control, correctly, for heterogeneity (of certain sorts, but many of the class you want to, eg, gene density). Peter is part of the ENCODE DAC group I am putting together, and Peter and his postdoc, Ben Brown, are going to be making Perl, pseudo-code and R routines for this statistic. We in Ensembl will implement this I think in a web page, so that everyone can easily use this. Overall... it is a great step forward in handling genome-wide statistics.


It is also about as mathematical as I get.

1 comment:

Anonymous said...

So when will this be implemented - available or published?