The Ensembl Weblog: 2008

23 December, 2008

Upcoming Workshops (Jan 2009)

Happy Holidays, and Happy New Year from Ensembl!

The new year will start with some workshops given by our Outreach team on how to use our new interface (and the data behind the scenes!). We hope you have had time to explore and learn the layout! Remember to send any questions to our helpdesk.

Upcoming workshops in January, 2009:

11 Jan Ensembl Demo at the PAG XVII conference, San Diego, CA, USA
13-14 Ensembl 2-day browser workshop at the Universidad de Chile, Santiago, Chile
15-16 Modules in the EBI Bioinformatics Roadshow, UCLA, USA
19-20 Modules in the EBI Bioinformatics Roadshow, City of Hope, USA
22-23 Modules in the EBI Bioinformatics Roadshow, UCSF , USA
24 Browser course in the Computational Biology Workshop, Sultan Qaboos University, Muscat, Oman
26 Browser course in the 9th BioSapiens European School of Bioinformatics, Brussels, Belgium

That's all for now!

11 December, 2008

GeneTrees: how do I read them? And can I view alignments using Jalview?

If you have clicked on the GeneTree link in Ensembl (for example, the gene tree for IL2), you may have noticed that we have a new way of displaying large GeneTrees. This time, if you have a large gene family with lots of genes that you want to look at, you won't need to ask the Miami Dolphins to let you plug your laptop into their huge screen...

This new feature in EnsemblCompara is called collapsible subtrees and allows for more compact, summarized views of interesting gene families like PAX2/PAX5/PAX8:

http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?g=ENSG00000075891

If you check the legend at the bottom, you will see that "blue triangles" correspond to collapsed subtrees that have within-species paralogs of your gene. If you want to see all the within-species paralogs expanded, you can click on the option "View paralogs of current gene". You can even set that as a default if you want in the "Configure this page" options.

Jalview is a great way to view protein alignments in the tree. And were is my Jalview link now? Click on any internal node (square) in the tree, and be able to visualize the alignment (or subalignment) with the new Jalview applet by clicking on the Jalview link. You have to have Java installed though, or the link won't show. The two Jalview windows that pop up are one, the protein alignment and the other, the underlying TreeBeST tree. You can now use Jalview's sorting feature to sort your sequences according to the tree with: Calculate->Sort->By Tree Order->URL. Having the tree associated to the alignment allows for a more phylo-centric visualization of sequence conservation: if you click at a point in the tree, a red vertical line will appear that divides the alignment into different groups. If you choose Colour->Percentage Identity, the shades of blue will be relative to the subgroups in your tree (e.g., fish versus placental mammals). This is also useful to spot segments in the alignment that don't look that good, or gaps created in a subpart that can now be collapsed in the subalignment (Edit->Remove Empty Columns), or sequences that stand out as long branches in the alignment (View->Overview Window).

For even more tree funkiness, you can use PhyloWidget to visualize our NHX trees. Use our NHX tree ("Configure this page->Output for normal tree->NHX->Save and Close->Gene Tree(text)") to copy+paste the representation of the GeneTree into Phylowidget, with duplication/speciation events (red/blue), bootstrap values (greyscale) and taxonomy levels "View->Rendering->Show clade labels". Then use the "Zoom in/Zoom out" features, or clicking on an internal node, the "Tree Edit->collapse", and specially the "View->Branch lenghts [x]" and the "View->Layout->Options->Branch Scaling" options.

We hope these new features will help you in your research. We have some new ideas that we are currently testing to visualize even more phylogenetic information, and help make better judgement on the orthology and paralogy relationships in our EnsemblCompara GeneTrees. Stay tuned for more updates!

Ensembl 52

Hot on the heels of release 51 comes release 52 of Ensembl - the first revision of the new webcode... So what's new?

Data:

http://www.ensembl.org/Gorilla_gorilla/ - Gorilla (Gorilla gorilla gorilla) 2x assembly from the Wellcome Trust Sanger Institute and associated genebuild
http://www.ensembl.org/Homo_sapiens/ Human and http://www.ensembl.org/Mus_musculus/ Mouse - New gene build merging with the latest manually annotated gene set from the Vega project.

Web site:

Updated export: - Restored most of the functionality with the new Export wizard on Genes, Transcripts and Locations - to allow export of FASTA, EMBL, Genbank, GFF, TSV, Vista and PIP files.

Image export: - Restored an improved version of the image export functionality - all "Horizontal" generated images have and [Export image] button to allow the image to be exported in vector format (PDF, SVG, EPS) and scaled bitmap format (PNG x0.5, x1, x2, x5 and x10) to allow publication quality images to be exported.
The vector formats PDF, SVG and EPS can all be imported into vector image editors to be manipulated as well.

04 December, 2008

Ensembl and Amazon Web Services

We're happy to announce that Ensembl is one of the launch partners for Amazon's "Public Data Sets" initiative, so the MySQL data and index files for the current release of Ensembl can be accessed from within Amazon's Elastic Compute Cloud (EC2) service. From the Amazon website:

AWS Hosted Public Data Sets provide a convenient way to share, access, and use public domain or non-proprietary data within your Amazon EC2 environment. Select public data sets are hosted on AWS for free as an Amazon EBS snapshot. Any Amazon EC2 customer can access this data by creating their own personal Amazon EBS volume from a publicly shared Amazon EBS public data set snapshot. They can then access, modify, and perform computation on these data sets directly using an Amazon EC2 instance and just pay for the compute and storage resources that they use.

Details of how to access the data can be found at http://aws.amazon.com/publicdatasets .

We have plans to make much more use of AWS in the future, stay tuned!

30 November, 2008

Linking in to Ensembl...

Due to the changes in the web interface there have been a number of changes to the URLs for pages. In most cases the web code catches these changes but there are a number of requests which due to the nature of the site have changed:

Configuring the way a page is rendered;
Changing the way tracks are rendered;
Adding DAS sources via a web-address and not via the web interface;
Attach UCSC style external resources.

These are now all attached in a similar - systematic way:

To change global page settings: add a paramter config=key=value{,key=val}
e.g. to turn off the top image on Location > Region in detail

http://www.ensembl.org/Homo_sapiens/Location/View?r=1:1000-2000;config=view_top=off

e.g. to link directly to the Exon Intron markup panel (Transcript > Exons) and to show full introns and only 60bp flanking sequence AND turn the display to be 60bp wide

http://www.ensembl.org/Homo_sapiens/Transcript/Exons?t=ENST00000309255;config=flanking=60,seq_cols=60,fullseq=yes

To change configuration for an individual panel add a parameter refering to the panel (this will be documented shortly on the website) e.g. For Location > Region in detail the two panels are contigviewtopcontigviewbottom, for Location > Region overview it is cytoview. This is again a comma separated list, where the left hand side of each "=" is the name of the track, and the right hand side is the name of the "renderer" to use - the latter depends on the type of track. Additionally the left hand side can be used to integrate external data: Notes:
- Track names are now systematically named so will have changed from the values you may have been used to using - again we will shortly publish a list of these, but examples are: transcript_core_ensembl - the ensembl genes from the ensembl database.
- Renderers depend on the type of track, but e.g. for transcripts you have the option of "transcript_label", "transcript_nolabel", "collapsed_label" and "collapsed_nolabel", for alignment features (and also url attached data at the moment) "normal", "half_height", "stack", "unlimited" and "ungrouped", for DAS tracks "labels" (show labels if configured by the source) or "nolabels" - hide labels.
- At the moment two special parameters can be used:
  das:http://www.mydas.source/das/my_data=render
  - which attaches a DAS source to the session and selects the renderer
  url:http://www.myweb.server/my_data.format=render
For example:

http://www.ensembl.org/Homo_sapiens/Location/View?g=ENSG00000012048;config=panel_top=off;contigviewbottom=das:http://www.ensembl.org/das/Homo_sapiens.NCBI36.transcript=nolabels,transcript_core_ensembl=collapsed_nolabel

Turns on a das source (in this case the Ensembl transcripts) and collapses the standard ensembl track down to a single line per Gene AND also turns off the top panel!

29 November, 2008

Ensembl 51

The web team can finally let out a quick sigh of relief now that the long awaited new web code has finally emerged kicking and screaming out of the web team office...

It is obvious to see the "cosmetic" changes to the site:

the colours,
fonts,
layout,
the unified configuration
the reduction in page sizes.

On top of this there have been a large number of underlying technical improvements to the way the pages are put together.

Streamlining the JavaScript and css to make sure that the transfers to and from the server to your browser are as fast as possible; Using unobtrusive JavaScript throughout the new code so pages work with or without JavaScript or AJAX - althouth they are not quite as functional they still work!
Making the pages standards compliant to make them render in most browsers without issues (unless of course that browser is IE and there are lots of places where the "standards" approach fails)
Using an fast in memory cache (a modified version of memcached which allows for the use of tags) to reduce the load on our user database and to store and server temporary images, processed HTML etc.
Segregation of code into more modules to reduce the size of the very large modules we had (noticeably the breakdown of the Component modules into smaller chunks)
Configuration meta information contained in core databases making the site easier and more automatic to set up.
Optimisation of drawing and configuration code.
Transparent use of AJAX in many cases. Use of Perl's LWP::ParallelUserAgent where the user's browser doesn't support AJAX.
Further areas where the extensible plugin system is available - defining colours, configuring images.

19 November, 2008

Upcoming training events December

There are still a few more Ensembl training events before the end of the year.

Browser workshops:

UNAM, Mexico City, Mexico (1-2 Dec)
UNAM, Cuernavaca, Mexico (5 Dec) (+ departmental seminar 4 Dec)

Amsterdam, The Netherlands (19 Dec)

Developers workshop:

University of Cambridge, UK (1-3 Dec)

In addition, Ensembl will feature as part of the following courses:

Wellcome Trust Open Door Workshop 'Working with the Human Genome Sequence' (1-2 Dec, Hinxton, Cambridge, UK) and Genes en evolución, ecologia e conservación (8-9 Dec, La Paz, Baja California, Mexico)

For details of these workshops, please have a look at the complete list of Ensembl training events.

17 November, 2008

Accessing the Ensembl data with Perl

Do you know a bit of Perl? Ensembl hosts an API (Application Programmers Interface) which uses Object-Oriented Perl to extract data from Ensembl databases. This API is public and can be used for people to programmatically access the data in the Ensembl database. We understand that not everyone is used to Object-Oriented code, although people may have basic Perl skills and be interested in using our datasets. For that kind of bioinformaticist, I would recommend a recent short read in O'Reilly's Broadcast:

Beginners Introduction to Object-Oriented Programming with Perl - O'Reilly Broadcast

And for the more advanced readers, the classic reference book in OO-Perl would be Damian Conway's Object Oriented Perl, which a part from being very informative, has a really cool cover :-)

We are always trying to lower the barrier to entry for research communities interested in using the Ensembl database in programmatic ways that make use of all the complexity associated with the generation of our data. That's why our API is public and well-documented. You can learn about our API by attending on of our API workshops for free (e.g.: 1-3 December - Univ. Cambridge, UK). We are currently trying to smooth things out even more, working on ways to make it even easier to download all that's needed to use the API and have the example scripts running in your computer with the minimum number of steps. Keep tuned for news in this respect soon...

31 October, 2008

Ensembl at the IMGC

Ensembl is attending the 22nd International Mammalian Genome Conference taking place in Prague (Czech Republic) from 2-5 November 2008. This meeting starts with three bioinformatics workshops on Sunday (2nd November) at the Institute of Molecular Genetics (AS CR). One of these workshops will focus on Ensembl, discussing new developments and featuring a preview of our new interface. We will be starting a 9:00 (Seminar Room 3.102, in the third floor). You can download the workshop materials here (exercises and tutorials). As part of our commitment with the EURATools consortium, we'll be focusing on rat genomics, but if you work with any other species annotated in Ensembl, you are welcome.
See you in Prague!

30 October, 2008

Upcoming training events November

Only 3 continents to cover this time, but November will be even busier for the Ensembl trainers than October ....

Ensembl will feature as part of the following courses:

'Computational & Comparative Genomics' (5-11 Nov, Cold Spring Harbor Laboratory, New York, US)
Wellcome Trust Open Door Workshop 'Working with the Human Genome Sequence (10-12 Nov, Wellcome Trust Genome Campus, Hinxton, UK)
Hands-on training at EBI 'Programmatic access in Java: webservices & work flows' (24-27 Nov, Wellcome Trust Genome Campus, Hinxton, UK)

Browser workshops will be given at the following locations:

Europe:
Prague, Czech Republic (2 Nov)
Madrid, Spain (5 Nov)
Newcastle, UK (13 Nov)
Cambridge, UK (13-14 Nov)
Naples, Italy (19 Nov)

North America:
Cambridge, Massachusetts, US (12 Nov & 14 Nov)
Boston, Massachusetts, US (13 Nov)

Asia:
Kuala Lumpur, Malaysia (24-25 Nov)
Sabah, Malaysia (27-28 Nov)

For details of these workshops, please have a look at the complete list of Ensembl training events.

14 October, 2008

Power outage

Ensembl is currently down due to a power outage in the data centre at the Sanger Institute last night. Power has been restored, but it will take some time to restore all of the services.

We are working to get things up and running and expect that Ensembl will be back mid to late morning UK time.

03 October, 2008

Ensembl 51 development

New design
You will already have seen a number of emails about the upcoming Ensembl 51 release - the web team are working hard to tidy up the loose ends of the release! We have got most of the major views ready, and just working on some of the views you may have never found before. As a taster I'm posting a few screen shots from our development site, the first shows the new page layout for graphical display of genomic regions (the old contigview). You will see many of the new design decisions in this screen shot:

There are more views per object as we have broken up the large single pages into smaller components;
Tabs for the different focus objects - in this case Gene and Location. Transcript and Variation feature are the other tabs available;
A tree of all information available about the focus feature on the left hand side;
Left/right pagination buttons to allow you to navigate between all the information we have about the focus object.
"General" and "local" tools areas

Under the hood!

There have been a large number of changes under the hood of the web-site. Notable changes have been:

Use of modified version of memcached to store and retrieve cached images, static and dynamic content, user settings;
Re-writing the configuration code to automagically detect the contents of the databases and try and display the content appropriately;
Breaking up of the component code into separate modules;
Removing the need for a script per view - by using "routeing" style URL parsing to work out what objects are to be rendered and how... e.g. /Gene/Compara_Tree/Text displays the text version of a gene's homology tree.
More and easier to configure renderers for drawing code.
A strive for standards compliance in both XHTML and CSS; which should allow us to support more easily modern web browsers. We will be actively supporting Firefox 3+, Internet Explorer 7+ and Safari 3+ (and other similar browsers), while trying to make sure that the site is still workable in other browsers (at the site appears to work in Opera 9.25+)

New configuration panel

All configuration of the site and individual views has been moved to a common "Configuration dialog" box.

The old "yellow menus" are replaced by a more expansive and easier to navigate tree of features. Important now there are nearly 200 individual tracks in the Human Location view page.
There are more choices to display some tracks - rather than just turning them on and off, you can decide how you wish them to be displayed.
Configuration for other pages are loaded in a similar way.
The site has a common site-wide image width setting.
The configuration panel is also where you will: manage your accounts, upload data, attach DAS and URL based data

Different renderers

For different data types we now support different renderers - not just collapsed and expanded.
For example:

For genomic alignments we support, the ungrouped features (all on one line), normal grouped and bumped features at both full and half-height, and now also "stacked" features - "2 pixel" high glyphs.

We hope when you see the new interface that you will find it more intuitive, more discoverable and faster to use and most importantly more productive for the research work that you are doing.

Incredible !ndia

The Ensembl team has been involved in several activities in Hyderabad (India) during the last few days, making the most of the latest HUGO's 13th Human Genome (HGM2008).

A satellite workshop has been organised within the Open Door Workshop framework at the Centre for Cellular and Molecular Biology (CCMB). Over 40 scientists from different countries had the opportunity to learn about different resources freely available on the Internet, providing us with useful feedback.

Following our presence in the HGM2008 in the EBI booth we had the opportunity to make several contacts that hopefully should allow us to organise a series of workshops around India next year. If you were interested to know more about this, or query about possibilities to host one of our workshops, you can contact us.

Greetings from India भारत से नमस्ते

21 September, 2008

Upcoming training events October

As usual October is a busy month for the Ensembl trainers with workshops on 4(!) different continents.

From 1-3 Oct Ensembl will feature in the Wellcome Trust Open Door Workshop "Working with the Human Genome Sequence" in Hyderabad, India, and from 6-8 Oct in the EBI hands-on workshop "A two-day dip into the EBI’s data resources: Understanding your data" in Hinxton, UK.

Upcoming browser workshops:
9-10 Oct: J. Craig Venter Institute, Rockville, MD, US
14 Oct: National Human Genome Research Institute (NHGRI), Bethesda, MD, US
15 Oct: National Human Genome Research Institute (NHGRI), Bethesda, MD, US
16-17 Oct: University of the Free State, Bloemfontein, South Africa
20-21 Oct: University of the Witwatersrand, Johannesburg, South Africa
22 Oct: University of Nottingham, Nottingham, UK
23-24 Oct: University of the Western Cape, Cape Town, South Africa
29-30 Oct: EBI Roadshow, Dublin, Ireland

If you want to know to which locations we are coming after October, then have a look at the complete list of all upcoming training events.

Considering hosting an Ensembl workshop yourself? Please contact Xose Fernandez.

18 September, 2008

Websites and Guinness. Worth the wait.

Steve posted the news that we're delaying our new release for at least two more weeks. The message is pasted in here:

Hi all

In our Intentions Summary mail for release 51 we stated that the release was scheduled for early/mid September. The 51 release will include significant updates and improvements to the web interface. We are delaying release while we complete development on these. We are working to get the release out as soon as possible, and are now aiming for end September/early October. I apologise for this delay.


Steve


Dr Steve Searle
Ensembl Project Leader, Sanger

It is always so frustrating to delay, but of course, far more important to have a working site than something only part working. Welcome to delivering high end services.

We took on alot of things to change in this web refresh. For most users the main thing people will notice is the entirely new web layout. This was driven by our surveys of users who mainly complained about being buried in too many displays and data. We then took around 4 months working with user groups and trialling different layouts (many thanks for those who participated) which in some cases made significant changes to our original designs (we now have a hybrid "tab and left-hand-side" approach, voted as best by ~60% of people, with the other three options splitting the rest of vote). We're very excited about this new layout going live as it just looks cleaner, less cluttered and yet providing more information. The other thing people will notice is that it is just faster. As the saying goes, you can't be too rich, too thin or have your websites go too fast.

Making a website go faster is harder than it might look. It involves all sorts of things - the bandwidth of your machines to us, the speed the servers, the connectivity of servers to databases, the speed of the API, the database to disk, the management of the huge number of simultaneous users we have and then the size of the html returned and finally the render speed on your browser. All of these contribute to the overall perception of "speed". Under the hood we've been working on all these aspects - internally a big change is that we have switched from needing a common file system for our web farm to work off. Previously when your browser asks for a contigview page, our servers generates html with an image and that image is written to the common disk, the browser parses the image tag, asks for this image - and this is the critical bit - sends a request which in all likelihood will be served by a different server in our webfarm. That server then went to the common file system to pick up the file and send it back. Many times a critical bottleneck has been read/write on this shared filesystem. In the new system this has all gone, and the images are stored in a memory-based common store, meaning both that we remove this bottle-neck (which will be the first big effect) and secondly we will be able to cache alot more - the hope is that many of the identical pictures for the common species will be entirely served from memory in the new system. Another important change has been aggressively sliming our html. Currently all sorts of files - often very small - are pinged by each page up, just to see if they have changed. We've consolidated alot of these files - and compressed them - and then also optimised them for render speed.

There is a variety of things not for this release but coming up end of 2008/early 2009 also on speed. Our API has a new concept, collections, which better handles the case of zoomed out views, where we know the renders will not be able to render every object. Instead a collection - which may be rendered as a union or density or something will be provided. The other thing on the horizon is us setting up a US mirror on the west coast. For the last year we have been extensively monitoring the speed of Ensembl from different sites, and there is a large increase in time to retrieve on the north-west coast of the US. We've been investigating quite why this (and learning lots more about the backbone of the internet than we knew before) but it seems as if the simplest way to getting speed to work in the west coast is to just run a mirror over there. Probably 2009 for that to go live.

Back to the website. It looks so much better - and has much better hardware characteristics - (our shared file system is ... well ... rather 2004 technology and needs pretty constant care at the moment) that I can't wait until it comes out. But there is absolutely no point in having a crippled site in functionality even though we've got many of the user interface and technical issues right. The sticking point at the moment is the configuration panel. This comes up as "modal" box on top of the page, allowing alot of options to choose from, but not a bewildering set of options on each page. To cope with the 200 odd different tracks to switch on and off, the box has to have tabs and friendly, browseable hieriarchies. To get all this to work in a nice, friendly, slick way... that's alot of Javascript.

And alot of Javascript is alot of browser compatible headaches. Even using JS libraries - prototype and scriptolicious (I think - James smith can tell you the details!) there are all sorts of details that might not work just-quite the same way on IE5 compared to IE6. Or Firefox. Or Safari. And it must degrade at least functionally without JS. And of course work, and render fast. This modal box is the last, complex thing to get sorted.

We're close. I've seen the box come up over James' screen. I hear Steve has seen it come and tracks change, and see the link of tracks to changes. The API for the configuration system was gutted and is much better. But its got to work on all main browsers. For all our genomes, in particular Human and Mouse. And this is just tricky, fiddly work.

We're not quite there yet. We're really close, and so much is working it is just excruitiating. But we need another couple of weeks. James is being shielded from other jobs by Steve and others; Eugene is torture testing memcachedb to stress test the system before it goes live; Xose, Bert and Guilietta are writing help; Beth and Anne are writing the additional pagelets inside of the new geneview and transcriptviews. and it all looks really good.

So - apologies - we thought we'd be launching in July. We thought we'd be launching in September. We still might just do that, but then again, it might well be October. If it goes any later I will have no hair.

But it does look really good.

It is definitely worth the wait. Like Guinness.

Ewan

21 August, 2008

Upcoming training events September

After the Summer break we are getting up to speed again with our training events:

14-16 Sep: Ensembl User Meeting, Hinxton, UK

17-19 Sep: Browser workshops and presentations, Erasmus MC Molecular Medicine Postgraduate School, Rotterdam, The Netherlands

22 Sep: Browser workshop, VIB Flanders Interuniversity Institute of Biotechnology, Antwerp, Belgium

We also have a complete list of all upcoming training events for the coming months available. Are we not coming to a location close to you? Why not host then an Ensembl workshop yourself? For more details, please contact Xose Fernandez.

18 August, 2008

GWAS Data in Ensembl

Ensembl has begun to incorporate data from genome-wide association studies. These data are being added in coordination with the European Genotype Archive, a new database resource at the EBI designed to provide a permanent archive for human variation data that is not available for unlimited public release because of ethical or individual privacy restrictions. The European Genotype Archive has recently launched with the raw data from the Wellcome Trust Case Control Consortium (WTCCC. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661-678). In the future the EGA will provide additional array-based genotype data as well as data from re-sequencing and CNV studies. The EGA will also contain phenotype data.

Ensembl is incorporating summary data from genome-wide association studies represented in the EGA. The data generally represent the p-value for each of the tested SNP (Single Nucleotide Polymorphism) associated with the given phenotype.

The WTCCC summary data is now available on Ensembl as DAS tracks selectable from the "DAS Sources" menu from the CytoView and ContigView pages. The following menu items provide access to data from biopolar disorder (BD), coronary artery disease (CAD), cardiovascular disease (CD), hypertension (HT), type 1 diabetes (T1D), type 2 diabetes (T2D):

WTCCC BD
WTCCC CAD
WTCCC CD
WTCCC HT
WTCCC T1D
WTCCC T2D

In future releases, GWAS data will be integrated into the Ensembl variation databases.

We will be adding additional data to both Ensembl and the European Genotype Archive as the data become available. We hope you find these new data resources useful.

15 August, 2008

Glitches in Ensembl

Ensembl is currently migrating to new hardware in conjunction to the development of new webcode for the next release (due in late September). During this period, and due to some technical issues, there might be some downtime of our website. We apologise for any problems this may cause and we are working to minimise its impact in Ensembl.

The Ensembl Team

11 August, 2008

Ensembl in China

The Ensembl team has recently run a series of workshops in China:

In Shanghai we were at Tongji University for a workshop organised by the Shanghai Center for Bioinformation Technology, and

In Beijing, we were hosted by Professor Jingchu Luo from the Center of Bioinformatics at Peking University where we also delivered some lectures in the Applied Bioinformatics Course.

Following this experience and due to the success of the tour, we are planning to go back to China. So if you were interested in hosting a workshop or you have a collaboration with a Chinese group who might be interested in knowing more about Ensembl, please contact us to discuss dates. We are trying to coordinate our next trip with different hosts.

The Ensembl Team

05 August, 2008

Ensembl User Meeting

Ensembl announces a workshop for developers that will take place in the Wellcome Trust Genome Campus in Hinxton (near Cambridge, UK) next September (14th-16th September), following the Genome Informatics meeting.

In this workshop we will be exploring Ensembl beyond the website. Participants will be expected to have experience in writing Perl programs and a background in object oriented programming techniques. Being familiar with databases (MySQL) and the Ensembl APIs would be an advantage.

Several Ensembl developers will present uses of our APIs (Application Programming Interfaces) as well as extensions of the Ensembl system. Note this is not a course about how to use the Ensembl APIs.

At the end of this course, attendees will:

• have a good understanding of Ensembl's annotation pipeline;
• know how to customise a local installation of the Ensembl website;
• and have hands-on experience with the annotation pipeline.

In late 2008, the Ensembl Genomes project at the EBI will leverage the Ensembl system to create consistent genome annotation resources focused on a wide variety of eukaryote, as well as prokaryote genomes and thereby continue the activities of the current EBI Integr8 and Genome Reviews projects.

Thus, there will be a session where the new divisions of Ensembl will be introduced and previewed; the initial data content and future directions will be discussed.

There is no registration fee to attend this course, but you may need accommodation (or extending your stay in Hinxton Hall: info@wtconference.org.uk), if you could let us know you are planning to attend or wanted more information, please send an eMail to xose@ebi.ac.uk.

The Ensembl Team

22 July, 2008

The final verdict ...

In a previous post I promised to do some more genome browser screenshot counting. So, that is what I did last week at the XX International Congress of Genetics 2008 in Berlin. I limited myself to the second poster session of the conference that should have contained 675 posters. To my surprise a vast amount was missing though, so I estimate that the number I looked at was closer to somewhere between 400 and 500. Compared to the Barcelona conference the result was poor; I identified only 4 posters with Ensembl screenshots as well as 4 posters with UCSC Genome Browser screenshots and none with NCBI Map Viewer screenshots. So, based on the combined results from two genetics conferences, it seems that the Ensembl and UCSC browsers are about equally popular amongst poster-making geneticists.

However, I had expected more genome browser screenshots in general. What can be the reason for these low numbers? Is there no need for screenshots at all? Or can people not get what they need for their poster from Ensembl or UCSC? We are curious about your thoughts and views on this and are welcoming any suggestions for improvements to Ensembl that will make preparing figures for your poster (or publication) more of a breeze!

16 July, 2008

Update on new release and new interface

The next release (50) will happen in just under a week's time. This will retain the old (classic) look, with the Ensembl interface you are all used to! The new interface will be released in August as a publicly accessible beta testing site alongside our usual Ensembl, in order to make sure everything is running smoothly before we switch over completely. This will give us time to collect feedback from you about the new interface, before we completely switch over to the new interface in release 51 (due in September).

What can you expect in release 50?

A new gene set for human, where UTRs (UnTranslated Regions) are based on ditags. An improved merge between the new human Ensembl gene set and the latest manually annotated gene set from Havana will be available. Also, new gene sets for tetraodon (genes from the Ensembl pipeline along with other genes from the genoscope set), C. elegans (WS190), and projection of the new human set against pika and cat.

Cow has a new assembly and geneset! The Ensembl automated pipeline was run on Btau 4.0 for this release.

New variation sets will be available for orangutan, tetraodon, cow and human.

We will keep you posted about the new interface, beta testing surveys, and upcoming organisms and annotation in release 51.

Thanks to all our users.

25 June, 2008

The big little things: the colour of genes, default tracks, words.

I was thinking about the web design process for e50 - our new web interface due out in July (definitely will be late July). We're at the stage now where Fiona is going to be asking users their preferences for all the "little things" which make no difference to technical aspects of the web site but make a pretty big difference to the useability. Like, for example, how do we colour our genes? This is a long standing debate where everyone has an opinion and everyone's opinion is right - at least for them. (only 2 colours, and the colours should distinguish manually annotated genes from automatic says one person. No - use the whole spectrum of colours, and make sure we distinguish non-coding RNA genes from pseudogenes from protein coding genes and indicate which ones have orthologs - to mouse. No to rat. No - instead of that use GO functional catagories to colour genes. Or the number of non coding SNPs. Or the gene-wide omega value from the dn/ds measurement)

Sometimes people look at this debate and say that this is a clear area for user defined colours. Which is sort of true for 10 seconds, but - not really. Firstly most users are not going to get around to changing options - partly due to the fact they have better things to do (like design experiments and run them!), partly because this sort of configuration is just a bit too geeky and partly because, to be honest, if they are into configuring things we'd like them first off to work out which tracks that would like displayed (more on this below), and colouring genes should be low on their list. Secondly we want to provide a scheme which feels natural to the most number of people. Hence a rather long series of options to choose from currently being proposed.

The same argument goes for default tracks. (I can't imagine not having SNPs on my display! I can't imagine not having the ESTs switched on!). Everyone has an opinion and everyone is right. Here it is clear we've got to make sensible default decisions (which are also heavily, heavily speed optimised - sadly the new Collections framework wont be ready for SNPs for 50, which is annoying, as really we want SNP density these days in human, but all the other obvious default tracks are pretty well optimised, including some funky scaling stuff to get the continuous basepair comparative genomics measure to come back sensibly when you are zoomed out). But then our main task it to get the user to explore as the "wouldn't it be nice to see xxxx, I wonder if Ensembl has it" with configuration system which is very enticing, but not in the way, and importantly for the non-expert user, not completely overwhelming. In our e50 design means more hierarchy in the options so they can be grouped (itself a bit of pain to handle - we've got alot of tracks), and a nice "light box" effect over the display which reassures you that (a) the thing that you were looking at wont disappear (b) the display will come back quickly. I think we're on the right path here for the configuration, but we still have decide on the default tracks (for me the only obvious one is "Genes").

Finally we've got the mundane business of which words do we use for each of our "pagelet" displays. (our new pagelets are very nice, and in our latest round of testing, >50% of the in-the-lab biologists liked not only the pagelets, but a specific layout of them. less than 10% preferred the current ensembl display). So - we need one or two words to describe "A graphical representation of a phylogenetic tree of a gene with duplication nodes marked". Hmmm. "Gene Tree". Or "Phylogenetic Tree"? (phylogenetic is a bit of a long word, and might get in the way of the menu...). What about "a text based alignment of resequenced individuals with the potential to mark up some features of interest". Is this - "resequencing alignment" or "individual alignment" or "individuals".

If you'd like to take part in this, email survey@ebi.ac.uk (perhaps cc'd to Xose - xose@ebi.ac.uk) to make sure you are on our list. Ideally we'd like you to be wet-lab biologists. We have alot of in-house or near-in-house opinions from bioinformaticians, and in anycase, bioinformaticians are happier to explore configurations etc. Its the researcher who will be visiting us - say - once or twice a month which we think is the main user to optimise for (again, more frequent users we hope will explore configuration to match things perfectly for them).

More on other e50 topics soon - speed, the importance of chocolate in bribing web developers and the end game for e50!

20 June, 2008

Ensembl 50 - technical requirements

Development for the new Ensembl 50 website is progressing well... some of you may have already seen the test sites when you signed up to be part of our testing team...

One of the complaints of the current site (hardware failures aside) is the performance of the webpages - we are addressing this in a number of ways in the Ensembl 50 web code.

Tuning the Apache web server configuration:
Compressing all HTML/Javascript/CSS files using mod_deflate;
Minimizing the number and size of Javascript/CSS files by stripping unnecessary white space and comments from the files and merging them together;
Setting headers to improve the browsers caching of content.
Aggressively caching content on the server side using a modified version of memcached (this will require Linux users using a 2.6.x kernel as it uses the epoll technology).
Increased use of asynchronous HTTP requests (AJAX) to allow more immediate responses for the page while generating other content; and to minimize the content that is sent (can retrieve initially hidden content later)
Reducing page size - rather than having single pages containing lots of disparate information having more pages containing smaller amounts of information; this doesn't just help with the page size - but also increases the discoverability of content that we have on the site - which people do not find easily - especially comparative genomics; variational genomics and regulatory information.

For those who will be implementing local copies of Ensembl 50 code - additionally Ensembl 50 code will:

Make configuration easier - the pages will configure most of the tracks directly from the contents of the databases;
Make code more pluggable:
ConfigPacker - the SpeciesDefs database parsing; and
ImageConfig - replacement for UserConfig;
Make caching and AJAX implementation easier.

There are a number of changes to the code - so if you have written your own components or drawing code tracks there will be work to be done but in most cases these modifications are easy to implement (e.g. moving code between modules).

Finally, here are some additional system recommendations:

Perl 5.8.8 or newer;
MySQL 5.0 server;
64 bit architecture;
large memory machine;
you can compile our modified "memcached" code (e.g. for Linux you will need a 2.6.x kernel) to get significant speed up;

19 June, 2008

Technical Difficulties

For the past two days, Ensembl has been slow or has not returned the page (instead offering an 'Ensembl is down' yellow screen).

Be assured we are working on the problem. It is a hardware issue, but should be resolved soon.

From all of us in the Ensembl team, thanks for your patience!

12 June, 2008

Ensembl needs you!

As you know, we are working on a new website design for the Ensembl 50 release. We are currently seeking 'beta testers' who would be happy to take part in a survey and help us shape the look and feel of the new website.

If you could spare some time we would be very grateful if you could send an eMail to survey@ebi.ac.uk so we can add you to our list of testers.

We are looking forward to hearing from you.
The Ensembl Team

04 June, 2008

Upcoming Workshops - Summer

Hello all,

There are a few Ensembl training events taking place this summer:

(2-day) Browser workshop in the Dept. of Genetics, University of Cambridge, UK (5-6 June)

Module in a Wellcome Trust Mini-Open Door Workshop (ODW) for MalariaGEN in Hinxton, UK (20 June)

Module in a Mini-ODW at the ICG in Berlin (12 July)

Programmers' group at the ISMB meeting in Toronto, Canada (19-23 July)

As ever, email us with any questions (or comments) at helpdesk@ensembl.org

Best Wishes,
Helpdesk

03 June, 2008

Ensembl 11 - UCSC 8

The past days I was in Barcelona at the European Human Genetics Conference 2008. After giving my presentation on Ensembl in one of the 'Educational sessions' and listening to numerous talks about GWAS (genome-wide association studies), I had a look at the posters. Under the impression that the UCSC Genome Browser is the preferred browser amongst (human) geneticists and with Ewan's experience at the recent 'Biology of Genomes' meeting fresh in my mind, I decided to have a closer look at the posters in the Cytogenetics section. Out of 189 posters, I could positively identify 11 with Ensembl screenshots (mostly CytoView and ContigView, but also two times KaryoView), 8 with UCSC Genome Browser screenshots and none with NCBI Map Viewer screenshots. OK, I admit that I can recognise almost any pixel copied from our site and may have missed one or two UCSC screenshots, but all in all I thought this was a very encouraging result! Of course we should keep in mind that this was a European conference, mainly attended by European scientists .... I guess I have a bit more screenshot counting to do at the International Congress of Genetics 2008 in Berlin. So, let's say Ensembl 11 - UCSC 8 is the score at half time .... next month I'll report back with the final result!

28 May, 2008

Ensembl 50

Ensembl is now busy with preparations for our next release, Ensembl 50! We're working hard and we'll keep you updated on what's in store for this release. Our biggest new development will be our revamped website. As usual, we have updated some species and provided new data for other species. Keep reading for an outline of what we aim to provide in Ensembl 50.

New web interface:
The most exciting change in Ensembl 50 will be a new web interface: Simpler, Better, Faster is what we're aiming for. Not only will pages take less time to load, but they will also look a little different. We're hoping that we will have improved the navigability and discoverability of the site so that you can make the best use possible of the data we provide. We have taken into account your messages at helpdesk and your voices in courses. Let us know what you think by emailing helpdesk@ensembl.org !

Genebuild team:
In terms of new data for Ensembl 50, we have constructed new gene sets for tetraodon and cow. Vega/Havana (manual annotation) has released new gene sets for human and mouse so these will be displayed on our website alongside Ensembl genes.

For human, you may know that Ensembl and Havana merge identical transcripts. We have improved the Vega/Havana merge using the latest Havana gene set. Because untranslated regions are notoriously difficult to determine, we've used ditags when predicting UTRs for human. Finally, we have removed some dodgy-looking gene models that were highlighted by the Alpheus project.

For low-coverage genomes, gene models are predicted by projecting the human gene models down onto the 2x genomes. In this release, cat and pika have been updated by projecting current human gene models onto the existing assembly.

We've also updated the gene sets for C. elegans and chimp. Release notes for C. elegans can be found on the WormBase website. Chimp has an updated gene set to include more chimp-specific predictions, and genes projected from human onto chimp are updated.

The horse genomic assembly (EquCab2) has recently been updated such that chromosome 27 has been shortened. This is not a new genebuild as such, but we have modified our data to reflect this change. Zebrafish Agilent V2 Arrays have been mapped to cDNA and genomic sequences.

Canonical transcripts (the longest translations) have been labeled for all species in the database, though this will not appear in the browser. As usual, non-coding RNA genes have also been updated for most species, and cDNA alignments have been redone for human and mouse.

Variation and Functional Genomics teams:
Our Variation team plans to provide updated single nucleotide polymorphisms (SNPs) for tetraodon, cow, human, chimp and orangutan. Our Functional Genomics team will provide promoter cis-regulatory motifs from here. They will also update the current regulatory build on human.

Comparative Genomics team:
Our Comparative Genomics team is extending their multiple alignments with new species and low-coverage (2x) genomes to include:
* 4-species: catarrhini primates EPO (Enredo-Pecan-Ortheus) alignments (human, chimp, orangutan, macaque )
* 12-species: amniote vertebrates Mercator-Pecan alignments (current 10-species alignments + Pongo pygmaeus and Equus caballus)
* 23-species: eutherian mammals EPO (Enredo-Pecan-Ortheus) alignments (all 2X genomes + current 7-species alignments + Pongo pygmaeus and Equus caballus)

GERP scores (% conservation on a basepair level for the 23-species eutherian mammals alignments) will be released.

The Compara (comparative genomics) team is working hard! They're also providing new pairwise alignments:
* All the pairwise (between two species), whole-genome alignments (using tBLAT) will be updated using a new pipeline that follows a best-in-genome approach to filter spurious hits.
* The pairwise alignments for more closely related species (using BLASTz-net) will be updated for the following species so that the reference species is human:
. human vs Pongo_pygmaeus
. human vs Loxodonta africana
. human vs Echinops telfairi
. human vs Oryctolagus cuniculus
. human vs Dasypus novemcinctus
. human vs Myotis lucifugus
. human vs Bos Taurus
. human vs Ochotona princeps
. human vs Felis catus
Sitewise dN/dS values will be provided in our gene trees to detect positions in the alignments that are under different evolutionary pressure.

Web team:
Last but not least, please note that from Release 50 we will no longer be providing the 'ssaha' sequence search. If you wish to run your own 'ssaha' sequence search you can download the files to generate the search hashes from our FTP site. Alternatively, use BLAT (the BLAST-like Alignment Tool) which is equally fast and also demands exact matches.

That's it for now! Any questions, just email helpdesk@ensembl.org. We will be posting more information as the release date gets closer (we are aiming for end of July!)

10 May, 2008

Coming back from CSHL

I'm in the airline lounge about to head back from "Biology of Genomes" at Cold Spring Harbor Laboratory. As always, it was a great meeting; highlights for me was seeing the 1,000 genomes data starting to flow - it is clear that the shift in technology is going to change the way we think about population genomics - and for me, the best session was one on "non-traditional models" - Dogs, Horses and Cows, where the ability to do cost effective genotyping has completely revolutionised this field. Now the peculiarities of the breeding structures, with Dog breeds being selected for diverse phenotypes, Cows with the elite bulls siring thousands of offspring due to artificial insemination and Horses having obsessive trait fixation over the last 1,000 years can really bring power to genetics in different ways. Expect alot more knowledge to come from these organisms and others (chickens, pigs, sheep...) over the coming years.

For my own group, Daniel Zerbino talked about Velvet, our new short read assembler which has also just been published in Genome Research (link). Velvet is now robust and capable of assembling "lower" eukaryotic genomes - certainly up to 300MB from short reads in read pair format. It is also being extensively used by other groups, often for partial, minature de novo assemblies in regions. It went down well, and Daniel handled some pretty tricky questions in the Q&A afterwards. Next up - we get access to a 1.5TB real memory machine, and put a whole human genome WGS into memory. Alison (Meynert) and Michael (Hoffman) had great posters on cis-regulation and looked completely exhausted at the end of their poster session.

From Ensembl, Javier talked about Enredo-Pecan-Ortheus (which we often nickname as EPO) pipeline. As some said afterwards to us "you've really solved the problem, haven't you" - Javier was able to show clear evidence that each component was working well, better than competitive methods, and having a impact on real biological problems, for example, derived allele frequency. Its ability to handle duplications is a key innovation. Javier and Kathryn are current wrestling in the "final" 2x genomes into this framework, from which point we will start to have a truly comprehensive grasp on mammalian DNA alignments. I also like it as Enredo is another "de bruijn graph" like mechanism. Currently the joke is that about 10 minutes into any conversation I say "well, the right way to solve this problem is to put the DNA sequence into a de bruijn graph".

Going to CSHL biology of genomes is always a little wince making though as this field - high end genomics - really prefers to use the UCSC Genome Browser (which as I've written before on, is a good browser, and I take the use of it to be our challenge to make better interfaces for these users on our side). My informal counting of screen shots was > 20 UCSC, 4 Ensembl (sneaking one case of 'Ensembl genes' shown in the UCSC browser as a point for each side) and 0 NCBI shots. Well. It just shows the task ahead of us. e50! - our user interface relaunch - is coming together, and we will start focus-group testing soon - time for us to address our failings head on. I'll be blogging more about this as we start to head towards broader testing.

Lots more to write about potentially - Neanderthals, Francis Collins singing in the NHGRI band (quite an experience), reduced representation libraries with Elliott, genome wide association studies (of which, I just _love_ the basic phenotype measures, from groups like Manolis Dermitzakis) and structural variation... but for the moment I've got to persuade my body to feel as if it is 11.30 at night and see if I can get a good nights sleep on the plane.

02 May, 2008

Year of the rat...

Things are moving within the rat community as this month's Nature Genetics issue shows with a special on rat genetics exploring the latest developments.

Featuring:

ENU-induced gene targeting in rats;
A 'white paper' discussing progress and prospects in rat genetics;
A brief overview on rat genome resources online;
ENU-induced gene targeting in rats;
A contribution on dynamics of CNV in rat and their impact in phenotypes;
A survey of genetic variation from The STAR Consortium (over 3 million newly identified SNPs and over 20,000 SNPs genotyped across 167 distinct inbred rat strains);
and several papers focusing on the identification of genetic variants associated to rat models of human disease...

The driving force behind these outstanding achievements can be found on a well interliked rat community bridging resources across the Atlantic: RGD and the EURATools Consortium (FP6 contract number LSHG-CT-2005-019015) collaborations are a good example.

EURATools investigators are developing integrated genome tools (Ensembl is one of the partners of this consortium). Integrating high-throughput sequencing and genotyping with informatics; intensive analysis of phenotypes, gene sequence and gene expression in congenic strains to identify genes and regulatory pathways for a wide range of rat disease phenotypes; and establishing optimised protocols for rat gene targeting are the goals of this ambitious EU funded project.

23 April, 2008

Upcoming Workshops and Instructional Videos

Hello to our readers, I hope everyone is having a nice April. In the UK we are experiencing a long winter with some rain, but spring seems to be around the corner... as are these upcoming workshops...

Did you know? The EBI has released tutorial videos.

Have a look at the Ensembl browser videos for information and direction to some of its pages! Or, learn more about BioMart, a fast data mining tool.

Upcoming workshops- May

Browser workshop at the WHO in Cairo (12-13 May)
Module in the Open Door Workshop at the Sanger (12-14 May)
Ensembl in China: The Shanghai Center for Bioinformation Technology (14-16 May)
Ensembl in China: Center for Bioinformatics, Beijing (19-21 May)
Browser and API workshops at the GTPB in Oeiras, Portugal (27-30 May)
Presentation at the European Human Genetics Conference in Barcelona (30 May)

11 April, 2008

The gene love-in

We have four groups on campus interested in human genes: Ensembl, Havana, whos data forms the bulk of the Vega database, HGNC, the human gene nomenclature committee, and finally UniProt, which has a special initiative on human proteins. With all these groups on the hinxton campus, and with all of them reporting to (at least one) of myself (Ewan Birney), Rolf Apweiler or Tim Hubbard, who form the three-way coordination body now called "Hinxton Sequence Forum", HSF it should all work out well, right?

Which is sort of true; the main thing that has recently changed over the last year has been far, far closer coordination between these four groups than there was ever before, meaning we will be achieving an even closer coordination of our data, leaving us hopefully with the only differences being update cycle and genes which cannot be coordinated fully (eg, due to gaps in the assembly).

Each of these groups have a unique view point on the problem. Ensembl wants to create as best-as-possible geneset across the entire human genome, and its genesis back in 2000 was that this had to be largely automatic to be achievable in the time scale desired, being months (not years) after data was present. Havana wants to provide the best possible individual gene calls when they annotate a region, integrating both computational, high throughput and individual literature references together, UniProt wants to provide maximal functional information on the protein products of genes, using many literature references on protein function which are not directly informative on gene structure and finally HGNC wants to provide a single, unique, symbol for each gene to provide a framework for discussing genes, in particular between practicing scientists.

Three years ago, each group knew of the other's existence, often discussed things, was friendly enough but rarely tried to understand in depth why certain data items were causing conflicts as they moved between the different groups. Result: many coordinated genes but a rather persistent set of things which was not coordinated. Result of that: irritated users.

This year, this has already changed, and will change even more over 2008 and 2009. Ensembl is now using full length Havana genes in the gene build, such that when Havana has integrated the usually complex web of high throughput cDNAs, ESTs and literature information, these gene structures "lock down" this part of the genome. About one third of the genome has Havana annotation, and because of the ENCODE Scale up award to a consortium headed by Tim Hubbard, this will now both extend across the entire genome and be challenged and refined by some of the leading computational gene finders world wide (Michael Brent, Mark Diekhans and Manolis Kellis, please take a bow). Previously Ensembl brought in Havana on a one-off basis; now this process has been robustly engineered, and Steve Searle, the co-head of the Gene Build team, is confident this can work in a 4-monthly cycle. This means it seems possible that we can promise a worse-case response to a bad gene structure being fixed in six months, with the fixed gene structure also being present far faster on the Vega web site. It also means that the Ensembl "automated" system will be progressively replaced by this expert lead "manual" annotation over the next 3 years across the entire genome.

(An aside. I hate using the words "automated" and "manual" for these two processes. The Ensembl gene build is, in parts, very un-automated, with each gene build being precisely tailored to the genome of interest in a manual manner, by the so called "gene builder". In contrast "manual" annotation is an expert curator looking at the results of many computational tools, each usually using different experimental information mapped, in often sophisticated ways, onto the genome. Both use alot of human expertise and alot of computational expertise. The "Ensembl" approach is to use human expertise in the crafting of rules, parameters and choosing which evidence is the most reliable in the context of the genome of interest, but having the final decision executed on those rules systematically, whereas the "Havana" curation approach is to use human expertise inherently gene-by-gene to provide the decision making in each case, and have the computational expertise focus on making this decision making as efficient as possible. Both will continue as critical parts of what we do, with higher investment genomes (or gene regions in some genomes) deserving the more human-resource hungry per genome annotated "manual" curation whereas "automated" systems, which still have a considerable human resource, can be scaled across many more genomes easily).

This joint Havana/Ensembl build will, by construction, be both more correct and more stable over time due to the nature of the Havana annotation process. This means other groups interacting with Havana/Ensembl can work in a smoother, more predictable way. In particular on campus it provides a route for the UniProt team to both schedule their own curation in a smart way (basically, being post-Havana curation) and provide a feedback route for issues noticed in UniProt curation which can be fixed in a gene-by-gene manner. This coordination also helps drive down the issues with HGNC. HGNC always had a tight relationship with Havana, providing HGNC names to their structures, but the HGNC naming process did not coordinate so well with the Ensembl models, with gene names in complex cases becoming confused. This now can be untangled at the right levels - when it is an issue with gene structures, prioritise those for the manual route, when it is an issue with the transfer of the assignment of HGNC names (which primarily has individual sequences, with notes to provide disambiguation) to the final Havana/Ensembl gene models this can be triaged and fixed. HGNC will be providing new classifiers of gene names to deal with complex scenarios where there is just no consistent rule-based way of classifying the difference between "gene" "locus" and "transcript" in a way which can work genome-wide. The most extreme example are the ig loci, with a specialised naming scheme for the components of each locus, but there are other oddities in the genome, such as the proto-cadherin locus which is... just complex. By having these flags, we can warn users that they are looking at a complex scenario, and provide the ability for people who want to work only with cases that follow the "simple" rules (one gene, in one location, with multiple transcripts) the ability to work just in that genome space, without pretending that these parts of biology don't exist.

It also means our relationships to the other groups in this area; in particular NCBI and UCSC (via the CCDS collaboration), NCBI EntrezGenes (via the HGNC collaboration) and other places worldwide can (a) work better with us because we've got more of our shop in order and (b) we can provide a system where if we want to change information or a system, we have only one place we need to change it.

End result; far more synchrony of data, far less confusion for users, far better use of our own resources and better integration with other groups. Everyone's a winner. Although this is all fiddly, sometimes annoying, detail orientated work, it really makes me happy to see us on a path where we can see this resolved.

06 April, 2008

High dimensions, hetreogenity statistics

Last week I was a co-organiser of a Newton Institute workshop on high dimensional statistics in biology. It was a great meeting and there were lots of interesting discussions, in particular on chip-seq methods and protein-DNA binding array work. I also finally heard Peter Bickel talk about the "Genome Structure Correction" method (GSC), something which he developed for ENCODE statistics, which I now, finally, understand. It is a really important advance in the way we think about statistics on the genome.

The headache for genome analysis is that we know for sure that it is a heterogeneous place - lots of things vary, from gene density to GC content to ... nearly anything you name. This means that naive parametric statistical measures, for example, assuming everything is poisson, is will completely overestimate the significance. In contrast, naive randomisation experiments, to build some potential empirical distribution of the genome can easily lead to over-dispersed null distributions, ie, end up under estimating the significance (given a choice it is always better to underestimate). What's nice is that Peter has come up with a sampling method to give you the "right" empirical null distribution. This involves a segmented-block-bootstrap method where in effect you create "feasible" miniature genome samples by sampling the existing data. As well as being intuitively correct, Peter can show it is actually correct given only two assumptions; one that genome's heterogeneity is block-y at a suitably larger scale than the items being measured, and secondly that the genome has independence of structure once one samples from far enough way, a sort of mixing property. Finally Peter appeals the same ergodic theory used in physics to convert a sampling over space to being a sampling over time; in other words, that by sampling the single genome's heterogeneity from the genome we have, this produces a set of samples of "potential genomes" that evolution could have created. All these are justifiable, and certainly this is far fewer assumptions than other statistics. Using this method, empirical distributions (which in some cases can be safely assumed to be gaussian, so then far fewer points are needed to get the estimate) can be generated, and test statistics built off these distributions. (Peter prefers confidence limits of a null distribution).

End result - one can control, correctly, for heterogeneity (of certain sorts, but many of the class you want to, eg, gene density). Peter is part of the ENCODE DAC group I am putting together, and Peter and his postdoc, Ben Brown, are going to be making Perl, pseudo-code and R routines for this statistic. We in Ensembl will implement this I think in a web page, so that everyone can easily use this. Overall... it is a great step forward in handling genome-wide statistics.

It is also about as mathematical as I get.

19 March, 2008

Ensembl Release 49!

Yesterday, Ensembl released a new version of the browser and database (version 49). Along with new species, homologue predictions, and new code in our API, there have been changes in how the multiple alignments are done on the whole-genome scale. Have a look at the news for more details.

We are looking forward to release 50! as we are working on some new features. Keep your eye out in August for this next release. A reminder, we will not release another version between now and August, and updates may appear in the Pre! site but not in the main site, for that time.

Please explore features on release 49 such as BLAST which is now configured to align queries against top-level sequences (i.e. chromosomes and scaffolds), and BLAT, a fast alignment program which is now the default selection.

Paralogues are shown in blue in GeneTreeView to help aid your eye.

Upcoming workshops -April

(March workshops are listed in a previous post)

Browser workshops at the VIB Ghent and Leuven (31 Mar - 2 Apr)
Browser workshop (focus: rat) at the ULB Brussels (EURATools) (16 Apr)
Browser workshop at the BCB UCL/Birkbeck (21 Apr)
Module in the EBI roadshow in Poitiers (23, 24 Apr)
API workshop at the Dept. of Genetics, Cambridge (28, 29, 30 Apr)

13 March, 2008

New Release notes and BAC clones in mouse

Keep your eye out for Release 49, which is due on Tuesday 18 March. The delay is due to the scheduled downtime and maintance at the Sanger and EBI this weekend, which has caused some trouble. However, Release 49 will soon be visible to the community!

New features in release 49 will include BLAST against top-level sequences on all species, updates on the GeneTreeView page that should make things easier to see, and new Ensembl gene sets for Orangutan, Horse and and Takifugu. FlyBase 5.4 will be imported for Fruitfly. For API users, the regulatory features will be moved from the core API to the functional genomics API.

Also, a word of warning to those using our mouse clones under 'DAS sources'. MICER clones and the bMQ set (129S7/AB2.2 in the 'DAS Sources' menu of ContigView). The clones, originally mapped to NCBI M36, are lifted over to the new assembly (NCBIM 37) coordinates. The drawing indicates where the clone lifts over to in the new assembly. However, the pop-up box shows the coordinates of the original mappings. This is indicated in Ensembl by the 'NCBIM36' label above the coordinates.

Write our helpdesk if you are confused! helpdesk@ensembl.org

26 February, 2008

Ensembl US East Coast Tour

After a very successful Ensembl US West Coast Tour last month, the Ensembl Outreach team is presently looking into the possibility of organising a similar tour on the US East Coast in the second half of 2008. At the moment we are mainly thinking of 1-day browser workshops, but if there is interest in an API workshop we can of course also consider this.
The participating institutions would only have to pay the instructor's expenses and would share the travel costs, but we would not otherwise charge for the workshops. People that are potentially interested in hosting a workshop can contact me for more details.