[Bio] / FigTutorial / 1KG_update.html Repository:
ViewVC logotype

View of /FigTutorial/1KG_update.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.4 - (download) (as text) (annotate)
Thu Sep 27 15:23:44 2007 UTC (12 years, 1 month ago) by overbeek
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, rast_rel_2008_09_30, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, rast_rel_2008_11_24, HEAD
Changes since 1.3: +38 -6 lines
fixes to the 1K update

<div align=center>
<h1>The Project to Annotate 1000 Genomes:</h1> 
<h1>An Update (Sept. 2007)</h1>
<h2>by Ross Overbeek</h2>

It has been almost exactly four years since <b>The Project to Annotate 1000 Genomes</b> was launched (see <a href="./1KG.html">the manifesto</a> written in early 2004 for details).  
It is certainly arguable that 1000 genomes already exist.  I believe that there are now about 600-700 in the public
archives marked as "complete", that another 100-200 are complete but not yet submitted to the public archives, 
and 300-500 are "essentialy complete" (i.e., they have over 95% coverage).  So, the first comment that I would
make is that our prediction that we would reach 1000 genomes in 2007 was right on.
In fact, as I reread the original manifesto, I am very pleased with how well we formulated the
essential task, and how well we implemented it.  The salient points of that plan were as follows:
<li>FIG launched a cooperative effort to provide accurate, high-quality annotations that would lay the foundation
for exploiting the wealth of genomic data that would emerge during this decade.  We were quickly joined by
researchers from a number of institutions, including Argonne National Laboratory, the Computation Institute at the University of Chicago, the Burnham Institute, 
the University of Illinois at Urbana-Champaign, and San Diego State University.  Researchers from 
other institutions joined the
effort as we progressed, most notably scientists from Hope College and the University of Florida.
<li>We believed that the standard approaches of high-volume annotation based on protein families and automated 
pipelines (at least as commonly implemented) would be inadequate.  The "tough cases" would prove to be a major
hindrance, and much of the existing fully automated efforts would just propagate errors.  We still hold this opinion.
<li>As we put it in the original manifesto: <b>the key to development of high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes.</b>
That is, we formulated a precise notion of <i>subsystem</i>, implemented the software to support development and
exchange of subsystems, and argued that the key to complete automation was to first create a large body of accurate 
annotations using a technology that dramatically improved the productivity of <i>experts</i> with decades of 
experience in specific biological topics.  The development of a large and maintained library of
subsystems would become the foundation for eventually producing accurate automated annotations.
<li>We proposed a 3-stage schedule for development of the subsystem library, leading to a substantial,
curated collection by 2007.
<li>Finally, we planned on working closely with Bernhard Palsson's team at UCSD to develop 1000 
stoichiometric matricies as a foundation for supporting quantitative modeling. 
Palsson's team has continued to move rapidly forward with modeling, but the level of collaboration
envisioned in the manifesto never materialized.  Rather, the team at Hope College joined our effort and
the technology for creating and maintaining initial stoichiometric models for hundreds of organisms.
<h2>What Was Actually Accomplished?</h2>

It is now 2007, the 1000 genomes are here, and it is time to assess the situation.
The basic goal from the beginning was to substantially improve the available annotations for
the first 1000 sequenced genomes.  I believe that we have accomplished this task. 
We have developed a distributed and maintained collection of over 600 subsystems 
containing over 500,000 genes.  This collection
has been used
to manually annotate the existing collection of complete genomes.
We have designed and imlemented the technology for using this collection of subsystems
as the foundation for rapid, accurate annotation of new genomes 
[see <a href="http://nar.oxfordjournals.org/cgi/content/abstract/33/17/5691">The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes</a>].
(Rapid Annotation using Subsystems Technology) server implemented at Argonne National Laboratory is now capable
of producing relatively accurate annotations, and they continue to improve.  Over 200 
genomes from external users (i.e. from researchers who have no connection to the Project to Annotate 1000 
Genomes) have been
annotated by the RAST server in the last four months [manuscript submitted for publication].
The Argone team grew with the addition of five researchers who had previously worked on GenDB at
the University of Bielefeld.  The look and feel of the RAST sever, as well as the new SEED viewer,
owe a great deal to these new members.
The team at Hope College defined the notion of <i>scenario</i> 
[see <a href="http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1868769">Toward the automated generation of genome-scale metabolic networks in the SEED</a>]
and used it to formulate detailed reconstructions of metabolic networks for a number of organisms.

We have offered technical support for development of "boutique"
databases describing specific subsystems [see <a
Subsystem</a> and <a href="http://aropath.lanl.gov">the AroPath
site</a>].  Roy Jensen and Carol Bonner have spent large efforts in
building these sites, and we now have other experts building similar
sites designed to cover specific subsystems in depth using SEED
In several cases, review publications reflecting the web
site contents have either been submitted or are in preparation.
Andrei Osterman, one of the FIG founding fellows, was
awarded a grant with Valerie de Crecy and Tadhg Begley to develop
specific subsystems (including wet lab verifications) throughout a
number of pathogens ("The Genomics of Coenzyme Metabolism in Bacterial
Pathogens").  The movement of subsystems from strictly bioinformatics efforts to
the core of integrated bioinformatics and wet lab efforts is just beginning,
but I do believe that it will gradually gain momentum.
Dmitry Rodionov of the Burnham Institute has made substantial progress in
integrating searches for regulatory sites with development of subsystems.
His papers <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=ShowDetailView&TermToSearch=16857666&ordinalpos=3&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum">
Comparative genomics and experimental characterization of N-acetylglucosamine utilization pathway of Shewanella oneidensis</a> and 
<a href="http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=17360515">
Genomic identification and in vitro reconstitution of a complete biosynthetic pathway for the osmolyte di-myo-inositol-phosphate</a> with Andrei Osterman (and a number of others) illustrate the technique.  These papers reflect
a technology that may well bring rapid advances over the next few years.


<h2>Where Do We Go From Here?</h2>

First, we wish to make it clear that the subsystems, FIGfams,
annotations, and metabolic reconstructions generated by the Project to
Annotate 1000 Genomes are all freely available to anyone for any use.
We do allow groups to collaborate and withhold data, but the central
participants continue to enhance a body of data that we make publicly
available.  As the details of our effort bear fruit, I believe that
more and more new groups will build upon this data collection.  We
hope that new collaborations will emerge, but in many cases I would
assume that that new teams will just use the data and build research
projects upon it (and that is fine with us).

<h3>The RAST Server</h3>

The <a href="http://rast.nmpdr.org/">RAST server</a> has the potential of making a huge impact.  It is certainly the
most visible outcome of the project.  By offering a free annotation service that produces
higher quality output (in identification of genes, annotation of gene function, and placement
of genes into metabolic reconstructions) than existing technlogies, 
we believe that we lay the foundation for rapidly 
processing the 1000s of genomes that will be sequenced in the next five years.
We will steadily improve the quality of our annotations by
<li>adding subsystems,
<li>removing errors from existing subsystems,
<li>building and maintaining FIGfams, a set of protein families grounded in
subsystems technology,
<li>handling mobile elements and prophages in a separate stage of processing, 
<li>improving gene calls and start positions more accurately using comparative evidence, and
<li>offering the capability of removing frameshifts using bioinformatics tools (for genomes
of relatively low quality with systematic errors producing fragmented genes under the
current RAST).
I should also point out that after the RAST Server was released, we proceeded with the development
the technology and implemented a <a href="http://metagenomics.nmpdr.org/">MetaGenomics RAST Server</a>.  This 
server is now completely operational and in widespread use.

<h3>The FIGfams</h3>

The FIGfams are yet another attempt to produce protein families designed to support
annotations.  They are grounded in the subsystems collection, but they do include numerous
families for which subsystems do not yet exist.  This effort has not yet been published, but a 
manuscript is in preparation.

<h3>Boutique Web Sites for Specific Subsystems</h3>

We will be offering support to a number of our subsystem curators building small
web sites focusing on specific subsystems of interest.  Normally, these efforts
are coupled with the production of review papers and are undertaken only with
biologists that have extensive backgrounds in the subsystems of interest.

<h3>Broadening Participation</h3>

As the benefit of our approach becomes increasingly apparent, I would anticipate that
a growing number of biological experts will wish to access and use the technology we are 
developing.  I would guess that this would proceed in steps.  First, an expert would participate
in the annotation clearinghose (see next section).  Of those that do, a smaller number will wish
us to help clean up the annotations in their area of expertise by implementing new subsystems.
A relatively few 
experts will wish to implement their own subsystems and "publish the results" (a process that makes
the subsystems available to anyone worldwide that wishes to download them from a server maintained
at Argonne National Laboratory).  To do this they will normally utilize a <a href="http://theseed.uchicago.edu/FIG/index.cgi">
publicly available installation of the SEED</a> maintained at the University of Chicago.

<h3>The Annotation Clearinghouse</h4>

Although I have not discussed the <b>Annotation Clearinghouse</b> in
this document, <a href="./annotation_clearinghouse.html">I do discuss
it elsewhere</a>.  It offers a framework where experts can deposit relatively
reliable assertions of function for genes they have studied.  These assertions are
grouped with existing annotations from numerous annotation groups and form
a resource that can be used for a number of purposes.  The most obvious is to
support efforts to clean up existing annotation efforts (like our own).  A less
obvious outcome will be a growing collection of reliable assertions that can be
used by the bioinformatics community as a basis for testing and developing new tools.
Contributing to the annotation clearinghouse will be the most basic and common
way experts will interact with our project.

<h3>Alignments and Trees</h3>

Gary Olsen from the University of Illinois at Urbana-Champaign and myself have been working on building alignments and trees (as well
as the tools needed to maintain them).  We are just beginning what I think will become a serious
attempt to integrate trees more deeply into the annotation process and the generation of FIGFams.
We have generated in excess of 20,000 aligments and trees, but the effort is still at an early stage.


I wrote this document because I felt that we are approaching the end of the Project
to Annotate 1000 Genomes.  Certainly, our collaborative effort will continue, but it seemed
time to assess how well we have done and what should be the next defining goal.
As to how well we have done, in my view we have succeeded in almost all of our major goals,
and in some cases surpassed them.  We now have genomes in which about 45% of the genes are in subsystems,
initial metabolic reconstructions have been developed, and we are beginning to significantly impct
the way genomes (at least bacterial and archaeal genomes) are annotated.  The RAST server is a major
development that will, I predict, become the basic annotation "workhorse" for the next 5-10,000 genomes.
So, I feel we have clearly succeeded.
It is now time to think about the next stage.  Clearly, we will continue executing our basic strategy,
and this will steadily improve the body of existing annotations.  However, I do believe that
it is useful to have some clear compelling short statement of purpose like "The Project to Annotate
1000 Genomes".  At this time, I am too wrapped up in the final stage of the existing project, and
I have nothing to suggest -- but I do think that we need to discuss this topic over the coming few 

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3