[Bio] / FigTutorial / 1KG.html Repository:
ViewVC logotype

View of /FigTutorial/1KG.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (download) (as text) (annotate)
Wed Dec 7 19:38:22 2005 UTC (13 years, 11 months ago) by golsen
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, rast_rel_2008_09_30, caBIG-13Feb06-00, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, caBIG-05Apr06-00, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, rast_rel_2008_11_24, HEAD
Changes since 1.4: +1 -1 lines
Don't mean to be pedantic, but I figured that I might as well add newlines
at the end of text files that lack one.  Identified with a script:

/bin/sh -O extglob -c 'for f in */*.@(c|css|html|js|pl|pm|py|TXT); do perl -e '"'"'while(<>) {$s = ! /\n$/} exit $s'"'"' $f || echo $f; done'

fixed with a script:

/bin/sh
for file in _list_; do echo "" >> $file; done

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta name=Title
content="A Proposal to Annotate the First 1000 Sequenced Genomes, Develop Detailed Metabolic Reconstructions,  and Construct the Corresp">
<meta name=Keywords content="">
<meta http-equiv=Content-Type content="text/html; charset=macintosh">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 10">
<meta name=Originator content="Microsoft Word 10">
<link rel=File-List href="white_paper_files/filelist.xml">
<title>A Proposal to Annotate the First 1000 Sequenced Genomes, Develop
Detailed Metabolic Reconstructions,  and Construct the Corresp</title>
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>Trial User</o:Author>
  <o:Template>Normal</o:Template>
  <o:LastAuthor>Trial User</o:LastAuthor>
  <o:Revision>2</o:Revision>
  <o:Created>2004-10-25T20:41:00Z</o:Created>
  <o:LastSaved>2004-10-25T20:41:00Z</o:LastSaved>
  <o:Pages>3</o:Pages>
  <o:Words>3398</o:Words>
  <o:Characters>19369</o:Characters>
  <o:Lines>161</o:Lines>
  <o:Paragraphs>38</o:Paragraphs>
  <o:CharactersWithSpaces>23786</o:CharactersWithSpaces>
  <o:Version>10.1316</o:Version>
 </o:DocumentProperties>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:Zoom>150</w:Zoom>
  <w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
  <w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
  <w:UseMarginsForDrawingGridOrigin/>
 </w:WordDocument>
</xml><![endif]-->
<style>
<!--
 /* Font Definitions */
@font-face
	{font-family:"Times New Roman";
	panose-1:0 2 2 6 3 5 4 5 2 3;
	mso-font-charset:0;
	mso-generic-font-family:auto;
	mso-font-pitch:variable;
	mso-font-signature:50331648 0 0 0 1 0;}
 /* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{mso-style-parent:"";
	margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	font-size:12.0pt;
	font-family:Times;}
h1
	{mso-style-next:Normal;
	margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	page-break-after:avoid;
	mso-outline-level:1;
	font-size:14.0pt;
	font-family:Times;
	mso-font-kerning:0pt;}
h2
	{mso-style-next:Normal;
	margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	page-break-after:avoid;
	mso-outline-level:2;
	font-size:12.0pt;
	font-family:Times;}
h3
	{mso-style-next:Normal;
	margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	page-break-after:avoid;
	mso-outline-level:3;
	font-size:14.0pt;
	font-family:Times;
	font-weight:normal;}
p.MsoBodyText, li.MsoBodyText, div.MsoBodyText
	{margin:0in;
	margin-bottom:.0001pt;
	text-align:center;
	mso-pagination:widow-orphan;
	font-size:16.0pt;
	font-family:Times;}
@page Section1
	{size:8.5in 11.0in;
	margin:1.0in 1.25in 1.0in 1.25in;
	mso-header-margin:.5in;
	mso-footer-margin:.5in;
	mso-paper-source:0;}
div.Section1
	{page:Section1;}
 /* List Definitions */
@list l0
	{mso-list-id:103496843;
	mso-list-type:hybrid;
	mso-list-template-ids:-1661435914 984073 1639433 1770505 984073 1639433 1770505 984073 1639433 1770505;}
@list l0:level1
	{mso-level-tab-stop:.5in;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l1
	{mso-list-id:784008324;
	mso-list-type:hybrid;
	mso-list-template-ids:-92915524 984073 1639433 1770505 984073 1639433 1770505 984073 1639433 1770505;}
@list l1:level1
	{mso-level-tab-stop:.5in;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l2
	{mso-list-id:1578901290;
	mso-list-type:hybrid;
	mso-list-template-ids:-1329047020 984073 1639433 1770505 984073 1639433 1770505 984073 1639433 1770505;}
@list l2:level1
	{mso-level-tab-stop:.5in;
	mso-level-number-position:left;
	text-indent:-.25in;}
@list l3
	{mso-list-id:2032412648;
	mso-list-type:hybrid;
	mso-list-template-ids:1033775006 984073 1639433 1770505 984073 1639433 1770505 984073 1639433 1770505;}
@list l3:level1
	{mso-level-tab-stop:.5in;
	mso-level-number-position:left;
	text-indent:-.25in;}
ol
	{margin-bottom:0in;}
ul
	{margin-bottom:0in;}
-->
</style>
</head>

<body bgcolor=white lang=EN-US style='tab-interval:.5in'>

<div class=Section1>

<p class=MsoBodyText><b>The Project to Annotate the First 1000 Sequenced
Genomes, Develop Detailed Metabolic Reconstructions,<span style="mso-spacerun:
yes">&nbsp; </span>and Construct the Corresponding Stoichiometric Matrices<o:p></o:p></b></p>

<p class=MsoNormal><span style='font-size:16.0pt'><b><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></b></span></p>

<p class=MsoNormal align=center style='text-align:center'>by Ross Overbeek</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<h1>Introduction</h1>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>In December, 2003 The Fellowship for Interpretation of
Genomes (FIG) initiated<span style="mso-spacerun: yes">&nbsp; </span><b>The
Project to Annotate 1000 Genomes</b><span style='font-weight:normal'>.<span
style="mso-spacerun: yes">&nbsp; </span>The explicit goal was to develop a
technology for more accurate, high-volume annotation of genomes and to use this
technology to provide superior annotations for the first 1000 sequenced
genomes.<span style="mso-spacerun: yes">&nbsp; </span>Members of FIG were
convinced that the current approaches for high-throughput annotation, based on
protein families and automated pipelines that processed genomes sequentially,
would ultimately fail to produce annotations of the desired accuracy.<span
style="mso-spacerun: yes">&nbsp; </span>We believe that</span><b> the key to
development of high-throughput annotation technology is to have experts
annotate single subsystems over the complete collection of genomes</b><span
style='font-weight:normal'>.<span style="mso-spacerun: yes">&nbsp; </span>The
existing annotation approaches, in which teams analyze a whole genome at a
time, ensure that annotators have no special expertise relating to the vast
majority of genes they annotate.<span style="mso-spacerun: yes">&nbsp;
</span>By having individuals annotate single subsystems over a large collection
of genomes, we allow individuals with expertise in specific pathways (or, more
generally, <i>subsystems</i></span>) to perform their task with relatively high
accuracy.<span style="mso-spacerun: yes">&nbsp; </span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The early stages of the effort began at FIG, but quickly
spread to a number of cooperating institutions, most notably Argonne National
Lab.<span style="mso-spacerun: yes">&nbsp; </span>During the first year of the
project, we have developed detailed encodings of subsystems that include a
majority of the genes from subsystems that make up the core cellular
machinery.<span style="mso-spacerun: yes">&nbsp; </span>More importantly, we
have developed the initial versions of technology needed to support the
project.<span style="mso-spacerun: yes">&nbsp; </span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The Project to Annotate 1000 Genomes has reached the stage
where it is clear that it will very shortly produce what we call <i>informal
metabolic reconstructions</i><span style='font-style:normal'> that cover the
majority of central metabolism as it is implemented in the close to 300
more-or-less complete genomes that are now available.<span style="mso-spacerun:
yes">&nbsp; </span>We think of an informal metabolic reconstruction as a
partitioning of the cellular machinery into subsystems, the specification of
the functional roles that make up each subsystem, and the inventory of which
genes in a specific organism implement the functional roles.<span
style="mso-spacerun: yes">&nbsp; </span>What is needed to support both
qualitative analysis and effective quantitative modeling is to convert these
informal metabolic reconstructions into </span><i>formal metabolic
reconstructions. </i><span style='font-style:normal'>By a formal
reconstruction, we mean an accurate encoding of the metabolic network.<span
style="mso-spacerun: yes">&nbsp; </span>The goal of such an encoding is to
construct a list of metabolites and a detailed reaction network that is </span><i>internally
consistent</i><span style='font-style:normal'> (in the sense that metabolites
that are produced by reactions are connected as substrates to other reactions
or to specific transporters,<span style="mso-spacerun: yes">&nbsp; </span>and
that all metabolites that act as substrates are produced by other reactions or
provided by transporters).<span style="mso-spacerun: yes">&nbsp;
</span>Perhaps, a better way to put this is that all apparent anomalies are
highlighted as such, and the essential components of the metabolic network are
accurately encoded.<span style="mso-spacerun: yes">&nbsp; </span>The output of
such an effort is normally what is termed a</span></p> <p class=MsoNormal><i>stoichiometric matrix</i><span style='font-style:normal'>,
the basic resource required to support stoichiometric modeling.<span
style="mso-spacerun: yes">&nbsp; </span>One of the central goals of this
enlarged effort is to develop accurate stoichiometric matrices for each of the
1000 genomes; we refer to this component of the effort as <b>The Project to
Produce 1000 Stoichiometric Matrices.<o:p></o:p></b></span></p>

<p class=MsoNormal><b><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></b></p>

<p class=MsoNormal>It is our belief that the development of the technology
required to mass-produce accurate genome annotations will ultimately allow
fully automated annotation pipelines to achieve relatively high accuracy.<span
style="mso-spacerun: yes">&nbsp; </span>Similarly, the existence of 1000
accurate formal metabolic reconstructions would constitute a resource that
would allow rapid and accurate development of stoichiometric matrices for
newly-sequenced genomes.<span style="mso-spacerun: yes">&nbsp; </span>That is,
besides producing accurate annotations, informal metabolic reconstructions,
formal metabolic reconstructions, and stoichiometric matrices for a large
collection of diverse genomes, we believe that the expanded project will
produce technology that will support nearly automatic, very rapid
characterization of new genomes.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>All of the encoded subsystems, metabolic reconstructions and
stoichiometric matrices will be made freely available on open web sites.<span
style="mso-spacerun: yes">&nbsp; </span>In addition, the software environments
used to develop the encoded subsystems and stoichiometric matrices will be
developed and supported as open source software.<span style="mso-spacerun:
yes">&nbsp; </span>By making the fundamental data items, the encoded subsystems
and stoichiometric matrices, freely available to the community, we expect to
stimulate development of alternative software systems to support curation and
maintenance of these items.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<h1>The Project to Annotate 1000 Genomes</h1>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>We have chosen to conceptually break the Project to Annotate
1000 Genomes into three stages.<span style="mso-spacerun: yes">&nbsp; </span>We
discuss these stages as if they will occur sequentially; in fact, all three
stages are now in progress.<span style="mso-spacerun: yes">&nbsp; </span>To
understand the three stages, the reader must have at least a rudimentary grasp
of what we mean by an <i>encoded subsystem</i><span style='font-style:normal'>
and an </span><i>informal metabolic reconstruction</i><span style='font-style:
normal'>.<span style="mso-spacerun: yes">&nbsp; </span>When we speak of a
subsystem, we think of a set of related </span><i>functional roles</i><span
style='font-style:normal'>.<span style="mso-spacerun: yes">&nbsp; </span>In a
specific organism, a set of genes implement these roles, and we think of those
genes as constituting the subsystem in that organism.<span style="mso-spacerun:
yes">&nbsp; </span>That is, we are really dealing with an abstract notion of
subsystem (in which the subsystem is a set of functional roles) and instances
of the subsystem in a specific organism (in which a set of genes implements the
abstract functional roles).<span style="mso-spacerun: yes">&nbsp;
</span>Precisely the same subsystem and functional roles exist in distinct
organisms, although obviously the genes are unique to each organism. </span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>Subsystems are thought of as possibly having multiple <i>variants</i><span
style='font-style:normal'>.<span style="mso-spacerun: yes">&nbsp;
</span>Organisms that have operational versions of a subsystem may well have
genes that implement slightly different subsets of the functional roles that
make up the subsystem.<span style="mso-spacerun: yes">&nbsp; </span>Each subset
of functional roles that exists in at least one organism with an operational
version of the subsystem constitutes an operational variant.</span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>We think of an <i>informal metabolic reconstruction</i><span
style='font-style:normal'> for an organism as a set of operational variants of
subsystems that are believed to exist for the organism.<span
style="mso-spacerun: yes">&nbsp; </span>In this conceptualization, one does not
have a meaningful functional hierarchy or DAG; rather, we simply have an
inventory of functional roles that are implemented in the organism, along with
the variants of subsystems that they implement.<span style="mso-spacerun:
yes">&nbsp; </span>We do believe that the task of imposing an actual hierarchy
is relatively straightforward in comparison with the effort required to
construct the set of operational variants.<span style="mso-spacerun:
yes">&nbsp;&nbsp; </span>In some contexts, we have included a functional
overview in which the subsystems are embedded at the lowest levels.<span
style="mso-spacerun: yes">&nbsp; </span>It is clear that, given a diverse
collection of informal metabolic reconstructions, the development of
appropriate functional hierarchies can be generated with relatively few
resources.</span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>Our encoding of a subsystem can now be reduced to</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<ol style='margin-top:0in' start=1 type=1>
 <li class=MsoNormal style='mso-list:l2 level1 lfo1;tab-stops:list .5in'>a
     specification of a set of functional roles (this amounts to the abstract
     subsystem) and</li>
 <li class=MsoNormal style='mso-list:l2 level1 lfo1;tab-stops:list .5in'>sets
     of genes which implement the operational variants in a number of
     genomes.<span style="mso-spacerun: yes">&nbsp; </span>These genes are
     given as a <i>subsystem spreadsheet</i><span style='font-style:normal'> in
     which each row corresponds to a single genome, each column corresponds to
     a single functional role, and each cell contains the set of genes in that
     genome that are believed to implement the given functional role.<span
     style="mso-spacerun: yes">&nbsp; </span></span></li>
</ol>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The Project to Annotate 1000 Genomes amounts to an effort to
produce detailed and comprehensive encodings of several hundred subsystems,
which will impose assigned functions on genes in each of the genomes.<span
style="mso-spacerun: yes">&nbsp; </span>The total percent of genes that can be
assigned functions this way is probably on the order of 50-70% in most genomes
(in large eukaryotic genomes the total is obviously substantially lower).<span
style="mso-spacerun: yes">&nbsp;&nbsp; </span>The percent will grow as our
understanding grows.<span style="mso-spacerun: yes">&nbsp; </span>What should
be noted is that the accuracy of these assignments will be substantially better
than that of current assignments, and the conserved cellular machinery almost
all falls within the projected subsystems.<span style="mso-spacerun:
yes">&nbsp; </span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>Once we have produced our initial set of annotations, we
believe that automated pipelines and protein families are excellent tools for
propagating them.<span style="mso-spacerun: yes">&nbsp; </span>Protein families
are, in fact, a key component of annotation and provide the fundamental
mechanism for projection of function between genes. The added dimension
provided by subsystems, along with the manual curation required to develop
accurate initial encodings of subsystems, is an essential technology for
increasing the accuracy and effectiveness of protein families.<span
style="mso-spacerun: yes">&nbsp; </span>Ultimately the encoded subsystems will
be used to make incremental, essential corrections to collections of protein
families (like those supported by UniProt and COGs), and a basis for much more
accurate annotation will emerge.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>We now proceed to describe the details of the three stages.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<h2>Stage 1: Development of Initial Encodings of Subsystems</h2>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The initial stage of the project will involve development of
approximately 100-150 subsystems that will cover most of the conserved cellular
machinery in prokaryotes (and all of the central metabolic machinery in
eukaryotes).<span style="mso-spacerun: yes">&nbsp; </span>This work will be
done largely by trained annotators who achieve a limited mastery of specific
subsystems via review articles and detailed analysis of the collection of
genomes.<span style="mso-spacerun: yes">&nbsp; </span>These individuals can
define the abstract subsystems and add most genomes to the emerging
spreadsheets, but not without error.<span style="mso-spacerun: yes">&nbsp;
</span>They are necessarily far less skilled than experts who have invested tens
of years in study of specific subsystems.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>These initial subsystems will have many uses.<span
style="mso-spacerun: yes">&nbsp; </span>They can be used to enhance sets of
curated protein families, to clarify identification of gene starts, and to
develop a consistent set of annotations.<span style="mso-spacerun: yes">&nbsp;
</span>They will form the basis of informal metabolic reconstructions, and will
be used to support the development of formal metabolic reconstructions.<span
style="mso-spacerun: yes">&nbsp; </span>However, given the relative lack of
expertise of these initial annotators and the fact that they will seldom have
access to the wet lab facilities needed to remove ambiguities in assignments,
errors will inevitably remain.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<h2>Stage 2: The Use of True Experts and the Wet Lab to Refine the Encodings</h2>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The second stage will involve the gradual refinement and
enhancement of the original subsystem encodings by domain experts.<span
style="mso-spacerun: yes">&nbsp; </span>Almost every subsystem spreadsheet
makes it clear that numerous detailed questions remain to be answered.<span
style="mso-spacerun: yes">&nbsp;&nbsp; </span>These questions relate to
correcting gene calls, correction of frameshifts, refining function assignments,
and removing ambiguities (either via bioinformatics based analysis or through
actual wet lab efforts).</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The participation of domain experts will be critical, but it
seems most likely that a relatively small set will choose to get involved until
the utility of the approach becomes obvious.<span style="mso-spacerun:
yes">&nbsp; </span>We already have some domain experts (in translation,
transcription, and<span style="mso-spacerun: yes">&nbsp; </span>a limited
number of metabolic subsystems) participating in the effort.<span
style="mso-spacerun: yes">&nbsp; </span>We believe that this number will grow
rapidly over the next 2-3 years.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>It should be emphasized that upon completion of step 2 we
will have accurate annotations and a solid foundation for the construction of
stoichiometric matrices.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal><b>Stage 3: Understanding the Evolutionary History of the
Genes within the Subsystem<o:p></o:p></b></p>

<p class=MsoNormal><b><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></b></p>

<p class=MsoNormal>The third stage involves determination of the evolutionary
history of the genes within the subsystem.<span style="mso-spacerun:
yes">&nbsp; </span>To understand what this involves and the utility of this
type of analysis, we must simply recommend two papers by the team led by Roy
Jensen:</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<ol style='margin-top:0in' start=1 type=1>
 <li class=MsoNormal style='mso-list:l0 level1 lfo2;tab-stops:list .5in'><b>Ancient
     origin of the tryptophan operon and the dynamics of evolutionary change</b><span
     style='font-weight:normal'> by Xie, Keyhani, Bonner, Jensen, Microbiol Mol
     Biol Rev. 2003 Sep;67(3):303-42</span></li>
 <li class=MsoNormal style='mso-list:l0 level1 lfo2;tab-stops:list .5in'><b>Inter-genomic
     displacement via lateral transfer of bacterial <i>trp</i></b><span
     style='font-style:normal'><b> operons in an overall context of vertical
     genealogy, </b></span>by Xie, Song, Keyhani, Bonner, Jensen, BMC Biology,
     2004, 2:15</li>
</ol>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>These papers elegantly display the exact style of analysis
required to uncover and clarify the evolutionary history of the relevant
genes.<span style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp; </span>Essentially,
trees must be built containing all of the genes implementing each specific
functional role (multiple trees may be needed for distinct forms).<span
style="mso-spacerun: yes">&nbsp; </span>Those trees that display a common
topology indicate which columns in the spreadsheet can be used to infer the
most probable vertical<span style="mso-spacerun: yes">&nbsp; </span>history of
the subsystem.<span style="mso-spacerun: yes">&nbsp; </span>Once the overall
history has been clarified, it becomes possible to attempt clarification of
horizontal transfers, to reconstruct the history of clusters on the chromosome,
and in some cases to tie the analysis to regulatory issues.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The effort required to do this style of analysis well is
high.<span style="mso-spacerun: yes">&nbsp; </span>While we expect the initial
efforts to go slowly, we also expect experience and advances in tools to
dramatically reduce the required effort.<span style="mso-spacerun: yes">&nbsp;
</span>In any event, it is clear that this stage will not be completed in the
next few years, but will undoubtedly stimulate large amounts of related
research. </p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<h2>Filling in the Missing Pieces</h2>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The encoded subsystems produced by the Project to Annotate
1000 Genomes offer a detailed picture of exactly what components have been
identified and are present in each genome.<span style="mso-spacerun:
yes">&nbsp; </span>Perhaps as significant, they vividly display exactly what is
missing or ambiguous, allowing one to arrive at an accurate inventory of gaps
in our understanding. </p>

<p class=MsoNormal>The issue of how best to address these gaps is an integral
part of the project.<span style="mso-spacerun: yes">&nbsp; </span>The
technology that is emerging is what we refer to as the <i>bioinformatics-driven
wet lab</i><span style='font-style:normal'>.<span style="mso-spacerun:
yes">&nbsp; </span>This concept refers to the development of a wet lab that
utilizes conventional biochemical and genetic techniques in a framework designed
to maximize the overall number of confirmations.<span style="mso-spacerun:
yes">&nbsp; </span>It is driven by predictions arising from the analysis of
subsystems, and it targets a prioritized list of conjectures.<span
style="mso-spacerun: yes">&nbsp; </span>That is, the explicit goal is to fill
in as many gaps and remove as many ambiguities as possible for resources
consumed.</span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal style='mso-pagination:none;tab-stops:28.0pt 56.0pt 84.0pt 112.0pt 140.0pt 168.0pt 196.0pt 224.0pt 3.5in 280.0pt 308.0pt 336.0pt;
mso-layout-grid-align:none;text-autospace:none'>Although it is inconceivable
that one experimental group would be able to assess all of the functional
predictions, we believe that integrating an experimental component into our
annotation/modeling effort will directly support our main goal.<span
style="mso-spacerun: yes">&nbsp; </span>In addition to verification of key
predictions and removal of central ambiguities, it will validate the overall
approach and set an example for other groups worldwide.<span style='font-family:
Helvetica;color:black'> </span></p>

<p class=MsoNormal><b><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></b></p>

<p class=MsoNormal><span style='font-size:14.0pt'><b>The Project to Develop
1000 Stoichiometric Matrices<o:p></o:p></b></span></p>

<p class=MsoNormal><span style='font-size:14.0pt'><b><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></b></span></p>

<p class=MsoNormal>We believe that the informal metabolic reconstructions are
of substantial value by themselves.<span style="mso-spacerun:
yes">&nbsp;&nbsp;&nbsp; </span>Indeed, numerous applications are quite
obvious.<span style="mso-spacerun: yes">&nbsp; </span>However, they are not
enough to support quantitative modeling.<span style="mso-spacerun:
yes">&nbsp;&nbsp;&nbsp; </span>Whole genome modeling will require development
of stoichiometric matrices, an effort that will pay many dividends.<span
style="mso-spacerun: yes">&nbsp; </span>The most immediate payout is as quality
control on the informal metabolic reconstruction.<span style="mso-spacerun:
yes">&nbsp;&nbsp; </span>Just as the use of subsystems imposes a critical set
of consistency checks on the assignment of function to genes, an attempt to
develop an internally consistent reaction network imposes a strong consistency
check on both the annotations and assertions of the presence of specific<span
style="mso-spacerun: yes">&nbsp; </span>subsystems. </p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>Over the last 4-5 years, the success of stoichiometric
modeling has set the stage for large-scale employment of the technology.<span
style="mso-spacerun: yes">&nbsp;&nbsp; </span>The key limiting factor is the
development of the stoichiometric matrix itself.<span style="mso-spacerun:
yes">&nbsp; </span>This is a time-consuming task that frequently requires on
the order of a year for a skilled practitioner.<span style="mso-spacerun:
yes">&nbsp; </span>Many actual modeling efforts have foundered on just the
technical difficulties in producing this basic datum.<span style="mso-spacerun:
yes">&nbsp; </span>Bernhard Palsson has pioneered much of the key research that
has led to the recent successes.<span style="mso-spacerun: yes">&nbsp;
</span>Spending large amounts of effort, his team has built a very few of these
stoichiometric matrices, iteratively improving their accuracy.<span
style="mso-spacerun: yes">&nbsp; </span>They have successfully used these
matrices to support initial modeling efforts on the organisms, and the results have
gained international recognition.<span style="mso-spacerun: yes">&nbsp; </span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>PalssonŐs team originated the <b>The Project to Produce 1000
Stoichiometric Matrices</b><span style='font-weight:normal'>, and they will
play the lead role in converting the informal metabolic reconstructions into
formal reconstructions and produce the matrices.<span style="mso-spacerun:
yes">&nbsp;&nbsp; </span>The team at FIG and Argonne National Laboratory will
participate in the effort, coordinating closely with PalssonŐs team.<span
style="mso-spacerun: yes">&nbsp;&nbsp; </span>At this point, the Palsson team
and the teams at FIG, ANL, and The Burnham Institute are all working on issues
relating to tools to automate the generation of matrices from informal
metabolic reconstructions.</span></p>

<p class=MsoNormal><span style="mso-spacerun: yes">&nbsp;</span><span
style='font-size:14.0pt'><b><o:p></o:p></b></span></p>

<p class=MsoNormal><span style='font-size:14.0pt'><b>The Participants<o:p></o:p></b></span></p>

<p class=MsoNormal><span style='font-size:14.0pt'><b><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></b></span></p>

<p class=MsoNormal>We expect participants in both projects from many
institutions worldwide, probably with both academic and commercial
interests.<span style="mso-spacerun: yes">&nbsp;&nbsp; </span>Initially, it is
likely that the effort will be led from FIG, ANL and PalssonŐs team at
UCSD.<span style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp; </span>We are planning
on Roy Jensen playing a role relating to quality control and development of
tools to support Stage 3 analysis.<span style="mso-spacerun: yes">&nbsp;
</span>Andrei Osterman from the Burnham Institute will lead wet lab efforts to
challenge <i>in silico</i><span style='font-style:normal'> predictions.</span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>If the effort is successful, we would hope to stimulate
numerous research efforts worldwide, and we welcome broad participation.<span
style="mso-spacerun: yes">&nbsp; </span>Ultimately, leadership and
participation will broaden rapidly, if the effort is successful.</p>

<p class=MsoNormal><span style='font-size:14.0pt'><b><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></b></span></p>

<h3><b>A Proposed Schedule<o:p></o:p></b></h3>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>Let us begin by estimating the point at which 1000 genomes
will become available.<span style="mso-spacerun: yes">&nbsp; </span>One simple
approach would go as follows:</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<ol style='margin-top:0in' start=1 type=1>
 <li class=MsoNormal style='mso-list:l1 level1 lfo3;tab-stops:list .5in'>The
     number of genomes will double approximately every 18 months.</li>
 <li class=MsoNormal style='mso-list:l1 level1 lfo3;tab-stops:list .5in'>We now
     have about 300 more-or-less complete genomes.</li>
 <li class=MsoNormal style='mso-list:l1 level1 lfo3;tab-stops:list .5in'>Therefore,
     we should have approximately 1000 genomes in just a bit under 3 years (by
     sometime in 2007)</li>
</ol>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>There is a great deal in this analysis that is far from
certain.<span style="mso-spacerun: yes">&nbsp; </span>However, let us use this
estimate as a working hypothesis.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal><b>2005<o:p></o:p></b></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>During 2005, Stage 1 will be completed for the vast majority
of subsystems.<span style="mso-spacerun: yes">&nbsp; </span>Stage 2 will be
initiated for 30-50 subsystems.<span style="mso-spacerun: yes">&nbsp;
</span>Less than 10 will move deeply into stage 3.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>We will actively attempt to produce 10-15 stoichiometric
matrices.<span style="mso-spacerun: yes">&nbsp; </span>We will focus on diverse
organisms of interest to DOE and a set of gram-positive pathogens.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>We will begin a detailed review for quality assurance by a
small number of expert biochemists and microbiologists.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal style='mso-pagination:none;tab-stops:28.0pt 56.0pt 84.0pt 112.0pt 140.0pt 168.0pt 196.0pt 224.0pt 3.5in 280.0pt 308.0pt 336.0pt;
mso-layout-grid-align:none;text-autospace:none'>We expect wet lab confirmations
to begin, but this is one area in which funding plays an essential role.<span
style="mso-spacerun: yes">&nbsp; </span>We expect funding to support targeted
confirmation/rejection of the numerous conjectures arising from the
bioinformatics to begin in 2005-2006.<span style="mso-spacerun: yes">&nbsp;
</span>It is possible to fairly accurately predict the potential flow of
confirmations, but we cannot predict available funding. We believe that the
bioinformatics-driven wet lab, in which conjectures are prioritized and
grouped, <span style='color:black'>would allow a relatively small group (of 3-4
postdocs and technician) to characterize up to 50 novel gene families encoding
the most important functional roles in central metabolic subsystems of diverse
organisms per year.<o:p></o:p></span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal><b>2006<o:p></o:p></b></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>During 2006,<span style="mso-spacerun: yes">&nbsp;
</span>the vast majority of subsystems will enter Stage 2.<span
style="mso-spacerun: yes">&nbsp; </span>We will attempt to move a large number
into Stage 3 (this is truly difficult to predict; it depends hugely on success
with the early attempts, our ability to reduce the required effort, and the
research aims of the participants).</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>We would plan on completing at least 200 more stoichiometric
matrices.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>If the wet lab component of the effort is fully functional,
we would expect a steady stream of confirmations, and (based on our past
experience) we would project roughly that 75-90% of the tested conjectures will
be validated.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal><b>2007<o:p></o:p></b></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>During 2007 we would plan on pushing Stage 2 and 3 analysis
as far as possible.<span style="mso-spacerun: yes">&nbsp; </span>We believe
that we will have the subsystems needed to cover the vast majority of well
understood subsystems and many that are not well understood.<span
style="mso-spacerun: yes">&nbsp;&nbsp; </span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>We would plan on completing initial stoichiometric matrices
for several hundred more genomes. <span style="mso-spacerun:
yes">&nbsp;</span>Since the majority of the genomes will not become available
until this year, of necessity many of the stoichiometric matrices will not be
reasonably complete before sometime in 2008 or 2009.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>If the wet lab component of the effort is fully functional,
we would expect the stream of successful conjectures to stimulate numerous labs
to join the effort.<span style="mso-spacerun: yes">&nbsp; </span>Ultimately,
the role of the wet lab component that is tightly-coupled to the project is to
demonstrate the huge improvement in efficiency that can be attained by coupling
the wet lab effort to well-chosen, targeted conjectures generated from the
subsystems.<span style="mso-spacerun: yes">&nbsp; </span></p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<h1>A Short Note on the Analysis of Environmental Samples</h1>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>It is becoming clear that analysis of environmental samples
will become increasingly significant. <span style="mso-spacerun:
yes">&nbsp;&nbsp;</span>Consider a framework in which we have 1000 genomes and
detailed informal metabolic reconstructions for all of them.<span
style="mso-spacerun: yes">&nbsp; </span>We believe that, given a substantial
environmental sample,</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<ol style='margin-top:0in' start=1 type=1>
 <li class=MsoNormal style='mso-list:l3 level1 lfo4;tab-stops:list .5in'>it
     will be possible to produce accurate estimates of which organisms are present
     (where an "organism" in this context should probably be viewed as "some
     organism within a very constrained phylogenetic neighborhood"),</li>
 <li class=MsoNormal style='mso-list:l3 level1 lfo4;tab-stops:list .5in'>it
     will be possible to produce fairly precise estimates of the metabolism of
     the organisms believed to be present, and</li>
 <li class=MsoNormal style='mso-list:l3 level1 lfo4;tab-stops:list .5in'>it
     will be possible to compared the predicted metabolism with the actual
     enzymes detected in the environmental sample.</li>
</ol>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The hope is clearly that we will be able to make accurate
estimates, given 1000 well-annotated genomes.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<h1>Summary</h1>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The value of a collection of 1000 genomes depends directly
on the quality of the annotations, the corresponding metabolic reconstructions,
and the extent to which the foundations of modeling have been established.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The Project to Annotate 1000 Genomes is based directly on the
notion of building a collection of carefully created and curated
subsystems.<span style="mso-spacerun: yes">&nbsp; </span>The fact that the
individuals who encode these subsystems annotate the same subsystem over a
broad collection of genomes allows them to gain an understanding of detailed
variation and at least a minimal grasp of the review literature.<span
style="mso-spacerun: yes">&nbsp; </span>They will be annotating genes for which
they develop some detailed familiarity.<span style="mso-spacerun: yes">&nbsp;
</span>We place this technology in direct opposition to the existing approaches
in which individuals annotate complete genomes (assuring an almost complete
lack of familiarity with the majority of genes being annotated), and automated
pipelines are badly limited by the ambiguities and errors in existing
annotations.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The Project to Produce 1000 Stoichiometric Matrices has the
potential of laying the foundations for quantitative modeling.<span
style="mso-spacerun: yes">&nbsp; </span>Many, if not most, existing modeling
efforts are dramatically hampered by the fact that very, very few
stoichiometric matrices<span style="mso-spacerun: yes">&nbsp; </span>now exist,
and the cost of developing more using existing approaches is quite high.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>The development of a wet lab component that challenges a
carefully prioritized set of conjectures flowing from both the subsystems
analysis and the initial modeling based on quantitative modeling is
essential.<span style="mso-spacerun: yes">&nbsp; </span>It will confirm the relative
efficiency of this approach (which might reasonably be characterized as
"picking the low-hanging fruit"), and in the process establish a paradigm that
directly challenges the more common approach to establishing priorities.</p>

<p class=MsoNormal><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

<p class=MsoNormal>We claim to understand the key technology needed to develop
high-throughput development of annotations, metabolic reconstructions, and
stoichiometric matrices.<span style="mso-spacerun: yes">&nbsp; </span>By the
summer of 2005, this should be completely obvious.</p>

</div>

</body>

</html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3