[Bio] / FigTutorial / TheCycle.html Repository:
ViewVC logotype

Annotation of /FigTutorial/TheCycle.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (view) (download) (as text)

1 : overbeek 1.1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
2 :     <html><head>
3 :     <meta content="text/html; charset=ISO-8859-1" http-equiv="content-type"><title>The Impotance of the Annotation Cycle</title>
4 :    
5 :     </head>
6 :     <body>
7 :     <h1 style="text-align: center;">Reflections on Accurate Annotations:</h1>
8 :     <div style="text-align: center;">
9 :     <h2>The Basic Cycle and Its Significance<br>
10 :     </h2>
11 :     </div>
12 :     <h2 style="text-align: center;">by Ross Overbeek</h2>
13 :     <br>
14 :     I have recently been reflecting on the status of the&nbsp;<span style="font-style: italic;"><span style="font-weight: bold;"><span style="font-style: italic;"><span style="font-style: italic;"></span></span></span></span><span style="font-weight: bold;">Project to Annotate 1000 Genomes,
15 :     </span>and in this short essay I will argue that it has been an
16 :     overwhelming success due to issues that became apparent only as the
17 :     project progressed. &nbsp;&nbsp; A
18 :     thousand more-or-less complete genomes now exist, a framework for
19 :     rapidly annotating new genomes with remarkable accuracy is now
20 :     functioning, and we are on the verge of another major shift in the
21 :     world of annotations. &nbsp; This reflection is based on an
22 :     informal note that I sent to friends on the last day of 2007, but my
23 :     thoughts have clarified somewhat since then.<br>
24 :     <br>
25 :     <h3>The Production of Accurate Annotations</h3>
26 :     The efforts required to establish a framework for high-volume, accurate
27 :     annotation are substantial. &nbsp;I believe that it is important
28 :     that we reflect on what we have learned about the factors that
29 :     determine productivity. &nbsp;So, what have we learned from the
30 :     project?<br>
31 :     <br>
32 :     First, <span style="font-weight: bold;">subsystem-based
33 :     annotation</span> <span style="font-weight: bold;">is
34 :     the key to accuracy.</span> &nbsp; While there are certainly
35 :     numerous efforts still focusing on annotation of a single genome, the
36 :     recognition that comparative analysis is the key to everything, and
37 :     that focusing on the variations of a single component of cellular
38 :     machinery as they are manifested over the entire collection of existing
39 :     genomes is the key to accuracy, are both widely accepted
40 :     &nbsp;principles at this stage. &nbsp; <span style="font-weight: bold;">Manually-based subsystem creation
41 :     and maintenance is the rate-limiting component</span> of
42 :     successful annotation efforts, and the factors that constrain this
43 :     process are at the heart of the matter. &nbsp;We have understood
44 :     this for some time now. &nbsp;<br>
45 :     <br>
46 :     However, I am going to argue a new position in this short essay: <br>
47 :     <br>
48 :     <ol>
49 :     <li>There are three distinct components that make up our
50 :     strategy for rapid accurate annotation: <span style="font-weight: bold;">subsystems-based annotation</span>,
51 :     <span style="font-weight: bold;">FIGfams</span>
52 :     as a framework for propagating the subsystems annotations, and <span style="font-weight: bold;">RAST</span> as a technology
53 :     for using FIGfams and subsystems to consistently propagate annotations
54 :     to newly-sequenced genomes.</li>
55 :     <br>
56 :     <li>These three components form a cycle (subsystems =&gt;
57 :     FIGfams =&gt; RAST technology =&gt; subsystems). &nbsp;This
58 :     cycle creates a feedback that rapidly accelerates the productivity
59 :     achievable in all three components. &nbsp;Further, failure in any
60 :     of these components impairs productivity dramatically in the others.
61 :     &nbsp;Understanding this cycle will be the key to supporting higher
62 :     productivity in&nbsp; subsystem maintenance and creation.</li>
63 :     <br>
64 :     <li>To understand the dependencies, we need to consider each of
65 :     the components:</li>
66 :     <ul>
67 :     <br>
68 :     <li><span style="font-weight: bold;">The key to
69 :     accurate FIGfam creation and maintenance is to couple it directly to
70 :     subsystem maintenance</span>.
71 :     &nbsp;Once the initial release
72 :     of the FIGfams was created, updating them&nbsp; occurs
73 :     automatically
74 :     based
75 :     on changes in the subsystem collection. &nbsp;Thus, FIGfams are
76 :     automatically split, merged&nbsp;and added as the subsystem
77 :     collection is maintained. &nbsp;There remains one area of
78 :     substantial cost in FIGfam development -- creation of family-dependent
79 :     decision procedures that are occasionally required to achieve the
80 :     required accuracy. &nbsp;At this point we have approximately 10,000
81 :     subsystem-based FIGfams, although the overall collection contains over
82 :     100,000 families (the majority containing only 2-3 members).</li>
83 :     <br>
84 :     <li><span style="font-weight: bold;">RAST has a
85 :     central dependency on FIGfams</span> for assertion of function to
86 :     newly-recognized genes. &nbsp;In this sense, the main dependency of
87 :     RAST is on the FIGfam collection. &nbsp;The more accurate the
88 :     FIGfams and their associated decision procedures, the more accurate the
89 :     assignments of function made to genes in genomes processed by RAST.</li>
90 :     <br>
91 :     <li>Finally, the central costs of maintenance of subsystems
92 :     are cleaning up errors in existing subsystems (often indicated by
93 :     multiple genes having the same function) and by adding new genomes to
94 :     existing subsystems. &nbsp;Once a subsystem has reached an
95 :     acceptable level of accuracy (and many are not there yet), <span style="font-weight: bold;">the central cost is integration
96 :     of new genomes after annotation by RAST.</span> &nbsp;The
97 :     speed with which new genomes can be added depends on how well RAST
98 :     assigns gene function (and, secondarily, on how accurately these
99 :     RAST-based annotations can be used to &nbsp;infer operational
100 :     variants of subsystems).</li>
101 :     </ul>
102 :     <br>
103 :     <li>The main costs of increasing the speed and accuracy of
104 :     annotations split into two categories: those relating to maintenance of
105 :     existing subsystems, and those relating to generation of new
106 :     subsystems.
107 :     The maintenance costs are containable, if the cycle is established and
108 :     functions smoothly. &nbsp;Otherwise, I suspect they inevitably grow
109 :     rapidly.</li>
110 :     </ol>
111 :     Let me begin by depicting the cycle pictorially:<br>
112 :     <br>
113 :     <br>
114 :     <img src="cycle_files_image003.png" v:shapes="_x0000_i1025" height="647" width="810"><br>
115 :     <br>
116 :     <br>
117 :     I have argued that the costs in achieving rapid, accurate annotations
118 :     is
119 :     limited by the rate at which subsystems can be maintained and created.
120 :     &nbsp;I place the maintenance ahead of creation at this stage.
121 :     &nbsp;As the collection grows (it now contains over 600
122 :     subsystems with over 6800 distinct functional roles), costs of
123 :     maintenance will tend to dominate. &nbsp;The
124 :     creation of new subsystems will always be a critical activity, but each
125 :     new subsystem will impact smaller sets of genomes as we "move into the
126 :     tail of the distribution". &nbsp;<br>
127 :     <br>
128 :     The costs relating to subsystem maintenance, which will quickly
129 :     dominate, depend critically on how smoothly the cycle I described
130 :     functions. &nbsp;We have just established the complete cycle.<br>
131 :     <br>
132 :     The two central costs that cannot be avoided will be creation of
133 :     FIGfam-dependent decision procedures and the creation of new
134 :     subsystems. &nbsp;The manual work on FIGfams will be necessary to
135 :     achieve near-100% accuracy on annotation of seriously ambiguous
136 :     paralogs. &nbsp;However, in the vast majority of cases, this effort
137 :     will be restricted to specific curators who are willing to spend
138 :     massive effort to get things perfect. &nbsp; The more central cost
139 :     relates to manual curation of the subsystems.&nbsp;
140 :     <br>
141 :     <h3>More Effective Integration of Existing Annotation Efforts</h3>
142 :     In the section above, I reflected on the cycle that we shall depend
143 :     upon for supporting increased volume and accuracy of our own efforts.
144 :     &nbsp;Other groups are certainly experimenting with their own
145 :     solutions, and in some cases with clear successes. &nbsp;I have no
146 :     desire to rate these competing efforts. &nbsp; I sincerely believe
147 :     that cooperative activity is the key to
148 :     enhanced achievements by everyone. &nbsp;However, effective
149 :     cooperation is often elusive. &nbsp;I think that we have put in
150 :     place an extremely important mechanism for making cooperation much
151 :     easier, and the benefits more compelling.<br>
152 :     <br>
153 :     Anyone working for one of the main annotation efforts realizes that it
154 :     is not easy to really benefit from access to the annotation efforts of
155 :     other groups. &nbsp;The efforts required to characterize
156 :     discrepancies between local annotations and those produced externally
157 :     often outweigh any benefits that result.<br>
158 :     <br>
159 :     Two events of major importance have occurred:<br>
160 :     <br>
161 :     <ol>
162 :     <li>Both PIR and the SEED Project decided to build
163 :     correspondences between IDs used by different annotation projects.
164 :     &nbsp;The PIR effort produced <a href="%28http://pir.georgetown.edu/pirwww/iprolink/biothesaurus.shtml">BioThesaurus</a>
165 :     and the SEED effort produced <a href="http://clearinghouse.nmpdr.org/aclh.cgi">the
166 :     Annotation Clearing House</a>. &nbsp;The fact that it will
167 :     become trivial to reconcile IDs between the different annotation
168 :     efforts will undoubtedly support rapid increases in cross-linking
169 :     entries. &nbsp;The SEED is working with UniProt to cross-link
170 :     proteins from all of our complete genomes, and I am sure similar
171 :     efforts are happening between the other major annotation efforts.</li>
172 :     <br>
173 :     <li>Within the Annotation Clearing House, a project to allow
174 :     experts to assert that specific annotations are reliable (using
175 :     whatever IDs they wish) has been initiated. &nbsp;This has led to
176 :     many tens of thousands of assertions that specific annotations are
177 :     highly reliable. &nbsp;PIR is preparing a list of assertions that
178 :     they consider highly reliable, and both institutions are making these
179 :     lists openly available.</li>
180 :     </ol>
181 :     <br>
182 :     To see the utility of exchanging expert assertions in a framework in
183 :     which it is easy to compare the results, let me describe how we intend
184 :     to use these assertions:<br>
185 :     <br>
186 :     <ol>
187 :     <li>We begin with a 3-column table of reliable annotations
188 :     containing <span style="font-style: italic;">[ProteinID,AssertedFunction,IDofExpert]</span></li>
189 :     <br>
190 :     <li><span style="font-style: italic;"></span>We
191 :     then take our IDs and construct a 2-column table <span style="font-style: italic;">[FIG-function,AssertedFunction].</span>
192 :     &nbsp;This table gives a correspondence between each of our <span style="font-style: italic;">functional roles</span>
193 :     and the functional roles used by the expert making the assertion of
194 :     reliability.</li>
195 :     <br>
196 :     <li>Then, we go through this correspondence table (using both
197 :     tools and manual inspection) and split it into one set in which we
198 :     believe both columns are essentially identical and a second set that we
199 :     believe represent errors (either our own or those of the expert
200 :     asserting reliability). &nbsp;We anticipate that in most cases the
201 :     expert assertion will be accurate, which is what makes this exercise so
202 :     beneficial to ourselves.</li>
203 :     <br>
204 :     <li>We take the table of "essentially the same" assertions and
205 :     distribute it as a table of synonyms (which we consider to be a very
206 :     useful resource).</li>
207 :     </ol>
208 :     <br>
209 :     We are strongly motivated to resolve differences between our
210 :     annotations and high-reliability assertions made by experts.
211 :     &nbsp;The production of the table of synonyms both reduces the
212 :     effort to redo such a comparison in the future, but is also a major
213 :     asset by itself. &nbsp;I am confident that any serious annotation
214 :     group that participates will benefit, and I believe that these
215 :     exchanges will accelerate in 2008 and 2009.<br>
216 :     <br>
217 :     <h2>Summary</h2>
218 :     I have tried to express the significance of the cycle depicted above,
219 :     but I think that I failed to really convey the epiphany, so let me
220 :     end by expressing it somewhat more emphatically. &nbsp;I believe
221 :     that there will be a very rapid acceleration in the sequencing of new,
222 :     complete genomes (although frequently the quality of the sequence wil
223 :     be far from perfect, and I am willing to say that a genome in 100
224 :     contigs is "essentially complete"). &nbsp;Groups that now try to
225 :     provide accurate integrations of all (or most) complete genomes will be
226 :     strained heavily. &nbsp;The tendency will be to go one of two
227 :     directions:<br>
228 :     <br>
229 :     <ol>
230 :     <li>Some will swing to completely automated approaches.
231 :     &nbsp;This will result in rapid propagation of errors (for those
232 :     portions of the cellular mechanisms that are not yet accurately
233 :     characterized -- which is quite a bit).</li>
234 :     <br>
235 :     <li>Others will give up any attempt at comprehensive annotation
236 :     and focus on accurate annotation of a slowly growing subset.</li>
237 :     </ol>
238 :     The problem with the second approach is that accurate annotation of new
239 :     cellular mechanisms (i.e., the introduction of new subsystems) will
240 :     increasingly depend on a comprehensive set of genomes (comparative
241 :     analysis is central to working out any of the serious difficulties, and
242 :     the larger the set of accurately annotated genomes, the better
243 :     framework for careful correction.<br>
244 :     <br>
245 :     The cycle depicted above is the only viable strategy that I know of to
246 :     handle the deluge of genomes accurately. &nbsp;I claim that as time
247 :     goes by, the SEED effort to implement the above cycle will emerge in a
248 :     continuously strengthening position. &nbsp;Other groups will be
249 :     forced to rapidly copy it, but it really was not that easy to
250 :     establish, and I believe the odds are that the SEED effort will be the
251 :     only group standing in 2-3 years (i.e., it will be the only group
252 :     claiming both accuracy and comprehensive integration).<br>
253 :     <br>
254 :     <br>
255 :     </body></html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3