[Bio] / FigTutorial / How_to_annotate_a_genome.html Repository:
ViewVC logotype

Annotation of /FigTutorial/How_to_annotate_a_genome.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.1 - (view) (download) (as text)

1 : overbeek 1.1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
2 :     <html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="content-type"><title>How_to_annotate_a_genome</title></head><body><h1 style="text-align: center;">How to Annotate a Genome</h1><div style="text-align: center;"><h3>by Ross Overbeek</h3><br><div style="text-align: left;">We at the Fellowship for Interpretation of Genomes (FIG) have actively led the <span style="font-weight: bold;">Project to Annotate a 1000 Genomes</span> since its inception in 2003. &nbsp;In that effort we pioneered what we called the <span style="font-weight: bold;">subsystems approach to annotation</span>
3 :     in which experts annotated a single subsystem across the entire set of
4 :     genomes. &nbsp;This was a radically different approach than the more
5 :     usual of attempting to annotate all of the genes in a single genome.
6 :     &nbsp;The effort to develop well-curated sets of subsystems has led to
7 :     a collection of 400-600 subsystems (depending on where you choose to
8 :     impose a threshold of acceptable quality). &nbsp;We believe that the
9 :     number will continue to grow for reasons that will become apparent in
10 :     this short note.<br><br>It is time to revisit the issue of how to
11 :     annotate a specific genome of interest, since numerous biologists are
12 :     now faced with that opportunity. &nbsp; For what it is worth, here is
13 :     our advice.<br><br><h2>Begin by Identifying the Recognizable Instances of Subsystems</h2>When
14 :     you are able to annotate a complete subsystem, the individual
15 :     assignments are all somewhat more reliable. &nbsp;Most of the common
16 :     machinery can easily be identified, and this establishes a starting
17 :     point for the more difficult remaining tasks. &nbsp;The easiest way to
18 :     perform this initial stage of analysis is to proceed through two tasks:<br><ol><li>Submit the genome sequence to the RAST server maintained at Argonne National Laboratory. &nbsp;This can be done by going to the <a href="http://rast.nmpdr.org/rast.cgi">RAST server</a>,
19 :     registering yourself as a user (anyone is welcome to use the site),
20 :     uploading your sequence, and getting an initial annotation back in
21 :     about 12 hours. &nbsp;You can then download the initial annotation to
22 :     your site and work on it using any tools you prefer. &nbsp;The initial
23 :     annotation from RAST gives you three things:<br><ul><li>protein-encoding genes (CDSs),</li><li>RNA-encoding genes (tRNAs and rRNAs)</li><li>identified subsystems</li></ul></li><li>Once
24 :     you have an initial set of identified subsystems, you should manually
25 :     go through and see where RAST missed identifying active variants.
26 :     &nbsp;It is fairly conservative in its calls, so if there were a
27 :     mis-called gene (e.g., due to a frameshift) or an unusual form of a
28 :     gene (e.g., an unknown form of an enzyme) you would see almost all of
29 :     the genes in a subsystem accounted for, but not enough for RAST to say
30 :     that the subsystem is really there. &nbsp;If you do this analysis
31 :     within RAST, you can compare the metabolic reconstruction for your
32 :     genome against related genomes, focusing on the specific differences.</li></ol>If
33 :     your genome is close to a previously annotated and studied genome, we
34 :     suggest that you do a detailed analysis of what genes distinguish the
35 :     new genome from the previously annotated genome (or genomes). &nbsp;The
36 :     SEED provides a tool for easily doing such a comparison, and similar
37 :     tools are either available or becoming available from a number of
38 :     sources.<br><br>Note that this initial step can be done very rapidly -- in a few days.<br><br><h2>Fix Frameshifts, Annotate Insertion Sequences, and Process Pseudo-genes</h2>RAST
39 :     often fails to identify the functional role of a particular gene due to
40 :     frameshifts. &nbsp;This is very common in low-quality sequence or
41 :     sequence produced by 454 technology. &nbsp;It is not particularly
42 :     serious, but we do recommend that you post-process the gene calls to
43 :     clean up the frameshifts. &nbsp; Biologists are justifiably reluctant
44 :     to change sequence data without resequencing; hence, we recommend that
45 :     the actual DNA sequence remain unchanged, that the correction be
46 :     embodied in the proposed translation of the feature, and that the
47 :     discrepancy between the actual DNA sequence and the translation be
48 :     recorded with the feature. We note that you can automatically correct
49 :     obvious frameshifts using tools within the SEED environment, and we
50 :     anticipate that these will become increasingly important as larger
51 :     volumes of low-quality sequence data becomes available.<br><br>The
52 :     issue of detecting insertion sequences, mobile elements, prophages and
53 :     so forth is important for a number of reasons. &nbsp;Determining the
54 :     set of impacted genes (often pseudo-genes) is extremely time-consuming.
55 :     &nbsp;We would guess that tools to support this type of analysis will
56 :     appear soon, but for now you will need to determine how much effort you
57 :     are willing to expend on the task. &nbsp; So, this part of the effort
58 :     can take from a few days (to automatically detect and correct
59 :     frameshifts) to man-years (to characterize insertion sequences,
60 :     pseudo-genes, and prophages).<br><h2>Look at Identified Functions that are Not in Subsystems</h2>As
61 :     you scan through the genes not yet placed in subsystems that were
62 :     identified by RAST, some correspond to FIGfams, and some do not.
63 :     &nbsp;Some are closely similar to well-annotated proteins (e.g., to
64 :     Swiss Prot entries), and some are not.<br><br>We recommend that you
65 :     scan through these focusing on those that correspond to functional
66 :     roles that should be encoded into subsystems. &nbsp; &nbsp;It is
67 :     particularly important to examine those for which "functional coupling"
68 :     information exists (RAST will give you this information). &nbsp;When
69 :     strong functional coupling data exists, and when the functional role
70 :     can be identified with reasonable certainty, you have a particularly
71 :     good candidate for a new subsystem. &nbsp;If you can connect any of the
72 :     genes in the cluster (in, say, genomes that are "close" and have been
73 :     actively studied) to literature, you need to get the relevant papers
74 :     before deciding how to proceed. &nbsp;We suggest making a rapid pass
75 :     through the set of genes that have not been assigned to subsystems,
76 :     prioritizing these genes for possible use in starting new subsystems.<br><br>We
77 :     urge you to develop new subsystems when possible and to publish these
78 :     subsystems (which makes them accessible to users working on other
79 :     versions of the SEED).<br><h2>Summary</h2>So, our approximate approach to annotating a new genome would be:<br><br><ol><li>Run the genome through RAST.</li><li>Do a detailed metabolic comparison (within RAST) between your new genome and one or more of its closest relatives. &nbsp;Follow this by a general comparison of what genes distinguish it from its closest relatives.</li><li>Correct obvious frameshifts.</li><li>Decide
80 :     whether or not you are willing to spend the effort needed to identify
81 :     IS elements, prophages and other mobile elements. &nbsp;Similarly,
82 :     decide whether or not you wish to expend the effort to carfully
83 :     identify pseudo-genes.</li><li>If you have substantially changed the
84 :     gene calls, rerun your genome through RAST again (keeping the gene
85 :     calls that you have now established).</li><li>Go through the genes that
86 :     have not yet been placed into subsystems, determine whether or not it
87 :     makes sense to construct a limited set of new subsystems (especially if
88 :     they capture aspects of the genome which may have motivated the
89 :     sequencing effort in the first place).</li></ol>This can be done either
90 :     very rapidly or more time can be taken. &nbsp;It all depends on&nbsp;
91 :     the anticipated role of the genome. &nbsp; In many cases, these tasks
92 :     can be performed in a few weeks, and we believe that the overall time
93 :     will continue to drop as the quality of the RAST analysis (due to an
94 :     expanded library of subsystems) improves.</div><div style="text-align: left;"></div></div></body></html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3