Parent Directory
|
Revision Log
Revision 1.1 - (view) (download) (as text)
1 : | overbeek | 1.1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> |
2 : | <html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="content-type"><title>How_to_annotate_a_genome</title></head><body><h1 style="text-align: center;">How to Annotate a Genome</h1><div style="text-align: center;"><h3>by Ross Overbeek</h3><br><div style="text-align: left;">We at the Fellowship for Interpretation of Genomes (FIG) have actively led the <span style="font-weight: bold;">Project to Annotate a 1000 Genomes</span> since its inception in 2003. In that effort we pioneered what we called the <span style="font-weight: bold;">subsystems approach to annotation</span> | ||
3 : | in which experts annotated a single subsystem across the entire set of | ||
4 : | genomes. This was a radically different approach than the more | ||
5 : | usual of attempting to annotate all of the genes in a single genome. | ||
6 : | The effort to develop well-curated sets of subsystems has led to | ||
7 : | a collection of 400-600 subsystems (depending on where you choose to | ||
8 : | impose a threshold of acceptable quality). We believe that the | ||
9 : | number will continue to grow for reasons that will become apparent in | ||
10 : | this short note.<br><br>It is time to revisit the issue of how to | ||
11 : | annotate a specific genome of interest, since numerous biologists are | ||
12 : | now faced with that opportunity. For what it is worth, here is | ||
13 : | our advice.<br><br><h2>Begin by Identifying the Recognizable Instances of Subsystems</h2>When | ||
14 : | you are able to annotate a complete subsystem, the individual | ||
15 : | assignments are all somewhat more reliable. Most of the common | ||
16 : | machinery can easily be identified, and this establishes a starting | ||
17 : | point for the more difficult remaining tasks. The easiest way to | ||
18 : | perform this initial stage of analysis is to proceed through two tasks:<br><ol><li>Submit the genome sequence to the RAST server maintained at Argonne National Laboratory. This can be done by going to the <a href="http://rast.nmpdr.org/rast.cgi">RAST server</a>, | ||
19 : | registering yourself as a user (anyone is welcome to use the site), | ||
20 : | uploading your sequence, and getting an initial annotation back in | ||
21 : | about 12 hours. You can then download the initial annotation to | ||
22 : | your site and work on it using any tools you prefer. The initial | ||
23 : | annotation from RAST gives you three things:<br><ul><li>protein-encoding genes (CDSs),</li><li>RNA-encoding genes (tRNAs and rRNAs)</li><li>identified subsystems</li></ul></li><li>Once | ||
24 : | you have an initial set of identified subsystems, you should manually | ||
25 : | go through and see where RAST missed identifying active variants. | ||
26 : | It is fairly conservative in its calls, so if there were a | ||
27 : | mis-called gene (e.g., due to a frameshift) or an unusual form of a | ||
28 : | gene (e.g., an unknown form of an enzyme) you would see almost all of | ||
29 : | the genes in a subsystem accounted for, but not enough for RAST to say | ||
30 : | that the subsystem is really there. If you do this analysis | ||
31 : | within RAST, you can compare the metabolic reconstruction for your | ||
32 : | genome against related genomes, focusing on the specific differences.</li></ol>If | ||
33 : | your genome is close to a previously annotated and studied genome, we | ||
34 : | suggest that you do a detailed analysis of what genes distinguish the | ||
35 : | new genome from the previously annotated genome (or genomes). The | ||
36 : | SEED provides a tool for easily doing such a comparison, and similar | ||
37 : | tools are either available or becoming available from a number of | ||
38 : | sources.<br><br>Note that this initial step can be done very rapidly -- in a few days.<br><br><h2>Fix Frameshifts, Annotate Insertion Sequences, and Process Pseudo-genes</h2>RAST | ||
39 : | often fails to identify the functional role of a particular gene due to | ||
40 : | frameshifts. This is very common in low-quality sequence or | ||
41 : | sequence produced by 454 technology. It is not particularly | ||
42 : | serious, but we do recommend that you post-process the gene calls to | ||
43 : | clean up the frameshifts. Biologists are justifiably reluctant | ||
44 : | to change sequence data without resequencing; hence, we recommend that | ||
45 : | the actual DNA sequence remain unchanged, that the correction be | ||
46 : | embodied in the proposed translation of the feature, and that the | ||
47 : | discrepancy between the actual DNA sequence and the translation be | ||
48 : | recorded with the feature. We note that you can automatically correct | ||
49 : | obvious frameshifts using tools within the SEED environment, and we | ||
50 : | anticipate that these will become increasingly important as larger | ||
51 : | volumes of low-quality sequence data becomes available.<br><br>The | ||
52 : | issue of detecting insertion sequences, mobile elements, prophages and | ||
53 : | so forth is important for a number of reasons. Determining the | ||
54 : | set of impacted genes (often pseudo-genes) is extremely time-consuming. | ||
55 : | We would guess that tools to support this type of analysis will | ||
56 : | appear soon, but for now you will need to determine how much effort you | ||
57 : | are willing to expend on the task. So, this part of the effort | ||
58 : | can take from a few days (to automatically detect and correct | ||
59 : | frameshifts) to man-years (to characterize insertion sequences, | ||
60 : | pseudo-genes, and prophages).<br><h2>Look at Identified Functions that are Not in Subsystems</h2>As | ||
61 : | you scan through the genes not yet placed in subsystems that were | ||
62 : | identified by RAST, some correspond to FIGfams, and some do not. | ||
63 : | Some are closely similar to well-annotated proteins (e.g., to | ||
64 : | Swiss Prot entries), and some are not.<br><br>We recommend that you | ||
65 : | scan through these focusing on those that correspond to functional | ||
66 : | roles that should be encoded into subsystems. It is | ||
67 : | particularly important to examine those for which "functional coupling" | ||
68 : | information exists (RAST will give you this information). When | ||
69 : | strong functional coupling data exists, and when the functional role | ||
70 : | can be identified with reasonable certainty, you have a particularly | ||
71 : | good candidate for a new subsystem. If you can connect any of the | ||
72 : | genes in the cluster (in, say, genomes that are "close" and have been | ||
73 : | actively studied) to literature, you need to get the relevant papers | ||
74 : | before deciding how to proceed. We suggest making a rapid pass | ||
75 : | through the set of genes that have not been assigned to subsystems, | ||
76 : | prioritizing these genes for possible use in starting new subsystems.<br><br>We | ||
77 : | urge you to develop new subsystems when possible and to publish these | ||
78 : | subsystems (which makes them accessible to users working on other | ||
79 : | versions of the SEED).<br><h2>Summary</h2>So, our approximate approach to annotating a new genome would be:<br><br><ol><li>Run the genome through RAST.</li><li>Do a detailed metabolic comparison (within RAST) between your new genome and one or more of its closest relatives. Follow this by a general comparison of what genes distinguish it from its closest relatives.</li><li>Correct obvious frameshifts.</li><li>Decide | ||
80 : | whether or not you are willing to spend the effort needed to identify | ||
81 : | IS elements, prophages and other mobile elements. Similarly, | ||
82 : | decide whether or not you wish to expend the effort to carfully | ||
83 : | identify pseudo-genes.</li><li>If you have substantially changed the | ||
84 : | gene calls, rerun your genome through RAST again (keeping the gene | ||
85 : | calls that you have now established).</li><li>Go through the genes that | ||
86 : | have not yet been placed into subsystems, determine whether or not it | ||
87 : | makes sense to construct a limited set of new subsystems (especially if | ||
88 : | they capture aspects of the genome which may have motivated the | ||
89 : | sequencing effort in the first place).</li></ol>This can be done either | ||
90 : | very rapidly or more time can be taken. It all depends on | ||
91 : | the anticipated role of the genome. In many cases, these tasks | ||
92 : | can be performed in a few weeks, and we believe that the overall time | ||
93 : | will continue to drop as the quality of the RAST analysis (due to an | ||
94 : | expanded library of subsystems) improves.</div><div style="text-align: left;"></div></div></body></html> |
MCS Webmaster | ViewVC Help |
Powered by ViewVC 1.0.3 |