[Bio] / KBaseTutorials / Basic_exercises / annotating_a_genome.html Repository:
ViewVC logotype

Annotation of /KBaseTutorials/Basic_exercises/annotating_a_genome.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.12 - (view) (download) (as text)

1 : nconrad 1.12 <h1>Annotating a Genome Using KBase Tools</h1>
2 : salazar 1.7 <p><strong>Purpose: </strong>This tutorial demonstrates how to annotate a set of closely related
3 :     genomes using KBase tools. For example, the genus <i>Geobacter</i> can be used to study the pan-genome and construct
4 :     metabolic models to clarify differences between strains. </p>
5 : nconrad 1.9 <p><strong>Required Prerequisite Activities: </strong><a href="/developer-zone/tutorials/getting-started/getting-started-with-the-kbase/">Getting Started with KBase</a></p>
6 :     <p><strong>Suggested Prerequesite Activities: </strong><a href="/developer-zone/tutorials/getting-started/some-basic-exercises-using-kbase/">Basic Exercises Using KBase</a></p>
7 : nconrad 1.11 <p><strong>Related Tutorials: </strong>None</p>
8 : salazar 1.7 <h2>Select a genome</h2>
9 : olson 1.1
10 : salazar 1.7 To see which <i>Geobacter</i> genomes are in the <i>Central Store (CS)</i> of KBase, use<br>
11 : salazar 1.2 <pre>
12 : olson 1.6 all_entities_Genome -f scientific_name | grep Geobacter
13 : olson 1.1 <br></pre>
14 : salazar 1.7 <i>Geobacter sulfurreducens KN400</i> is a good example genome to begin with.
15 :     Its KBase id is <i>kb|g.2860</i>. Locate it in the list of genomes.
16 :     Obtain a local copy of the contigs of that genome using the step below.
17 :    
18 :     <h2>Extract a FASTA file of contigs from the CS</h2>
19 :    
20 :     You can upload any desired set of contigs from a file in your
21 :     local machine to your IRIS workarea. However, in this example, obtain the contigs from the CS.
22 :     To do that, use
23 : olson 1.1
24 : salazar 1.2 <br>
25 :     <pre>
26 : olson 1.6 echo 'kb|g.2860' | genomes_to_contigs | contigs_to_sequences > g.2860.contigs
27 : olson 1.1 <br></pre>
28 : salazar 1.7 <h2>Create projects</h2>
29 :     Create subdirectories to contain
30 :     your separate projects. Make a subdirectory called <i>g.2860</i> where you will
31 :     annotate kb|g.2860. (Do not use any special characters in your filenames, e.g., in this case the 'kb|' prefix was left off).
32 : olson 1.1
33 : salazar 1.7 For example
34 : salazar 1.2 <br>
35 :     <pre>
36 : olson 1.1
37 : olson 1.6 mkdir g.2860
38 :     mv g.2860.contigs g.2860
39 :     cd g.2860
40 : salazar 1.2 ls
41 : olson 1.1 </br></pre>
42 :     The first command creates the subdirectory, the second moves our contigs into the subdirectory,
43 : salazar 1.7 the third moves our "position" to the subdirectory, and the last displays the contents of the subdirectory.
44 :     We urge you to verify that it all works as described.
45 : olson 1.1
46 : salazar 1.7 <h2>Create a genome object to annotate</h2>
47 : olson 1.1
48 : salazar 1.7 Next, use the FASTA file of contigs to create a <i>genome object</i> using <br>
49 : salazar 1.2 <pre>
50 : olson 1.1
51 : olson 1.6 fasta_to_genome 'Geobacter sulfurreducens KN400' Bacteria 11 < g.2860.contigs > genome
52 : olson 1.1
53 : salazar 1.2 </pre>
54 : salazar 1.7 The above command creates a "genome object" in the file <i>genome</i>. The object contains the contigs,
55 : olson 1.1 the scientific name of the organism, the domain (we specified 'B' for <i>Bacteria</i>), and
56 :     the genetic code for the organism (i.e., 11, which is the code most commonly used with prokaryotic genomes).
57 : salazar 1.7 Use <em>ls</em> to see the file, and then click on it to see the encoded fields).
58 :     Note that we have named our project subdirectory <i>g.2860, </i>which reflects the name of the
59 : olson 1.1 genome whose contigs we copied. We are going to re-annotate the contigs, generating a whole new
60 :     genome, which is what <b>fasta_to_genome</b> does; it registers a new genome ID that will not be used
61 :     by anyone else.
62 :    
63 : salazar 1.7 <h2>Annotate a genome object</h2>
64 : olson 1.1
65 : salazar 1.7 Now create an initial annotation for the genome using
66 : salazar 1.2 <br>
67 :     <pre>
68 : olson 1.1
69 :     annotate_genome < genome > annotated.genome
70 :    
71 : salazar 1.2 </pre>
72 : olson 1.1
73 : salazar 1.7 This causes an initial annotation to be generated. It may take several minutes for large genomes. You can issue other commands while you wait, and the completion message will display below the command when the annotation is complete. When it completes, use <b>ls</b>
74 : olson 1.1 to see the generated file, and click on it to see the encoded annotations. Alternatively, to the the features
75 : salazar 1.7 generated and placed in the annotated genome object (stored in <i>annotated.genome</i>), try
76 : salazar 1.2 <br>
77 :     <pre>
78 : olson 1.4 genomeTO_to_feature_data < annotated.genome > features.txt
79 : olson 1.1 </br></pre>
80 :     which produces a tab-separated table containing
81 : salazar 1.2 <br>
82 :     <pre>
83 :     [feature-id,location,type,assigned-function]
84 : olson 1.1 <br></pre>
85 : salazar 1.7 Use <em>ls</em> to see it and explore the contents.
86 :     <h2>Build a metabolic reconstruction from an annotated genome</h2>
87 :     The term
88 :     "metabolic reconstruction" as used
89 :     here roughly means <i>a set of <a href="http://www.theseed.org/wiki/Glossary#Subsystem" target=blank_>
90 : olson 1.1 subsystems</a> that are believed to be present in the genome, along with
91 :     the relevant variant codes</i>.
92 : salazar 1.7 Obtain a metabolic reconstruction for the annotated genome using <br>
93 : salazar 1.2 <pre>
94 :     genomeTO_to_reconstructionTO < annotated.genome > reconstruction
95 : olson 1.1 <br></pre>
96 : salazar 1.7 To see the roles that were found, use
97 : salazar 1.2 <br>
98 :     <pre>
99 :     reconstructionTO_to_roles < reconstruction > roles
100 : olson 1.1 <br></pre>
101 :     To see the subsystems (along with the variant codes that seemed appropriate), use
102 : salazar 1.2 <br>
103 :     <pre>
104 :     reconstructionTO_to_subsystems < reconstruction > subsystems
105 : olson 1.1 <br></pre>
106 :    
107 : salazar 1.7 <h2>Get roles that might impact metabolic models</h2>
108 : olson 1.1
109 :     How good are the annotations for your newly-annotated genome?
110 :     One way to assess this is to focus on the <i>Roles </i> that might impact metabolic models.
111 :     We can look at the ones that were found and then compare them against those that exist in
112 : salazar 1.7 a similar, manually-curated genome. Begin by getting the subset of the Roles that
113 :     might impact the metabolic models:<br>
114 : salazar 1.2 <pre>
115 : olson 1.1 all_roles_used_in_models > all.roles
116 :     a_and_b roles all.roles > roles.for.models
117 :     <br></pre>
118 :    
119 : salazar 1.7 <h2>Get roles that might have been missed</h2>
120 : olson 1.1
121 : salazar 1.7 Now, compare this set of Roles against the set found in
122 :     <i>Geobacter metallireducens GS-15</i> (<i>kb|g.9032</i> in the CS). Obtain the roles for
123 : olson 1.1 that genome using
124 : salazar 1.2 <br>
125 :     <pre>
126 : olson 1.6 echo 'kb|g.9032' | genomes_to_fids CDS | fids_to_roles 2> /dev/null | <br /> cut -f 3 > roles.in.g.9032
127 : olson 1.1 <br></pre>
128 : salazar 1.7 Note that the command <b>fids_to_roles</b> writes error messages when it cannot match
129 : olson 1.1 a fid to any Roles.
130 :     <p>
131 : salazar 1.7 Now use
132 : salazar 1.2 <br>
133 :     <pre>
134 : olson 1.6 a_and_b roles.in.g.9032 all.roles > roles.for.models.g.9032
135 :     a_not_b roles.for.models roles.for.models.g.9032 > roles.to.search.for
136 : olson 1.1 <br></pre>
137 :    
138 : salazar 1.7 to create a file of Roles in <i>kb|g.9032</i> that are not yet found in the annotations we
139 : olson 1.6 got back for our new version of <i>g.2860</i>.
140 : olson 1.1
141 : salazar 1.7 <h2>Create an initial metabolic model</h2>
142 : olson 1.1
143 : salazar 1.7 Now that we have an annotated genome, create an initial metabolic model using
144 : salazar 1.3 <br>
145 :     <pre>
146 :     genome_to_fbamodel < annotated.genome > initial.model
147 : olson 1.1 </pre>
148 :    
149 : salazar 1.2 <p>&nbsp;</p>
150 : salazar 1.7 <p>After a minute or two, a metabolic model object is stored in <i>initial.model</i>.
151 : salazar 1.2
152 :     </p>
153 : salazar 1.7 <h2>View the model</h2>
154 : olson 1.1
155 :     We can convert this model into readable HTML to see what it contains:
156 :    
157 :     <pre>
158 :     fbamodel_to_html < initial.model > initial.model.html
159 :     </pre>
160 :    
161 : salazar 1.2 <p>&nbsp;</p>
162 : salazar 1.7 <p>After that command completes, use <em>ls</em> and click on the generated html. </p>
163 : chenry 1.5
164 : salazar 1.7 <h2>Run flux balance analysis on the metabolic model</h2>
165 : chenry 1.5
166 : salazar 1.7 Flux balance analysis (FBA) is a mathematical approach in which our model is used to simulate
167 :     various cellular activities, typically the production of biomass or metabolites from
168 :     transportable nutrients. Now that we have a model, apply flux balance analysis to
169 :     determine whether our model can grow, and to discover which pathways are utilized during growth.
170 : chenry 1.5
171 :     <pre>
172 :     runfba < initial.model > solution.html
173 :     </pre>
174 :    
175 :     <p>&nbsp;</p>
176 : salazar 1.7 <p>The results from the FBA are printed in HTML in the output solution file. Simply run the <em>ls</em> command and click on the HTML file to view the solution. At the top of this file
177 :     is a table of the parameters used to run the FBA. Directly below this table is another
178 :     table showing the objective function and value. In all likelihood, your model did not grow
179 :     at all. This is because the model is missing pathways needed for biomass production. When
180 :     a model fails to grow, the FBA command attempst to diagnose the problem by identifying
181 : chenry 1.5 biomass components that cannot be produced. This is done by maximizing the production of
182 : salazar 1.7 each individual biomass component, one at a time. You can see these analysis results in your
183 :     solution HTML file (see the metabolite production table). Note that some of your biomass components
184 :     have no numbers in this table. These are the components that cannot be produced. Now try
185 :     to fix the model by adding reactions to enable production of these components. </p>
186 : chenry 1.5
187 : salazar 1.7 <h2>Gapfill the model</h2>
188 : olson 1.1
189 : chenry 1.5 We can run a gapfilling command on our model to automatically add reactions as needed to
190 :     enable the production of all biomass components:
191 : olson 1.1
192 :     <pre>
193 :     gapfill_fbamodel < initial.model > gapfilled.model
194 :     </pre>
195 :    
196 : salazar 1.2 <p>&nbsp;</p>
197 : chenry 1.5 <p>Depending on the size and state of the genome, this could take minutes to hours, but
198 :     when the analysis is complete, your model will have additional reactions in it, reflecting
199 : salazar 1.7 the ideal solution identified by the gapfilling algorithm to enable growth. Then
200 :     rerun the flux balance analysis to determine the biomass production pathways of the organism.
201 : chenry 1.5 </p>
202 :    
203 : salazar 1.7 <h2>Run flux balance analysis on the gapfilled model</h2>
204 : chenry 1.5
205 : salazar 1.7 Now that the model has been gapfilled, it should be possible to simulate biomass production
206 :     using flux balance analysis. Use the <i>runfba</i> command again:
207 : chenry 1.5
208 :     <pre>
209 :     runfba < gapfilled.model > solution.html
210 :     </pre>
211 :    
212 :     <p>&nbsp;</p>
213 : salazar 1.7 <p>Once this command completes, use <em>ls</em> and click on the HTML produced by the FBA
214 : chenry 1.5 command. Your FBA solution should now include a nonzero objective value as well as numerous
215 :     compound and reaction fluxes.</p>
216 :    
217 : salazar 1.7 <h2>Export the model to external tools</h2>
218 : chenry 1.5
219 : salazar 1.7 Many other tools now exist that enable the analysis of genome-scale metabolic models (e.g.,
220 :     the Cobra toolbox). Most of these tools read models printed in SBML format. Print
221 :     the gapfilled model in SBML format so the model can be used with these tools:
222 : chenry 1.5
223 :     <pre>
224 :     fbamodel_to_sbml < gapfilled.model > gapfilled.model.sbml
225 :     </pre>
226 :    
227 :     <p>&nbsp;</p>
228 : salazar 1.7 <p>After the command completes, use <i>ls</i> and select the SBML file for download.</p>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3