Annotating a Genome Using KBase Tools

Purpose: This tutorial demonstrates how to annotate a set of closely related genomes using KBase tools. For example, the genus Geobacter can be used to study the pan-genome and construct metabolic models to clarify differences between strains.

Required Prerequisite Activities: Getting Started with KBase

Suggested Prerequesite Activities: Basic Exercises Using KBase

Related Tutorials: None

Select a genome

To see which Geobacter genomes are in the Central Store (CS) of KBase, use
  all_entities_Genome -f scientific_name | grep Geobacter

Geobacter sulfurreducens KN400 is a good example genome to begin with. Its KBase id is kb|g.2860. Locate it in the list of genomes. Obtain a local copy of the contigs of that genome using the step below.

Extract a FASTA file of contigs from the CS

You can upload any desired set of contigs from a file in your local machine to your IRIS workarea. However, in this example, obtain the contigs from the CS. To do that, use
  echo 'kb|g.2860' | genomes_to_contigs | contigs_to_sequences > g.2860.contigs

Create projects

Create subdirectories to contain your separate projects. Make a subdirectory called g.2860 where you will annotate kb|g.2860. (Do not use any special characters in your filenames, e.g., in this case the 'kb|' prefix was left off). For example

  mkdir g.2860
  mv g.2860.contigs g.2860
  cd g.2860
  ls

The first command creates the subdirectory, the second moves our contigs into the subdirectory, the third moves our "position" to the subdirectory, and the last displays the contents of the subdirectory. We urge you to verify that it all works as described.

Create a genome object to annotate

Next, use the FASTA file of contigs to create a genome object using

  fasta_to_genome 'Geobacter sulfurreducens KN400' Bacteria 11 < g.2860.contigs > genome

The above command creates a "genome object" in the file genome. The object contains the contigs, the scientific name of the organism, the domain (we specified 'B' for Bacteria), and the genetic code for the organism (i.e., 11, which is the code most commonly used with prokaryotic genomes). Use ls to see the file, and then click on it to see the encoded fields). Note that we have named our project subdirectory g.2860, which reflects the name of the genome whose contigs we copied. We are going to re-annotate the contigs, generating a whole new genome, which is what fasta_to_genome does; it registers a new genome ID that will not be used by anyone else.

Annotate a genome object

Now create an initial annotation for the genome using

  annotate_genome < genome > annotated.genome

This causes an initial annotation to be generated. It may take several minutes for large genomes. You can issue other commands while you wait, and the completion message will display below the command when the annotation is complete. When it completes, use ls to see the generated file, and click on it to see the encoded annotations. Alternatively, to the the features generated and placed in the annotated genome object (stored in annotated.genome), try
  genomeTO_to_feature_data < annotated.genome > features.txt

which produces a tab-separated table containing
  [feature-id,location,type,assigned-function]
 
Use ls to see it and explore the contents.

Build a metabolic reconstruction from an annotated genome

The term "metabolic reconstruction" as used here roughly means a set of subsystems that are believed to be present in the genome, along with the relevant variant codes. Obtain a metabolic reconstruction for the annotated genome using
  genomeTO_to_reconstructionTO < annotated.genome > reconstruction

To see the roles that were found, use
  reconstructionTO_to_roles < reconstruction > roles

To see the subsystems (along with the variant codes that seemed appropriate), use
  reconstructionTO_to_subsystems < reconstruction > subsystems

Get roles that might impact metabolic models

How good are the annotations for your newly-annotated genome? One way to assess this is to focus on the Roles that might impact metabolic models. We can look at the ones that were found and then compare them against those that exist in a similar, manually-curated genome. Begin by getting the subset of the Roles that might impact the metabolic models:
  all_roles_used_in_models > all.roles
  a_and_b roles all.roles > roles.for.models

Get roles that might have been missed

Now, compare this set of Roles against the set found in Geobacter metallireducens GS-15 (kb|g.9032 in the CS). Obtain the roles for that genome using
  echo 'kb|g.9032' | genomes_to_fids CDS | fids_to_roles 2> /dev/null | 
cut -f 3 > roles.in.g.9032
Note that the command fids_to_roles writes error messages when it cannot match a fid to any Roles.

Now use

  a_and_b roles.in.g.9032 all.roles > roles.for.models.g.9032
  a_not_b roles.for.models roles.for.models.g.9032 > roles.to.search.for

to create a file of Roles in kb|g.9032 that are not yet found in the annotations we got back for our new version of g.2860.

Create an initial metabolic model

Now that we have an annotated genome, create an initial metabolic model using
  genome_to_fbamodel < annotated.genome > initial.model

 

After a minute or two, a metabolic model object is stored in initial.model.

View the model

We can convert this model into readable HTML to see what it contains:
  fbamodel_to_html < initial.model > initial.model.html

 

After that command completes, use ls and click on the generated html.

Run flux balance analysis on the metabolic model

Flux balance analysis (FBA) is a mathematical approach in which our model is used to simulate various cellular activities, typically the production of biomass or metabolites from transportable nutrients. Now that we have a model, apply flux balance analysis to determine whether our model can grow, and to discover which pathways are utilized during growth.
  runfba < initial.model > solution.html

 

The results from the FBA are printed in HTML in the output solution file. Simply run the ls command and click on the HTML file to view the solution. At the top of this file is a table of the parameters used to run the FBA. Directly below this table is another table showing the objective function and value. In all likelihood, your model did not grow at all. This is because the model is missing pathways needed for biomass production. When a model fails to grow, the FBA command attempst to diagnose the problem by identifying biomass components that cannot be produced. This is done by maximizing the production of each individual biomass component, one at a time. You can see these analysis results in your solution HTML file (see the metabolite production table). Note that some of your biomass components have no numbers in this table. These are the components that cannot be produced. Now try to fix the model by adding reactions to enable production of these components.

Gapfill the model

We can run a gapfilling command on our model to automatically add reactions as needed to enable the production of all biomass components:
  gapfill_fbamodel < initial.model > gapfilled.model

 

Depending on the size and state of the genome, this could take minutes to hours, but when the analysis is complete, your model will have additional reactions in it, reflecting the ideal solution identified by the gapfilling algorithm to enable growth. Then rerun the flux balance analysis to determine the biomass production pathways of the organism.

Run flux balance analysis on the gapfilled model

Now that the model has been gapfilled, it should be possible to simulate biomass production using flux balance analysis. Use the runfba command again:
  runfba < gapfilled.model > solution.html

 

Once this command completes, use ls and click on the HTML produced by the FBA command. Your FBA solution should now include a nonzero objective value as well as numerous compound and reaction fluxes.

Export the model to external tools

Many other tools now exist that enable the analysis of genome-scale metabolic models (e.g., the Cobra toolbox). Most of these tools read models printed in SBML format. Print the gapfilled model in SBML format so the model can be used with these tools:
  fbamodel_to_sbml < gapfilled.model > gapfilled.model.sbml

 

After the command completes, use ls and select the SBML file for download.