[Bio] / FigTutorial / GettingStarted.html Repository:
ViewVC logotype

View of /FigTutorial/GettingStarted.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (download) (as text) (annotate)
Mon Sep 20 13:35:25 2004 UTC (15 years, 2 months ago) by overbeek
Branch: MAIN
CVS Tags: merge-bodev_news-3, rast_rel_2014_0912, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, Root-bobdev_news, rast_rel_2008_09_30, caBIG-13Feb06-00, rast_rel_2010_0526, rast_rel_2014_0729, merge-trunktag-bobdev_news-1, rast_rel_2009_05_18, caBIG-05Apr06-00, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, caBIG-00-00-00, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, merge-trunktag-bodev_news-3, merge-bobdev_news-2, merge-bobdev_news-1, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, merge-trunktag-bobdev_news-2, rast_rel_2008_11_24, HEAD
Branch point for: Branch-bobdev_news
Changes since 1.2: +26 -0 lines
add some comments to assignment1

<html><head>
<meta name="Title" content="Common Uses of the SEED">
<meta http-equiv="Content-Type" content="text/html; charset=macintosh"><title>Common Uses of the SEED</title>

<style>
<!--
.Section1
	{page:Section1;}
-->
</style></head>
<body bgcolor="#ffffff" class="Normal" lang="EN-US">
<div class="Section1">
  <h1><span style="font-size: 16pt;">Common Uses of the SEED</span></h1>
  <p>The SEED is designed to support comparative analysis of genomes.† What
    does that mean?† Rather than discuss the abstract issues involved in
    this goal, let us focus on how the SEED is intended to be used.†† In
    this short tutorial we discuss a few of the more common uses of the system:</p>
  <ol start="1" type="1">
    <li>helping a researcher study a specific subsystem (set of genes),</li>
    <li>supporting community-wide annotation of genomes,</li>
    <li>searching for specific missing genes.</li>
  </ol>
  <p><span style="font-size: 14pt;"><b>1. Studying a Specific Subsystem (Set
        of Genes)</b></span></p>
  <p>It is often the case that a researcher wishes to study a specific molecular
    subsystem implemented via some more-or-less understood set of genes.† The
    SEED currently handles metabolic subsystems in which the functional roles
    are represented via EC numbers better than other subsystems (although we
    do try to support non-metabolic subsystems, and support for non-metabolic
    subsystems will improve during this coming year).† In such a case, one
    might take the following approach:</p>
  <ul type="disc">
    <li>enumerate the set of functional roles required by the subsystem,</li>
    <li>build a spreadsheet showing which of the roles can be connected to genes
      in each of the completely sequenced organisms,</li>
    <li>clean up assignments a bit to make the spreadsheet more accurate,</li>
    <li>make an assessment about which organisms have which versions of the subsystem,
      which will reveal† genes that should be there but cannot be located
      (we call these missing genes),</li>
    <li>attempt to locate candidates for the missing genes.</li>
  </ul>
  <p><b>1.1<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">† </span></b><b>Accessing
      the KEGG Maps to Get Metabolic Overviews</b></p>
  <p>To begin this process, you should get some idea of what the functional roles
    occur in the subsystem.† The easiest way to do this might be to go to
    the FIG search page, and then ask for a metabolic† overview of an organism
    that you know has the machinery you are interested in.† The SEED simply
    offers access to KEGG’s capability of portraying metabolic maps.† Once
    you have chosen a map to display and an organism, the SEED will determine
    which enzymes are shown in the metabolic map, which of these can be connected
    to specific genes in the organism you have chosen, and then it will invoke
    KEGG to render the results (and leave you positioned to explore the metabolism
    within the KEGG environment, which is certainly one of the best available
    presentations of metabolism).</p>
  <p>You can open a new explorer window opened to the SEED search page by <a href="http://localhost/FIG/index.cgi">clicking
      here</a>.† Please try it, and then ask for a summary of glycolysis
      in <i>Thermatoga maritime </i><span style="font-style: normal;">(simply as an example).† This is a straightforward
      way to get a rapid overview of the metabolic potential within the genome
      of any of the organisms stored within the SEED.† In this case, the
      SEED is simply acting as a portal top the wonderful features implemented
      in KEGG.† Using the KEGG maps, you can rapidly extract a set of functional
      roles (i.e., enzymes, which we usually represent with EC numbers).</span></p>
  <p><b>1.2<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">† </span></b><b>Creating
      a Spreadsheet of Occurrences of Functional Roles</b></p>
  <p>Once you have an idea of what functional roles you wish to study, the next
    task is to create a spreadsheet showing exactly which functional roles have
    been connected to genes in each of the sequenced organisms.† To do this,
    pick one of the central genes in the metabolic† process you are studying,
    go to the SEED search page, type in either the EC number or a key word† for
    the enzyme, and then perform the search.† A search produces two tables:
    a table of specific genes that match the search criteria, and a second table
    showing enzymatic roles that match the search criteria.† Click on the
    appropriate entry in the table of enzymes that were matched (the second table).†† This
    should take you to a page in which the functional role is displayed at the
    top (it is a link to the KEGG description of this enzyme, from which a great
    deal can be learned).† You also get a text area showing a set of functional
    roles that occur close to the given enzyme (distance is defined in terms of 
    the number of reactions separating the substrates of the given enzyme and the 
    other enzymes listed).† The listed functional
    roles represent a neighborhood around the enzyme you selected.† You
    need to edit this set to include exactly the EC numbers that you wish to
    include in your spreadsheet.† You can delete entries, and add EC numbers
    until you have precisely the list you wish.† Then click on <b>Occurrences</b><span style="font-weight: normal;">,
    which will produce the spreadsheet you are after.† Each column represents
    occurrences of one of the enzymes, and each cell gives a count of the number
    of genes that can be connected to that enzyme in the organism corresponding
    to the row containing the cell.† A nonzero value produces a link that
    can be used to see the specific genes, while a value of zero produces a link
    that can be used to attempt to locate a candidate for the functional role. </span></p>
  <p>To make sure that you can easily do this, we suggest that you pick the following
    genes, which encode the textbook version of glycolysis, and build a spreadsheet:</p>
  <ul type="disc">
    <li>2.7.1.2 - glucokinase</li>
    <li>5.3.1.9 - <span style="color: black;">glucose-6-phosphate isomerase</span></li>
    <li><span style="color: black;">2.7.1.11 - 6-phosphofructokinase</span></li>
    <li><span style="color: black;">4.1.2.13 - fructose-bisphosphate aldolase</span></li>
    <li><span style="color: black;">1.2.1.12 - glyceraldehyde-3-phosphat dehydrogenase (phosphorylating)</span></li>
    <li><span style="color: black;">2.7.2.3 - phosphoglycerate kinase</span></li>
    <li><span style="color: black;">5.4.2.1 - phosphoglycerate mutase</span></li>
    <li><span style="color: black;">4.2.1.11 - enolase</span></li>
    <li><span style="color: black;">2.7.1.40 - pyruvate kinase</span></li>
  </ul>
  <p><b>1.3<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">† </span></b><b>Setting
      a User</b></p>
  <p>When you enter the SEED via the search page, you have the option of setting
    a user id.† You must set one if you intend to alter assignments of function
    or add annotations to genes.† Otherwise, it is not necessary.† For
    the casual user, we recommend that you either do not set an id, or just set
    one to something like RossOverbeek.† Do not embed blanks.† If you
    do set a user, then you are free to alter assignments or annotations.</p>
  <p>If you just set a user, your assignments are visible to other users, but
    they do not override the master assignments.† If you wish to overwrite
    master assignments, you need to begin your user id with master:.† Thus,
    master:RossO would assert that I wish to override existing master assertions,
    and <i>RossO</i><span style="font-style: normal;"> would be the user reflected in annotations and log
    records.</span></p>
  <p><b>1.4<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">† </span></b><b>Cleaning
      Up Assignments</b></p>
  <p>As you study a given subsystem, you may wish to correct or add assignments.† To
    do this, you establish yourself with a user id, and then you will probably
    make three types of assignments:</p>
  <p>1.<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">†††† </span>you
    will look at cases in which you believe there must be a gene implementing
    a function, but none has yet been identified, and</p>
  <p>2.<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">†††† </span>you
    will look at cases in which it appears that too many genes have been asserted
    with the same function.</p>
  <p>3.<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">†††† </span>you
    will look at cases in which the functional roles assigned to genes are obviously the same but require the syntax,punctuation    and/or the capitalization to be edited to create identical assignments.</p>
  <p>We call the first case looking for missing genes.† There are really
    two forms: in the first, similarity can be used to locate and identify the
    gene or genes that need to be assigned the function, and in the second you
    probably have a new form of an enzyme (and use of similarity will not get
    you the desired answer).† We cover the second case in detail below.† The
    first case (in which we just use similarity to find the gene) is invoked
    directly from the occurrence spreadsheet. †When you click on an entry
    that contains <b>0</b><span style="font-weight: normal;">, the SEED will search
    for candidates using genes already believed to play the functional role in
    other organisms.</span></p>
  <p>For example, in the glycolysis spreadsheet you constructed† above,
    note that the phosphoglycerate kinase (EC 2.7.2.3) is apparently not yet
    identified in <span style="color: black;"><i>Streptococcus agalactiae 2603V/R</i></span><span style="font-size: 16pt; font-family: &quot;Times New Roman&quot;; color: black;">.† </span><span style="color: black;">By
    clicking on the 0, you should eventually see at least two candidates for
    the function displayed.† If you follow the link or links, you should
    be able to locate where the bad assignment appears.† Try to correct
    it.</span></p>
  <p><span style="color: black;">The second type of problem (too many genes with
      the function) is also illustrated nicely with the assignments for glycolytic
      enzymes in <i>Streptococcus agalactiae 2603V/R</i></span><span style="font-size: 16pt; font-family: &quot;Times New Roman&quot;; color: black;">.† </span><span style="color: black;">Look
      at† the phosphoglycerate mutase (EC† 5.4.2.1).† Note that
      three distinct genes have all been given this assignment.† Can you
      tell which is correct?† If so, try to change the incorrect annotations.</span></p>
  <p><b>2.Community Annotation of a Genome</b></p>
  <p>One of the intended uses of the SEED is to support community-wide annotation
    efforts.† In this case, we anticipate a few heavy users and many infrequent
    users (all examining genes of particular interest, correcting annotations,
    and adding assertions of function as they are determined in the lab).† Some
    users will be working on a central server over the web, while others use
    their laptops (synchronizing all annotations and assignments periodically).</p>
  <p><b>2.1 Choosing User IDs</b></p>
  <p>When a community annotation effort is initiated, it is important to decide
    exactly who is allowed to update master annotations, and who is not.† Most
    serious users should establish user IDs of the form master:UserID, which
    allows them to overwrite† master annotations.†† The nonmaster
    form of user IDs is supported to allow students and beginners to work with
    the SEED without introducing errors.</p>
  <p><b>2.2 Moving Through the Chromosome Sequentially</b></p>
  <p>The most straightforward way to examine the genes within a genome is to
    start at the first and move sequentially through the genome.† This is
    not often done, but let us try it to see what happens.† The first gene
    in <i>Escherischia coli</i><span style="font-style: normal;"> is fig|562.1.peg.1.† Try
    typing this into the search field and go to look at the gene.† We will
    cover the meanings of the information that you see a little later.† For
    now, just note the graphical depiction of the genes in which the leftmost
    (the one you are positioned on is green, and the rest are red).† The
    meanings of the colors are as follows:</span></p>
  <ul type="disc">
    <li>you are positioned on the green gene,</li>
    <li>the red genes are apparently unrelated genes, and</li>
    <li>blue genes are genes that might be functionally related (there is some
      evidence based on co-occurrence close to the given gene in several chromosomes).</li>
  </ul>
  <p>You can move along the chromosome by simply clicking on the colored gene.† For
    example, if you click on the second gene, you should see things change substantially.† Now,
    you see that your position has changed (to gene 2), but also genes 3,† 4,
    7, and 8 have turned blue.†† Try simply clicking on genes to watch
    your position change in the graphical bar.</p>
<p>
There is another, perhaps superior, way to proceed methodically down the chromosome.
To try this other approach, look for the link <b>To Compare Regions</b> and click on it.
This will not only show you the genes in the genome you are examining
-- it will also show you corresponding regions in closely related
genomes.  Note that the orientation of the chromosomes is determined
by the gene you are positioned upon.  If it is on the positive strand,
the genes to the right go "up" in coordinates; otherwise, they
descend.
If you click on any other gene in the graphical display, you will move
to that gene (and display the compared regions).  So, if you click on
genes from the same genome, you can effectively "walk the genome".
The only tricky aspect is, either stay positioned on genes on the
positive strand, or think about whether to click on a gene to the
right (if you are on a gene from the positive strand) or the left (if
you are positioned on a gene on the negative strand).
<p>
As a fun execise, you might walk down a genome with a number of very
closely-related other existing genomes (a <i>Staph. aureus</i> or
<i>Strep. pyogenes</i>, for example) and look for genes that are
probably miscalled.

  <p><b>2.3 Searching for Specific Genes</b></p>
  <p>Normally you do not simply walk through the genes in the chromosome.† Rather,
    you type in a specific word or two in the seach box and try to go directly
    to a gene of interest.† Try typing in <b>arsenical pump coli</b><span style="font-weight: normal;"> and
    see what happens.† Find the occurrence for a gene with alias <i>arsB</i></span> and
    click on it.</p>
  <p><b>2.4 Examining a Gene (the Gene Page)</b></p>
  <p>Much of your time using the SEED will probably
    be spent looking at data on the gene page.† †In this section we
    comment briefly on what is available on the gene page.</p>
  <p><b>2.41 The Context</b></p>
  <p>The gene page begins with a table we call the context.† It represents
    the region on the chromosome (or fragment of a chromosome that we often call
    a contig). †The first column in this table has the label fid, which
    stands for <i>feature ID</i><span style="font-style: normal;">.† The feature
    IDs for protein-encoding genes are abbreviated.† For example, the </span><i>arsB</i><span style="font-style: normal;"> gene
    mentioned above was abbreviated to 4308, which was short for <span style="color: black;"><b>fig|562.2.peg.4308.† </b></span><span style="color: black;">RNA-encoding
    genes are not abbreviated.† </span>The </span><i>start</i><span style="font-style: normal;"> and </span><i>end</i><span style="font-style: normal;"> columns give the exact coordinates of the gene on
    the contig (not including the stop codon).† The size is in bases.† The
    strand is <b>+</b></span> or <b>-</b><span style="font-weight: normal;">, and
    the gap is the distance between two genes (genes that overlap have negative
    values for the gap, which is something worth checking occasionally).† The
    next two columns, </span><b>fc</b><span style="font-weight: normal;"> and </span><b>neigh</b><span style="font-weight: normal;">, are important.† The genes with a </span><b>*</b><span style="font-weight: normal;"> in
    the fc column appear to have some evidence supporting the hypothesis that
    they tend to co-occur with the gene you are positioned on.† Thus, is
    you look at the display while positioned on the </span><b>arsB</b><span style="font-weight: normal;"> gene, you will see that the gene before and after
    appear to be functionally-coupled based on co-occurrence data.† The </span><b>neigh</b><span style="font-weight: normal;"> column will be marked for genes that are known to
    play closely-related functional roles (e.g., in the same pathway).</span></p>
  <p><b>2.42 Current Assignments</b></p>
  <p>Below the table giving the context and the graphical depiction of the region,
    we have a table giving current assignments.† The assignments in this
    table are for proteins that have essentially the same amino acid sequence.† They
    may be external sequences from other sources of data, and occasionally they
    are from different SEED genomes, but they are not just closely-related – they
    are virtually identical sequences.† The assignments, therefore, should
    be taken seriously.† If the current assignment seriously disagrees with
    any of these, then someone is very probably wrong.† If you believe that
    the current assignment for the current gene is wrong, and if you established
    an ID when you began, then little arrows show up under the <b>ASSIGN</b><span style="font-weight: normal;"> column.† If you click on one of these, the current
    assignment is changed to match that of the row in which you clicked.</span></p>
  <p><b>2.43 Viewing Annotations</b></p>
  <p>Below the current assignments is a link that will allow you to view annotations
    of the gene, if there are any.† Whenever anyone assigns a functions,
    an annotation is generated. </p>
  <p><b>2.44 Functional Coupling as Detected Via Chromosomal Clusters</b></p>
  <p>As we mentioned above, the context shows genes that are believed to be co-occurring
    with the given gene.† The evidence is not kept up to date, so there
    may be functionally-coupled genes that are not shown.† To be sure, you
    can click on the link <b>To Get Detailed Function al Coupling Data</b><span style="font-weight: normal;">.† This
    updates and retains the functional coupling scores.† You can click on
    the link (which produces a table of related genes), and then click on the
    numeric value to see the co-occurrences.† A more visual way to see co-occurrences
    (and far more informative) is to click on the * in the fc column in the context
    table.† This produces a visual depiction of the co-occurrences.† It
    is one of the most important displays in the FIG.† Make sure that you
    try it. After examining the visual display for evidence of clustering, click on the
    </span><b>Commentary</b><span style="font-weight: normal;"> button.</span> This will show
    a table with sets of genes within a possible cluster that may perform the same functional
    roles.</p>
  <p><b>2.45 The Similarities Table</b></p>
  <p>You may wish to check the similarities between the given gene and other
    sequences in the FIG non-redundant database of protein sequences.† You
    get this table by by clicking on the <b> Similarities</b><span style="font-weight: normal;">
    button at the bottom of the gene page. The similarities are precomputed using blast.
  † Even so, it may take a bit to collect the results and display them.</span></p>
  <p><b>2.46 Aligning Sets of Genes</b></p>
  <p>Once you have the similarities table, you can check a set of genes and do
    things with the set.† One thing you can do with the set is to align
    the checked sequences.† Try aligning 4-5 sequences and verify that it
    works on your machine.</p>
  <p><b>2.47 Making Assignments to Sets of Checked Genes</b></p>
  <p>Once you start discovering errors in function assignments (and you will
    discover many, many errors), you will find that the errors propagate.† Thus,
    when one must be changed it may well be that an entire set of related errors
    must all be corrected.† You can do this by checking a set of genes,
    and clicking the assign/annotate button after checking the proper checkbox options
    below the assign/annotate button (see the "Help on Assignments,Rules and Checkboxes" link
    above the assign/annotate button for details on the options). Another way to look at this feature is in terms
    of generating a whole set of errors with one operation (so be careful)! </p>
  <p><b>2.48 Viewing Annotations of Checked Genes</b></p>
  <p>You can also retrieve the annotations for an entire set of genes, which
    is often helpful, by checking the pegs of interest and clicking on view annotations </p>
  <p><b>2.49 Invoking External Tools</b></p>
  <p>You can invoke external tools passing the given protein sequence on as the
    input.† Currently, we have installed links to NCBI’s psi-blast and to
    the ISREC TMpred (which is used to predict transmembrane domains).† These
    are two excellent tools, and we will hook in more on demand (we do not wish
    to add a huge number of basically useless tools, but we would like to add
    any you find truly useful – so let us know).</p>
  <p><b>2.5 The Goals of Community Annotation and the Goals of FIG</b><span style="font-weight: normal;"></span></p>
  <p>FIG intends to convert the SEED into a far more powerful tool for supporting
    community-wide annotations.† We support the capability of
    synchronizing distinct versions easily, which allows individuals to
    have versions on laptops and to use them even during periods in which connections
    to the network are impossible.</p>
  <p>An even more important capability involves adding new genomes rapidly as
    they become available.† We plan on supporting these efforts (even when
    the data cannot be widely shared immediately). †</p>
  <p><b>3.† Finding Missing Genes</b></p>
  <p>Occasionally you know (through an accumulation of wet lab and <i>in silico</i><span style="font-style: normal;"> evidence)
      that a gene performing a given function must be present although you cannot
      identify it yet.† Searching for such missing genes is one of the most
      exciting activities that you can do using the SEED; it is, perhaps, what
      it was really designed to do.† We will be adding more tools to support
      this activity as rapidly as possible.</span></p>
  <p><b>3.1<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">† </span></b><b>Locating
      the Critical Clusters (or Finding the Motherload)</b></p>
  <p>We believe that the most effective way to locate missing genes involves
    the use of the fact that functionally-related genes tend to cluster on the
    chromosome (in prokaryotes, and very occasionally in eukaryotes).† Indeed,
    as much as 30-60% of the genes in most prokaryotic genomes are clustered
    (often in operons) with genes that play closely related functions.</p>
  <p>The first step in locating a missing gene is to figure out a functional
    neighborhood.† To do this, we recommend using the KEGG maps as discussed
    in section 1.1 above, or using the search function to access a functional
    role (which then comes with a selected set of functional roles that constitute
    a neighborhood).† To illustrate, suppose that we wished to locate clusters
    relating to chorismate biosynthesis.† We could just take a neighborhood
    around the chorismate synthase.† When you get the page for the functional
    role (ie., the one with the proposed neighborhood), pick a set of genes,
    and then click on <b>clusters</b><span style="font-weight: normal;">.† This locates the largest clusters containing
    the functions you designate as a neighborhood.† See if you can find
    the clusters that seem to suggest that genes assigned the function transketolase
    play a role in chroismate biosynthesis in the archaea.</span></p>
  <p><b>3.2<span style="font-family: &quot;Times New Roman&quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-stretch: normal; font-size-adjust: none;">† </span></b><b>The
      Missing Tools: More to Come</b></p>
  <p>While exploring clusters is a great way to locate missing genes, it is not
    the only way.† There are more and more significant tools emerging.† The
    best involve use of fusions in which two genes that are separate in one organism
    are fused in another (this is extremely strong evidence that the genes play
    closely related roles), the use of regulatory sites (it is now possible using
    comparative analysis with a set of closely-related genomes to clearly locate
    many regulatory sites in prokaryotes), and occurrence profiles.† We
    will be adding these tools to the SEED as quickly as possible.</p>
  <h2>Summary</h2>
  <p>This short discussion was originally written by Ross Overbeek in a rush.† He
    established the rule that anyone who seriously had problems with it should
    fix it, and FIG would use the result (until someone else added more corrections).† We
    believe that functions will rapidly be added, and decent tutorials and examples
    can best be done as a cooperative effort.† In any event, Overbeek is
    off adding more features,</p>
<br>
<b>If you want to try working on some assignments to make sure that
  you can actually navigate with the SEED.  To see the proposed
  assignments, <a href=assignment1.html>click here</a>.
</div>
</body></html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3