[Bio] / FigTutorial / UsingSigs.html Repository:
ViewVC logotype

View of /FigTutorial/UsingSigs.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.2 - (download) (as text) (annotate)
Mon Sep 19 11:25:19 2005 UTC (14 years, 5 months ago) by overbeek
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, rast_rel_2008_09_30, caBIG-13Feb06-00, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, caBIG-05Apr06-00, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, caBIG-00-00-00, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, rast_rel_2008_11_24, HEAD
Changes since 1.1: +51 -0 lines
update of tutorial

<h1>Locating Genes that Distinguish Two Sets of Genomes</h1>

<h2>The Basic Capability</h2>
It is fairly common to find researchers that wish to locate genes that might be associated
with some propertry of an organism.  Suppose, for example, that we wished to locate the genes
associated with photosynthesis.  One obvious approach is to gather two
sets of genomes,

one set for photosynthetic organsims and
one set containing genomes for organisms known not to use
Then, if we were to go through the genes in one of the photosyntheic
genomes (i.e., one of the genomes in set 1), we could try to find
genes that occurred in all of the genomes in set 1, but did not occur
in any of the genomes in set 2.  This is basically what
<b>sigs.cgi</b> does, except that it tabulates genes that <i>tend to
occur in set 1, but not in genomes from set 2</i>.  The tool assigns a
score from 0 to 2.  The exact way the score is computed is a separate
topic, but essentially the score for each gene is formed by summing
two values:

a value from 0 to 1 is formed based on the number of genomes in set 1
that contain versions of the gene, and
a second value from 0 to 1 is formed based on the number of genomes in
set 2 that do not have the gene.

If you simply thought of the values being computed as the fraction of
genomes in set 1 with versions of the gene plus the fraction of
genomes in set 2 without the gene, you would not be too wrong.  The
actual value is a bit different, but the basic idea is the same -- we
produce a value from 0 to 2 that corresponds to how well the gene
discriminates between the two sets of genomes.
<h2>A Simple First Example: Finding Genes Involved in Photosynthesis</h2>
So, let us see how this works.  From the home page of the NMPDR, click
on <b>NMPDR Tools</b>, and then click on <b>To find the genes that
differentiate two sets of genomes</b>.  This should bring you to a
page with the title <b>Find Proteins that Discriminate Two Sets of
Organisms or Are Common to a Set of Organisms</b>.  This page is
designed for you to specify

<li>the genomes in <b>set 1</b>,
<li>the genome from set 1 from which we will select genes to be scored
(the <b>Given</b> genome), and
<li>the genomes in <b>set 2</b>.
To make this concrete, suppose that I told you that we have a set of
photosyntheic organisms that are in <b>set 1</b>:
<li> <i>Chlorobium tepidum TLS</i>,
<li> <i>Nostoc sp. PCC 7120</i>,
<li> <i>Prochlorococcus marinus str. MIT 9313</i>, and
<li> <i>Synechocystis sp. PCC 6803</i>

By default all organisms in the table are in nether set 1 nor set 2.
You need to go down the second column and click on the organisms you
wish to be in set 1.  Please do this now.
Then, you need to select one of the genomes from set 1 to be the
<b>given</b> genome.  Please select <i>Prochlorococcus marinus
str. MIT 9313</i>
by clicking on the button in the first column.
Now, let us select a few genomes for organisms that are known not 
to implement photosynthesis:
<li><i>Bacillus subtilis subsp. subtilis str. 168</i>,
<li><i>Clostridium acetobutylicum ATCC 824</i>,
<li><i>Escherichia coli K12</i>, and
<li><i>Thermotoga maritima MSB8</i>.

Please select these by clicking on entries in the fourth column of the
Once you have done this, go to the bottom of the page and clink on
<b>Find the Discriminating Proteins from Given Organism</b>.  This
will bring up between 30 and 40 genes with scores of 2 (i.e., perfect
scores).  About 25% of these are clearly related to photosynthesis.
Others may be, but you would need to explore more data to gain any
You could easily get much more detailed data by expanding the number
of genomes in the two sets.  After all, it should be simple to add
30-40 non-photosynthetic genomes, and at least 10 more photosyntheic.

<h2>Finding Genes that Distinguish Sets of Pathogens</h2>

You can use this tool to find genes that distinguish any two sets of
organisms, including a set of pathogenic strains from a set of
closely-related nonpathogenic strains (or sets of pathogenic strains
that are believed to have different virulence determinants).
For example, suppose that you wished to find the genes in 
<i>Staphylococcus aureus subsp. aureus MRSA252</i> that also occurred
<li> <i>Staphylococcus aureus subsp. aureus MSSA476</i> and
<li> <i>Staphylococcus aureus subsp. aureus Mu50</i>,
but not in
<li><i>Staphylococcus aureus subsp. aureus MW2 </i> or
<li><i>Staphylococcus epidermidis ATCC 12228</i>.
Is it clear how to do this?  We suggest that you compute this set of
genes (set the <b>given genome</b> to <i>Staphylococcus aureus
subsp. aureus MRSA252</i>, put the three strains that you wish to
group in set 1, and place MW2 and <i>S. epidermidis</i> in set 2).
How many genes do you get?

<h2>Finding Genes Common to a Set of Genomes</h2>

It is also easy to use this tool to compute the set of genes common to
a set of genomes.  To do this, select a <b>given</b> genome and
specify the set of genomes to be searched as set 1.  Then, specify a
minimum number of genomes that a gene must be in to be displayed.
Finally, click on <b>Find Genes from Checked Organism and Organisms
from Set 1</b>.  We suggest that you find the genes in common to all
of the following strains of <i>S.aureus</i>:
<li> <i>Staphylococcus aureus subsp. aureus MSSA476</i>,
<li> <i>Staphylococcus aureus subsp. aureus Mu50</i>,
<li><i>Staphylococcus aureus subsp. aureus MW2 </i>,
<li><i>Staphylococcus aureus subsp. aureus MRSA252</i>, and
<li><i>Staphylococcus aureus subsp. aureus N315</i>
How many do you get?  This amounts to the "core machinery" of
<i>Staphylococcus aureus</i>.

<h2>BE WARNED: The Issue of Paralogs</h2>

The computation this tool uses (in the NMPDR) is based on bidirectional best hits.
In cases in which more than one copy of a gene exists in a genome
(i.e., there are paralogs of the gene present) and the copies are very
similar, the tool fails to match genes between organisms accurately.
This means that you should use this tool just to arrive at sets of
genes that will be subjected to closer analysis.  You will need to use
a copy of the SEED or an external service like the NCBI blast server
to verify the results.

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3