[Bio] / FigTutorial / DataMining.html Repository:
ViewVC logotype

View of /FigTutorial/DataMining.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.5 - (download) (as text) (annotate)
Thu Feb 16 22:03:22 2006 UTC (13 years, 9 months ago) by overbeek
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, rast_rel_2008_09_30, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, rast_rel_2008_11_24, HEAD
Changes since 1.4: +1 -1 lines
fix Data mining tutorial

<h1>Data Mining Tools in the SEED</h1>
<h2>A Sketchy Set of Notes Maintained by Ross Overbeek</h2>
This tutorial discusses a number of issues that you will need to know about
in order to use the data mining tools included in SEED.  It is organized as follows:

<li><A HREF="#Introduction: pegs, fid2dna, translate, and translation_of">Introduction: pegs, fid2dna, translate, and translation_of</A>
<li><A HREF="#A More Complex Example: Indirect Functional Coupling">A More Complex Example: Indirect Functional Coupling</A>
<li><A HREF="#Extracting Upstream Regions">Extracting Upstream Regions</A>
<li><A HREF="#Locating New Forms of Enzymes">Locating New Forms of Enzymes</A>
The first time through, I recommend reading it in order.

<h2 id="Introduction: pegs, fid2dna, translate, and translation_of" >Introduction: pegs, fid2dna, translate, and translation_of</h2>
The SEED is a loosely structured, evolving system.  It is by intent an
"executable prototype" in which new ideas are installed by numerous
participants.   The set of tools that I describe here are descendants
of an experiment that we did with the WIT system.  The basic idea is
to build a command-line capability to very easily extract data from an
integration like the SEED.  To extract desired data, you invoke small
programs that are strung together to produce the desired output.  It
is clearly a notion that derived from the original ideas developed in
the UNIX culture.  Although we will not elaborate on the point here,
most of the precise design grew out of work done on logic programming
systems in the late 1980's and early 1990's.  
To see how things work, let's consider a simple example:
	pegs 83333.1 | fid2dna | translate
is a sequence of three commands that are strung together to get
translations of the genes in <i>Escherichia coli</i>.  The way it
works is that
<li><b>pegs</b> is a command that takes as arguments a list of genome
IDs (in this case we used only one - 83333.1).  It writes out a stream
of the IDs of the PEGs (protein-encoding genes, often called CDSs) for
the set of genomes.  You can see this by just typing
pegs 83333.1 | head -5
which will show the first 5 lines written by <b>pegs</b>.
<li> <b>fid2dna</b> is a utility that takes as input a stream of
feature IDs and outputs a FASTA formatted file of the DNA for the
You might try
	pegs 83333.1 | fid2dna | head -50
to see what comes out.
<li>Finally, <b>translate</b> is a utility that translates a file of
DNA sequences (in FASTA format) into a file of protein sequences in
FASTA format.
This little example actually translates the DNA.  The SEED includes
translations for each PEG.  You could get them using
	pegs 83333.1 | translation_of 
As an exercise, we suggest that you run
	pegs 83333.1 | translation_of > tmp.SEED.translations
	pegs 83333.1 | fid2dna | translate > tmp.translated.SEED.DNA
and see how the two files compare.  Would you expect them to be the
same?  Why might they differ?

<h2 id="A More Complex Example: Indirect Functional Coupling">A More Complex Example: Indirect Functional Coupling </h2>

Functionally related genes tend to cluster on prokaryotic chromosomes.
This fact is the basis for a number of techniques that have been used
to gain clues relating to the function of "hypothetical proteins".  As
the number of sequenced genomes grows, the power of techniques based
on statistical analysis of chromosomal clusters grows dramatically (I
have argued elsewhere that the information grows approximately as the
square of the number of genomes, but that is a story best left for
another document).
Inevitably, the question arises: <b>"What can one do with a gene when it
does not appear to be in a chromosomal cluster?"</b>  The technique
that I will describe now has often been referred to as <i>indirect
Here is how it works:
You start with a gene in some genome (prokaryotic or eukaryotic).
Call this gene <i>X</i>.
Compute all bi-directional best hits (BBHs) of <i>X</i>.  These amount to
an attempt to find the corresponding genes in other genomes.  You need
to remember that such projections are not completely reliable, but
they are at the heart of many of the inferences we make relating to
Call this set of BBHs <i>S1</i>.
Now for each element of S1, we can check to see if it appears to be in
a chromosomal cluster, and if it is we can compute the genes that
appear to be <i>directly functional coupled</i>.  Call this set of
genes that are fucntionally coupled to BBHs of <i>X</i> <i>S2</i>.
Now we want to locate the genes in the original genome that correspond
to genes in </S2>.  Again, we estimate this correspondence using BBHs.
That is, we take BBHs of genes in <i>S2</i> that are in the same
genome as <i>X</i>.  Call this set of genes <i>S3</i>.  These are the
genes that are indirectly coupled to <i>X</i>.
Finally, look at the functions of the indirectly coupled genes.  This
set of functions can often be hugely informative.

To illustrate this process and how it can be implemented using the
tools we provide with the SEED, let us go through the computation one
step at a time.

The first step is to pick an initial gene.  As an example, let me
choose a gene from <i>Methanocaldococcus jannaschii</i>, one of the first
genomes I studied.  In the version of the SEED I am using at this
time, the peg is <i>fig|2190.1.peg.1537</i>.  It is also known by a
number of other ids: <i>NP_248444.1</i>, <i>gi|15669631</i>,
<i>sp|Q58835</i>, and kegg|mja:MJ1440</i>, for example.  You should
make a file containing just one line, and that line should contain one
of these ids.  Call this file <i>initial.peg</i>.
	bbhs < initial.peg > corresponding.genes.in.other.genomes
should construct a file containing the BBHs of the single gene in the
file.  If <i>initial.peg</i> had contained more genes, BBHs would have
been computed for all of the ids in the file.
You should make an initial file and run <b>bbhs</b> to make sure that
you understand this simple tool.  If you wished to restrict yourself
to BBHs that had better similarity scores than, say, 1.0e-50, you
would use
	bbhs 1.0e-50 < initial.peg > corresponding.genes.in.other.genomes
The tool <i>bbhs</i> computes BBHs for any id that occurs as a last
field in the tab-separated fields in lines from the input file.  If there are numerous
PEGs in the input file, execution of <i>bbhs</i> can go for hours.
The next step involved computing genes that were functionally coupled
to those in <i>corresponding.genes.in.other.genomes</i>.  This can be
done using

    functionally_coupled 5000 1.0e-10 3 fast < corresponding.genes.in.other.genomes > coupled.in.other.genomes
The tool <i>functionally_coupled</i> computes genes that are
functionally coupled (based on preserved contiguity) to those that
occur as last entries in the tab-separated fields of lines from the
input file. The parameters to the tool are as follows:
<li>the maximum distance separating a gene from another that is
putatively clustered with it,
the maximum p-score for similarities used to find corresponding genes
between two clusters,
the minimum functional coupling score (which is one less than the number of
clusters in "significantly different" genomes), and
the word <i>fast</i> which causes the tool to attempt a fast technique
that sometimes misses couplings (this argument is optional).  <i>Fast</i> is meant only relatively here:
execution can go for hours, even with the option set.
The tool writes out 3-tuples containing the two functionally-coupled
genes separated by their coupling score.
Again, I urge you to actually go through this step and examine the
output file.
Once you have created the file <i>coupled.in.other.genomes</i>, it is
time to search for BBHs from the coupled genes back in the original
genome. This can be done using
   bbhs < coupled.in.other.genomes | restrict_to_genome initial.genome > restricted.to.initial.genome
The <i>bbhs</i> tool is the same one we discussed above.  The simple
tool <i>restrict_to_genome</i> takes a single argument which is a file
containing a list of genome ids.  In this case the file contains just
a single line giving the id of the genome containing the initial gene
(which was, if you recall, 2190.1).
Finally, you should append the functions of the detected genes and
remove duplicates using
function_of < restricted.to.initial.genome | sort -u > indirectly.coupled.in.original.genome
The <i>function_of</i> tool takes the last field in each input line,
looks up the function (assuming the field was a PEG id), and writes
out the PEG id and function.
If you do all of this, you get a file containing
fig|2190.1.peg.1170	Methanocaldococcus jannaschii	Shikimate 5-dehydrogenase (EC
fig|2190.1.peg.1551	Methanocaldococcus jannaschii	3-dehydroquinate dehydratase (EC
fig|2190.1.peg.571	Methanocaldococcus jannaschii	3-phosphoshikimate 1-carboxyvinyltransferase (EC
For those of you who are familiar with aromatic amino acid
biosynthesis, these three functions do, in fact, perfectly bracket the
initial gene (fig|2190.1.peg.1537, which is the archael shikimate
kinase) in the pathway for chorismate biosynthesis.  A number of us
partiucipated in predicting the function of this gene a number of
years ago.  With the tools and data we have now, the predictions of function
for such genes is vastly simplified.

You could have produced the three lines of output by stringing all of the
commands together using
	% bbhs < initial.peg | 
	> functionally_coupled 5000 1.0e-10 3 fast | 
	> bbhs | 
	> restrict_to_genomes initial.genome | 
	> function_of | 
	> sort -u > indirectly.coupled.in.original.genome
So, there you have a pipeline of commands to compute indirect functional
coupling.  You can adjust the similarities required for the BBHs, and
you can alter the minimum coupling score required, but that is
essentially the computation.
You might wish to play with the different parameters.  In particular,
the "fast" parameter on <i>functionally_coupled</i> is optional, and
does alter the behaviour.  How well it performs depends critically on
how many <i>pins</i> and <i>chromosomal clusters</i> have accurately
been precomputed.  If you do not use the "fast" parameter, it can slow
things down by an order of magnitude.  We do try to keep pins and
clusters in good shape, but ....
Finally, you could use <i>similar_to</i> rather than <i>bbhs</i>.
However, unless you proceed incrementally, checking all intermediate
results, this can produce a lot of false positives (i.e., a lot of

<h2 id="Extracting Upstream Regions">Extracting Upstream Regions</h2>

Suppose that you wished to search upstream regions of genes for
control sites.  To do this, you need to get sets of genes that might
share a common control site, extract the upstream regions, and then
search these for a common regulatory site.  In this section we will
discuss how to extract the desired upstream regions.
Normally, one would select a set of potentially co-regulated genes.
This might be done using something like 
% pegs_in_subsystem "Histidine Biosynthesis" | grep 'fig|211586.1.peg' > related
% cat related
ATP phosphoribosyltransferase (EC	fig|211586.1.peg.1886
Phosphoribosyl-ATP pyrophosphatase (EC	fig|211586.1.peg.1879
Phosphoribosyl-AMP cyclohydrolase (EC	fig|211586.1.peg.1879
Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC	fig|211586.1.peg.1881
Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC	fig|211586.1.peg.3771
Imidazole glycerol phosphate synthase subunit hisH (EC 2.4.2.-)	fig|211586.1.peg.1882
Imidazole glycerol phosphate synthase subunit hisF (EC 4.1.3.-)	fig|211586.1.peg.1880
Imidazoleglycerol-phosphate dehydratase (EC	fig|211586.1.peg.1883
Histidinol-phosphate aminotransferase (EC	fig|211586.1.peg.1884
Histidinol-phosphatase (EC	fig|211586.1.peg.1883
Histidinol dehydrogenase (EC	fig|211586.1.peg.1885
The command <i>pegs_in_subsystem</i> can be used to extract all of the
genes from the spreadsheet encoding the designated subsystem (in this case,
<i>Histidine Biosynthesis</i>).  The command 
produces a 2-column (tab-separated) file in which the
first column is a functional role and the second is a PEG assigned
that function.  By using <i>grep</i> to pull out just those genes for
a given genome (in this case 211586.1, which is <i>Shewanella
oneidensis MR-1</i>), we get a potentially co-regulated set of genes.
This may or may not be exactly what you wish -- you must still be
careful of situations in which the genes occur in a single operon, but
that is a detail for later.
Once you have a set of potentially co-regulated genes, you might use
something like
	upstream upstream=200 plus=100 < related > upstream.fasta
which produces a file like
Here, the parameters to upstream have the following meanings:
<b>upstream=200</b> specifies that the output is to contain 200 bp of
upstream region, if possible (i.e., if the gene is not at the
end of a contig), and
<b>plus=100</b> specifies that you wish to include 100 bp at the start
of the gene.
The output contains one fasta entry for each gene.  A "-" separates the upstream sequence
from the start of the gene's coding sequence.  Lowercase is used for the
upstream sequence, until it overlaps another gene (and from there on you get uppercase).
The key to actually extracting accurate upstream operators in prokaryotes has turned out
to be the use of comparative analysis with corresponding genes from closely related organisms.
This requires determining a set of "corresponding genes from closely related organisms" for each gene in the
potentially co-regulated set.
One way to do this might be to use
pegs_in_subsystem "Histidine Biosynthesis" | grep "ATP phosphoribosyltransferase (EC"
which produces a file something like
ATP phosphoribosyltransferase (EC	fig|83333.1.peg.1994
ATP phosphoribosyltransferase (EC	fig|272626.1.peg.573
ATP phosphoribosyltransferase (EC	fig|224308.1.peg.3498
ATP phosphoribosyltransferase (EC	fig|169963.1.peg.563
Note that all we have done here is to extract a column (rather than a row) from the
subsystem spreadsheet.
You might, of course, wish to consider corresponding genes as defined by sequence similarity,
rather than using a subsystem spreadsheet.  In this case, you need to get all of the genes
from other organisms that pass some similarity threshhold.  To do this, you might use something like
	echo 'fig|211586.1.peg.1886' | similar_to > genes.similar.to.1886
This gives you the genes similar to <i>fig|211586.1.peg.1886</i>.  If you wished to get
the upstream regions for the original gene and its related genes, then
(echo 'fig|211586.1.peg.1886';  echo 'fig|211586.1.peg.1886' | similar_to) | upstream
would work (here we just use default parameters on <i>similar_to</i> and <i>upstream</i>).
To do this for an entire <i>Shewanella</i> genome I used
% for i in `pegs 211586.1`
> do
>     (echo "$i"; echo "$i" | similar_to | is_prokaryotic) | upstream upstream=500 plus=100 > "Output/$i"
>     echo $i
> done
while running in <i>bash</i>.  The <i>is_prokaryotic</i> filter just removes non-prokaryotic genes from the
input to <i>upstream</i>.  The <ii>is_eukaryotic</i>, <i>is_bacterial</i>, and <i>is_archaeal</i> filters now
work, as well.

<h2 id="Locating New Forms of Enzymes">Locating New Forms of Enzymes </h2>

Researchers have discovered that many enzymes have multiple,
nonhomologous forms.  The best way to recognize such situations is by
constructing a <i>subsystem spreadsheet</i>.  Once you have
constructed a spreadsheet for a subsystem, you often will find a set
of organisms in which the majority of functional roles can be matched
up with PEGs, but one or more columns remain empty.  These often
represent cases in which an alternative form of the enzyme exists, and
one of the more interesting tasks in annotation is to search for the gene that
has not yet been identified as playing the missing role.  In this
short section, we will illustrate one one to seek the missing gene
using the SEED data mining tools (you could, of course, also seek it
using the Web-based tools).
I will illustrate the technique using a search for the archaeal form
of the shikimate kinase.  This example induces pleasant memories for
me, since this was one of the first bioinformatics predictions that I
was involved in that was then confirmed in the wet lab.  At the time a
number of us worked out the prediction, our approachs were far from
systematic.  Now it is time to formulate the approaches that
work and apply them routinely.  
In this example, we had a known
version of the shikimate kinase, an enzyme used in aromatic
amino-acid biosynthesis.  The known version occurs throughout the
bacteria and eukaryotes.  However, while it was absolutely clear that
many archaea synthesize aromatic amino acids with essentially the same
pathway used by bacteria, locating the archaeal form of the shikimate
kinase had not yet been done.  So, how might one go about locating the
archaeal form, given only a number of occurrences of the known
bacterial form?
The basic approach that we will illustrate here uses the following
<li><b>Step 1</b>: You begin the approach using the known version of the
enzyme (any of the known versions).  We will take a version from
<i>Brucella abortus</i>, fig|235.1.peg.189, but any of the bacterial
occurrences would work.  We will call this the <i>initial
sequence</i>.  We will begin by computing bi-directional best hits
(BBHs) of the initial sequence.  Call the set composed of the initial
sequence and all of its BBHs, <b>Si: the set of known versions of the
enzyme</b>.  Here is how we propose computing <b>Si</b>:
	bbhs iterate < initial > bbhs
	cat initial bbhs > Si
The command <i>bbhs</i> was used in our last example.  In this case,
we have added an extra argument; by saying <i>iterate</i> we have
requested that the command compute BBHs of the PEGs in the file
initial, but then iterate by adding BBHs of any of the newly added
PEGs (and continue until no more new BBHs can be added).

<li><b>Step 2</b>: Now we compute the set of genes that are
functionally coupled to the known version using

	functionally_coupled 5000 1.0e-10 5 fast < Si > fc.Si
	bbhs length=0.8 iterate < fc.Si > known.machinery

The genes in <b>fc.Si</b> should all occur in genomes that contain the
known version of the enzyme.  This set represents "genes coupled to
the known version" and represents an estimate of the machinery that is
functionally related to the role played by the enzyme.  If we then
project this machinery to all of the genomes (including those that do
not contain the known version) using <i>bbhs</i> we have the file <i>known.machinery</i>.

<li><b>Step 3</b>: We then proceed by restricting the PEGs in <i>known.machinery</i> to those from
genomes that do not contain members of <i>Si</i>; that is, to those that might be from
genomes containing the alternative form.  To do this we use
    genomes_of < Si > genomes.of.Si
    restrict_to_genomes -v genomes.of.Si < known.machinery > known.in.genomes.without.orig.form
The tool <i>genome_of</i> takes as input a file containing lines that end with PEG ids.  It
outputs lines that contain the genome ids of all genomes containing those PEGs (i.e., those that occur
in the last column of the input file).
The tool <i>restrict_to_genomes</i> takes as input a file containing PEG ids in the
last column.  It restricts this input set based on the genome ids of these PEGs, writing out those lines
that pass a test.  The test is given by the command line arguments.  If the command line contains a single
argument, it should be a file containing genome ids.  In the case, the test amounts to "keep PEGs from
genomes corresponding to those in the given file".  If the command line arguments are of the form
<i>-v File</i>, then the test amounts to "keep only PEG ids from genomes other than those in the given file".

<li><b>step 4</b>: Now that we have a file on known machinery in genomes that probably do not have the original
form of the enzyme, we use
    functionally_coupled 5000 1.0e-10 5  fast < known.in.genomes.without.orig.form > potential.candidates
to get genes that are functionally coupled to PEGs in <i>known.in.genomes.without.orig.form</i>.  The trouble
with these "potential candidates" is that the file will contain instances of all components of the
machinery of the subsystem; what we really want is instances that bear no resemblance to either the original
form or to any other gene that was functionally related to the original form (or to any BBH of those!).
We achive this restriction using
    cat Si fc.Si known.machinery > things.to.ignore
    restrict_by_similarity -v things.to.ignore 1.0e-3 < potential.candidates > candidates
The <i>restrict_by_similarity</i> command can be used to restrict a set of PEGs to those that are
either similar to, or definitely not similar to, a set of given PEGs (in this case <i>things.to.ignore</i>).
In the invocation shown, we take those not similar to genes in <i>potential.candidates</i>, and we specify a similarity threshhold of 1.0e-3.
<li><b>Step 5</b>: Finally, we take the set of candidate pegs and cluster all that are BBHs of one another.
This gives us a sequence of clusters of BBHs, where all of the entries in each cluster were dredged up by
this fairly simple little procedure.  The command to produce the clusters is as follows:
    cluster_by_bbhs < candidates > data.to.look.at
This concludes my initial discussion of what I call the "sniffer algorithm" (with the image of a bloodhound in mind).
It is, perhaps, worth noting that we base this exploration almost entirely on projection (BBHs) and functional coupling on the chromosome.  If one wanted to include use of other forms of evidence of functional coupling, it can
be done easily.  For example, to add fusion data, one would simply change

    functionally_coupled 5000 1.0e-10 5  fast < known.in.genomes.without.orig.form > potential.candidates

    functionally_coupled 5000 1.0e-10 5  fast < known.in.genomes.without.orig.form > potential.candidates
    fusion_of < known.in.genomes.without.orig.form >> potential.candidates

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3