[Bio] / FigTutorial / DataMining.html Repository:
ViewVC logotype

Diff of /FigTutorial/DataMining.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.1, Mon Jun 28 19:14:07 2004 UTC revision 1.2, Sat Aug 7 15:58:14 2004 UTC
# Line 220  Line 220 
220  results, this can produce a lot of false positives (i.e., a lot of  results, this can produce a lot of false positives (i.e., a lot of
221  garbage).  garbage).
222    
223    <h2>Extracting Upstream Regions</h2>
224    
225    Suppose that you wished to search upstream regions of genes for
226    control sites.  To do this, you need to get sets of genes that might
227    share a common control site, extract the upstream regions, and then
228    search these for a common regulatory site.  In this section we will
229    discuss how to extract the desired upstream regions.
230    <p>
231    Normally, one would select a set of potentially co-regulated genes.
232    This might be done using something like
233    <pre>
234    % pegs_in_subsystem "Histidine Biosynthesis" | grep 'fig|211586.1.peg' > related
235    % cat related
236    ATP phosphoribosyltransferase (EC 2.4.2.17)     fig|211586.1.peg.1886
237    Phosphoribosyl-ATP pyrophosphatase (EC 3.6.1.31)        fig|211586.1.peg.1879
238    Phosphoribosyl-AMP cyclohydrolase (EC 3.5.4.19) fig|211586.1.peg.1879
239    Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16)   fig|211586.1.peg.1881
240    Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16)   fig|211586.1.peg.3771
241    Imidazole glycerol phosphate synthase subunit hisH (EC 2.4.2.-) fig|211586.1.peg.1882
242    Imidazole glycerol phosphate synthase subunit hisF (EC 4.1.3.-) fig|211586.1.peg.1880
243    Imidazoleglycerol-phosphate dehydratase (EC 4.2.1.19)   fig|211586.1.peg.1883
244    Histidinol-phosphate aminotransferase (EC 2.6.1.9)      fig|211586.1.peg.1884
245    Histidinol-phosphatase (EC 3.1.3.15)    fig|211586.1.peg.1883
246    Histidinol dehydrogenase (EC 1.1.1.23)  fig|211586.1.peg.1885
247    %
248    </pre>
249    <br>
250    The command <i>pegs_in_subsystem</i> can be used to extract all of the
251    genes from the spreadsheet encoding the designated subsystem (in this case,
252    <i>Histidine Biosynthesis</i>).  The command produces
253    produces a 2-column (tab-separated) file in which the
254    first column is a functional role and the second is a PEG assigned
255    that function.  By using <i>grep</i> to pull out just those genes for
256    a given genome (in this case 211586.1, which is <i>Shewanella
257    oneidensis MR-1</i>), we get a potentially co-regulated set of genes.
258    This may or may not be exactly what you wish -- you must still be
259    careful of situations in which the genes occur in a single operon, but
260    that is a detail for later.
261    <p>
262    One you have a set of potentially co-regulated genes, you might use
263    something like
264    <pre>
265            upstream upstream=200 plus=100 < related > upstream.fasta
266    </pre>
267    <br>
268    which produces a file like
269    <pre>
270    >fig|211586.1.peg.1886
271    acccaccatcatcaccaccatatgcggtagccttcggacgttcactgccggaggactata
272    tccttctggcggactgattcaaagatgatcataaacccctggaaggaattccgggggttt
273    ttgcttttaggcttttaataaacgatttaacagctaattagtaattaatttcaatatgtt
274    aaccgccacatttactggcg-GTGCCGATAAGCCAGTTAAAGCGTGAAACGAAAGAATAC
275    CATTTTGAATTTGTTAACTTTGAGCGAGATAGAATTATGACCGAATCAAACCGTTTACGT
276    A
277    >fig|211586.1.peg.1879
278    GTTTATCGATGCGAAGGTGGATGCAGCACTGGCGGCCAGCGTGTTCCATAAAGCCATTAT
279    CAATATCGGCGAGTTAAAACAGTATTTAGCCGCCGAAGGTATTGCTATTAGACGCtagag
280    cggttctaggttctaggttctaggttctaggttctaggttctaggttctaggttctacaa
281    atagattaagagtgaatact-ATGTCAACGCAAACAAACACTAAGTCAGATTTTAGCGAG
282    CTGGATTGGGACAAACAGGATGGGCTTATCCCTGCTGTTATCCAAAACCACTTATCAGGC
283    A
284    >fig|211586.1.peg.1881
285    ATCGGTAAAGACAACTTTATGGGCGTGCAATTCCACCCTGAAAAAAGTGCCGCCGTGGGC
286    GCGCAAATCCTAGGTAACTTTTTAAAAATGCAAtaaatgcaatcgcttattaggtttggc
287    tgagccaccactatttttcagccacaccatacgacaataaagcgcggcagcaaatgcacc
288    aaaagaattaaggacaacag-ATGATCATTCCAGCGATTGATTTAATTGATGGCAAGGTA
289    GTCAGGCTCTACCAAGGGGATTATGGGCAGCAGACCACCTTCGATTTAAGCCCCCTCGCC
290    C
291    .
292    .
293    .
294    </pre>
295    Here, the parameters to upstream have the following meanings:
296    <ul>
297    <li>
298    <b>upstream=200</b> specifies that the output is to contain 200 bp of
299    upstream region, if possible (i.e., if the gene is not at the
300    end of a contig), and
301    <li>
302    <b>plus=100</b> specifies that you wish to include 100 bp at the start
303    of the gene.
304    </ul>
305    The output contains one fasta entry for each gene.  A "-" separates the upstream sequence
306    from the start of the gene's coding sequence.  Lowercase is used for the
307    upstream sequence, until it overlaps another gene (and from there on you get uppercase).
308    <p>
309    The key to actually extracting accurate upstream operators in prokaryotes has turned out
310    to be the use of comparative analysis with corresponding genes from closely related organisms.
311    This requires determining a set of "corresponding genes from closely related organisms" for each gene in the
312    potentially co-regulated set.
313    One way to do this might be to use
314    <pre>
315    pegs_in_subsystem "Histidine Biosynthesis" | grep "ATP phosphoribosyltransferase (EC 2.4.2.17)" >
316    </pre>
317    <br>
318    which produces a file something like
319    <br>
320    <pre>
321    ATP phosphoribosyltransferase (EC 2.4.2.17)     fig|83333.1.peg.1994
322    ATP phosphoribosyltransferase (EC 2.4.2.17)     fig|272626.1.peg.573
323    ATP phosphoribosyltransferase (EC 2.4.2.17)     fig|224308.1.peg.3498
324    ATP phosphoribosyltransferase (EC 2.4.2.17)     fig|169963.1.peg.563
325    .
326    .
327    .
328    </pre>
329    <br>
330    Note that all we have done here is to extract a column (rather than a row) from the
331    subsystem spreadsheet.
332    <p>
333    You might, of course, wish to consider corresponding genes as defined by sequence similarity,
334    rather than using a subsystem spreadsheet.  In this case, you need to get all of the genes
335    from other organisms that pass some similarity threshhold.  To do this, you might use something like
336    <br>
337    <pre>
338            echo 'fig|211586.1.peg.1886' | similar_to > genes.similar.to.1886
339    </pre>
340    <br>
341    This gives you the genes similar to <i>fig|211586.1.peg.1886</i>.  If you wished to get
342    the upstream regions for the original gene and its related genes, then
343    <pre>
344    (echo 'fig|211586.1.peg.1886';  echo 'fig|211586.1.peg.1886' | similar_to) | upstream
345    </pre>
346    would work (here we just use default parameters on <i>similar_to</i> and <i>upstream</i>).
347    To do this for an entire <i>Shewanella</i> genome I used
348    <pre>
349    % for i in `pegs 211586.1`
350    > do
351    > (echo "$i"; echo "$i" | similar_to | is_prokaryotic) | upstream upstream=500 plus=100 < tmp > "Output/$i"
352    > echo "$i"
353    > done
354    </pre>
355    while running in <i>bash</i>.  The <i>is_prokaryotic</i> filter just removes non-prokaryotic genes from the
356    input to <i>upstream</i>.
357    
358  <h2>Locating New Forms of Enzymes </h2>  <h2>Locating New Forms of Enzymes </h2>
359    
360  Researchers have discovered that many enzymes have multiple,  Researchers have discovered that many enzymes have multiple,

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.2

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3