[Bio] / KBaseTutorials / Towards_a_controlled_vocabulary_of_function / making_the_MOL_translation.html Repository:
ViewVC logotype

View of /KBaseTutorials/Towards_a_controlled_vocabulary_of_function/making_the_MOL_translation.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (download) (as text) (annotate)
Wed Jun 13 21:03:59 2012 UTC (7 years, 6 months ago) by salazar
Branch: MAIN
CVS Tags: HEAD
Changes since 1.4: +43 -37 lines
Created H2s

<h1>Towards a Controlled Vocabulary Part 3: Making the MOL Translation to the Abstract Function Vocabulary</h1>

<p>We have described how to define an abstract vocabulary of function using exemplars
  in <a href="exemplars.html"l>Part 1: Defining Exemplars</a>.
  Next, we described how to generate the SEED mapping from SEED IDs to the abstract
  vocabulary imposed by exemplars 
  <a href="mapping_to_exemplars.html">Part 2: Mapping to Exemplars</a>.
  
  We complete this discussion of how to reach consistent annotations in an abstract vocabulary by
  discussing how one might map MicrobesOnLine (MOL) features to the abstract vocabulary.
  <br><br>
The overall strategy follows two main steps, as explained in the sections below.</p>
<h2>Begin with the SEED Translation</h2>
<p>Begin with the SEED translation, represented as 4-tuples of the form <br>
</p>
<pre>
           [source,source-id,fid,exemplar]
<br>
  </pre>
where the <i>source</i> is <i>SEED</i>.  Here, the <i>fid</i> field is the KBase ID of
  the SEED <i>source_id</i>.  We begin by adding a column containing the md5 values using
  
<br>
<pre>
           fids_to_proteins -c 3 < seed.translation.table > seed.translation.with.md5
<br>
  </pre>
and we sort them on the md5 value using
<br>
<pre>
            sort -u -k 5 seed.translation.with.md5 > sorted.seed.with.md5
<br> </pre>

<h2>Create a Table of Data Relating to the MOL Genomes</h2>
<p>Then, we create a table of data relating to the MOL genomes using <br>
</p>
<pre>
            echo MOL | 
	    get_relationship_Submitted -to id | 
	    genomes_to_fids peg | 
	    cut -f 3 | 
	    get_entity_Feature -f source_id |
	    fids_to_proteins -c 1 2> /dev/null | 
	    fids_to_functions -c 1 |
	    sort -k 3 > MOL.fid.source_id.md5.function
<br></pre>
<p>Now, we have two fairy large files, <i>seed.translation.with.md5</i> and <i>MOL.fid.md5.function</i>,
  both sorted on md5 values.
  We can now make a single pass through the two files, compiling the data we will need to construct
  an initial translation file for MOL.  Essentially, as we pass through the two files, we detect
  cases in which an MOL fid had an identical md5 to a SEED fid; we use this to take the corresponding 
  exemplar and the MOL function and increment a count of the times this function occurred with
  that exemplar.  After the pass, we simply take, for each exemplar the string that MOL most often
  used to describe the abstract function, and we write out a file of inconsistent sets to support
  MOL in detecting and correcting inconsistencies.
  This is accomplished by the following short perl program:</p>
<h2>A Short Perl Program supporting
  MOL in Correcting Inconsistencies<br>
</h2>
<pre>
#
# make_MOL_translation.pl
#
open(IN1,"<","MOL.fid.source_id.md5.function") || die "could not open MOL.fid.source_id.md5.function";
open(IN2,"<","sorted.seed.with.md5")           || die "could not open sorted.seed.with.md5";

my %counts;
my $row1 = &lt;IN1&gt;;
my $row2 = &lt;IN2&gt;;
my $n=1;
while ($row1 && $row2)
{
    my($fid_MOL,$source_id_MOL,$md5_MOL,$function_MOL) 
	= ($row1 =~ /^(\S+)\t(\S+)\t(\S+)\t(.*)/);
    my($source_id_SEED,$exemplar,$md5_SEED) 
	= ($row2 =~ /^\S+\t(\S+)\t\S+\t(\S+)\t(\S+)/);

    if (($md5_MOL lt $md5_SEED) || &hypo($function_MOL))
    {
	$row1 = &lt;IN1&gt;;
    }
    elsif ($md5_MOL gt $md5_SEED)
    {
	$row2 = &lt;IN2&gt;;
    }
    else
    {
	$counts{$exemplar}->{$function_MOL}++;
	$row1 = &lt;IN1&gt;;
    }
}
close(IN1);
close(IN2);

foreach my $exemplar (keys(%counts))
{
    my @funcs = sort { $counts{$exemplar}->{$b} <=> $counts{$exemplar}->{$a} } keys(%{$counts{$exemplar}});
    print join("\t",($exemplar,$funcs[0])),"\n";
}

# a modest attempt to catch most hypothetical roles
#
sub hypo {
    my ($func) = @_;
    if (! $func)                             { return 1 }
    $func =~ s/\s*\#.*$//;
    if ($func =~ /lmo\d+ protein/i)          { return 1 }
    if ($func =~ /hypoth/i)                  { return 1 }
    if ($func =~ /conserved protein/i)       { return 1 }
    if ($func =~ /gene product/i)            { return 1 }
    if ($func =~ /interpro/i)                { return 1 }
    if ($func =~ /B[sl][lr]\d/i)             { return 1 }
    if ($func =~ /^U\d/)                     { return 1 }
    if ($func =~ /^orf[^_]/i)                { return 1 }
    if ($func =~ /uncharacterized/i)         { return 1 }
    if ($func =~ /pseudogene/i)              { return 1 }
    if ($func =~ /^predicted/i)              { return 1 }
    if ($func =~ /AGR_/)                     { return 1 }
    if ($func =~ /similar to/i)              { return 1 }
    if ($func =~ /similarity/i)              { return 1 }
    if ($func =~ /glimmer/i)                 { return 1 }
    if ($func =~ /unknown/i)                 { return 1 }
    if (($func =~ /domain/i) ||
        ($func =~ /^y[a-z]{2,4}\b/i) ||
        ($func =~ /complete/i) ||
        ($func =~ /ensang/i) ||
        ($func =~ /unnamed/i) ||
        ($func =~ /EG:/) ||
        ($func =~ /orf\d+/i) ||
        ($func =~ /RIKEN/) ||
        ($func =~ /Expressed/i) ||
        ($func =~ /[a-zA-Z]{2,3}\|/) ||
        ($func =~ /predicted by Psort/) ||
        ($func =~ /^bh\d+/i) ||
        ($func =~ /cds_/i) ||
        ($func =~ /^[a-z]{2,3}\d+[^:\+\-0-9]/i) ||
        ($func =~ /similar to/i) ||
        ($func =~ / identi/i) ||
        ($func =~ /\bputative\b/i) ||
        ($func =~ /ortholog of/i) ||
        ($func =~ /structural feature/i))    { return 1 }
    return 0;
}

<br></pre>

The program outputs a set of 2-tuples:

<br><pre>
           [exemplar,MOL-function]
<br></pre>

On our initial uses of the program approximately 80% of the functional roles used in modelling
were included in the initial MOL tranlation table.
<br><br>
We have attempted to weed out hypothetical assignments from consideration.
We make no attempt to identify inconsistencies in the MOL assignments.  All we give is a table
that maps the abstract roles (exemplars) to an MOL function that is used at least as often as any other
to represent the role.

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3