Towards a Controlled Vocabulary: Mapping to Exemplars (Step 2)

In the first tutorial relating to creating and maintaining a controlled vocabulary of function (The Issue of a Controlled Vocabular for Protein Function: Step 1) we discussed the creation of a set of exemplars. These exemplars allowed us to make statements like
The function of protein X is the same as that of exemplar E, where the exemplar is the ID of a Feature stored in KBase.
We now consider the issue of creating a translation table
           [source,source-id,fid,exemplar]

that maps fids from some sources of annotations into the exemplars. In these tuples, source_id is the ID used in the source database, while fid is the registered KBase ID. To be concrete we will construct these tables for both MicrobesOnLine (MOL) genomes and the SEED genomes. In each case we will also construct sets of inconsistencies that will need to be resolved.

Let us begin by creating the translation table for the SEED. The strategy here is as follows:
  1. For each exemplar E, locate all SEED fids that have the same function as the KBase function assigned to E. Call this set S.
  2. Then, for each SEED fid F in S, get all SEED fids that have identical md5 values. Call this set FS. Then, form a 2-tuple: [F,FS].
  3. For each two-tuple [F, FS], split FS into



    If a majority of genes with a common md5 have a function identical to that of E, write tuples
        [SEED,SEED-id,fid,E] 
    
    as part of the translation table, and for cases in which a fid has a distinct function from the exemplar, write entries of the form
        [SEED-id,fid,E] 
    
    as a 3-tuple to the file SEED.inconsistencies.1. Otherwise, write the entire set of inconsistent fids to the file SEED.inconsistencies.2.


This simple procedure constructs a mapping of the SEED fids to the exemplars, a set of SEED fids that should probably be automatically reassigned a function to match an exemplar (SEED.inconsistencies.1), and a set of clear inconsistencies that need to be resolved (SEED.inconsistencies.2).

Here is how we implement this strategy:
            cat exemplars.with.literature exemplars.for.no.lit.roles > exemplars

            cut -f1,2 exemplars |
            roles_to_fids -c 1 |
            fids_to_genomes | get_relationship_WasSubmittedBy -to id | grep "SEED$" | cut -f1,2,3 |
            fids_to_proteins |
            fids_to_functions -c 3 |
            get_entity_Feature -c 3 -f source_id > role.exemplar.fid.md5.function.source_id

            export TAB=`echo -e "\t"`
            sort -t "$TAB" -k 4 role.exemplar.fid.md5.function.source_id |
            perl make_seed_translation.pl > seed.translation.table

where make_seed_translation.pl program is given below. Let us go through this somewhat complex set of commands one step at a time.
            cat exemplars.with.literature exemplars.for.no.lit.roles > exemplars

just concatenates the two sets of exemplars into a single file. The lines in this exemplars file contain
            [role,exemplar-fid,genome_name]

These 3-tuples define our "abstract vocabulary of function". Then, look at
            cut -f1,2 exemplars |
            roles_to_fids -c 1 |
            fids_to_genomes | get_relationship_WasSubmittedBy -to id | grep "SEED$" | cut -f1,2,3 |

These three lines take the first two fields of the 3-tuples (dropping the genome_name), extend the table with fids that are believed to implement the role, and then the last line has the effect of keeping only entries that originated in the SEED. The output will be 3-tuples
            [role,exemplar-fid,KBase-id-of-SEED-fid]

Then, we add columns for the md5 of the SEED-fid, the function of the SEED-fid, and the SEED-id of the SEED-fid.
            fids_to_proteins |
            fids_to_functions -c 3 |
            get_entity_Feature -c 3 -f source_id > role.exemplar.fid.md5.function.source_id

This gives
            [role,exemplar-fid,
	     KBase-id-of-SEED-fid,
	     md5-SEED-fid,
	     function-SEED-fid,
	     SEED-id]

Finally, we sort the table on the md5 values and use a simple perl program to generate the SEED translations:
            export TAB=`echo -e "\t"`
            sort -t "$TAB" -k 4 role.exemplar.fid.md5.function.source_id |
            perl make_seed_translation.pl > seed.translation.table

The export is a minor ugliness needed to tell the sort command that tabs are being used to delimit fields (this assumes use of the bash shell). By sorting the tuples on md5 values, you group rows that represent the same protein sequence (and should, hence, be consistently annotated). The program make_seed_translation.pl just forms the groups of rows with the same md5 values, checks to verify if they are consistently annotated (or can easily be made to be consistent), and writes out the desired SEED translation as the 4-tuples
           [source,source-id,fid,exemplar]

where source will always be SEED.

The last time that we generated the SEED translations, the program produced somewhat over 1.8 million tuples. These impose a relatively consistent set of annotations on the SEED features.

Here is the program make_seed_translation.pl that actually generates the translation tuples.
# make_seed_translation.pl
#
open(OUT1,">","SEED.inconsistencies.1") || die "could not open SEED.inconsistencies.1";
open(OUT2,">","SEED.inconsistencies.2") || die "could not open SEED.inconsistencies.2";

my $last = <STDIN>;
while ($last && ($last =~ /(\S[^\t]*\S)\t(\S+)\t(\S+)\t(\S+)\t([^\t]*)\t(\S+)$/))
{
    my $role     = $1;
    my $exemplar = $2;
    my $md5      = $4;
    my @match;
    my @mismatch;
    while ($last && ($last =~ /(\S[^\t]*\S)\t(\S+)\t(\S+)\t(\S+)\t([^\t]*)\t(\S+)$/) && ($4 eq $md5))
    {
	my $fid        = $3;
	my $source_id  = $6;
	my $function   = $5;
	$function    =~ s/\s*#.*$//;
	if ($function eq $role) 
	{
	    push(@match,[$source_id,$fid]);
	}
	else
	{
	    push(@mismatch,[$source_id,$fid]);
	}
        $last = <STDIN>;
    }

    if (@match > @mismatch)
    {
	foreach my $_ (@match)
	{
	    print join("\t",('SEED',@$_,$exemplar)),"\n";
	}

	foreach $_ (@mismatch)
	{
	    print OUT1 join("\t",(@$_,$exemplar)),"\n";
	}
    }
    else
    {
	if (@match > 0)
	{
	    print OUT2 join("\t",map { @$_ } (@match,@mismatch)),"\n";
	}
    }
}
close(OUT1);
close(OUT2);