The Issue of a Controlled Vocabulary for Protein Function: Step 1

Much has been said about the desirability of a controlled vocabulary for protein function and how it might be achieved. One of the central goals of the KBase effort will be to develop detailed, predictive models of microbes. A consistent, controlled vocabulary of protein function will be needed to support automated generation of these models.

The KBase will include a major effort to automatically construct metabolic models of organisms directly from sequenced genomes. Much of the technology used by KBase will come directly from the SEED and Model SEED Projects which utilize the vocabulary established by the SEED Project. We wish to remove this dependency on the SEED voabulary as quickly as possible, and this short note sketches out the plan for achieving this and how to begin implementing it.

Limiting the Scope

Unifying the distinct vocabularies of function would require a major effort. However, if one circumscribes the goal to
a plan can be formulated that could be implemented by modest resources and good will.

So, how many functional roles are now used in the construction of metabolic models? It is clear that ultimately we wish to include regulatory models, and eventually all functional roles that occur in living organisms, but for now let us confine ourselves to the set of functional roles needed to support metabolic modeling. You can just use the command

to get the current list of functional roles used in metabolic models. Currently, there are about 2000-2500. This is a manageable number.

The Concept of Exemplars

The key to a rapid and straightforward path to supporting models built using differing vocabularies of function is the concept of exemplar. We choose a sequence for which Let's begin with connecting the roles used in models to literature. A simple, but somewhat slow, way to do this is using
         all_roles_used_in_models | 
	 roles_to_fids 2> roles.without.fids | 
	 fids_to_literature 2> > role.fid.pubmed 

This requires some explanation. First, many of the command-line scripts that cross from one entity type to another write input lines that could not be matched up to stderr. Thus, roles that cannot be connected to fids get written to roles.without.fids, and fids that cannot be lined to literature cause lines to be written to Roles that cannot be connected to fids result from functions that may have been renamed (without renaming the functional role) or roles that have been conjectured but simply have not yet been connect to specific genes. They represent a set of issues that need to be processed manually. When I last ran this command, I found 148 such roles.

If we now run,
      cut -f1,2 role.fid.pubmed | get_relationship_Produces -to id | sort -u > role.fid.md5

we get a table showing roles and fids, where each of the fids connnects to a least one pubmed reference. The cut picks up the first two fields (role and fid, dropping the pubmed id). Then the
        get_relationship_Produces -to id

is used to pick up the md5 value of the ProteinSequence corresponding to a fid. To understand how this works, you need to know that the relationship Produces connects Feature entities to ProteinSequence entities, and that the md5 hash values are used as IDs for the ProteinSequence entities. To see how this works, you might try running
        echo 'kb|g.0.peg.2659' | get_relationship_Produces -to id

and study what comes out. Now, let us look at a portion of the table representing fids that can be connected to a critical enzyme of glycolysis:
6-phosphofructokinase (EC	kb|g.0.peg.2659	        9f6606c2e93c6ac75fdc60dff2f54955
6-phosphofructokinase (EC	kb|g.1052.peg.2290	9f6606c2e93c6ac75fdc60dff2f54955
6-phosphofructokinase (EC	kb|g.1053.peg.2771	9f6606c2e93c6ac75fdc60dff2f54955
6-phosphofructokinase (EC	kb|g.1081.peg.3424	9f6606c2e93c6ac75fdc60dff2f54955
6-phosphofructokinase (EC     kb|g.1406.peg.856       1c183b0fa280f9dc25b4e88d234f10f6
6-phosphofructokinase (EC     kb|g.1445.peg.3274      9f6606c2e93c6ac75fdc60dff2f54955
6-phosphofructokinase (EC     kb|g.1478.peg.3901      9f6606c2e93c6ac75fdc60dff2f54955

We have over 200 distinct fids that all are believed to implement the functional role and can be linked to at least one pubmed reference. It is worth noting that the pubmed articles (Publication entities) are linked to ProteinSequence entities, and then through them to Features. This means that we may see many Fids that share identical protein sequence (note the md5 values in the initial output).

Now we are ready to illustrate the concept of an exemplar. Rather than saying
      the function of protein X is "6-phosphofructokinase (EC", 

we can say
      the function of protein X is the same as that of feature kb|g.0.peg.2659,
      which has a sequence with an md5 hash of 9f6606c2e93c6ac75fdc60dff2f54955.

Thus, we have an abstract function which we represent with a specific feature ID (kb|g.0.peg.2659). When we say "The function of feature X is the same as that of exemplar Y" anyone can look up the sequence of the exemplar in KBase, they can access any attached literature, and they can investigate the potential role of X in modelling. Further, they can do all this without arguing about how to precisely label the abstract function.

Now let us consider how we might use the data in the file role.fid.md5 to select an appropriate exemplar for each functional role. It really should not matter which ones we pick. However, we have chosen to use Escherichia coli and Bacillus subtilis features when possible. We consider these to be relatively stable. We begin by just getting a 3-column table

           cut -f1,2 role.fid.md5 | 
	   get_relationship_IsOwnedBy -to scientific_name | 
	   fids_to_functions -c 2 |
	   sort > role.fid.genome.function

We have tacked on the function of each fid because we wish to eliminate the use of multifunctional fids as exemplars. We do this by only looking at fids with functions that exactly match the role in the little program below. Now the simple program

my %exemplar;
while ($_ = <STDIN>)
    my($role,$fid,$genome_name,$function) = split(/\t/,$_);
    $function =~ s/\s*#.*$//;     # some annotators have appended comments
                                  # beginning with the hash; we remove these
                                  # before verifying that the function
                                  # matches the role (multifunctional proteins
                                  # should probably not be exemplars)
    if ($role eq $function)
        my $existing = $exemplar{$role};
        if ((! $existing) || &better($genome_name,$existing->[1]))
	    $exemplar{$role} = [$fid,$genome_name];

foreach my $role (sort keys(%exemplar))
    my($fid,$genome_name) = @{$exemplar{$role}};
    print join("\t",($role,$fid,$genome_name)),"\n";

sub better {
    my($x,$existing) = @_;

    return ((($x =~ /Escherichia coli/) && ($existing !~ /Escherichia coli/)) ||
	    (($x =~ /Bacillus subtilis/) && ($existing !~ /Bacillus subtilis/)));

can be used to select an initial set of exemplars.
          perl < role.fid.genome.function  > exemplars.with.literature

For each functional role that is used in construction of KBase models, and for which we can find a literature reference identifying a feature that implements the role, we have selected a feature to act as an exemplar. The feature has been the topic of at least one paper, and we believe that the paper supports the position that the exemplar feature implements the corresponding role.

We now have a set of functional roles that are represented by exemplars that are supported by literature references. We have captured the roles that cannot be connected to any fids in roles.without.fids. Finally, we are left with roles that are connected to fids, but not to any fid that we have connected to literature. The connections between roles and possible exemplars was captured in We need to delete from this file all roles for which exemplars have been chosen, and then select exemplars from those that are left. We can make fairly arbitrary choices to get initial exemplars. Perhaps the easiest way to make an initial choice is to just use
                perl exemplars.with.literature < | 
                get_relationship_IsOwnedBy -to scientific_name |
                fids_to_functions -c 2 | 
                sort | 
                perl >

Here the simple perl program to filter out roles that already have exemplars is just
my %roles_with_exemplars = map { $_ =~ /^([^\t\n]*)/; ($1 => 1) } `cat $ARGV[0]`;
while ($_ = )
    if (($_ =~ /^([^\t\n]+)/) && (! $roles_with_exemplars{$1}))
	print $_;

To get a list of the roles for which no exemplars could be chosen, we can use
    cat exemplars.with.literature > exemplars
    all_roles_used_in_models | perl exemplars > roles.without.exemplars

Out of the 2000-2500 roles used in the curent collection of models, there are often a small number for which no attachment to a sequence exists. These require manual curation and need to be continuously reviewed. In any event, we will have to curate the set of exemplars as new experiemental evidence becomes available.

The choice of exemplars constitutes step 1 of the process of reaching interchangable annotations via controlled vocabularies.