[Bio] / KBaseTutorials / Towards_a_controlled_vocabulary_of_function / exemplars.html Repository:
ViewVC logotype

Annotation of /KBaseTutorials/Towards_a_controlled_vocabulary_of_function/exemplars.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.2 - (view) (download) (as text)

1 : overbeek 1.2 <h1>Towards a Controlled Vocabulary Part 1: Defining Exemplars</h1>
2 : disz 1.1
3 :     Much has been said about the desirability of a controlled vocabulary for protein function
4 :     and how it might be achieved. One of the central goals of the KBase effort will be
5 :     to develop detailed, predictive models of microbes. A consistent, controlled vocabulary
6 :     of protein function will be needed to support automated generation of these models.
7 :     <br><br>
8 :     The KBase will include a major effort to automatically construct metabolic models of
9 :     organisms directly from sequenced genomes. Much of the technology used by
10 :     KBase will come directly from the SEED and Model SEED Projects which utilize the
11 :     vocabulary established by the SEED Project. We wish to remove this dependency on
12 :     the SEED voabulary as quickly as possible, and this short note sketches out the plan
13 :     for achieving this and how to begin
14 :     implementing it.
15 :     <br><br>
16 :     <h2>Limiting the Scope</h2>
17 :     Unifying the distinct vocabularies of function would require a major effort.
18 :     However, if one circumscribes the goal to
19 :     <ul>
20 :     <li>identifying the functional roles needed to support modeling,
21 :     <li>creating an abstract representation for each role in this limited set, and
22 :     <li>buiilding translation dictionaries to and from this set of abstract functions,
23 :     </ul>
24 :     <br>
25 :     a plan can be formulated that could be implemented by modest resources and good will.
26 :     <br><br>
27 :     So, how many functional roles are now used in the construction of metabolic models?
28 :     It is clear that ultimately we wish to include regulatory models, and eventually
29 :     all functional roles that occur in living organisms, but for now let us confine
30 :     ourselves to the set of functional roles needed to support metabolic modeling.
31 :     You can just use the command
32 :    
33 :     <br><pre>
34 :     all_roles_used_in_models
35 :     <br></pre>
36 :     to get the current list of functional roles used in metabolic models. Currently, there
37 :     are about 2000-2500. This is a manageable number.
38 :    
39 :     <h2>The Concept of <i>Exemplars</i></h2>
40 :     The key to a rapid and straightforward path to supporting models built using differing
41 :     vocabularies of function is the concept of <b>exemplar</b>. We choose a sequence for which
42 :     <ul>
43 :     <li> we believe that we know the function of the sequence
44 :     (ideally through experimentation reported in the literature), and
45 :     <li> the sequence has been annotated by a number of groups (and we conjecture that they believe their
46 :     functions to be reliable due to the literature).
47 :     </ul>
48 :    
49 :     Let's begin with connecting the roles used in models to literature. A simple, but somewhat slow,
50 :     way to do this is using
51 :     <br><pre>
52 :     all_roles_used_in_models |
53 :     roles_to_fids 2> roles.without.fids |
54 :     fids_to_literature 2> roles.fids.no.literature > role.fid.pubmed
55 :     <br></pre>
56 :     This requires some explanation. First, many of the command-line scripts that cross from
57 :     one entity type to another write input lines that could not be matched up to <i>stderr</i>.
58 :     Thus, roles that cannot be connected to fids get written to <i>roles.without.fids</i>, and
59 :     fids that cannot be lined to literature cause lines to be written to <i>roles.fids.no.literature</i>.
60 :     Roles that cannot be connected to fids result from
61 :     functions that may have been renamed (without renaming the functional role) or
62 :     roles that have been conjectured but simply have not yet been connect to specific genes. They represent
63 :     a set of issues that need to be processed manually. When I last ran this command, I found
64 :     148 such roles.
65 :     <br><br>
66 :     If we now run,
67 :     <br><pre>
68 :     cut -f1,2 role.fid.pubmed | get_relationship_Produces -to id | sort -u > role.fid.md5
69 :     <br></pre>
70 :     we get a table showing roles and fids, where each of the fids connnects to a least
71 :     one pubmed reference. The <i>cut</i> picks up the first two fields (<i>role</i> and
72 :     <i>fid</i>, dropping the <i>pubmed id</i>). Then the
73 :     <br><pre>
74 :     get_relationship_Produces -to id
75 :     <br></pre>
76 :     is used to pick up the md5 value of the ProteinSequence corresponding to a fid. To understand
77 :     how this works, you need to know that the relationship <i>Produces</i> connects <i>Feature</i>
78 :     entities to <i>ProteinSequence</i> entities, and that the md5 hash values are used as IDs for the
79 :     <i>ProteinSequence</i> entities. To see how this works, you might try running
80 :     <br><pre>
81 :     echo 'kb|g.0.peg.2659' | get_relationship_Produces -to id
82 :     <br></pre>
83 :     and study what comes out.
84 :    
85 :     Now, let us look at a portion of the table representing
86 :     fids that can be connected to a critical enzyme of glycolysis:
87 :     <br><pre>
88 :     6-phosphofructokinase (EC 2.7.1.11) kb|g.0.peg.2659 9f6606c2e93c6ac75fdc60dff2f54955
89 :     6-phosphofructokinase (EC 2.7.1.11) kb|g.1052.peg.2290 9f6606c2e93c6ac75fdc60dff2f54955
90 :     6-phosphofructokinase (EC 2.7.1.11) kb|g.1053.peg.2771 9f6606c2e93c6ac75fdc60dff2f54955
91 :     6-phosphofructokinase (EC 2.7.1.11) kb|g.1081.peg.3424 9f6606c2e93c6ac75fdc60dff2f54955
92 :     6-phosphofructokinase (EC 2.7.1.11) kb|g.1406.peg.856 1c183b0fa280f9dc25b4e88d234f10f6
93 :     6-phosphofructokinase (EC 2.7.1.11) kb|g.1445.peg.3274 9f6606c2e93c6ac75fdc60dff2f54955
94 :     6-phosphofructokinase (EC 2.7.1.11) kb|g.1478.peg.3901 9f6606c2e93c6ac75fdc60dff2f54955
95 :     .
96 :     .
97 :     .
98 :     <br></pre>
99 :    
100 :     We have over 200 distinct fids that all are believed to implement the functional role
101 :     and can be linked to at least one pubmed reference. It is worth noting that
102 :     the pubmed articles (Publication entities) are linked to ProteinSequence entities, and then
103 :     through them to Features. This means that we may see many Fids that share identical
104 :     protein sequence (note the md5 values in the initial output).
105 :     <br><br>
106 :     Now we are ready to illustrate the concept of an exemplar. Rather than saying
107 :     <br><pre>
108 :     the function of protein X is "6-phosphofructokinase (EC 2.7.1.11)",
109 :     <br></pre>
110 :     we can say
111 :     <br><pre>
112 :     the function of protein X is the same as that of feature kb|g.0.peg.2659,
113 :     which has a sequence with an md5 hash of 9f6606c2e93c6ac75fdc60dff2f54955.
114 :     <br></pre>
115 :     Thus, we have an <i>abstract function</i> which we represent with a specific
116 :     feature ID (<i>kb|g.0.peg.2659</i>). When we say <i>"The function of feature X is the same as
117 :     that of exemplar Y"</i> anyone can look up the sequence of the exemplar in KBase, they can access any
118 :     attached literature, and they can investigate the potential role of <i>X</i> in modelling. Further,
119 :     they can do all this without arguing about how to precisely label the abstract function.
120 :     <br><br>
121 :     Now let us consider how we might use the data in the file <i>role.fid.md5</i> to select an appropriate
122 :     exemplar for each functional role.
123 :     It really should not matter which ones we pick. However, we have chosen to use <i>Escherichia coli</i> and
124 :     <i>Bacillus subtilis</i> features when possible. We consider these to be relatively stable.
125 :     We begin by just getting a 3-column table
126 :     <br><pre>
127 :     [role,fid,genome-name]
128 :     <br></pre>
129 :     using
130 :     <br><pre>
131 :     cut -f1,2 role.fid.md5 |
132 :     get_relationship_IsOwnedBy -to scientific_name |
133 :     fids_to_functions -c 2 |
134 :     sort > role.fid.genome.function
135 :     <br></pre>
136 :     We have tacked on the function of each fid because we wish to eliminate the use
137 :     of multifunctional fids as exemplars. We do this by only looking at
138 :     fids with functions that exactly match the role in the little program below.
139 :     Now the simple program
140 :     <br><pre>
141 :    
142 :     # choose_exemplars.pl
143 :     #
144 :     my %exemplar;
145 :     while ($_ = &lt;STDIN&gt;)
146 :     {
147 :     chomp;
148 :     my($role,$fid,$genome_name,$function) = split(/\t/,$_);
149 :     $function =~ s/\s*#.*$//; # some annotators have appended comments
150 :     # beginning with the hash; we remove these
151 :     # before verifying that the function
152 :     # matches the role (multifunctional proteins
153 :     # should probably not be exemplars)
154 :     if ($role eq $function)
155 :     {
156 :     my $existing = $exemplar{$role};
157 :     if ((! $existing) || &better($genome_name,$existing->[1]))
158 :     {
159 :     $exemplar{$role} = [$fid,$genome_name];
160 :     }
161 :     }
162 :     }
163 :    
164 :     foreach my $role (sort keys(%exemplar))
165 :     {
166 :     my($fid,$genome_name) = @{$exemplar{$role}};
167 :     print join("\t",($role,$fid,$genome_name)),"\n";
168 :     }
169 :    
170 :     sub better {
171 :     my($x,$existing) = @_;
172 :    
173 :     return ((($x =~ /Escherichia coli/) && ($existing !~ /Escherichia coli/)) ||
174 :     (($x =~ /Bacillus subtilis/) && ($existing !~ /Bacillus subtilis/)));
175 :     }
176 :     <br></pre>
177 :     can be used to select an initial set of exemplars.
178 :     <br><pre>
179 :     perl choose_exemplars.pl < role.fid.genome.function > exemplars.with.literature
180 :     <br></pre>
181 :    
182 :     For each functional role that is
183 :     used in construction of KBase models, and for which we can find a literature reference
184 :     identifying a feature that implements the role, we have selected a feature to act as an exemplar.
185 :     The feature has been the topic of at least one paper, and we believe that the paper supports
186 :     the position that the exemplar feature implements the corresponding role.
187 :     <br><br>
188 :     We now have a set of functional roles that are represented by exemplars that are supported by
189 :     literature references. We have captured the roles that cannot be connected to any fids in
190 :     <i>roles.without.fids</i>. Finally, we are left with roles that are connected to fids, but
191 :     not to any fid that we have connected to literature.
192 :     The connections between roles and possible exemplars was captured in <i>roles.fids.no.literature</i>
193 :     We need to delete from this file all roles for which exemplars have been chosen, and then
194 :     select exemplars from those that are left.
195 :     We can make fairly arbitrary choices to get initial exemplars.
196 :     Perhaps the easiest way to make an initial choice is to just use
197 :     <br><pre>
198 :     perl delete_roles_with_exemplars.pl exemplars.with.literature < roles.fids.no.literature |
199 :     get_relationship_IsOwnedBy -to scientific_name |
200 :     fids_to_functions -c 2 |
201 :     sort |
202 :     perl choose_exemplars.pl > exemplars.for.no.lit.roles
203 :     <br></pre>
204 :     Here the simple perl program to filter out roles that already have exemplars is just
205 :     <br><pre>
206 :     # delete_roles_with_exemplars.pl
207 :     #
208 :     my %roles_with_exemplars = map { $_ =~ /^([^\t\n]*)/; ($1 => 1) } `cat $ARGV[0]`;
209 :     while ($_ = <STDIN>)
210 :     {
211 :     if (($_ =~ /^([^\t\n]+)/) && (! $roles_with_exemplars{$1}))
212 :     {
213 :     print $_;
214 :     }
215 :     }
216 :     <br></pre>
217 :     To get a list of the roles for which no exemplars could be chosen, we can
218 :     use
219 :     <br><pre>
220 :     cat exemplars.with.literature exemplars.for.no.lit.roles > exemplars
221 :     all_roles_used_in_models | perl delete_roles_with_exemplars.pl exemplars > roles.without.exemplars
222 :     <br></pre>
223 :     Out of the 2000-2500 roles used in the curent collection of models, there are often a small
224 :     number for which no attachment to a sequence exists. These require manual curation and need to be
225 :     continuously reviewed.
226 :     In any event, we will have to curate
227 :     the set of exemplars as new experiemental evidence becomes available.
228 :     <br><br>
229 :     The choice of exemplars constitutes step 1 of the process of reaching
230 :     interchangable annotations via controlled vocabularies.

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3