[Bio] / KBaseTutorials / Towards_a_controlled_vocabulary_of_function / mapping_to_exemplars.html Repository:
ViewVC logotype

Annotation of /KBaseTutorials/Towards_a_controlled_vocabulary_of_function/mapping_to_exemplars.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (view) (download) (as text)

1 : overbeek 1.2 <h1>Towards a Controlled Vocabulary Part 2: Mapping to Exemplars</h1>
2 : disz 1.1
3 :     In the first tutorial relating to creating and maintaining a controlled vocabulary of function
4 : overbeek 1.2 (<a href="exemplars.html">Part 1: Defining Exemplars</a>)
5 : disz 1.1 we discussed the creation of a set of exemplars. These exemplars allowed us to make
6 :     statements like
7 :    
8 :     <blockquote>
9 :     <i>The function of protein X is the same as that of exemplar E, where the exemplar
10 :     is the ID of a Feature stored in KBase.</i>
11 :     </blockquote>
12 : salazar 1.3 <h2>Creating a Translation Table</h2>
13 :     <p>We now consider the issue of creating a translation table <br>
14 :     </p>
15 :     <pre>
16 : disz 1.1 [source,source-id,fid,exemplar]
17 :     <br></pre>
18 :     that maps fids from some sources of annotations into the exemplars.
19 :     In these tuples, <i>source_id</i> is the ID used in the source database,
20 :     while <i>fid</i> is the registered KBase ID.
21 :     To be concrete
22 :     we will construct these tables for both MicrobesOnLine (MOL) genomes and the SEED genomes.
23 :     In each case we will also construct sets of inconsistencies that will need to be
24 :     resolved.
25 :     <br><br>
26 :     Let us begin by creating the translation table for the SEED. The strategy here is
27 :     as follows:
28 :     <ol>
29 :     <li>For each exemplar <b>E</b>, locate all SEED fids that have the same function as the KBase function
30 :     assigned to <b>E</b>. Call this set <b>S</b>.
31 :    
32 :     <li>Then, for each SEED fid <b>F</b> in <b>S</b>, get all SEED fids that have identical md5 values.
33 :     Call this set <b>FS</b>.
34 :     Then,
35 :     form a 2-tuple: [<b>F</b>,<b>FS</b>].
36 :    
37 :     <li>For each two-tuple [<b>F</b>, <b>FS</b>], split <b>FS</b> into
38 :     <br><br>
39 :     <ul>
40 :     <li>those genes with function identical to that of <b>E</b> and
41 :     <li>those genes with functions that differ from <b>E</b>.
42 :     </ul>
43 :     <br><br>
44 :     If a majority of genes with a common md5 have a function identical to that of <b>E</b>, write tuples
45 :     <br><pre>
46 :     [SEED,SEED-id,fid,E]
47 :     <br></pre>
48 :    
49 :     as part of the translation table,
50 :     and for cases in which a fid has a distinct function from the exemplar, write entries of the form
51 :     <br><pre>
52 :     [SEED-id,fid,E]
53 :     <br></pre>
54 :     as a 3-tuple to the file <i>SEED.inconsistencies.1</i>. Otherwise, write
55 :     the entire set of inconsistent fids to the file <i>SEED.inconsistencies.2</i>.
56 :     </ol>
57 :     <br><br>
58 :     This simple procedure constructs a mapping of the SEED fids to
59 :     the exemplars, a set of SEED fids that should probably be
60 :     automatically reassigned a function to match an exemplar (<i>SEED.inconsistencies.1</i>),
61 :     and a set of clear inconsistencies that need to be resolved (<i>SEED.inconsistencies.2</i>).
62 :     <br><br>
63 :     Here is how we implement this strategy:
64 :     <br><pre>
65 :     cat exemplars.with.literature exemplars.for.no.lit.roles > exemplars
66 :    
67 :     cut -f1,2 exemplars |
68 :     roles_to_fids -c 1 |
69 :     fids_to_genomes | get_relationship_WasSubmittedBy -to id | grep "SEED$" | cut -f1,2,3 |
70 :     fids_to_proteins |
71 :     fids_to_functions -c 3 |
72 :     get_entity_Feature -c 3 -f source_id > role.exemplar.fid.md5.function.source_id
73 :    
74 :     export TAB=`echo -e "\t"`
75 :     sort -t "$TAB" -k 4 role.exemplar.fid.md5.function.source_id |
76 :     perl make_seed_translation.pl > seed.translation.table
77 :     <br></pre>
78 :     where <i>make_seed_translation.pl</i> program is given below.
79 :     Let us go through this somewhat complex set of commands one step at a time.
80 :     <br><pre>
81 :     cat exemplars.with.literature exemplars.for.no.lit.roles > exemplars
82 :     <br></pre>
83 :    
84 :     just concatenates the two sets of exemplars into a single file. The lines in this <i>exemplars</i>
85 :     file contain
86 :     <br><pre>
87 :     [role,exemplar-fid,genome_name]
88 :     <br></pre>
89 :     These 3-tuples define our "abstract vocabulary of function".
90 :     Then, look at
91 :     <br><pre>
92 :     cut -f1,2 exemplars |
93 :     roles_to_fids -c 1 |
94 :     fids_to_genomes | get_relationship_WasSubmittedBy -to id | grep "SEED$" | cut -f1,2,3 |
95 :     <br></pre>
96 :     These three lines take the first two fields of the 3-tuples (dropping the <i>genome_name</i>),
97 :     extend the table with fids that are believed to implement the role, and then the last line has
98 :     the effect of keeping only entries that originated in the SEED. The output will be 3-tuples
99 :     <br><pre>
100 :     [role,exemplar-fid,KBase-id-of-SEED-fid]
101 :     <br></pre>
102 :     Then, we add columns for the md5 of the SEED-fid, the function of the SEED-fid, and
103 :     the SEED-id of the SEED-fid.
104 :     <br><pre>
105 :     fids_to_proteins |
106 :     fids_to_functions -c 3 |
107 :     get_entity_Feature -c 3 -f source_id > role.exemplar.fid.md5.function.source_id
108 :     <br></pre>
109 :     This gives
110 :     <br><pre>
111 :     [role,exemplar-fid,
112 :     KBase-id-of-SEED-fid,
113 :     md5-SEED-fid,
114 :     function-SEED-fid,
115 :     SEED-id]
116 :     <br></pre>
117 : salazar 1.3 <h2>Generating
118 :     the SEED Translations</h2>
119 :     <p>Finally, we sort the table on the md5 values and use a simple perl program to generate
120 :     the SEED translations: <br>
121 :     </p>
122 :     <pre>
123 : disz 1.1 export TAB=`echo -e "\t"`
124 :     sort -t "$TAB" -k 4 role.exemplar.fid.md5.function.source_id |
125 :     perl make_seed_translation.pl > seed.translation.table
126 :     <br></pre>
127 :     The <i>export</i> is a minor ugliness needed to tell the sort command that tabs
128 :     are being used to delimit fields (this assumes use of the bash shell). By sorting the
129 :     tuples on md5 values, you group rows that represent the same protein sequence
130 :     (and should, hence, be consistently annotated). The program <i>make_seed_translation.pl</i>
131 :     just forms the groups of rows with the same md5 values, checks to verify if they are
132 :     consistently annotated (or can easily be made to be consistent), and writes out the
133 :     desired SEED translation as the 4-tuples
134 :    
135 :     <br><pre>
136 :     [source,source-id,fid,exemplar]
137 :     <br></pre>
138 :     where <i>source</i> will always be <i>SEED</i>.
139 :     <br><br>
140 :     The last time that we generated the SEED translations, the program produced somewhat over
141 :     1.8 million tuples. These impose a relatively consistent set of annotations on the SEED
142 :     features.
143 :     <br><br>
144 :     Here is the program <i>make_seed_translation.pl</i> that actually generates the translation
145 :     tuples.
146 :     <br><pre>
147 :     # make_seed_translation.pl
148 :     #
149 :     open(OUT1,">","SEED.inconsistencies.1") || die "could not open SEED.inconsistencies.1";
150 :     open(OUT2,">","SEED.inconsistencies.2") || die "could not open SEED.inconsistencies.2";
151 :    
152 :     my $last = &lt;STDIN&gt;;
153 :     while ($last && ($last =~ /(\S[^\t]*\S)\t(\S+)\t(\S+)\t(\S+)\t([^\t]*)\t(\S+)$/))
154 :     {
155 :     my $role = $1;
156 :     my $exemplar = $2;
157 :     my $md5 = $4;
158 :     my @match;
159 :     my @mismatch;
160 :     while ($last && ($last =~ /(\S[^\t]*\S)\t(\S+)\t(\S+)\t(\S+)\t([^\t]*)\t(\S+)$/) && ($4 eq $md5))
161 :     {
162 :     my $fid = $3;
163 :     my $source_id = $6;
164 :     my $function = $5;
165 :     $function =~ s/\s*#.*$//;
166 :     if ($function eq $role)
167 :     {
168 :     push(@match,[$source_id,$fid]);
169 :     }
170 :     else
171 :     {
172 :     push(@mismatch,[$source_id,$fid]);
173 :     }
174 :     $last = &lt;STDIN&gt;;
175 :     }
176 :    
177 :     if (@match > @mismatch)
178 :     {
179 :     foreach my $_ (@match)
180 :     {
181 :     print join("\t",('SEED',@$_,$exemplar)),"\n";
182 :     }
183 :    
184 :     foreach $_ (@mismatch)
185 :     {
186 :     print OUT1 join("\t",(@$_,$exemplar)),"\n";
187 :     }
188 :     }
189 :     else
190 :     {
191 :     if (@match > 0)
192 :     {
193 :     print OUT2 join("\t",map { @$_ } (@match,@mismatch)),"\n";
194 :     }
195 :     }
196 :     }
197 :     close(OUT1);
198 :     close(OUT2);
199 :    
200 :     <br></pre>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3