[Bio] / KBaseTutorials / Towards_a_controlled_vocabulary_of_function / mapping_to_exemplars.html Repository:
ViewVC logotype

Annotation of /KBaseTutorials/Towards_a_controlled_vocabulary_of_function/mapping_to_exemplars.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (view) (download) (as text)

1 : disz 1.1 <h1>Towards a Controlled Vocabulary: Mapping to Exemplars (Step 2)</h1>
2 :    
3 :     In the first tutorial relating to creating and maintaining a controlled vocabulary of function
4 :     (<a href="exemplars.html">The Issue of a Controlled Vocabular for Protein Function: Step 1</a>)
5 :     we discussed the creation of a set of exemplars. These exemplars allowed us to make
6 :     statements like
7 :    
8 :     <blockquote>
9 :     <i>The function of protein X is the same as that of exemplar E, where the exemplar
10 :     is the ID of a Feature stored in KBase.</i>
11 :     </blockquote>
12 :     We now consider the issue of creating a translation table
13 :     <br><pre>
14 :     [source,source-id,fid,exemplar]
15 :     <br></pre>
16 :     that maps fids from some sources of annotations into the exemplars.
17 :     In these tuples, <i>source_id</i> is the ID used in the source database,
18 :     while <i>fid</i> is the registered KBase ID.
19 :     To be concrete
20 :     we will construct these tables for both MicrobesOnLine (MOL) genomes and the SEED genomes.
21 :     In each case we will also construct sets of inconsistencies that will need to be
22 :     resolved.
23 :     <br><br>
24 :     Let us begin by creating the translation table for the SEED. The strategy here is
25 :     as follows:
26 :     <ol>
27 :     <li>For each exemplar <b>E</b>, locate all SEED fids that have the same function as the KBase function
28 :     assigned to <b>E</b>. Call this set <b>S</b>.
29 :    
30 :     <li>Then, for each SEED fid <b>F</b> in <b>S</b>, get all SEED fids that have identical md5 values.
31 :     Call this set <b>FS</b>.
32 :     Then,
33 :     form a 2-tuple: [<b>F</b>,<b>FS</b>].
34 :    
35 :     <li>For each two-tuple [<b>F</b>, <b>FS</b>], split <b>FS</b> into
36 :     <br><br>
37 :     <ul>
38 :     <li>those genes with function identical to that of <b>E</b> and
39 :     <li>those genes with functions that differ from <b>E</b>.
40 :     </ul>
41 :     <br><br>
42 :     If a majority of genes with a common md5 have a function identical to that of <b>E</b>, write tuples
43 :     <br><pre>
44 :     [SEED,SEED-id,fid,E]
45 :     <br></pre>
46 :    
47 :     as part of the translation table,
48 :     and for cases in which a fid has a distinct function from the exemplar, write entries of the form
49 :     <br><pre>
50 :     [SEED-id,fid,E]
51 :     <br></pre>
52 :     as a 3-tuple to the file <i>SEED.inconsistencies.1</i>. Otherwise, write
53 :     the entire set of inconsistent fids to the file <i>SEED.inconsistencies.2</i>.
54 :     </ol>
55 :     <br><br>
56 :     This simple procedure constructs a mapping of the SEED fids to
57 :     the exemplars, a set of SEED fids that should probably be
58 :     automatically reassigned a function to match an exemplar (<i>SEED.inconsistencies.1</i>),
59 :     and a set of clear inconsistencies that need to be resolved (<i>SEED.inconsistencies.2</i>).
60 :     <br><br>
61 :     Here is how we implement this strategy:
62 :     <br><pre>
63 :     cat exemplars.with.literature exemplars.for.no.lit.roles > exemplars
64 :    
65 :     cut -f1,2 exemplars |
66 :     roles_to_fids -c 1 |
67 :     fids_to_genomes | get_relationship_WasSubmittedBy -to id | grep "SEED$" | cut -f1,2,3 |
68 :     fids_to_proteins |
69 :     fids_to_functions -c 3 |
70 :     get_entity_Feature -c 3 -f source_id > role.exemplar.fid.md5.function.source_id
71 :    
72 :     export TAB=`echo -e "\t"`
73 :     sort -t "$TAB" -k 4 role.exemplar.fid.md5.function.source_id |
74 :     perl make_seed_translation.pl > seed.translation.table
75 :     <br></pre>
76 :     where <i>make_seed_translation.pl</i> program is given below.
77 :     Let us go through this somewhat complex set of commands one step at a time.
78 :     <br><pre>
79 :     cat exemplars.with.literature exemplars.for.no.lit.roles > exemplars
80 :     <br></pre>
81 :    
82 :     just concatenates the two sets of exemplars into a single file. The lines in this <i>exemplars</i>
83 :     file contain
84 :     <br><pre>
85 :     [role,exemplar-fid,genome_name]
86 :     <br></pre>
87 :     These 3-tuples define our "abstract vocabulary of function".
88 :     Then, look at
89 :     <br><pre>
90 :     cut -f1,2 exemplars |
91 :     roles_to_fids -c 1 |
92 :     fids_to_genomes | get_relationship_WasSubmittedBy -to id | grep "SEED$" | cut -f1,2,3 |
93 :     <br></pre>
94 :     These three lines take the first two fields of the 3-tuples (dropping the <i>genome_name</i>),
95 :     extend the table with fids that are believed to implement the role, and then the last line has
96 :     the effect of keeping only entries that originated in the SEED. The output will be 3-tuples
97 :     <br><pre>
98 :     [role,exemplar-fid,KBase-id-of-SEED-fid]
99 :     <br></pre>
100 :     Then, we add columns for the md5 of the SEED-fid, the function of the SEED-fid, and
101 :     the SEED-id of the SEED-fid.
102 :     <br><pre>
103 :     fids_to_proteins |
104 :     fids_to_functions -c 3 |
105 :     get_entity_Feature -c 3 -f source_id > role.exemplar.fid.md5.function.source_id
106 :     <br></pre>
107 :     This gives
108 :     <br><pre>
109 :     [role,exemplar-fid,
110 :     KBase-id-of-SEED-fid,
111 :     md5-SEED-fid,
112 :     function-SEED-fid,
113 :     SEED-id]
114 :     <br></pre>
115 :     Finally, we sort the table on the md5 values and use a simple perl program to generate
116 :     the SEED translations:
117 :    
118 :     <br><pre>
119 :     export TAB=`echo -e "\t"`
120 :     sort -t "$TAB" -k 4 role.exemplar.fid.md5.function.source_id |
121 :     perl make_seed_translation.pl > seed.translation.table
122 :     <br></pre>
123 :     The <i>export</i> is a minor ugliness needed to tell the sort command that tabs
124 :     are being used to delimit fields (this assumes use of the bash shell). By sorting the
125 :     tuples on md5 values, you group rows that represent the same protein sequence
126 :     (and should, hence, be consistently annotated). The program <i>make_seed_translation.pl</i>
127 :     just forms the groups of rows with the same md5 values, checks to verify if they are
128 :     consistently annotated (or can easily be made to be consistent), and writes out the
129 :     desired SEED translation as the 4-tuples
130 :    
131 :     <br><pre>
132 :     [source,source-id,fid,exemplar]
133 :     <br></pre>
134 :     where <i>source</i> will always be <i>SEED</i>.
135 :     <br><br>
136 :     The last time that we generated the SEED translations, the program produced somewhat over
137 :     1.8 million tuples. These impose a relatively consistent set of annotations on the SEED
138 :     features.
139 :     <br><br>
140 :     Here is the program <i>make_seed_translation.pl</i> that actually generates the translation
141 :     tuples.
142 :     <br><pre>
143 :     # make_seed_translation.pl
144 :     #
145 :     open(OUT1,">","SEED.inconsistencies.1") || die "could not open SEED.inconsistencies.1";
146 :     open(OUT2,">","SEED.inconsistencies.2") || die "could not open SEED.inconsistencies.2";
147 :    
148 :     my $last = &lt;STDIN&gt;;
149 :     while ($last && ($last =~ /(\S[^\t]*\S)\t(\S+)\t(\S+)\t(\S+)\t([^\t]*)\t(\S+)$/))
150 :     {
151 :     my $role = $1;
152 :     my $exemplar = $2;
153 :     my $md5 = $4;
154 :     my @match;
155 :     my @mismatch;
156 :     while ($last && ($last =~ /(\S[^\t]*\S)\t(\S+)\t(\S+)\t(\S+)\t([^\t]*)\t(\S+)$/) && ($4 eq $md5))
157 :     {
158 :     my $fid = $3;
159 :     my $source_id = $6;
160 :     my $function = $5;
161 :     $function =~ s/\s*#.*$//;
162 :     if ($function eq $role)
163 :     {
164 :     push(@match,[$source_id,$fid]);
165 :     }
166 :     else
167 :     {
168 :     push(@mismatch,[$source_id,$fid]);
169 :     }
170 :     $last = &lt;STDIN&gt;;
171 :     }
172 :    
173 :     if (@match > @mismatch)
174 :     {
175 :     foreach my $_ (@match)
176 :     {
177 :     print join("\t",('SEED',@$_,$exemplar)),"\n";
178 :     }
179 :    
180 :     foreach $_ (@mismatch)
181 :     {
182 :     print OUT1 join("\t",(@$_,$exemplar)),"\n";
183 :     }
184 :     }
185 :     else
186 :     {
187 :     if (@match > 0)
188 :     {
189 :     print OUT2 join("\t",map { @$_ } (@match,@mismatch)),"\n";
190 :     }
191 :     }
192 :     }
193 :     close(OUT1);
194 :     close(OUT2);
195 :    
196 :     <br></pre>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3