Parent Directory
|
Revision Log
Revision 1.20 - (view) (download) (as text)
1 : | parrello | 1.1 | #!/usr/bin/perl -w |
2 : | olson | 1.16 | # |
3 : | # Copyright (c) 2003-2006 University of Chicago and Fellowship | ||
4 : | # for Interpretations of Genomes. All Rights Reserved. | ||
5 : | # | ||
6 : | # This file is part of the SEED Toolkit. | ||
7 : | # | ||
8 : | # The SEED Toolkit is free software. You can redistribute | ||
9 : | # it and/or modify it under the terms of the SEED Toolkit | ||
10 : | # Public License. | ||
11 : | # | ||
12 : | # You should have received a copy of the SEED Toolkit Public License | ||
13 : | # along with this program; if not write to the University of Chicago | ||
14 : | # at info@ci.uchicago.edu or the Fellowship for Interpretation of | ||
15 : | # Genomes at veronika@thefig.info or download a copy from | ||
16 : | # http://www.theseed.org/LICENSE.TXT. | ||
17 : | # | ||
18 : | |||
19 : | parrello | 1.1 | |
20 : | =head1 Add / Delete / Change Features | ||
21 : | |||
22 : | This method will run through a set of transaction files, adding, deleting, and changing | ||
23 : | features in the FIG data store. The command takes three input parameters. The first is | ||
24 : | a command. The second specifies a directory full of transaction files. The third | ||
25 : | specifies a file that tells us which feature IDs are available for each organism. | ||
26 : | |||
27 : | parrello | 1.14 | C<TransactFeatures> [I<options>] I<command> I<transactionDirectory> I<idFile> |
28 : | parrello | 1.1 | |
29 : | The supported commands are | ||
30 : | |||
31 : | =over 4 | ||
32 : | |||
33 : | =item count | ||
34 : | |||
35 : | Count the number of IDs needed to process the ADD and CHANGE transactions. This | ||
36 : | parrello | 1.15 | will produce a listing of the number of feature IDs needed for each |
37 : | parrello | 1.1 | organism and feature type. This command is mostly a sanity check: it provides |
38 : | useful statistics without changing anything. | ||
39 : | |||
40 : | =item register | ||
41 : | |||
42 : | Create an ID file by requesting IDs from the clearinghouse. This performs the | ||
43 : | same function as B<count>, but takes the additional step of creating an ID | ||
44 : | file that can be used to process the transactions. | ||
45 : | |||
46 : | =item process | ||
47 : | |||
48 : | parrello | 1.14 | Process the transactions and update the FIG data store. This will also update |
49 : | the NR file and queue features for similarity generation. | ||
50 : | parrello | 1.1 | |
51 : | parrello | 1.14 | =item fudge |
52 : | parrello | 1.4 | |
53 : | parrello | 1.14 | Convert transactions that have already been applied to new transactions that can |
54 : | be used to test the transaction processor. | ||
55 : | parrello | 1.10 | |
56 : | parrello | 1.1 | =back |
57 : | |||
58 : | =head2 The Transaction File | ||
59 : | |||
60 : | Each transaction file is a standard tab-delimited file, one transaction per line. The | ||
61 : | name of the file is C<tbl_diff_>I<org> where I<org> is an organism ID. All records in | ||
62 : | the transaction file refer to transactions against the organism encoded in the file | ||
63 : | name. | ||
64 : | |||
65 : | The file must specify IDs for new features, but the real IDs cannot be known until | ||
66 : | they are requested from the SEED clearing house. Therefore, each new ID is specified | ||
67 : | in a special format consisting of the feature type (C<peg>, C<rna>, and so forth) | ||
68 : | followed by a dot and the 0-based ordinal number of the new ID within that | ||
69 : | feature type. So, for example, if the transaction file consists of a delete, | ||
70 : | a change, and two adds, it might look like this | ||
71 : | |||
72 : | delete fig|83333.1.peg.2 | ||
73 : | change fig|83333.1.peg.6 peg.0 ... | ||
74 : | add peg.1 ... | ||
75 : | add rna.0 ... | ||
76 : | |||
77 : | Note that the old feature IDs do not participate in the numbering process, and the RNA | ||
78 : | numbering is independent of the PEG numbering. In the discussion below of transaction | ||
79 : | types, a field named I<newID> will always indicate one of these type/number pairs. | ||
80 : | So, the field setup for the B<chang> command is | ||
81 : | |||
82 : | change fid newID locations aliases translation | ||
83 : | |||
84 : | And the I<newID> corresponds to the C<peg.6> in the example above. | ||
85 : | |||
86 : | The first field of each record is the transaction type. The list of subsequent fields | ||
87 : | depends on this type. | ||
88 : | |||
89 : | =over 4 | ||
90 : | |||
91 : | =item DELETE fid | ||
92 : | |||
93 : | Deletes a feature. The feature is marked as deleted in the FIG database, which | ||
94 : | causes it to be skipped or ignored by most of the SEED software. The ID of the | ||
95 : | feature to be deleted is the second field (I<fid>). | ||
96 : | |||
97 : | =item ADD newID locations translation | ||
98 : | |||
99 : | Adds a new feature. The I<newID> indicates the feature type and its ordinal number. | ||
100 : | The location is a comma-separated list of location strings. The translation is the | ||
101 : | protein translation for the location. If the translation is omitted, then it will | ||
102 : | be generated from the location information in the normal way. | ||
103 : | |||
104 : | =item CHANGE fid newID locations aliases translation | ||
105 : | |||
106 : | Changes an existing feature. The current copy of the feature is marked as deleted, | ||
107 : | and a new feature is created with a new ID. All annotations and assignments are | ||
108 : | transferred from the deleted feature to the new one. The location is a | ||
109 : | comma-separated list of location strings. The aliases are specified as a comma-delimited | ||
110 : | list of alternate names for the feature. These replace any existing aliases for the | ||
111 : | old feature. If the alias list is omitted, no aliases will be assigned to the new | ||
112 : | feature. The translation is the protein translation for the location. If the | ||
113 : | translation is omitted, then it will be generated from the location information in the | ||
114 : | normal way. | ||
115 : | |||
116 : | =back | ||
117 : | |||
118 : | =head2 The ID File | ||
119 : | |||
120 : | The ID file is a tab-delimited file containing one record for each feature type | ||
121 : | of each organism that has a transaction file. Each record consists of three | ||
122 : | fields. | ||
123 : | |||
124 : | =over 4 | ||
125 : | |||
126 : | =item orgID | ||
127 : | |||
128 : | The ID of the organism being updated. | ||
129 : | |||
130 : | =item ftype | ||
131 : | |||
132 : | The relevant feature type. | ||
133 : | |||
134 : | =item firstNumber | ||
135 : | |||
136 : | The first available ID number for the organism and feature type. | ||
137 : | |||
138 : | =back | ||
139 : | |||
140 : | This file's primary purpose is that it tells us how to create the feature IDs | ||
141 : | for features we'll be adding to the data store, whether it be via a straight | ||
142 : | B<add> or a B<chang> that deletes an old ID and recreates the feature with a | ||
143 : | new ID. | ||
144 : | |||
145 : | If we need new IDs for an organism not listed in this ID file, an error will be | ||
146 : | thrown. | ||
147 : | |||
148 : | =head2 Command-Line Options | ||
149 : | |||
150 : | The command-line options for this script are as follows. | ||
151 : | |||
152 : | =over 4 | ||
153 : | |||
154 : | =item trace | ||
155 : | |||
156 : | Numeric trace level. A higher trace level causes more messages to appear. The | ||
157 : | default trace level is 3. | ||
158 : | |||
159 : | parrello | 1.2 | =item safe |
160 : | |||
161 : | Wrap each organism's processing in a database transaction. This makes the process | ||
162 : | slightly more restartable than it would be otherwise. | ||
163 : | |||
164 : | parrello | 1.5 | =item noAlias |
165 : | |||
166 : | Assume that the transaction files do not contain aliases. This means that in CHANGE | ||
167 : | records the translation will immediately follow the location. | ||
168 : | |||
169 : | parrello | 1.8 | =item sql |
170 : | |||
171 : | Trace SQL commands. | ||
172 : | |||
173 : | parrello | 1.15 | =item tblFiles |
174 : | |||
175 : | Output TBL files containing the corrected IDs. (B<process> command only) | ||
176 : | |||
177 : | parrello | 1.12 | =item start |
178 : | |||
179 : | ID of the first genome to process. This allows restarting a transaction run that failed | ||
180 : | in the middle. The default is to run all transaction files. | ||
181 : | |||
182 : | parrello | 1.8 | =back |
183 : | |||
184 : | parrello | 1.1 | =cut |
185 : | |||
186 : | use strict; | ||
187 : | use Tracer; | ||
188 : | use Cwd; | ||
189 : | use File::Copy; | ||
190 : | use File::Path; | ||
191 : | use FIG; | ||
192 : | use Stats; | ||
193 : | parrello | 1.3 | use TransactionProcessor; |
194 : | use ApplyTransactions; | ||
195 : | use CountTransactions; | ||
196 : | parrello | 1.14 | use FudgeTransactions; |
197 : | parrello | 1.1 | |
198 : | # Get the command-line options. | ||
199 : | parrello | 1.20 | my ($options, @parameters) = StandardSetup(["FIG"], |
200 : | parrello | 1.18 | { safe => [0, "use database transactions"], |
201 : | parrello | 1.19 | trace => [2, "trace level"], |
202 : | parrello | 1.18 | noAlias => [0, "do not expect aliases in CHANGE transactions"], |
203 : | start => [' ', "start with this genome"], | ||
204 : | tblFiles => [0, "output TBL files containing the corrected IDs"] }, | ||
205 : | "command transactionDirectory IDfile", | ||
206 : | @ARGV); | ||
207 : | parrello | 1.8 | # Get the command. |
208 : | my $mainCommand = lc shift @parameters; | ||
209 : | parrello | 1.1 | # Get the FIG object. |
210 : | my $fig = FIG->new(); | ||
211 : | parrello | 1.3 | # Create the transaction object. |
212 : | my $controlBlock; | ||
213 : | if ($mainCommand eq 'count' || $mainCommand eq 'register') { | ||
214 : | $controlBlock = CountTransactions->new($options, $mainCommand, @parameters); | ||
215 : | } elsif ($mainCommand eq 'process') { | ||
216 : | $controlBlock = ApplyTransactions->new($options, $mainCommand, @parameters); | ||
217 : | parrello | 1.14 | } elsif ($mainCommand eq 'fudge') { |
218 : | $controlBlock = FudgeTransactions->new($options, $mainCommand, @parameters); | ||
219 : | parrello | 1.3 | } else { |
220 : | Confess("Invalid command \"$mainCommand\" specified on command line."); | ||
221 : | parrello | 1.1 | } |
222 : | parrello | 1.3 | # Setup the process. |
223 : | $controlBlock->Setup(); | ||
224 : | parrello | 1.1 | # Verify that the organism directory exists. |
225 : | if (! -d $parameters[0]) { | ||
226 : | Confess("Directory of genome files \"$parameters[0]\" not found."); | ||
227 : | } else { | ||
228 : | # Here we have a valid directory, so we need the list of transaction | ||
229 : | # files in it. | ||
230 : | my $orgsFound = 0; | ||
231 : | my %transFiles = (); | ||
232 : | my @transDirectory = OpenDir($parameters[0], 1); | ||
233 : | parrello | 1.12 | # Pull out the "start" option value. This will be a space if all genomes should |
234 : | # be processed, in which case it will always compare less than the genome ID. | ||
235 : | my $startGenome = $options->{start}; | ||
236 : | parrello | 1.1 | # The next step is to create a hash of organism IDs to file names. This |
237 : | # saves us some painful parsing later. | ||
238 : | for my $transFileName (@transDirectory) { | ||
239 : | parrello | 1.12 | # Parse the file name. This will only match if it's a real transaction file. |
240 : | parrello | 1.1 | if ($transFileName =~ /^tbl_diff_(\d+\.\d+)$/) { |
241 : | parrello | 1.12 | # Get the genome ID; |
242 : | my $genomeID = $1; | ||
243 : | # If we're skipping, only include this genome ID if it's equal to | ||
244 : | # or greater than the start value. | ||
245 : | parrello | 1.13 | if ($genomeID ge $startGenome) { |
246 : | parrello | 1.12 | $transFiles{$1} = "$parameters[0]/$transFileName"; |
247 : | $orgsFound++; | ||
248 : | } | ||
249 : | parrello | 1.1 | } |
250 : | } | ||
251 : | Trace("$orgsFound genome transaction files found in directory $parameters[0].") if T(2); | ||
252 : | if (! $orgsFound) { | ||
253 : | Confess("No \"tbl_diff\" files found in directory $parameters[1]."); | ||
254 : | } else { | ||
255 : | # Loop through the organisms. | ||
256 : | for my $genomeID (sort keys %transFiles) { | ||
257 : | parrello | 1.3 | # Start this organism. |
258 : | parrello | 1.1 | Trace("Processing changes for $genomeID.") if T(3); |
259 : | parrello | 1.3 | my $orgFileName = $transFiles{$genomeID}; |
260 : | $controlBlock->StartGenome($genomeID, $orgFileName); | ||
261 : | parrello | 1.1 | # Open the organism file. |
262 : | Open(\*TRANS, "<$orgFileName"); | ||
263 : | parrello | 1.3 | # Clear the transaction counter. |
264 : | parrello | 1.1 | my $tranCount = 0; |
265 : | # Loop through the organism's data. | ||
266 : | while (my $transaction = <TRANS>) { | ||
267 : | # Parse the record. | ||
268 : | chomp $transaction; | ||
269 : | my @fields = split /\t/, $transaction; | ||
270 : | $tranCount++; | ||
271 : | # Save the record number in the control block. | ||
272 : | $controlBlock->{line} = $tranCount; | ||
273 : | # Process according to the transaction type. | ||
274 : | my $command = lc shift @fields; | ||
275 : | if ($command eq 'add') { | ||
276 : | parrello | 1.3 | $controlBlock->Add(@fields); |
277 : | parrello | 1.1 | } elsif ($command eq 'delete') { |
278 : | parrello | 1.3 | $controlBlock->Delete(@fields); |
279 : | parrello | 1.1 | } elsif ($command eq 'change') { |
280 : | parrello | 1.5 | # Here we have a special case. If "noalias" is in effect, we need |
281 : | # to splice an empty field in before the translation. | ||
282 : | if ($controlBlock->Option("noAlias")) { | ||
283 : | splice @fields, 3, 0, ""; | ||
284 : | } | ||
285 : | parrello | 1.3 | $controlBlock->Change(@fields); |
286 : | parrello | 1.1 | } else { |
287 : | parrello | 1.3 | $controlBlock->AddMessage("Invalid command $command in line $tranCount for genome $genomeID"); |
288 : | parrello | 1.1 | } |
289 : | parrello | 1.3 | $controlBlock->IncrementStat($command); |
290 : | parrello | 1.1 | } |
291 : | parrello | 1.14 | # Close the transaction input file. |
292 : | close TRANS; | ||
293 : | parrello | 1.3 | # Terminate processing for this genome. |
294 : | my $orgStats = $controlBlock->EndGenome(); | ||
295 : | parrello | 1.6 | Trace("Statistics for $genomeID\n\n" . $orgStats->Show() . "\n") if T(3); |
296 : | parrello | 1.1 | } |
297 : | } | ||
298 : | parrello | 1.3 | # Terminate processing. |
299 : | $controlBlock->Teardown(); | ||
300 : | parrello | 1.6 | Trace("Statistics for this run\n\n" . $controlBlock->Show() . "\n") if T(1); |
301 : | parrello | 1.1 | Trace("Processing complete.") if T(1); |
302 : | } | ||
303 : | |||
304 : | |||
305 : | golsen | 1.17 | 1; |
MCS Webmaster | ViewVC Help |
Powered by ViewVC 1.0.3 |