Parent Directory
|
Revision Log
Revision 1.5 - (view) (download) (as text)
1 : | parrello | 1.1 | #!/usr/bin/perl -w |
2 : | |||
3 : | =head1 Add / Delete / Change Features | ||
4 : | |||
5 : | This method will run through a set of transaction files, adding, deleting, and changing | ||
6 : | features in the FIG data store. The command takes three input parameters. The first is | ||
7 : | a command. The second specifies a directory full of transaction files. The third | ||
8 : | specifies a file that tells us which feature IDs are available for each organism. | ||
9 : | |||
10 : | C<TransactFeatures> I<[options]> I<command> I<transactionDirectory> I<idFile> | ||
11 : | |||
12 : | The supported commands are | ||
13 : | |||
14 : | =over 4 | ||
15 : | |||
16 : | =item count | ||
17 : | |||
18 : | Count the number of IDs needed to process the ADD and CHANGE transactions. This | ||
19 : | will produce an listing of the number of feature IDs needed for each | ||
20 : | organism and feature type. This command is mostly a sanity check: it provides | ||
21 : | useful statistics without changing anything. | ||
22 : | |||
23 : | =item register | ||
24 : | |||
25 : | Create an ID file by requesting IDs from the clearinghouse. This performs the | ||
26 : | same function as B<count>, but takes the additional step of creating an ID | ||
27 : | file that can be used to process the transactions. | ||
28 : | |||
29 : | =item process | ||
30 : | |||
31 : | Process the transactions and update the FIG data store. This will also create | ||
32 : | a copy of each transaction file in which the pseudo-IDs have been replaced by | ||
33 : | real IDs. | ||
34 : | |||
35 : | parrello | 1.4 | =item annotate |
36 : | |||
37 : | Annotate the features created by the transactions so as to indicate how they were | ||
38 : | derived. | ||
39 : | |||
40 : | =item fix | ||
41 : | |||
42 : | Fix the locations of new features and verify the translations of new and changed | ||
43 : | features. | ||
44 : | |||
45 : | parrello | 1.1 | =back |
46 : | |||
47 : | =head2 The Transaction File | ||
48 : | |||
49 : | Each transaction file is a standard tab-delimited file, one transaction per line. The | ||
50 : | name of the file is C<tbl_diff_>I<org> where I<org> is an organism ID. All records in | ||
51 : | the transaction file refer to transactions against the organism encoded in the file | ||
52 : | name. | ||
53 : | |||
54 : | The file must specify IDs for new features, but the real IDs cannot be known until | ||
55 : | they are requested from the SEED clearing house. Therefore, each new ID is specified | ||
56 : | in a special format consisting of the feature type (C<peg>, C<rna>, and so forth) | ||
57 : | followed by a dot and the 0-based ordinal number of the new ID within that | ||
58 : | feature type. So, for example, if the transaction file consists of a delete, | ||
59 : | a change, and two adds, it might look like this | ||
60 : | |||
61 : | delete fig|83333.1.peg.2 | ||
62 : | change fig|83333.1.peg.6 peg.0 ... | ||
63 : | add peg.1 ... | ||
64 : | add rna.0 ... | ||
65 : | |||
66 : | Note that the old feature IDs do not participate in the numbering process, and the RNA | ||
67 : | numbering is independent of the PEG numbering. In the discussion below of transaction | ||
68 : | types, a field named I<newID> will always indicate one of these type/number pairs. | ||
69 : | So, the field setup for the B<chang> command is | ||
70 : | |||
71 : | change fid newID locations aliases translation | ||
72 : | |||
73 : | And the I<newID> corresponds to the C<peg.6> in the example above. | ||
74 : | |||
75 : | The first field of each record is the transaction type. The list of subsequent fields | ||
76 : | depends on this type. | ||
77 : | |||
78 : | =over 4 | ||
79 : | |||
80 : | =item DELETE fid | ||
81 : | |||
82 : | Deletes a feature. The feature is marked as deleted in the FIG database, which | ||
83 : | causes it to be skipped or ignored by most of the SEED software. The ID of the | ||
84 : | feature to be deleted is the second field (I<fid>). | ||
85 : | |||
86 : | =item ADD newID locations translation | ||
87 : | |||
88 : | Adds a new feature. The I<newID> indicates the feature type and its ordinal number. | ||
89 : | The location is a comma-separated list of location strings. The translation is the | ||
90 : | protein translation for the location. If the translation is omitted, then it will | ||
91 : | be generated from the location information in the normal way. | ||
92 : | |||
93 : | =item CHANGE fid newID locations aliases translation | ||
94 : | |||
95 : | Changes an existing feature. The current copy of the feature is marked as deleted, | ||
96 : | and a new feature is created with a new ID. All annotations and assignments are | ||
97 : | transferred from the deleted feature to the new one. The location is a | ||
98 : | comma-separated list of location strings. The aliases are specified as a comma-delimited | ||
99 : | list of alternate names for the feature. These replace any existing aliases for the | ||
100 : | old feature. If the alias list is omitted, no aliases will be assigned to the new | ||
101 : | feature. The translation is the protein translation for the location. If the | ||
102 : | translation is omitted, then it will be generated from the location information in the | ||
103 : | normal way. | ||
104 : | |||
105 : | =back | ||
106 : | |||
107 : | =head2 The ID File | ||
108 : | |||
109 : | The ID file is a tab-delimited file containing one record for each feature type | ||
110 : | of each organism that has a transaction file. Each record consists of three | ||
111 : | fields. | ||
112 : | |||
113 : | =over 4 | ||
114 : | |||
115 : | =item orgID | ||
116 : | |||
117 : | The ID of the organism being updated. | ||
118 : | |||
119 : | =item ftype | ||
120 : | |||
121 : | The relevant feature type. | ||
122 : | |||
123 : | =item firstNumber | ||
124 : | |||
125 : | The first available ID number for the organism and feature type. | ||
126 : | |||
127 : | =back | ||
128 : | |||
129 : | This file's primary purpose is that it tells us how to create the feature IDs | ||
130 : | for features we'll be adding to the data store, whether it be via a straight | ||
131 : | B<add> or a B<chang> that deletes an old ID and recreates the feature with a | ||
132 : | new ID. | ||
133 : | |||
134 : | If we need new IDs for an organism not listed in this ID file, an error will be | ||
135 : | thrown. | ||
136 : | |||
137 : | =head2 Command-Line Options | ||
138 : | |||
139 : | The command-line options for this script are as follows. | ||
140 : | |||
141 : | =over 4 | ||
142 : | |||
143 : | =item trace | ||
144 : | |||
145 : | Numeric trace level. A higher trace level causes more messages to appear. The | ||
146 : | default trace level is 3. | ||
147 : | |||
148 : | parrello | 1.2 | =item safe |
149 : | |||
150 : | Wrap each organism's processing in a database transaction. This makes the process | ||
151 : | slightly more restartable than it would be otherwise. | ||
152 : | |||
153 : | parrello | 1.5 | =item noAlias |
154 : | |||
155 : | Assume that the transaction files do not contain aliases. This means that in CHANGE | ||
156 : | records the translation will immediately follow the location. | ||
157 : | |||
158 : | parrello | 1.1 | =cut |
159 : | |||
160 : | use strict; | ||
161 : | use Tracer; | ||
162 : | use DocUtils; | ||
163 : | use TestUtils; | ||
164 : | use Cwd; | ||
165 : | use File::Copy; | ||
166 : | use File::Path; | ||
167 : | use FIG; | ||
168 : | use Stats; | ||
169 : | parrello | 1.3 | use TransactionProcessor; |
170 : | use ApplyTransactions; | ||
171 : | use CountTransactions; | ||
172 : | use AnnotateTransactions; | ||
173 : | use FixTransactions; | ||
174 : | parrello | 1.1 | |
175 : | # Get the command-line options. | ||
176 : | parrello | 1.5 | my ($options, @parameters) = Tracer::ParseCommand({ trace => 3, safe => 0, noAlias => 0 }, @ARGV); |
177 : | parrello | 1.1 | # Set up tracing. |
178 : | my $traceLevel = $options->{trace}; | ||
179 : | TSetup("$traceLevel Tracer DocUtils FIG", "TEXT"); | ||
180 : | # Get the FIG object. | ||
181 : | my $fig = FIG->new(); | ||
182 : | # Get the command. | ||
183 : | my $mainCommand = lc shift @parameters; | ||
184 : | parrello | 1.3 | # Create the transaction object. |
185 : | my $controlBlock; | ||
186 : | if ($mainCommand eq 'count' || $mainCommand eq 'register') { | ||
187 : | $controlBlock = CountTransactions->new($options, $mainCommand, @parameters); | ||
188 : | } elsif ($mainCommand eq 'process') { | ||
189 : | $controlBlock = ApplyTransactions->new($options, $mainCommand, @parameters); | ||
190 : | } elsif ($mainCommand eq 'fix') { | ||
191 : | $controlBlock = FixTransactions->new($options, $mainCommand, @parameters); | ||
192 : | } else { | ||
193 : | Confess("Invalid command \"$mainCommand\" specified on command line."); | ||
194 : | parrello | 1.1 | } |
195 : | parrello | 1.3 | # Setup the process. |
196 : | $controlBlock->Setup(); | ||
197 : | parrello | 1.1 | # Verify that the organism directory exists. |
198 : | if (! -d $parameters[0]) { | ||
199 : | Confess("Directory of genome files \"$parameters[0]\" not found."); | ||
200 : | } else { | ||
201 : | # Here we have a valid directory, so we need the list of transaction | ||
202 : | # files in it. | ||
203 : | my $orgsFound = 0; | ||
204 : | my %transFiles = (); | ||
205 : | my @transDirectory = OpenDir($parameters[0], 1); | ||
206 : | # The next step is to create a hash of organism IDs to file names. This | ||
207 : | # saves us some painful parsing later. | ||
208 : | for my $transFileName (@transDirectory) { | ||
209 : | if ($transFileName =~ /^tbl_diff_(\d+\.\d+)$/) { | ||
210 : | $transFiles{$1} = "$parameters[0]/$transFileName"; | ||
211 : | $orgsFound++; | ||
212 : | } | ||
213 : | } | ||
214 : | Trace("$orgsFound genome transaction files found in directory $parameters[0].") if T(2); | ||
215 : | if (! $orgsFound) { | ||
216 : | Confess("No \"tbl_diff\" files found in directory $parameters[1]."); | ||
217 : | } else { | ||
218 : | # Loop through the organisms. | ||
219 : | for my $genomeID (sort keys %transFiles) { | ||
220 : | parrello | 1.3 | # Start this organism. |
221 : | parrello | 1.1 | Trace("Processing changes for $genomeID.") if T(3); |
222 : | parrello | 1.3 | my $orgFileName = $transFiles{$genomeID}; |
223 : | $controlBlock->StartGenome($genomeID, $orgFileName); | ||
224 : | parrello | 1.1 | # Open the organism file. |
225 : | Open(\*TRANS, "<$orgFileName"); | ||
226 : | parrello | 1.3 | # Clear the transaction counter. |
227 : | parrello | 1.1 | my $tranCount = 0; |
228 : | # Loop through the organism's data. | ||
229 : | while (my $transaction = <TRANS>) { | ||
230 : | # Parse the record. | ||
231 : | chomp $transaction; | ||
232 : | my @fields = split /\t/, $transaction; | ||
233 : | $tranCount++; | ||
234 : | # Save the record number in the control block. | ||
235 : | $controlBlock->{line} = $tranCount; | ||
236 : | # Process according to the transaction type. | ||
237 : | my $command = lc shift @fields; | ||
238 : | if ($command eq 'add') { | ||
239 : | parrello | 1.3 | $controlBlock->Add(@fields); |
240 : | parrello | 1.1 | } elsif ($command eq 'delete') { |
241 : | parrello | 1.3 | $controlBlock->Delete(@fields); |
242 : | parrello | 1.1 | } elsif ($command eq 'change') { |
243 : | parrello | 1.5 | # Here we have a special case. If "noalias" is in effect, we need |
244 : | # to splice an empty field in before the translation. | ||
245 : | if ($controlBlock->Option("noAlias")) { | ||
246 : | splice @fields, 3, 0, ""; | ||
247 : | } | ||
248 : | parrello | 1.3 | $controlBlock->Change(@fields); |
249 : | parrello | 1.1 | } else { |
250 : | parrello | 1.3 | $controlBlock->AddMessage("Invalid command $command in line $tranCount for genome $genomeID"); |
251 : | parrello | 1.1 | } |
252 : | parrello | 1.3 | $controlBlock->IncrementStat($command); |
253 : | parrello | 1.1 | } |
254 : | parrello | 1.3 | # Terminate processing for this genome. |
255 : | my $orgStats = $controlBlock->EndGenome(); | ||
256 : | parrello | 1.1 | Trace("Statistics for $genomeID\n\n" . $orgStats->Show()) if T(3); |
257 : | parrello | 1.2 | # Close the transaction input file. |
258 : | parrello | 1.1 | close TRANS; |
259 : | } | ||
260 : | } | ||
261 : | parrello | 1.3 | # Terminate processing. |
262 : | $controlBlock->Teardown(); | ||
263 : | Trace("Statistics for this run\n\n" . $controlBlock->Show()) if T(1); | ||
264 : | parrello | 1.1 | Trace("Processing complete.") if T(1); |
265 : | } | ||
266 : | |||
267 : | |||
268 : | 1; |
MCS Webmaster | ViewVC Help |
Powered by ViewVC 1.0.3 |