[Bio] / FigKernelScripts / TransactFeatures.pl Repository:
ViewVC logotype

Annotation of /FigKernelScripts/TransactFeatures.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.12 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     =head1 Add / Delete / Change Features
4 :    
5 :     This method will run through a set of transaction files, adding, deleting, and changing
6 :     features in the FIG data store. The command takes three input parameters. The first is
7 :     a command. The second specifies a directory full of transaction files. The third
8 :     specifies a file that tells us which feature IDs are available for each organism.
9 :    
10 :     C<TransactFeatures> I<[options]> I<command> I<transactionDirectory> I<idFile>
11 :    
12 :     The supported commands are
13 :    
14 :     =over 4
15 :    
16 :     =item count
17 :    
18 :     Count the number of IDs needed to process the ADD and CHANGE transactions. This
19 :     will produce an listing of the number of feature IDs needed for each
20 :     organism and feature type. This command is mostly a sanity check: it provides
21 :     useful statistics without changing anything.
22 :    
23 :     =item register
24 :    
25 :     Create an ID file by requesting IDs from the clearinghouse. This performs the
26 :     same function as B<count>, but takes the additional step of creating an ID
27 :     file that can be used to process the transactions.
28 :    
29 :     =item process
30 :    
31 :     Process the transactions and update the FIG data store. This will also create
32 :     a copy of each transaction file in which the pseudo-IDs have been replaced by
33 :     real IDs.
34 :    
35 : parrello 1.4 =item annotate
36 :    
37 :     Annotate the features created by the transactions so as to indicate how they were
38 :     derived.
39 :    
40 : parrello 1.7 =item check
41 :    
42 :     Verify that the locations and translations of the new and changed features are
43 :     correct.
44 :    
45 : parrello 1.4 =item fix
46 :    
47 : parrello 1.7 Fix the locations and translations of the new and changed features.
48 : parrello 1.4
49 : parrello 1.8 =item aliasMove
50 :    
51 :     Move the aliases from the old features to the ones that replaced them.
52 :    
53 : parrello 1.9 =item attribute
54 :    
55 :     Move the attributes from the old features to the ones that replaced them.
56 :    
57 : parrello 1.10 =item attributeCheck
58 :    
59 :     Same as C<attribute>, but no changes are made to the database.
60 :    
61 : parrello 1.1 =back
62 :    
63 :     =head2 The Transaction File
64 :    
65 :     Each transaction file is a standard tab-delimited file, one transaction per line. The
66 :     name of the file is C<tbl_diff_>I<org> where I<org> is an organism ID. All records in
67 :     the transaction file refer to transactions against the organism encoded in the file
68 :     name.
69 :    
70 :     The file must specify IDs for new features, but the real IDs cannot be known until
71 :     they are requested from the SEED clearing house. Therefore, each new ID is specified
72 :     in a special format consisting of the feature type (C<peg>, C<rna>, and so forth)
73 :     followed by a dot and the 0-based ordinal number of the new ID within that
74 :     feature type. So, for example, if the transaction file consists of a delete,
75 :     a change, and two adds, it might look like this
76 :    
77 :     delete fig|83333.1.peg.2
78 :     change fig|83333.1.peg.6 peg.0 ...
79 :     add peg.1 ...
80 :     add rna.0 ...
81 :    
82 :     Note that the old feature IDs do not participate in the numbering process, and the RNA
83 :     numbering is independent of the PEG numbering. In the discussion below of transaction
84 :     types, a field named I<newID> will always indicate one of these type/number pairs.
85 :     So, the field setup for the B<chang> command is
86 :    
87 :     change fid newID locations aliases translation
88 :    
89 :     And the I<newID> corresponds to the C<peg.6> in the example above.
90 :    
91 :     The first field of each record is the transaction type. The list of subsequent fields
92 :     depends on this type.
93 :    
94 :     =over 4
95 :    
96 :     =item DELETE fid
97 :    
98 :     Deletes a feature. The feature is marked as deleted in the FIG database, which
99 :     causes it to be skipped or ignored by most of the SEED software. The ID of the
100 :     feature to be deleted is the second field (I<fid>).
101 :    
102 :     =item ADD newID locations translation
103 :    
104 :     Adds a new feature. The I<newID> indicates the feature type and its ordinal number.
105 :     The location is a comma-separated list of location strings. The translation is the
106 :     protein translation for the location. If the translation is omitted, then it will
107 :     be generated from the location information in the normal way.
108 :    
109 :     =item CHANGE fid newID locations aliases translation
110 :    
111 :     Changes an existing feature. The current copy of the feature is marked as deleted,
112 :     and a new feature is created with a new ID. All annotations and assignments are
113 :     transferred from the deleted feature to the new one. The location is a
114 :     comma-separated list of location strings. The aliases are specified as a comma-delimited
115 :     list of alternate names for the feature. These replace any existing aliases for the
116 :     old feature. If the alias list is omitted, no aliases will be assigned to the new
117 :     feature. The translation is the protein translation for the location. If the
118 :     translation is omitted, then it will be generated from the location information in the
119 :     normal way.
120 :    
121 :     =back
122 :    
123 :     =head2 The ID File
124 :    
125 :     The ID file is a tab-delimited file containing one record for each feature type
126 :     of each organism that has a transaction file. Each record consists of three
127 :     fields.
128 :    
129 :     =over 4
130 :    
131 :     =item orgID
132 :    
133 :     The ID of the organism being updated.
134 :    
135 :     =item ftype
136 :    
137 :     The relevant feature type.
138 :    
139 :     =item firstNumber
140 :    
141 :     The first available ID number for the organism and feature type.
142 :    
143 :     =back
144 :    
145 :     This file's primary purpose is that it tells us how to create the feature IDs
146 :     for features we'll be adding to the data store, whether it be via a straight
147 :     B<add> or a B<chang> that deletes an old ID and recreates the feature with a
148 :     new ID.
149 :    
150 :     If we need new IDs for an organism not listed in this ID file, an error will be
151 :     thrown.
152 :    
153 :     =head2 Command-Line Options
154 :    
155 :     The command-line options for this script are as follows.
156 :    
157 :     =over 4
158 :    
159 :     =item trace
160 :    
161 :     Numeric trace level. A higher trace level causes more messages to appear. The
162 :     default trace level is 3.
163 :    
164 : parrello 1.2 =item safe
165 :    
166 :     Wrap each organism's processing in a database transaction. This makes the process
167 :     slightly more restartable than it would be otherwise.
168 :    
169 : parrello 1.5 =item noAlias
170 :    
171 :     Assume that the transaction files do not contain aliases. This means that in CHANGE
172 :     records the translation will immediately follow the location.
173 :    
174 : parrello 1.8 =item sql
175 :    
176 :     Trace SQL commands.
177 :    
178 : parrello 1.12 =item start
179 :    
180 :     ID of the first genome to process. This allows restarting a transaction run that failed
181 :     in the middle. The default is to run all transaction files.
182 :    
183 : parrello 1.8 =back
184 :    
185 : parrello 1.1 =cut
186 :    
187 :     use strict;
188 :     use Tracer;
189 :     use DocUtils;
190 :     use TestUtils;
191 :     use Cwd;
192 :     use File::Copy;
193 :     use File::Path;
194 :     use FIG;
195 :     use Stats;
196 : parrello 1.3 use TransactionProcessor;
197 :     use ApplyTransactions;
198 :     use CountTransactions;
199 :     use AnnotateTransactions;
200 : parrello 1.9 use AttributeTransactions;
201 : parrello 1.3 use FixTransactions;
202 : parrello 1.8 use MoveAliases;
203 : parrello 1.1
204 :     # Get the command-line options.
205 : parrello 1.12 my ($options, @parameters) = Tracer::ParseCommand({ trace => 3, sql => 0, safe => 0, noAlias => 0,
206 :     start => ' '},
207 : parrello 1.8 @ARGV);
208 :     # Get the command.
209 :     my $mainCommand = lc shift @parameters;
210 : parrello 1.1 # Set up tracing.
211 :     my $traceLevel = $options->{trace};
212 : parrello 1.8 my $tracing = "$traceLevel Tracer DocUtils FIG";
213 :     if ($options->{sql}) {
214 :     $tracing .= " SQL";
215 :     }
216 :     TSetup($tracing, "TEXT");
217 : parrello 1.1 # Get the FIG object.
218 :     my $fig = FIG->new();
219 : parrello 1.3 # Create the transaction object.
220 :     my $controlBlock;
221 :     if ($mainCommand eq 'count' || $mainCommand eq 'register') {
222 :     $controlBlock = CountTransactions->new($options, $mainCommand, @parameters);
223 :     } elsif ($mainCommand eq 'process') {
224 :     $controlBlock = ApplyTransactions->new($options, $mainCommand, @parameters);
225 : parrello 1.6 } elsif ($mainCommand eq 'annotate') {
226 :     $controlBlock = AnnotateTransactions->new($options, $mainCommand, @parameters);
227 : parrello 1.7 } elsif ($mainCommand eq 'fix' || $mainCommand eq 'check') {
228 : parrello 1.3 $controlBlock = FixTransactions->new($options, $mainCommand, @parameters);
229 : parrello 1.8 } elsif ($mainCommand eq 'aliasmove') {
230 :     $controlBlock = MoveAliases->new($options, $mainCommand, @parameters);
231 : parrello 1.9 } elsif ($mainCommand eq 'attribute') {
232 :     $controlBlock = AttributeTransactions->new($options, $mainCommand, @parameters);
233 : parrello 1.11 } elsif ($mainCommand eq 'attributecheck') {
234 : parrello 1.10 $controlBlock = AttributeTransactions->new($options, $mainCommand, @parameters);
235 : parrello 1.3 } else {
236 :     Confess("Invalid command \"$mainCommand\" specified on command line.");
237 : parrello 1.1 }
238 : parrello 1.3 # Setup the process.
239 :     $controlBlock->Setup();
240 : parrello 1.1 # Verify that the organism directory exists.
241 :     if (! -d $parameters[0]) {
242 :     Confess("Directory of genome files \"$parameters[0]\" not found.");
243 :     } else {
244 :     # Here we have a valid directory, so we need the list of transaction
245 :     # files in it.
246 :     my $orgsFound = 0;
247 :     my %transFiles = ();
248 :     my @transDirectory = OpenDir($parameters[0], 1);
249 : parrello 1.12 # Pull out the "start" option value. This will be a space if all genomes should
250 :     # be processed, in which case it will always compare less than the genome ID.
251 :     my $startGenome = $options->{start};
252 : parrello 1.1 # The next step is to create a hash of organism IDs to file names. This
253 :     # saves us some painful parsing later.
254 :     for my $transFileName (@transDirectory) {
255 : parrello 1.12 # Parse the file name. This will only match if it's a real transaction file.
256 : parrello 1.1 if ($transFileName =~ /^tbl_diff_(\d+\.\d+)$/) {
257 : parrello 1.12 # Get the genome ID;
258 :     my $genomeID = $1;
259 :     # If we're skipping, only include this genome ID if it's equal to
260 :     # or greater than the start value.
261 :     if ($genomeID > $startGenome) {
262 :     $transFiles{$1} = "$parameters[0]/$transFileName";
263 :     $orgsFound++;
264 :     }
265 : parrello 1.1 }
266 :     }
267 :     Trace("$orgsFound genome transaction files found in directory $parameters[0].") if T(2);
268 :     if (! $orgsFound) {
269 :     Confess("No \"tbl_diff\" files found in directory $parameters[1].");
270 :     } else {
271 :     # Loop through the organisms.
272 :     for my $genomeID (sort keys %transFiles) {
273 : parrello 1.3 # Start this organism.
274 : parrello 1.1 Trace("Processing changes for $genomeID.") if T(3);
275 : parrello 1.3 my $orgFileName = $transFiles{$genomeID};
276 :     $controlBlock->StartGenome($genomeID, $orgFileName);
277 : parrello 1.1 # Open the organism file.
278 :     Open(\*TRANS, "<$orgFileName");
279 : parrello 1.3 # Clear the transaction counter.
280 : parrello 1.1 my $tranCount = 0;
281 :     # Loop through the organism's data.
282 :     while (my $transaction = <TRANS>) {
283 :     # Parse the record.
284 :     chomp $transaction;
285 :     my @fields = split /\t/, $transaction;
286 :     $tranCount++;
287 :     # Save the record number in the control block.
288 :     $controlBlock->{line} = $tranCount;
289 :     # Process according to the transaction type.
290 :     my $command = lc shift @fields;
291 :     if ($command eq 'add') {
292 : parrello 1.3 $controlBlock->Add(@fields);
293 : parrello 1.1 } elsif ($command eq 'delete') {
294 : parrello 1.3 $controlBlock->Delete(@fields);
295 : parrello 1.1 } elsif ($command eq 'change') {
296 : parrello 1.5 # Here we have a special case. If "noalias" is in effect, we need
297 :     # to splice an empty field in before the translation.
298 :     if ($controlBlock->Option("noAlias")) {
299 :     splice @fields, 3, 0, "";
300 :     }
301 : parrello 1.3 $controlBlock->Change(@fields);
302 : parrello 1.1 } else {
303 : parrello 1.3 $controlBlock->AddMessage("Invalid command $command in line $tranCount for genome $genomeID");
304 : parrello 1.1 }
305 : parrello 1.3 $controlBlock->IncrementStat($command);
306 : parrello 1.1 }
307 : parrello 1.3 # Terminate processing for this genome.
308 :     my $orgStats = $controlBlock->EndGenome();
309 : parrello 1.6 Trace("Statistics for $genomeID\n\n" . $orgStats->Show() . "\n") if T(3);
310 : parrello 1.2 # Close the transaction input file.
311 : parrello 1.1 close TRANS;
312 :     }
313 :     }
314 : parrello 1.3 # Terminate processing.
315 :     $controlBlock->Teardown();
316 : parrello 1.6 Trace("Statistics for this run\n\n" . $controlBlock->Show() . "\n") if T(1);
317 : parrello 1.1 Trace("Processing complete.") if T(1);
318 :     }
319 :    
320 :    
321 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3