[Bio] / FigKernelScripts / TransactFeatures.pl Repository:
ViewVC logotype

Annotation of /FigKernelScripts/TransactFeatures.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.20 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 : olson 1.16 #
3 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
4 :     # for Interpretations of Genomes. All Rights Reserved.
5 :     #
6 :     # This file is part of the SEED Toolkit.
7 :     #
8 :     # The SEED Toolkit is free software. You can redistribute
9 :     # it and/or modify it under the terms of the SEED Toolkit
10 :     # Public License.
11 :     #
12 :     # You should have received a copy of the SEED Toolkit Public License
13 :     # along with this program; if not write to the University of Chicago
14 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
15 :     # Genomes at veronika@thefig.info or download a copy from
16 :     # http://www.theseed.org/LICENSE.TXT.
17 :     #
18 :    
19 : parrello 1.1
20 :     =head1 Add / Delete / Change Features
21 :    
22 :     This method will run through a set of transaction files, adding, deleting, and changing
23 :     features in the FIG data store. The command takes three input parameters. The first is
24 :     a command. The second specifies a directory full of transaction files. The third
25 :     specifies a file that tells us which feature IDs are available for each organism.
26 :    
27 : parrello 1.14 C<TransactFeatures> [I<options>] I<command> I<transactionDirectory> I<idFile>
28 : parrello 1.1
29 :     The supported commands are
30 :    
31 :     =over 4
32 :    
33 :     =item count
34 :    
35 :     Count the number of IDs needed to process the ADD and CHANGE transactions. This
36 : parrello 1.15 will produce a listing of the number of feature IDs needed for each
37 : parrello 1.1 organism and feature type. This command is mostly a sanity check: it provides
38 :     useful statistics without changing anything.
39 :    
40 :     =item register
41 :    
42 :     Create an ID file by requesting IDs from the clearinghouse. This performs the
43 :     same function as B<count>, but takes the additional step of creating an ID
44 :     file that can be used to process the transactions.
45 :    
46 :     =item process
47 :    
48 : parrello 1.14 Process the transactions and update the FIG data store. This will also update
49 :     the NR file and queue features for similarity generation.
50 : parrello 1.1
51 : parrello 1.14 =item fudge
52 : parrello 1.4
53 : parrello 1.14 Convert transactions that have already been applied to new transactions that can
54 :     be used to test the transaction processor.
55 : parrello 1.10
56 : parrello 1.1 =back
57 :    
58 :     =head2 The Transaction File
59 :    
60 :     Each transaction file is a standard tab-delimited file, one transaction per line. The
61 :     name of the file is C<tbl_diff_>I<org> where I<org> is an organism ID. All records in
62 :     the transaction file refer to transactions against the organism encoded in the file
63 :     name.
64 :    
65 :     The file must specify IDs for new features, but the real IDs cannot be known until
66 :     they are requested from the SEED clearing house. Therefore, each new ID is specified
67 :     in a special format consisting of the feature type (C<peg>, C<rna>, and so forth)
68 :     followed by a dot and the 0-based ordinal number of the new ID within that
69 :     feature type. So, for example, if the transaction file consists of a delete,
70 :     a change, and two adds, it might look like this
71 :    
72 :     delete fig|83333.1.peg.2
73 :     change fig|83333.1.peg.6 peg.0 ...
74 :     add peg.1 ...
75 :     add rna.0 ...
76 :    
77 :     Note that the old feature IDs do not participate in the numbering process, and the RNA
78 :     numbering is independent of the PEG numbering. In the discussion below of transaction
79 :     types, a field named I<newID> will always indicate one of these type/number pairs.
80 :     So, the field setup for the B<chang> command is
81 :    
82 :     change fid newID locations aliases translation
83 :    
84 :     And the I<newID> corresponds to the C<peg.6> in the example above.
85 :    
86 :     The first field of each record is the transaction type. The list of subsequent fields
87 :     depends on this type.
88 :    
89 :     =over 4
90 :    
91 :     =item DELETE fid
92 :    
93 :     Deletes a feature. The feature is marked as deleted in the FIG database, which
94 :     causes it to be skipped or ignored by most of the SEED software. The ID of the
95 :     feature to be deleted is the second field (I<fid>).
96 :    
97 :     =item ADD newID locations translation
98 :    
99 :     Adds a new feature. The I<newID> indicates the feature type and its ordinal number.
100 :     The location is a comma-separated list of location strings. The translation is the
101 :     protein translation for the location. If the translation is omitted, then it will
102 :     be generated from the location information in the normal way.
103 :    
104 :     =item CHANGE fid newID locations aliases translation
105 :    
106 :     Changes an existing feature. The current copy of the feature is marked as deleted,
107 :     and a new feature is created with a new ID. All annotations and assignments are
108 :     transferred from the deleted feature to the new one. The location is a
109 :     comma-separated list of location strings. The aliases are specified as a comma-delimited
110 :     list of alternate names for the feature. These replace any existing aliases for the
111 :     old feature. If the alias list is omitted, no aliases will be assigned to the new
112 :     feature. The translation is the protein translation for the location. If the
113 :     translation is omitted, then it will be generated from the location information in the
114 :     normal way.
115 :    
116 :     =back
117 :    
118 :     =head2 The ID File
119 :    
120 :     The ID file is a tab-delimited file containing one record for each feature type
121 :     of each organism that has a transaction file. Each record consists of three
122 :     fields.
123 :    
124 :     =over 4
125 :    
126 :     =item orgID
127 :    
128 :     The ID of the organism being updated.
129 :    
130 :     =item ftype
131 :    
132 :     The relevant feature type.
133 :    
134 :     =item firstNumber
135 :    
136 :     The first available ID number for the organism and feature type.
137 :    
138 :     =back
139 :    
140 :     This file's primary purpose is that it tells us how to create the feature IDs
141 :     for features we'll be adding to the data store, whether it be via a straight
142 :     B<add> or a B<chang> that deletes an old ID and recreates the feature with a
143 :     new ID.
144 :    
145 :     If we need new IDs for an organism not listed in this ID file, an error will be
146 :     thrown.
147 :    
148 :     =head2 Command-Line Options
149 :    
150 :     The command-line options for this script are as follows.
151 :    
152 :     =over 4
153 :    
154 :     =item trace
155 :    
156 :     Numeric trace level. A higher trace level causes more messages to appear. The
157 :     default trace level is 3.
158 :    
159 : parrello 1.2 =item safe
160 :    
161 :     Wrap each organism's processing in a database transaction. This makes the process
162 :     slightly more restartable than it would be otherwise.
163 :    
164 : parrello 1.5 =item noAlias
165 :    
166 :     Assume that the transaction files do not contain aliases. This means that in CHANGE
167 :     records the translation will immediately follow the location.
168 :    
169 : parrello 1.8 =item sql
170 :    
171 :     Trace SQL commands.
172 :    
173 : parrello 1.15 =item tblFiles
174 :    
175 :     Output TBL files containing the corrected IDs. (B<process> command only)
176 :    
177 : parrello 1.12 =item start
178 :    
179 :     ID of the first genome to process. This allows restarting a transaction run that failed
180 :     in the middle. The default is to run all transaction files.
181 :    
182 : parrello 1.8 =back
183 :    
184 : parrello 1.1 =cut
185 :    
186 :     use strict;
187 :     use Tracer;
188 :     use Cwd;
189 :     use File::Copy;
190 :     use File::Path;
191 :     use FIG;
192 :     use Stats;
193 : parrello 1.3 use TransactionProcessor;
194 :     use ApplyTransactions;
195 :     use CountTransactions;
196 : parrello 1.14 use FudgeTransactions;
197 : parrello 1.1
198 :     # Get the command-line options.
199 : parrello 1.20 my ($options, @parameters) = StandardSetup(["FIG"],
200 : parrello 1.18 { safe => [0, "use database transactions"],
201 : parrello 1.19 trace => [2, "trace level"],
202 : parrello 1.18 noAlias => [0, "do not expect aliases in CHANGE transactions"],
203 :     start => [' ', "start with this genome"],
204 :     tblFiles => [0, "output TBL files containing the corrected IDs"] },
205 :     "command transactionDirectory IDfile",
206 :     @ARGV);
207 : parrello 1.8 # Get the command.
208 :     my $mainCommand = lc shift @parameters;
209 : parrello 1.1 # Get the FIG object.
210 :     my $fig = FIG->new();
211 : parrello 1.3 # Create the transaction object.
212 :     my $controlBlock;
213 :     if ($mainCommand eq 'count' || $mainCommand eq 'register') {
214 :     $controlBlock = CountTransactions->new($options, $mainCommand, @parameters);
215 :     } elsif ($mainCommand eq 'process') {
216 :     $controlBlock = ApplyTransactions->new($options, $mainCommand, @parameters);
217 : parrello 1.14 } elsif ($mainCommand eq 'fudge') {
218 :     $controlBlock = FudgeTransactions->new($options, $mainCommand, @parameters);
219 : parrello 1.3 } else {
220 :     Confess("Invalid command \"$mainCommand\" specified on command line.");
221 : parrello 1.1 }
222 : parrello 1.3 # Setup the process.
223 :     $controlBlock->Setup();
224 : parrello 1.1 # Verify that the organism directory exists.
225 :     if (! -d $parameters[0]) {
226 :     Confess("Directory of genome files \"$parameters[0]\" not found.");
227 :     } else {
228 :     # Here we have a valid directory, so we need the list of transaction
229 :     # files in it.
230 :     my $orgsFound = 0;
231 :     my %transFiles = ();
232 :     my @transDirectory = OpenDir($parameters[0], 1);
233 : parrello 1.12 # Pull out the "start" option value. This will be a space if all genomes should
234 :     # be processed, in which case it will always compare less than the genome ID.
235 :     my $startGenome = $options->{start};
236 : parrello 1.1 # The next step is to create a hash of organism IDs to file names. This
237 :     # saves us some painful parsing later.
238 :     for my $transFileName (@transDirectory) {
239 : parrello 1.12 # Parse the file name. This will only match if it's a real transaction file.
240 : parrello 1.1 if ($transFileName =~ /^tbl_diff_(\d+\.\d+)$/) {
241 : parrello 1.12 # Get the genome ID;
242 :     my $genomeID = $1;
243 :     # If we're skipping, only include this genome ID if it's equal to
244 :     # or greater than the start value.
245 : parrello 1.13 if ($genomeID ge $startGenome) {
246 : parrello 1.12 $transFiles{$1} = "$parameters[0]/$transFileName";
247 :     $orgsFound++;
248 :     }
249 : parrello 1.1 }
250 :     }
251 :     Trace("$orgsFound genome transaction files found in directory $parameters[0].") if T(2);
252 :     if (! $orgsFound) {
253 :     Confess("No \"tbl_diff\" files found in directory $parameters[1].");
254 :     } else {
255 :     # Loop through the organisms.
256 :     for my $genomeID (sort keys %transFiles) {
257 : parrello 1.3 # Start this organism.
258 : parrello 1.1 Trace("Processing changes for $genomeID.") if T(3);
259 : parrello 1.3 my $orgFileName = $transFiles{$genomeID};
260 :     $controlBlock->StartGenome($genomeID, $orgFileName);
261 : parrello 1.1 # Open the organism file.
262 :     Open(\*TRANS, "<$orgFileName");
263 : parrello 1.3 # Clear the transaction counter.
264 : parrello 1.1 my $tranCount = 0;
265 :     # Loop through the organism's data.
266 :     while (my $transaction = <TRANS>) {
267 :     # Parse the record.
268 :     chomp $transaction;
269 :     my @fields = split /\t/, $transaction;
270 :     $tranCount++;
271 :     # Save the record number in the control block.
272 :     $controlBlock->{line} = $tranCount;
273 :     # Process according to the transaction type.
274 :     my $command = lc shift @fields;
275 :     if ($command eq 'add') {
276 : parrello 1.3 $controlBlock->Add(@fields);
277 : parrello 1.1 } elsif ($command eq 'delete') {
278 : parrello 1.3 $controlBlock->Delete(@fields);
279 : parrello 1.1 } elsif ($command eq 'change') {
280 : parrello 1.5 # Here we have a special case. If "noalias" is in effect, we need
281 :     # to splice an empty field in before the translation.
282 :     if ($controlBlock->Option("noAlias")) {
283 :     splice @fields, 3, 0, "";
284 :     }
285 : parrello 1.3 $controlBlock->Change(@fields);
286 : parrello 1.1 } else {
287 : parrello 1.3 $controlBlock->AddMessage("Invalid command $command in line $tranCount for genome $genomeID");
288 : parrello 1.1 }
289 : parrello 1.3 $controlBlock->IncrementStat($command);
290 : parrello 1.1 }
291 : parrello 1.14 # Close the transaction input file.
292 :     close TRANS;
293 : parrello 1.3 # Terminate processing for this genome.
294 :     my $orgStats = $controlBlock->EndGenome();
295 : parrello 1.6 Trace("Statistics for $genomeID\n\n" . $orgStats->Show() . "\n") if T(3);
296 : parrello 1.1 }
297 :     }
298 : parrello 1.3 # Terminate processing.
299 :     $controlBlock->Teardown();
300 : parrello 1.6 Trace("Statistics for this run\n\n" . $controlBlock->Show() . "\n") if T(1);
301 : parrello 1.1 Trace("Processing complete.") if T(1);
302 :     }
303 :    
304 :    
305 : golsen 1.17 1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3