[Bio] / Sprout / ERDBLoadGroup.pm Repository:
ViewVC logotype

Annotation of /Sprout/ERDBLoadGroup.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package ERDBLoadGroup;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :     use ERDB;
25 :     use Stats;
26 :     use Time::HiRes qw(time);
27 :     use ERDBGenerate;
28 :    
29 :     =head1 ERDB Database Load Group Object
30 :    
31 :     The process of loading an ERDB database can be a simple matter of creating some
32 :     sequential files from other sequential files, or it can be a complex web of
33 :     connected sub-processes involving multiple groups of tables being loaded in
34 :     parallel by multiple worker processes. The ERDB Database Load Group object
35 :     provides housekeeping functions to simplify the management of the more complex
36 :     load tasks.
37 :    
38 :     When discussing an ERDB database load, there are two similar concepts we use to
39 :     break the load into pieces: I<sections> and I<groups>. A I<section> is a
40 :     partition of the data that can be processed in isolation from other sections. A
41 :     I<group> is a set of tables that should be loaded at the same time. An ERDB load
42 :     group is a request to generate load files for one or more sections of the data
43 :     targeting a single group of tables.
44 :    
45 :     A certain amount of bookkeeping is required in order to handle parallelism. For
46 :     each table, a separate output file is generated for each section. If a section
47 :     does not complete successfully, then its load file is deleted and the section
48 :     must be loaded again. Because each section has its own load file, only the
49 :     particular sections that fail need to be reloaded.
50 :    
51 :     Individual load groups should subclass this object, providing a virtual override
52 :     for the L</Generate> method.
53 :    
54 :     The subclass name should consist of the group name followed by noise in capital
55 :     case. So, for example, the subclass name for a group named C<Feature> would be
56 :     C<FeatureSproutLoader> or C<FeatureAttributeLoader> or something similar. The
57 :     group name should only be letters, and only the first letter should be capitalized.
58 :     This allows the load script to be case-insensitive with regard to incoming group
59 :     names.
60 :    
61 :     Any working or status files generated by a subclass should have a prefix of C<dt>-something.
62 :     This will insure they are deleted by the C<clear> option of [[ERDBGeneratorPl]].
63 :    
64 :     The fields in this object are as follows.
65 :    
66 :     =over 4
67 :    
68 :     =item db
69 :    
70 :     [[ErdbPm]] object for accessing the target database
71 :    
72 :     =item directory
73 :    
74 :     Directory into which the load files should be placed.
75 :    
76 :     =item group
77 :    
78 :     name of this load group
79 :    
80 :     =item lastKey
81 :    
82 :     ID of the last major object processed
83 :    
84 :     =item loaders
85 :    
86 :     hash mapping the names of the group's tables to [[ERDBGeneratePm]] objects
87 :    
88 :     =item stats
89 :    
90 :     statistics object that can be used to track the progress of the load
91 :    
92 :     =item section
93 :    
94 :     name of this data section
95 :    
96 :     =item source
97 :    
98 :     object used to access the data from which the load files are to be generated
99 :    
100 :     =item tables
101 :    
102 :     reference to a list of the names of the tables in this group
103 :    
104 :     =item options
105 :    
106 :     hash containing the options originally passed in to the constructor
107 :    
108 :     =back
109 :    
110 :     =cut
111 :    
112 :     =head3 new
113 :    
114 :     my $edbl = ERDBLoadGroup->new($source, $db, $directory, $options, @tables);
115 :    
116 :     Construct a new ERDBLoadGroup object. The following parameters are expected:
117 :    
118 :     =over 4
119 :    
120 :     =item source
121 :    
122 :     The object to be used by the subclass to access the source data.
123 :    
124 :     =item db
125 :    
126 :     The [[ErdbPm]] object for the database being loaded.
127 :    
128 :     =item directory
129 :    
130 :     Name of the directory to contain the load files.
131 :    
132 :     =item options
133 :    
134 :     Reference to a hash of options. At the current time, no options are needed
135 :     by this object, but they may be important to subclass objects.
136 :    
137 :     =item tables
138 :    
139 :     A list of the names for the tables in this load group.
140 :    
141 :     =back
142 :    
143 :     This constructor is deliberately kept lightweight in order to insure that
144 :     L</GetGroupHash> is high-performance. For this reason, the [[ERDBGeneratePm]]
145 :     objects in the loaders hash are not created until L</ProcessSection>.
146 :    
147 :     =cut
148 :    
149 :     sub new {
150 :     # Get the parameters.
151 :     my ($class, $source, $db, $directory, $options, @tables) = @_;
152 :     # Create a statistics object
153 :     my $stats = Stats->new();
154 :     # Compute the group name from the class name. It is the first word in
155 :     # a name that is presumably capital case.
156 :     my $group = ($class =~ /^([A-Z][a-z]+)/ ? $1 : $class);
157 :     # Validate the directory.
158 :     Confess("Load directory \"$directory\" not found or invalid.") if ! -d $directory;
159 :     # Create the ERDBLoadGroup object. Note that so far we don't have any loaders
160 :     # defined and the section has not yet been assigned. The "ProcessSection"
161 :     # method is used to assign the section, and the loaders are created the first
162 :     # time it's called.
163 :     my $retVal = {
164 :     db => $db,
165 :     directory => $directory,
166 :     group => $group,
167 :     stats => $stats,
168 :     source => $source,
169 :     lastKey => undef,
170 :     loaders => {},
171 :     tables => \@tables,
172 :     section => undef,
173 :     options => $options
174 :     };
175 :     # Bless and return it.
176 :     bless $retVal, $class;
177 :     return $retVal;
178 :     }
179 :    
180 :     =head2 Subclass Methods
181 :    
182 :     =head3 Put
183 :    
184 :     $edbl->Put($table, %fields);
185 :    
186 :     Place a table record in a load file. This method is the workhorse of the
187 :     file generation phase of a load.
188 :    
189 :     =over 4
190 :    
191 :     =item table
192 :    
193 :     Name of the table being loaded.
194 :    
195 :     =item fields
196 :    
197 :     Hash of field names to field values for the fields in the table.
198 :    
199 :     =back
200 :    
201 :     =cut
202 :    
203 :     sub Put {
204 :     # Get the parameters.
205 :     my ($self, $table, %fields) = @_;
206 :     # Get the loader for this table.
207 :     my $loader = $self->{loaders}->{$table};
208 :     # Complain if it doesn't exist.
209 :     Confess("Table $table not found in load group $self->{group}.") if ! defined $loader;
210 :     # Put this record to the loader's output file.
211 :     my $bytes = $loader->Put(%fields);
212 :     # Count the record and the bytes of data. If no bytes were output, the record
213 :     # was discarded.
214 :     if (! $bytes) {
215 :     $self->Add("$table-discards" => 1);
216 :     } else {
217 :     $self->Add("$table-records" => 1);
218 :     $self->Add("$table-bytes" => $bytes);
219 :     }
220 :     }
221 :    
222 :     =head3 Add
223 :    
224 :     $edbl->Add($statName => $count);
225 :    
226 :     Add the specified count to the named statistical counter. The statistical
227 :     counts are kept in an internal statistics object whose contents are
228 :     displayed when the group is finished.
229 :    
230 :     =over 4
231 :    
232 :     =item statName
233 :    
234 :     Name of the statistic to increment.
235 :    
236 :     =item count
237 :    
238 :     Value by which to increment it.
239 :    
240 :     =back
241 :    
242 :     =cut
243 :    
244 :     sub Add {
245 :     # Get the parameters.
246 :     my ($self, $statName, $count) = @_;
247 :     # Update the statistic.
248 :     $self->{stats}->Add($statName => $count);
249 :     }
250 :    
251 :     =head3 Track
252 :    
253 :     $edbl->Track($statName => $key, $period);
254 :    
255 :     Save the specified key as the one currently in progress. If an error
256 :     occurs, the key value will appear in the output log. The named statistic
257 :     will also be incremented, and if the count is an even multiple of the stated
258 :     period, a trace message will be output at level 3.
259 :    
260 :     Most load groups have a primary object type that drives the main loop. When
261 :     something goes wrong, we want to know the ID of the offending object. When
262 :     things go right, we want to know how far we've progressed toward completion.
263 :     This method can be used to record each occurrence of a primary object, and
264 :     provide a log of the progress or our current position in times of stress.
265 :    
266 :     =over 4
267 :    
268 :     =item statName
269 :    
270 :     Name of the statistic to be incremented. This should be a plural noun
271 :     describing the object whose kep is coming in.
272 :    
273 :     =item key
274 :    
275 :     Key value to be displayed if something goes wrong.
276 :    
277 :     =item period (optional)
278 :    
279 :     If specified, should be the number of objects to be counted between each
280 :     level-3 trace message.
281 :    
282 :     =back
283 :    
284 :     =cut
285 :    
286 :     sub Track {
287 :     # Get the parameters.
288 :     my ($self, $statName, $key, $period) = @_;
289 :     # Save the key.
290 :     $self->{lastKey} = $key;
291 :     # Count it.
292 :     my $newValue = $self->{stats}->Add($statName => 1);
293 :     # Do we need to output a progress message?
294 :     if ($period && T(3) && ($newValue % $period == 0)) {
295 :     # Yes.
296 :     Trace("$newValue $statName processed for $self->{group} group.");
297 :     }
298 :     }
299 :    
300 :     =head3 section
301 :    
302 :     my $sectionID = $edbl->section();
303 :    
304 :     Return the ID of the current section.
305 :    
306 :     =cut
307 :    
308 :     sub section {
309 :     # Get the parameters.
310 :     my ($self) = @_;
311 :     # Return the result.
312 :     return $self->{section};
313 :     }
314 :    
315 :     =head3 source
316 :    
317 :     my $sourceObject = $edbl->source();
318 :    
319 :     Return the source object used to get the data needed for creating
320 :     the load files.
321 :    
322 :     =cut
323 :    
324 :     sub source {
325 :     # Get the parameters.
326 :     my ($self) = @_;
327 :     # Return the result.
328 :     return $self->{source};
329 :     }
330 :    
331 :     =head3 db
332 :    
333 :     my $erdbObject = $edbl->db();
334 :    
335 :     Return the database object for the target database.
336 :    
337 :     =cut
338 :    
339 :     sub db {
340 :     # Get the parameters.
341 :     my ($self) = @_;
342 :     # Return the result.
343 :     return $self->{db};
344 :     }
345 :    
346 :     =head2 Internal Methods
347 :    
348 :     =head3 ProcessSection
349 :    
350 :     my $flag = $edbl->ProcessSection($section);
351 :    
352 :     Generate the load file for a particular data section. This method calls
353 :     the virtual method L</Generate> to actually put the data into the load
354 :     files, and is responsible for assigning the section and finalizing the
355 :     load files if the load is successful.
356 :    
357 :     =over 4
358 :    
359 :     =item section
360 :    
361 :     ID of the section to load.
362 :    
363 :     =item RETURN
364 :    
365 :     Returns TRUE if successful, FALSE if an error prevented loading the section.
366 :    
367 :     =back
368 :    
369 :     =cut
370 :    
371 :     sub ProcessSection {
372 :     # Get the parameters.
373 :     my ($self, $section) = @_;
374 :     # Declare the return variable. We'll set it to 1 if we succeed.
375 :     # Save the section ID.
376 :     $self->{section} = $section;
377 :     # Get the database object.
378 :     my $db = $self->db();
379 :     # Start a timer and protect ourselves from errors.
380 :     my $startTime = time();
381 :     eval {
382 :     # Get the list of tables for this group.
383 :     my @tables = @{$self->{tables}};
384 :     # Get the loader hash.
385 :     my $loaderHash = $self->{loaders};
386 :     # Initialize the loaders for the necessary tables.
387 :     for my $table (@tables) {
388 :     # Get this table's loader.
389 :     my $loader = $loaderHash->{$table};
390 :     # If it doesn't exist yet, create it.
391 :     if (! defined $loader) {
392 :     $loader = ERDBGenerate->new($db, $self->{directory}, $table);
393 :     # Save it for future use.
394 :     $loaderHash->{$table} = $loader;
395 :     # Count it.
396 :     $self->Add(tables => 1);
397 :     }
398 :     $loader->Start($section);
399 :     }
400 :     # Generate the data to put in the newly-created load files.
401 :     Trace("Calling generator.") if T(3);
402 :     $self->Generate();
403 :     };
404 :     # Did it work?
405 :     if ($@) {
406 :     # No, so emit an error message and abort all the loaders.
407 :     $self->{stats}->AddMessage("Error loading section $section: $@");
408 :     if (defined $self->{lastKey}) {
409 :     $self->{stats}->AddMessage("Error occurred while processing \"$self->{lastKey}\".");
410 :     }
411 :     $self->Add("section-errors" => 1);
412 :     for my $loader (values %{$self->{loaders}}) {
413 :     $loader->Abort();
414 :     }
415 :     } else {
416 :     # Yes! Finish all the loaders.
417 :     for my $loader (values %{$self->{loaders}}) {
418 :     $loader->Finish();
419 :     }
420 :     # Update the load count and the timer.
421 :     $self->Add("section-loads" => 1);
422 :     $self->Add(duration => (time() - $startTime));
423 :     }
424 :     }
425 :    
426 :     =head3 DisplayStats
427 :    
428 :     my $text = $edbl->DisplayStats();
429 :    
430 :     Display the statistics for this load gorup.
431 :    
432 :     =cut
433 :    
434 :     sub DisplayStats {
435 :     # Get the parameters.
436 :     my ($self) = @_;
437 :     # Return the result.
438 :     return $self->{stats}->Show();
439 :     }
440 :    
441 :     =head3 GetGroupHash
442 :    
443 :     my $groupHash = ERDBLoadGroup::GetGroupHash($erdb);
444 :    
445 :     Return a hash that maps each load group in the specified database to its
446 :     constituent tables. This is useful when checking for problems with a load
447 :     or performing finishing tasks.
448 :    
449 :     =over 4
450 :    
451 :     =item erdb
452 :    
453 :     [[ErdbPm]] database whose load information is desired.
454 :    
455 :     =item RETURN
456 :    
457 :     Returns a reference to a hash that maps each group name to a list of
458 :     table names.
459 :    
460 :     =back
461 :    
462 :     =cut
463 :    
464 :     sub GetGroupHash {
465 :     # Get the parameters.
466 :     my ($erdb) = @_;
467 :     # Initialize the return variable.
468 :     my $retVal = {};
469 :     # Loop through the list of load groups.
470 :     for my $group ($erdb->LoadGroupList()) {
471 :     # Get a loader for this group.
472 :     my $loader = $erdb->Loader($group, {});
473 :     # Stash the loader's tables in the output hash.
474 :     $retVal->{$group} = $loader->{tables};
475 :     }
476 :     # Return the result.
477 :     return $retVal;
478 :     }
479 :    
480 :     =head3 ComputeGroups
481 :    
482 :     my @groupList = ERDBLoadGroup::ComputeGroups($erdb, $options, \@groups);
483 :    
484 :     Compute the actual list of groups determined by the incoming options and
485 :     group list. If the list is an asterisk (C<*>), this method returns a list
486 :     of all the groups. If the options include C<resume>, this method returns
487 :     the first specified group and all the groups after it in the standard
488 :     ordering.
489 :    
490 :     =over 4
491 :    
492 :     =item erdb
493 :    
494 :     [[ErdbPm]] object for the database being loaded.
495 :    
496 :     =item options
497 :    
498 :     Reference to a hash of command-line options for the command that started
499 :     this load operation.
500 :    
501 :     =item groups
502 :    
503 :     Reference to a list of group names specified on the command line.
504 :    
505 :     =item RETURN
506 :    
507 :     Returns the actual list of groups to be processed by the calling command. The
508 :     names will have been normalized to capital case.
509 :    
510 :     =back
511 :    
512 :     =cut
513 :    
514 :     sub ComputeGroups {
515 :     # Get the parameters.
516 :     my ($erdb, $options, $groups) = @_;
517 :     # Declare the return variable.
518 :     my @retVal;
519 :     # Check the group list.
520 :     if ($groups->[0] eq '*') {
521 :     # Load all groups.
522 :     @retVal = $erdb->LoadGroupList();
523 :     } elsif ($options->{resume}) {
524 :     # Load all groups after and including the specified one.
525 :     my $starter = $groups->[0];
526 :     @retVal = $erdb->LoadGroupList();
527 :     shift @retVal until (! @retVal) || $retVal[0] eq $starter;
528 :     # If we didn't find the specified group, it's an error.
529 :     Confess("Invalid group name \"$starter\" in parameter list.") if (! @retVal);
530 :     } else {
531 :     # Here the groups are all on the command line. Stuff them in the return
532 :     # list.
533 :     @retVal = @{$groups};
534 :     # Verify that they're all valid.
535 :     my %checker = map { $_ => 1 } $erdb->LoadGroupList();
536 :     for my $group (@retVal) {
537 :     Confess("Invalid group name \"$group\" in parameter list.")
538 :     if ! $checker{$group};
539 :     }
540 :     }
541 :     # Normalize the group names and return them.
542 :     return map { ucfirst $_ } @retVal;
543 :     }
544 :    
545 :     =head2 Virtual Methods
546 :    
547 :     =head3 Generate
548 :    
549 :     $edbl->Generate();
550 :    
551 :     Generate the data for this load group with respect to the current
552 :     section. This method must be overridden by the subclass and should call
553 :     the L</Put> method to put data into the tables.
554 :    
555 :     =cut
556 :    
557 :     sub Generate {
558 :     Confess("Pure virtual method Generate called.");
559 :     }
560 :    
561 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3