[Bio] / Sprout / ERDBLoadGroup.pm Repository:
ViewVC logotype

Annotation of /Sprout/ERDBLoadGroup.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package ERDBLoadGroup;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :     use ERDB;
25 :     use Stats;
26 :     use Time::HiRes qw(time);
27 :     use ERDBGenerate;
28 :    
29 :     =head1 ERDB Database Load Group Object
30 :    
31 :     The process of loading an ERDB database can be a simple matter of creating some
32 :     sequential files from other sequential files, or it can be a complex web of
33 :     connected sub-processes involving multiple groups of tables being loaded in
34 :     parallel by multiple worker processes. The ERDB Database Load Group object
35 :     provides housekeeping functions to simplify the management of the more complex
36 :     load tasks.
37 :    
38 :     When discussing an ERDB database load, there are two similar concepts we use to
39 :     break the load into pieces: I<sections> and I<groups>. A I<section> is a
40 :     partition of the data that can be processed in isolation from other sections. A
41 :     I<group> is a set of tables that should be loaded at the same time. An ERDB load
42 :     group is a request to generate load files for one or more sections of the data
43 :     targeting a single group of tables.
44 :    
45 :     A certain amount of bookkeeping is required in order to handle parallelism. For
46 :     each table, a separate output file is generated for each section. If a section
47 :     does not complete successfully, then its load file is deleted and the section
48 :     must be loaded again. Because each section has its own load file, only the
49 :     particular sections that fail need to be reloaded.
50 :    
51 :     Individual load groups should subclass this object, providing a virtual override
52 :     for the L</Generate> method.
53 :    
54 :     The subclass name should consist of the group name followed by noise in capital
55 :     case. So, for example, the subclass name for a group named C<Feature> would be
56 :     C<FeatureSproutLoader> or C<FeatureAttributeLoader> or something similar. The
57 :     group name should only be letters, and only the first letter should be capitalized.
58 :     This allows the load script to be case-insensitive with regard to incoming group
59 :     names.
60 :    
61 :     Any working or status files generated by a subclass should have a prefix of C<dt>-something.
62 :     This will insure they are deleted by the C<clear> option of [[ERDBGeneratorPl]].
63 :    
64 :     The fields in this object are as follows.
65 :    
66 :     =over 4
67 :    
68 :     =item db
69 :    
70 :     [[ErdbPm]] object for accessing the target database
71 :    
72 :     =item directory
73 :    
74 :     Directory into which the load files should be placed.
75 :    
76 :     =item group
77 :    
78 :     name of this load group
79 :    
80 :     =item lastKey
81 :    
82 :     ID of the last major object processed
83 :    
84 :     =item loaders
85 :    
86 :     hash mapping the names of the group's tables to [[ERDBGeneratePm]] objects
87 :    
88 :     =item stats
89 :    
90 :     statistics object that can be used to track the progress of the load
91 :    
92 :     =item section
93 :    
94 :     name of this data section
95 :    
96 :     =item source
97 :    
98 :     object used to access the data from which the load files are to be generated
99 :    
100 :     =item tables
101 :    
102 :     reference to a list of the names of the tables in this group
103 :    
104 :     =item options
105 :    
106 :     hash containing the options originally passed in to the constructor
107 :    
108 :     =back
109 :    
110 :     =cut
111 :    
112 :     =head3 new
113 :    
114 :     my $edbl = ERDBLoadGroup->new($source, $db, $directory, $options, @tables);
115 :    
116 :     Construct a new ERDBLoadGroup object. The following parameters are expected:
117 :    
118 :     =over 4
119 :    
120 :     =item source
121 :    
122 : parrello 1.3 The object to be used by the subclass to access the source data. If this parameter
123 :     is undefined, the source object will be retrieved from the database object as soon
124 :     as the client calls the L</source> method.
125 : parrello 1.1
126 :     =item db
127 :    
128 :     The [[ErdbPm]] object for the database being loaded.
129 :    
130 :     =item options
131 :    
132 :     Reference to a hash of options. At the current time, no options are needed
133 :     by this object, but they may be important to subclass objects.
134 :    
135 :     =item tables
136 :    
137 :     A list of the names for the tables in this load group.
138 :    
139 :     =back
140 :    
141 :     =cut
142 :    
143 :     sub new {
144 :     # Get the parameters.
145 : parrello 1.3 my ($class, $source, $db, $options, @tables) = @_;
146 : parrello 1.1 # Create a statistics object
147 :     my $stats = Stats->new();
148 :     # Compute the group name from the class name. It is the first word in
149 :     # a name that is presumably capital case.
150 :     my $group = ($class =~ /^([A-Z][a-z]+)/ ? $1 : $class);
151 : parrello 1.3 # Get the directory.
152 :     my $directory = $db->LoadDirectory();
153 : parrello 1.1 Confess("Load directory \"$directory\" not found or invalid.") if ! -d $directory;
154 :     # Create the ERDBLoadGroup object. Note that so far we don't have any loaders
155 :     # defined and the section has not yet been assigned. The "ProcessSection"
156 :     # method is used to assign the section, and the loaders are created the first
157 :     # time it's called.
158 :     my $retVal = {
159 :     db => $db,
160 :     directory => $directory,
161 :     group => $group,
162 :     stats => $stats,
163 :     source => $source,
164 :     lastKey => undef,
165 :     loaders => {},
166 :     tables => \@tables,
167 :     section => undef,
168 :     options => $options
169 :     };
170 :     # Bless and return it.
171 :     bless $retVal, $class;
172 :     return $retVal;
173 :     }
174 :    
175 :     =head2 Subclass Methods
176 :    
177 :     =head3 Put
178 :    
179 :     $edbl->Put($table, %fields);
180 :    
181 :     Place a table record in a load file. This method is the workhorse of the
182 :     file generation phase of a load.
183 :    
184 :     =over 4
185 :    
186 :     =item table
187 :    
188 :     Name of the table being loaded.
189 :    
190 :     =item fields
191 :    
192 :     Hash of field names to field values for the fields in the table.
193 :    
194 :     =back
195 :    
196 :     =cut
197 :    
198 :     sub Put {
199 :     # Get the parameters.
200 :     my ($self, $table, %fields) = @_;
201 :     # Get the loader for this table.
202 :     my $loader = $self->{loaders}->{$table};
203 :     # Complain if it doesn't exist.
204 :     Confess("Table $table not found in load group $self->{group}.") if ! defined $loader;
205 :     # Put this record to the loader's output file.
206 :     my $bytes = $loader->Put(%fields);
207 :     # Count the record and the bytes of data. If no bytes were output, the record
208 :     # was discarded.
209 :     if (! $bytes) {
210 :     $self->Add("$table-discards" => 1);
211 :     } else {
212 :     $self->Add("$table-records" => 1);
213 :     $self->Add("$table-bytes" => $bytes);
214 :     }
215 :     }
216 :    
217 :     =head3 Add
218 :    
219 :     $edbl->Add($statName => $count);
220 :    
221 :     Add the specified count to the named statistical counter. The statistical
222 :     counts are kept in an internal statistics object whose contents are
223 :     displayed when the group is finished.
224 :    
225 :     =over 4
226 :    
227 :     =item statName
228 :    
229 :     Name of the statistic to increment.
230 :    
231 :     =item count
232 :    
233 :     Value by which to increment it.
234 :    
235 :     =back
236 :    
237 :     =cut
238 :    
239 :     sub Add {
240 :     # Get the parameters.
241 :     my ($self, $statName, $count) = @_;
242 :     # Update the statistic.
243 :     $self->{stats}->Add($statName => $count);
244 :     }
245 :    
246 :     =head3 Track
247 :    
248 :     $edbl->Track($statName => $key, $period);
249 :    
250 :     Save the specified key as the one currently in progress. If an error
251 :     occurs, the key value will appear in the output log. The named statistic
252 :     will also be incremented, and if the count is an even multiple of the stated
253 :     period, a trace message will be output at level 3.
254 :    
255 :     Most load groups have a primary object type that drives the main loop. When
256 :     something goes wrong, we want to know the ID of the offending object. When
257 :     things go right, we want to know how far we've progressed toward completion.
258 :     This method can be used to record each occurrence of a primary object, and
259 :     provide a log of the progress or our current position in times of stress.
260 :    
261 :     =over 4
262 :    
263 :     =item statName
264 :    
265 :     Name of the statistic to be incremented. This should be a plural noun
266 :     describing the object whose kep is coming in.
267 :    
268 :     =item key
269 :    
270 :     Key value to be displayed if something goes wrong.
271 :    
272 :     =item period (optional)
273 :    
274 :     If specified, should be the number of objects to be counted between each
275 :     level-3 trace message.
276 :    
277 :     =back
278 :    
279 :     =cut
280 :    
281 :     sub Track {
282 :     # Get the parameters.
283 :     my ($self, $statName, $key, $period) = @_;
284 :     # Save the key.
285 :     $self->{lastKey} = $key;
286 :     # Count it.
287 :     my $newValue = $self->{stats}->Add($statName => 1);
288 :     # Do we need to output a progress message?
289 :     if ($period && T(3) && ($newValue % $period == 0)) {
290 :     # Yes.
291 :     Trace("$newValue $statName processed for $self->{group} group.");
292 :     }
293 :     }
294 :    
295 :     =head3 section
296 :    
297 :     my $sectionID = $edbl->section();
298 :    
299 :     Return the ID of the current section.
300 :    
301 :     =cut
302 :    
303 :     sub section {
304 :     # Get the parameters.
305 :     my ($self) = @_;
306 :     # Return the result.
307 :     return $self->{section};
308 :     }
309 :    
310 :     =head3 source
311 :    
312 :     my $sourceObject = $edbl->source();
313 :    
314 :     Return the source object used to get the data needed for creating
315 :     the load files.
316 :    
317 :     =cut
318 :    
319 :     sub source {
320 :     # Get the parameters.
321 :     my ($self) = @_;
322 : parrello 1.3 # If we do not have a source object, retrieve it.
323 :     if (! defined $self->{source}) {
324 :     $self->{source} = $self->{db}->GetSourceObject();
325 :     }
326 : parrello 1.1 # Return the result.
327 :     return $self->{source};
328 :     }
329 :    
330 :     =head3 db
331 :    
332 :     my $erdbObject = $edbl->db();
333 :    
334 :     Return the database object for the target database.
335 :    
336 :     =cut
337 :    
338 :     sub db {
339 :     # Get the parameters.
340 :     my ($self) = @_;
341 :     # Return the result.
342 :     return $self->{db};
343 :     }
344 :    
345 :     =head2 Internal Methods
346 :    
347 :     =head3 ProcessSection
348 :    
349 :     my $flag = $edbl->ProcessSection($section);
350 :    
351 :     Generate the load file for a particular data section. This method calls
352 :     the virtual method L</Generate> to actually put the data into the load
353 :     files, and is responsible for assigning the section and finalizing the
354 :     load files if the load is successful.
355 :    
356 :     =over 4
357 :    
358 :     =item section
359 :    
360 :     ID of the section to load.
361 :    
362 :     =item RETURN
363 :    
364 :     Returns TRUE if successful, FALSE if an error prevented loading the section.
365 :    
366 :     =back
367 :    
368 :     =cut
369 :    
370 :     sub ProcessSection {
371 :     # Get the parameters.
372 :     my ($self, $section) = @_;
373 :     # Declare the return variable. We'll set it to 1 if we succeed.
374 :     # Save the section ID.
375 :     $self->{section} = $section;
376 :     # Get the database object.
377 :     my $db = $self->db();
378 :     # Start a timer and protect ourselves from errors.
379 :     my $startTime = time();
380 :     eval {
381 :     # Get the list of tables for this group.
382 :     my @tables = @{$self->{tables}};
383 :     # Get the loader hash.
384 :     my $loaderHash = $self->{loaders};
385 :     # Initialize the loaders for the necessary tables.
386 :     for my $table (@tables) {
387 :     # Get this table's loader.
388 :     my $loader = $loaderHash->{$table};
389 :     # If it doesn't exist yet, create it.
390 :     if (! defined $loader) {
391 :     $loader = ERDBGenerate->new($db, $self->{directory}, $table);
392 :     # Save it for future use.
393 :     $loaderHash->{$table} = $loader;
394 :     # Count it.
395 :     $self->Add(tables => 1);
396 :     }
397 :     $loader->Start($section);
398 :     }
399 :     # Generate the data to put in the newly-created load files.
400 :     Trace("Calling generator.") if T(3);
401 :     $self->Generate();
402 :     };
403 :     # Did it work?
404 :     if ($@) {
405 :     # No, so emit an error message and abort all the loaders.
406 :     $self->{stats}->AddMessage("Error loading section $section: $@");
407 :     if (defined $self->{lastKey}) {
408 :     $self->{stats}->AddMessage("Error occurred while processing \"$self->{lastKey}\".");
409 :     }
410 :     $self->Add("section-errors" => 1);
411 :     for my $loader (values %{$self->{loaders}}) {
412 :     $loader->Abort();
413 :     }
414 :     } else {
415 :     # Yes! Finish all the loaders.
416 :     for my $loader (values %{$self->{loaders}}) {
417 :     $loader->Finish();
418 :     }
419 :     # Update the load count and the timer.
420 :     $self->Add("section-loads" => 1);
421 :     $self->Add(duration => (time() - $startTime));
422 :     }
423 :     }
424 :    
425 :     =head3 DisplayStats
426 :    
427 :     my $text = $edbl->DisplayStats();
428 :    
429 :     Display the statistics for this load gorup.
430 :    
431 :     =cut
432 :    
433 :     sub DisplayStats {
434 :     # Get the parameters.
435 :     my ($self) = @_;
436 :     # Return the result.
437 :     return $self->{stats}->Show();
438 :     }
439 :    
440 :     =head3 GetGroupHash
441 :    
442 :     my $groupHash = ERDBLoadGroup::GetGroupHash($erdb);
443 :    
444 :     Return a hash that maps each load group in the specified database to its
445 :     constituent tables. This is useful when checking for problems with a load
446 :     or performing finishing tasks.
447 :    
448 :     =over 4
449 :    
450 :     =item erdb
451 :    
452 :     [[ErdbPm]] database whose load information is desired.
453 :    
454 :     =item RETURN
455 :    
456 :     Returns a reference to a hash that maps each group name to a list of
457 :     table names.
458 :    
459 :     =back
460 :    
461 :     =cut
462 :    
463 :     sub GetGroupHash {
464 :     # Get the parameters.
465 :     my ($erdb) = @_;
466 :     # Initialize the return variable.
467 :     my $retVal = {};
468 :     # Loop through the list of load groups.
469 :     for my $group ($erdb->LoadGroupList()) {
470 :     # Stash the loader's tables in the output hash.
471 : parrello 1.3 $retVal->{$group} = GetTables($erdb, $group);
472 : parrello 1.1 }
473 :     # Return the result.
474 :     return $retVal;
475 :     }
476 :    
477 : parrello 1.3 =head3 GetTables
478 :    
479 :     my @tables = ERDBLoadGroup::GetTables($group);
480 :    
481 :     Return the list of tables belonging to the specified load group.
482 :    
483 :     =over 4
484 :    
485 :     =item erdb
486 :    
487 :     Return the list of tables for the specified load group.
488 :    
489 :     =item group
490 :    
491 :     Name of relevant group.
492 :    
493 :     =item RETURN
494 :    
495 :     Returns a list of a tables loaded by the specified group.
496 :    
497 :     =back
498 :    
499 :     =cut
500 :    
501 :     sub GetTables {
502 :     # Get the parameters.
503 :     my ($erdb, $group) = @_;
504 :     # Create a loader for the specified group.
505 :     my $loader = $erdb->Loader($group, undef, {});
506 :     # Extract the list of tables.
507 :     my @retVal = @{$loader->{tables}};
508 :     # Return the result.
509 :     return @retVal;
510 :     }
511 :    
512 :    
513 : parrello 1.1 =head3 ComputeGroups
514 :    
515 : parrello 1.2 my @groupList = ERDBLoadGroup::ComputeGroups($erdb, \@groups);
516 : parrello 1.1
517 : parrello 1.2 Compute the actual list of groups determined by the incoming group list.
518 : parrello 1.1
519 :     =over 4
520 :    
521 :     =item erdb
522 :    
523 :     [[ErdbPm]] object for the database being loaded.
524 :    
525 :     =item groups
526 :    
527 : parrello 1.2 Reference to a list of group names specified on the command line. A plus sign
528 :     (C<+>) has special meaning.
529 : parrello 1.1
530 :     =item RETURN
531 :    
532 :     Returns the actual list of groups to be processed by the calling command. The
533 :     names will have been normalized to capital case.
534 :    
535 :     =back
536 :    
537 :     =cut
538 :    
539 :     sub ComputeGroups {
540 :     # Get the parameters.
541 : parrello 1.2 my ($erdb, $groups) = @_;
542 :     # Get the complete group list in standard order.
543 :     my @allGroups = $erdb->LoadGroupList();
544 :     # Create a hash for validation purposes. This will map each valid group
545 :     # name to its position in the standard order.
546 :     my %allGroupHash;
547 :     for (my $i = 0; $i <= $#allGroups; $i++) {
548 :     $allGroupHash{$allGroups[$i]} = $i;
549 :     }
550 :     # This variable will be the index of the last-processed group in
551 :     # the standard order. We start it before the first group in the list.
552 :     my $lastI = -1;
553 :     # The listed groups will be put in here.
554 : parrello 1.1 my @retVal;
555 : parrello 1.2 # Process the group list.
556 :     for my $group (@$groups) {
557 :     # Process this group.
558 :     if ($group eq '+') {
559 :     # Here we have a plus sign. Push in everything after the previous
560 :     # group processed. Note that we'll be ending at the last position.
561 :     # A second "+" after this one will generate no entries in the result
562 :     # list.
563 :     my $firstI = $lastI + 1;
564 :     $lastI = $#allGroups;
565 :     push @retVal, @allGroups[$firstI..$lastI];
566 :     } elsif (exists $allGroupHash{$group}) {
567 :     # Here we have a valid group name. Push it into the list.
568 :     push @retVal, $group;
569 :     # Remember its location in case there's a plus sign.
570 :     $lastI = $allGroupHash{$group};
571 :     } else {
572 :     # This is an error.
573 :     Confess("Invalid load group name $group.");
574 : parrello 1.1 }
575 :     }
576 :     # Normalize the group names and return them.
577 :     return map { ucfirst $_ } @retVal;
578 :     }
579 :    
580 : parrello 1.2 =head3 KillFileName
581 :    
582 :     my $fileName = ERDBLoadGroup::KillFileName($erdb, $directory);
583 :    
584 :     Compute the kill file name for the specified database in the specified
585 :     directory. When the [[ERDBGeneratorPl]] script sees the kill file, it will
586 :     terminate itself at the end of the current section.
587 :    
588 :     =over 4
589 :    
590 :     =item erdb
591 :    
592 :     Database
593 :    
594 :     =item directory (optional)
595 :    
596 :     Load directory for the database.
597 :    
598 :     =item RETURN
599 :    
600 :     Returns the specified database's kill file name. If a directory is specified,
601 :     it is prefixed to the name with an intervening slash.
602 :    
603 :    
604 :     =back
605 :    
606 :     =cut
607 :    
608 :     sub KillFileName {
609 :     # Get the parameters.
610 :     my ($erdb, $directory) = @_;
611 :     # Compute the kill file name. We start with the database name in
612 :     # lower case, then prefix it with "kill_";
613 :     my $dbName = lc ref $erdb;
614 :     my $retVal = ERDBGenerate::CreateFileName("kill_$dbName", undef, 'control', $directory);
615 :     # Return the result.
616 :     return $retVal;
617 :     }
618 :    
619 :    
620 : parrello 1.1 =head2 Virtual Methods
621 :    
622 :     =head3 Generate
623 :    
624 :     $edbl->Generate();
625 :    
626 :     Generate the data for this load group with respect to the current
627 :     section. This method must be overridden by the subclass and should call
628 :     the L</Put> method to put data into the tables.
629 :    
630 :     =cut
631 :    
632 :     sub Generate {
633 :     Confess("Pure virtual method Generate called.");
634 :     }
635 :    
636 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3