[Bio] / Sprout / ERDBLoadGroup.pm Repository:
ViewVC logotype

Annotation of /Sprout/ERDBLoadGroup.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package ERDBLoadGroup;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :     use ERDB;
25 :     use Stats;
26 :     use Time::HiRes qw(time);
27 :     use ERDBGenerate;
28 :    
29 :     =head1 ERDB Database Load Group Object
30 :    
31 :     The process of loading an ERDB database can be a simple matter of creating some
32 :     sequential files from other sequential files, or it can be a complex web of
33 :     connected sub-processes involving multiple groups of tables being loaded in
34 :     parallel by multiple worker processes. The ERDB Database Load Group object
35 :     provides housekeeping functions to simplify the management of the more complex
36 :     load tasks.
37 :    
38 :     When discussing an ERDB database load, there are two similar concepts we use to
39 :     break the load into pieces: I<sections> and I<groups>. A I<section> is a
40 :     partition of the data that can be processed in isolation from other sections. A
41 :     I<group> is a set of tables that should be loaded at the same time. An ERDB load
42 :     group is a request to generate load files for one or more sections of the data
43 :     targeting a single group of tables.
44 :    
45 :     A certain amount of bookkeeping is required in order to handle parallelism. For
46 :     each table, a separate output file is generated for each section. If a section
47 :     does not complete successfully, then its load file is deleted and the section
48 :     must be loaded again. Because each section has its own load file, only the
49 :     particular sections that fail need to be reloaded.
50 :    
51 :     Individual load groups should subclass this object, providing a virtual override
52 :     for the L</Generate> method.
53 :    
54 :     The subclass name should consist of the group name followed by noise in capital
55 :     case. So, for example, the subclass name for a group named C<Feature> would be
56 :     C<FeatureSproutLoader> or C<FeatureAttributeLoader> or something similar. The
57 :     group name should only be letters, and only the first letter should be capitalized.
58 :     This allows the load script to be case-insensitive with regard to incoming group
59 :     names.
60 :    
61 :     Any working or status files generated by a subclass should have a prefix of C<dt>-something.
62 :     This will insure they are deleted by the C<clear> option of [[ERDBGeneratorPl]].
63 :    
64 :     The fields in this object are as follows.
65 :    
66 :     =over 4
67 :    
68 :     =item db
69 :    
70 :     [[ErdbPm]] object for accessing the target database
71 :    
72 :     =item directory
73 :    
74 :     Directory into which the load files should be placed.
75 :    
76 :     =item group
77 :    
78 :     name of this load group
79 :    
80 :     =item lastKey
81 :    
82 :     ID of the last major object processed
83 :    
84 :     =item loaders
85 :    
86 :     hash mapping the names of the group's tables to [[ERDBGeneratePm]] objects
87 :    
88 :     =item stats
89 :    
90 :     statistics object that can be used to track the progress of the load
91 :    
92 :     =item section
93 :    
94 :     name of this data section
95 :    
96 :     =item source
97 :    
98 :     object used to access the data from which the load files are to be generated
99 :    
100 :     =item tables
101 :    
102 :     reference to a list of the names of the tables in this group
103 :    
104 :     =item options
105 :    
106 :     hash containing the options originally passed in to the constructor
107 :    
108 :     =back
109 :    
110 :     =cut
111 :    
112 :     =head3 new
113 :    
114 :     my $edbl = ERDBLoadGroup->new($source, $db, $directory, $options, @tables);
115 :    
116 :     Construct a new ERDBLoadGroup object. The following parameters are expected:
117 :    
118 :     =over 4
119 :    
120 :     =item source
121 :    
122 : parrello 1.3 The object to be used by the subclass to access the source data. If this parameter
123 :     is undefined, the source object will be retrieved from the database object as soon
124 :     as the client calls the L</source> method.
125 : parrello 1.1
126 :     =item db
127 :    
128 :     The [[ErdbPm]] object for the database being loaded.
129 :    
130 :     =item options
131 :    
132 :     Reference to a hash of options. At the current time, no options are needed
133 :     by this object, but they may be important to subclass objects.
134 :    
135 :     =item tables
136 :    
137 :     A list of the names for the tables in this load group.
138 :    
139 :     =back
140 :    
141 :     =cut
142 :    
143 :     sub new {
144 :     # Get the parameters.
145 : parrello 1.3 my ($class, $source, $db, $options, @tables) = @_;
146 : parrello 1.1 # Create a statistics object
147 :     my $stats = Stats->new();
148 :     # Compute the group name from the class name. It is the first word in
149 :     # a name that is presumably capital case.
150 :     my $group = ($class =~ /^([A-Z][a-z]+)/ ? $1 : $class);
151 : parrello 1.3 # Get the directory.
152 :     my $directory = $db->LoadDirectory();
153 : parrello 1.1 Confess("Load directory \"$directory\" not found or invalid.") if ! -d $directory;
154 :     # Create the ERDBLoadGroup object. Note that so far we don't have any loaders
155 :     # defined and the section has not yet been assigned. The "ProcessSection"
156 :     # method is used to assign the section, and the loaders are created the first
157 :     # time it's called.
158 :     my $retVal = {
159 :     db => $db,
160 :     directory => $directory,
161 :     group => $group,
162 :     stats => $stats,
163 :     source => $source,
164 :     lastKey => undef,
165 :     loaders => {},
166 :     tables => \@tables,
167 :     section => undef,
168 :     options => $options
169 :     };
170 :     # Bless and return it.
171 :     bless $retVal, $class;
172 :     return $retVal;
173 :     }
174 :    
175 :     =head2 Subclass Methods
176 :    
177 :     =head3 Put
178 :    
179 :     $edbl->Put($table, %fields);
180 :    
181 :     Place a table record in a load file. This method is the workhorse of the
182 :     file generation phase of a load.
183 :    
184 :     =over 4
185 :    
186 :     =item table
187 :    
188 :     Name of the table being loaded.
189 :    
190 :     =item fields
191 :    
192 :     Hash of field names to field values for the fields in the table.
193 :    
194 :     =back
195 :    
196 :     =cut
197 :    
198 :     sub Put {
199 :     # Get the parameters.
200 :     my ($self, $table, %fields) = @_;
201 :     # Get the loader for this table.
202 :     my $loader = $self->{loaders}->{$table};
203 :     # Complain if it doesn't exist.
204 :     Confess("Table $table not found in load group $self->{group}.") if ! defined $loader;
205 :     # Put this record to the loader's output file.
206 :     my $bytes = $loader->Put(%fields);
207 :     # Count the record and the bytes of data. If no bytes were output, the record
208 :     # was discarded.
209 :     if (! $bytes) {
210 :     $self->Add("$table-discards" => 1);
211 :     } else {
212 :     $self->Add("$table-records" => 1);
213 :     $self->Add("$table-bytes" => $bytes);
214 :     }
215 :     }
216 :    
217 : parrello 1.4 =head3 PutE
218 :    
219 :     $edbl->PutE($table => $id, %fields);
220 :    
221 :     Place an entity-based table record in a load file. The first field
222 :     specified after the table name is the ID.
223 :    
224 :     =over 4
225 :    
226 :     =item table
227 :    
228 :     Name of the relevant table.
229 :    
230 :     =item id
231 :    
232 :     ID of the relevant entity.
233 :    
234 :     =item fields
235 :    
236 :     Hash mapping field names to values.
237 :    
238 :     =back
239 :    
240 :     =cut
241 :    
242 :     sub PutE {
243 :     # Get the parameters.
244 :     my ($self, $table, $id, %fields) = @_;
245 :     # Put the record.
246 :     $self->Put($table, id => $id, %fields);
247 :     # Record that we've done a putE.
248 :     $self->Add(putE => 1);
249 :     }
250 :    
251 :     =head3 PutR
252 :    
253 :     $edbl->PutR($table => $from, $to, %fields);
254 :    
255 :     Place a relationship record in a load file. The first two fields
256 :     specified after the table name are the from-link and the to-link,
257 :     respectively.
258 :    
259 :     =over 4
260 :    
261 :     =item table
262 :    
263 :     Name of the relevant relationship.
264 :    
265 :     =item from
266 :    
267 :     ID of the from-entity.
268 :    
269 :     =item to
270 :    
271 :     ID of the to-entity.
272 :    
273 :     =item fields
274 :    
275 :     Hash mapping field names to field values.
276 :    
277 :     =back
278 :    
279 :     =cut
280 :    
281 :     sub PutR {
282 :     # Get the parameters.
283 :     my ($self, $table, $from, $to, %fields) = @_;
284 :     # Put the record.
285 :     $self->Put($table, 'from-link' => $from, 'to-link' => $to, %fields);
286 :     # Record that we've done a PutR.
287 :     $self->Add(putR => 1);
288 :     }
289 :    
290 :    
291 : parrello 1.1 =head3 Add
292 :    
293 :     $edbl->Add($statName => $count);
294 :    
295 :     Add the specified count to the named statistical counter. The statistical
296 :     counts are kept in an internal statistics object whose contents are
297 :     displayed when the group is finished.
298 :    
299 :     =over 4
300 :    
301 :     =item statName
302 :    
303 :     Name of the statistic to increment.
304 :    
305 :     =item count
306 :    
307 :     Value by which to increment it.
308 :    
309 :     =back
310 :    
311 :     =cut
312 :    
313 :     sub Add {
314 :     # Get the parameters.
315 :     my ($self, $statName, $count) = @_;
316 :     # Update the statistic.
317 :     $self->{stats}->Add($statName => $count);
318 :     }
319 :    
320 : parrello 1.4 =head3 AddWarning
321 :    
322 :     $edbl->AddWarning($errorType => $message);
323 :    
324 :     Record a warning. Warnings indicate possible errors in the incoming data.
325 :     The first warning of a specified type is added as a message to the load
326 :     statistic. All warnings are also traced at level 3.
327 :    
328 :     =over 4
329 :    
330 :     =item errorType
331 :    
332 :     Type of error indicated by the warning. This is used as the label when the
333 :     warning is counted in the statistics object.
334 :    
335 :     =item message
336 :    
337 :     Message describing the reason for the warning.
338 :    
339 :     =back
340 :    
341 :     =cut
342 :    
343 :     sub AddWarning {
344 :     # Get the parameters.
345 :     my ($self, $errorType, $message) = @_;
346 :     # Count the warning.
347 :     my $count = $self->Add($errorType);
348 :     # Is this the first one of this type?
349 :     if ($count == 1) {
350 :     # Yes, add it to the messages for the end.
351 :     $self->{stats}->AddMessage($errorType);
352 :     } else {
353 :     # No, just trace it.
354 :     Trace("Data warning: $message") if T(3);
355 :     }
356 :     }
357 :    
358 : parrello 1.1 =head3 Track
359 :    
360 :     $edbl->Track($statName => $key, $period);
361 :    
362 :     Save the specified key as the one currently in progress. If an error
363 :     occurs, the key value will appear in the output log. The named statistic
364 :     will also be incremented, and if the count is an even multiple of the stated
365 :     period, a trace message will be output at level 3.
366 :    
367 :     Most load groups have a primary object type that drives the main loop. When
368 :     something goes wrong, we want to know the ID of the offending object. When
369 :     things go right, we want to know how far we've progressed toward completion.
370 :     This method can be used to record each occurrence of a primary object, and
371 :     provide a log of the progress or our current position in times of stress.
372 :    
373 :     =over 4
374 :    
375 :     =item statName
376 :    
377 :     Name of the statistic to be incremented. This should be a plural noun
378 :     describing the object whose kep is coming in.
379 :    
380 :     =item key
381 :    
382 :     Key value to be displayed if something goes wrong.
383 :    
384 :     =item period (optional)
385 :    
386 :     If specified, should be the number of objects to be counted between each
387 :     level-3 trace message.
388 :    
389 :     =back
390 :    
391 :     =cut
392 :    
393 :     sub Track {
394 :     # Get the parameters.
395 :     my ($self, $statName, $key, $period) = @_;
396 :     # Save the key.
397 :     $self->{lastKey} = $key;
398 :     # Count it.
399 :     my $newValue = $self->{stats}->Add($statName => 1);
400 :     # Do we need to output a progress message?
401 :     if ($period && T(3) && ($newValue % $period == 0)) {
402 :     # Yes.
403 :     Trace("$newValue $statName processed for $self->{group} group.");
404 :     }
405 :     }
406 :    
407 :     =head3 section
408 :    
409 :     my $sectionID = $edbl->section();
410 :    
411 :     Return the ID of the current section.
412 :    
413 :     =cut
414 :    
415 :     sub section {
416 :     # Get the parameters.
417 :     my ($self) = @_;
418 :     # Return the result.
419 :     return $self->{section};
420 :     }
421 :    
422 :     =head3 source
423 :    
424 :     my $sourceObject = $edbl->source();
425 :    
426 :     Return the source object used to get the data needed for creating
427 :     the load files.
428 :    
429 :     =cut
430 :    
431 :     sub source {
432 :     # Get the parameters.
433 :     my ($self) = @_;
434 : parrello 1.3 # If we do not have a source object, retrieve it.
435 :     if (! defined $self->{source}) {
436 :     $self->{source} = $self->{db}->GetSourceObject();
437 :     }
438 : parrello 1.1 # Return the result.
439 :     return $self->{source};
440 :     }
441 :    
442 :     =head3 db
443 :    
444 :     my $erdbObject = $edbl->db();
445 :    
446 :     Return the database object for the target database.
447 :    
448 :     =cut
449 :    
450 :     sub db {
451 :     # Get the parameters.
452 :     my ($self) = @_;
453 :     # Return the result.
454 :     return $self->{db};
455 :     }
456 :    
457 :     =head2 Internal Methods
458 :    
459 :     =head3 ProcessSection
460 :    
461 :     my $flag = $edbl->ProcessSection($section);
462 :    
463 :     Generate the load file for a particular data section. This method calls
464 :     the virtual method L</Generate> to actually put the data into the load
465 :     files, and is responsible for assigning the section and finalizing the
466 :     load files if the load is successful.
467 :    
468 :     =over 4
469 :    
470 :     =item section
471 :    
472 :     ID of the section to load.
473 :    
474 :     =item RETURN
475 :    
476 :     Returns TRUE if successful, FALSE if an error prevented loading the section.
477 :    
478 :     =back
479 :    
480 :     =cut
481 :    
482 :     sub ProcessSection {
483 :     # Get the parameters.
484 :     my ($self, $section) = @_;
485 :     # Declare the return variable. We'll set it to 1 if we succeed.
486 :     # Save the section ID.
487 :     $self->{section} = $section;
488 :     # Get the database object.
489 :     my $db = $self->db();
490 :     # Start a timer and protect ourselves from errors.
491 :     my $startTime = time();
492 :     eval {
493 :     # Get the list of tables for this group.
494 :     my @tables = @{$self->{tables}};
495 :     # Get the loader hash.
496 :     my $loaderHash = $self->{loaders};
497 :     # Initialize the loaders for the necessary tables.
498 :     for my $table (@tables) {
499 :     # Get this table's loader.
500 :     my $loader = $loaderHash->{$table};
501 :     # If it doesn't exist yet, create it.
502 :     if (! defined $loader) {
503 : parrello 1.4 $loader = ERDBGenerate->new($db, $self->{directory}, $table, $self->{stats});
504 : parrello 1.1 # Save it for future use.
505 :     $loaderHash->{$table} = $loader;
506 :     # Count it.
507 :     $self->Add(tables => 1);
508 :     }
509 :     $loader->Start($section);
510 :     }
511 :     # Generate the data to put in the newly-created load files.
512 :     $self->Generate();
513 :     };
514 :     # Did it work?
515 :     if ($@) {
516 :     # No, so emit an error message and abort all the loaders.
517 :     $self->{stats}->AddMessage("Error loading section $section: $@");
518 :     if (defined $self->{lastKey}) {
519 :     $self->{stats}->AddMessage("Error occurred while processing \"$self->{lastKey}\".");
520 :     }
521 :     $self->Add("section-errors" => 1);
522 :     for my $loader (values %{$self->{loaders}}) {
523 :     $loader->Abort();
524 :     }
525 :     } else {
526 :     # Yes! Finish all the loaders.
527 :     for my $loader (values %{$self->{loaders}}) {
528 :     $loader->Finish();
529 :     }
530 :     # Update the load count and the timer.
531 :     $self->Add("section-loads" => 1);
532 :     $self->Add(duration => (time() - $startTime));
533 :     }
534 :     }
535 :    
536 :     =head3 DisplayStats
537 :    
538 :     my $text = $edbl->DisplayStats();
539 :    
540 :     Display the statistics for this load gorup.
541 :    
542 :     =cut
543 :    
544 :     sub DisplayStats {
545 :     # Get the parameters.
546 :     my ($self) = @_;
547 :     # Return the result.
548 :     return $self->{stats}->Show();
549 :     }
550 :    
551 :     =head3 GetGroupHash
552 :    
553 :     my $groupHash = ERDBLoadGroup::GetGroupHash($erdb);
554 :    
555 :     Return a hash that maps each load group in the specified database to its
556 :     constituent tables. This is useful when checking for problems with a load
557 :     or performing finishing tasks.
558 :    
559 :     =over 4
560 :    
561 :     =item erdb
562 :    
563 :     [[ErdbPm]] database whose load information is desired.
564 :    
565 :     =item RETURN
566 :    
567 :     Returns a reference to a hash that maps each group name to a list of
568 :     table names.
569 :    
570 :     =back
571 :    
572 :     =cut
573 :    
574 :     sub GetGroupHash {
575 :     # Get the parameters.
576 :     my ($erdb) = @_;
577 :     # Initialize the return variable.
578 :     my $retVal = {};
579 :     # Loop through the list of load groups.
580 :     for my $group ($erdb->LoadGroupList()) {
581 :     # Stash the loader's tables in the output hash.
582 : parrello 1.4 $retVal->{$group} = [ GetTables($erdb, $group) ];
583 : parrello 1.1 }
584 :     # Return the result.
585 :     return $retVal;
586 :     }
587 :    
588 : parrello 1.3 =head3 GetTables
589 :    
590 :     my @tables = ERDBLoadGroup::GetTables($group);
591 :    
592 :     Return the list of tables belonging to the specified load group.
593 :    
594 :     =over 4
595 :    
596 :     =item erdb
597 :    
598 :     Return the list of tables for the specified load group.
599 :    
600 :     =item group
601 :    
602 :     Name of relevant group.
603 :    
604 :     =item RETURN
605 :    
606 :     Returns a list of a tables loaded by the specified group.
607 :    
608 :     =back
609 :    
610 :     =cut
611 :    
612 :     sub GetTables {
613 :     # Get the parameters.
614 :     my ($erdb, $group) = @_;
615 :     # Create a loader for the specified group.
616 :     my $loader = $erdb->Loader($group, undef, {});
617 :     # Extract the list of tables.
618 :     my @retVal = @{$loader->{tables}};
619 :     # Return the result.
620 :     return @retVal;
621 :     }
622 :    
623 :    
624 : parrello 1.1 =head3 ComputeGroups
625 :    
626 : parrello 1.2 my @groupList = ERDBLoadGroup::ComputeGroups($erdb, \@groups);
627 : parrello 1.1
628 : parrello 1.2 Compute the actual list of groups determined by the incoming group list.
629 : parrello 1.1
630 :     =over 4
631 :    
632 :     =item erdb
633 :    
634 :     [[ErdbPm]] object for the database being loaded.
635 :    
636 :     =item groups
637 :    
638 : parrello 1.2 Reference to a list of group names specified on the command line. A plus sign
639 :     (C<+>) has special meaning.
640 : parrello 1.1
641 :     =item RETURN
642 :    
643 :     Returns the actual list of groups to be processed by the calling command. The
644 :     names will have been normalized to capital case.
645 :    
646 :     =back
647 :    
648 :     =cut
649 :    
650 :     sub ComputeGroups {
651 :     # Get the parameters.
652 : parrello 1.2 my ($erdb, $groups) = @_;
653 :     # Get the complete group list in standard order.
654 :     my @allGroups = $erdb->LoadGroupList();
655 :     # Create a hash for validation purposes. This will map each valid group
656 :     # name to its position in the standard order.
657 :     my %allGroupHash;
658 :     for (my $i = 0; $i <= $#allGroups; $i++) {
659 :     $allGroupHash{$allGroups[$i]} = $i;
660 :     }
661 :     # This variable will be the index of the last-processed group in
662 :     # the standard order. We start it before the first group in the list.
663 :     my $lastI = -1;
664 :     # The listed groups will be put in here.
665 : parrello 1.1 my @retVal;
666 : parrello 1.2 # Process the group list.
667 :     for my $group (@$groups) {
668 :     # Process this group.
669 :     if ($group eq '+') {
670 :     # Here we have a plus sign. Push in everything after the previous
671 :     # group processed. Note that we'll be ending at the last position.
672 :     # A second "+" after this one will generate no entries in the result
673 :     # list.
674 :     my $firstI = $lastI + 1;
675 :     $lastI = $#allGroups;
676 :     push @retVal, @allGroups[$firstI..$lastI];
677 :     } elsif (exists $allGroupHash{$group}) {
678 :     # Here we have a valid group name. Push it into the list.
679 :     push @retVal, $group;
680 :     # Remember its location in case there's a plus sign.
681 :     $lastI = $allGroupHash{$group};
682 :     } else {
683 :     # This is an error.
684 :     Confess("Invalid load group name $group.");
685 : parrello 1.1 }
686 :     }
687 :     # Normalize the group names and return them.
688 : parrello 1.4 @retVal = map { ucfirst $_ } @retVal;
689 :     Trace("Final group list is " . join(" ", @retVal) . ".") if T(2);
690 :     return @retVal;
691 : parrello 1.1 }
692 :    
693 : parrello 1.2 =head3 KillFileName
694 :    
695 :     my $fileName = ERDBLoadGroup::KillFileName($erdb, $directory);
696 :    
697 :     Compute the kill file name for the specified database in the specified
698 :     directory. When the [[ERDBGeneratorPl]] script sees the kill file, it will
699 :     terminate itself at the end of the current section.
700 :    
701 :     =over 4
702 :    
703 :     =item erdb
704 :    
705 :     Database
706 :    
707 :     =item directory (optional)
708 :    
709 :     Load directory for the database.
710 :    
711 :     =item RETURN
712 :    
713 :     Returns the specified database's kill file name. If a directory is specified,
714 :     it is prefixed to the name with an intervening slash.
715 :    
716 :    
717 :     =back
718 :    
719 :     =cut
720 :    
721 :     sub KillFileName {
722 :     # Get the parameters.
723 :     my ($erdb, $directory) = @_;
724 :     # Compute the kill file name. We start with the database name in
725 :     # lower case, then prefix it with "kill_";
726 :     my $dbName = lc ref $erdb;
727 :     my $retVal = ERDBGenerate::CreateFileName("kill_$dbName", undef, 'control', $directory);
728 :     # Return the result.
729 :     return $retVal;
730 :     }
731 :    
732 :    
733 : parrello 1.1 =head2 Virtual Methods
734 :    
735 :     =head3 Generate
736 :    
737 :     $edbl->Generate();
738 :    
739 :     Generate the data for this load group with respect to the current
740 :     section. This method must be overridden by the subclass and should call
741 :     the L</Put> method to put data into the tables.
742 :    
743 :     =cut
744 :    
745 :     sub Generate {
746 :     Confess("Pure virtual method Generate called.");
747 :     }
748 :    
749 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3