[Bio] / Sprout / ERDBLoadGroup.pm Repository:
ViewVC logotype

Annotation of /Sprout/ERDBLoadGroup.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.9 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package ERDBLoadGroup;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :     use ERDB;
25 :     use Stats;
26 :     use Time::HiRes qw(time);
27 :     use ERDBGenerate;
28 :    
29 :     =head1 ERDB Database Load Group Object
30 :    
31 :     The process of loading an ERDB database can be a simple matter of creating some
32 :     sequential files from other sequential files, or it can be a complex web of
33 :     connected sub-processes involving multiple groups of tables being loaded in
34 :     parallel by multiple worker processes. The ERDB Database Load Group object
35 :     provides housekeeping functions to simplify the management of the more complex
36 :     load tasks.
37 :    
38 :     When discussing an ERDB database load, there are two similar concepts we use to
39 :     break the load into pieces: I<sections> and I<groups>. A I<section> is a
40 :     partition of the data that can be processed in isolation from other sections. A
41 :     I<group> is a set of tables that should be loaded at the same time. An ERDB load
42 :     group is a request to generate load files for one or more sections of the data
43 :     targeting a single group of tables.
44 :    
45 :     A certain amount of bookkeeping is required in order to handle parallelism. For
46 :     each table, a separate output file is generated for each section. If a section
47 :     does not complete successfully, then its load file is deleted and the section
48 :     must be loaded again. Because each section has its own load file, only the
49 :     particular sections that fail need to be reloaded.
50 :    
51 :     Individual load groups should subclass this object, providing a virtual override
52 :     for the L</Generate> method.
53 :    
54 :     The subclass name should consist of the group name followed by noise in capital
55 :     case. So, for example, the subclass name for a group named C<Feature> would be
56 :     C<FeatureSproutLoader> or C<FeatureAttributeLoader> or something similar. The
57 :     group name should only be letters, and only the first letter should be capitalized.
58 :     This allows the load script to be case-insensitive with regard to incoming group
59 :     names.
60 :    
61 :     Any working or status files generated by a subclass should have a prefix of C<dt>-something.
62 : parrello 1.8 This will insure they are deleted by the C<clear> option of L<ERDBGenerator.pl>.
63 : parrello 1.1
64 :     The fields in this object are as follows.
65 :    
66 :     =over 4
67 :    
68 :     =item db
69 :    
70 : parrello 1.8 L<ERDB> object for accessing the target database
71 : parrello 1.1
72 :     =item directory
73 :    
74 :     Directory into which the load files should be placed.
75 :    
76 :     =item group
77 :    
78 :     name of this load group
79 :    
80 : parrello 1.5 =item label
81 :    
82 :     name of this worker process
83 :    
84 : parrello 1.1 =item lastKey
85 :    
86 :     ID of the last major object processed
87 :    
88 :     =item loaders
89 :    
90 : parrello 1.8 hash mapping the names of the group's tables to L<ERDBGenerate> objects
91 : parrello 1.1
92 :     =item stats
93 :    
94 :     statistics object that can be used to track the progress of the load
95 :    
96 :     =item section
97 :    
98 :     name of this data section
99 :    
100 :     =item source
101 :    
102 :     object used to access the data from which the load files are to be generated
103 :    
104 :     =item tables
105 :    
106 :     reference to a list of the names of the tables in this group
107 :    
108 :     =item options
109 :    
110 :     hash containing the options originally passed in to the constructor
111 :    
112 :     =back
113 :    
114 :     =cut
115 :    
116 :     =head3 new
117 :    
118 : parrello 1.5 my $edbl = ERDBLoadGroup->new($db, $directory, $options, @tables);
119 : parrello 1.1
120 :     Construct a new ERDBLoadGroup object. The following parameters are expected:
121 :    
122 :     =over 4
123 :    
124 :     =item db
125 :    
126 : parrello 1.8 The L<ERDB> object for the database being loaded.
127 : parrello 1.1
128 :     =item options
129 :    
130 :     Reference to a hash of options. At the current time, no options are needed
131 :     by this object, but they may be important to subclass objects.
132 :    
133 :     =item tables
134 :    
135 :     A list of the names for the tables in this load group.
136 :    
137 :     =back
138 :    
139 :     =cut
140 :    
141 :     sub new {
142 :     # Get the parameters.
143 : parrello 1.5 my ($class, $db, $options, @tables) = @_;
144 : parrello 1.1 # Create a statistics object
145 :     my $stats = Stats->new();
146 :     # Compute the group name from the class name. It is the first word in
147 :     # a name that is presumably capital case.
148 :     my $group = ($class =~ /^([A-Z][a-z]+)/ ? $1 : $class);
149 : parrello 1.3 # Get the directory.
150 :     my $directory = $db->LoadDirectory();
151 : parrello 1.1 Confess("Load directory \"$directory\" not found or invalid.") if ! -d $directory;
152 :     # Create the ERDBLoadGroup object. Note that so far we don't have any loaders
153 :     # defined and the section has not yet been assigned. The "ProcessSection"
154 :     # method is used to assign the section, and the loaders are created the first
155 :     # time it's called.
156 :     my $retVal = {
157 :     db => $db,
158 :     directory => $directory,
159 :     group => $group,
160 :     stats => $stats,
161 : parrello 1.5 source => undef,
162 :     label => ($options->{label} || $$),
163 : parrello 1.1 lastKey => undef,
164 :     loaders => {},
165 :     tables => \@tables,
166 :     section => undef,
167 :     options => $options
168 :     };
169 :     # Bless and return it.
170 :     bless $retVal, $class;
171 :     return $retVal;
172 :     }
173 :    
174 : parrello 1.7 =head3 TRAILER
175 :    
176 :     This is a string constant that always compares high against real data.
177 :    
178 :     =cut
179 :    
180 :     use constant TRAILER => "\xFF";
181 :    
182 : parrello 1.1 =head2 Subclass Methods
183 :    
184 :     =head3 Put
185 :    
186 :     $edbl->Put($table, %fields);
187 :    
188 :     Place a table record in a load file. This method is the workhorse of the
189 :     file generation phase of a load.
190 :    
191 :     =over 4
192 :    
193 :     =item table
194 :    
195 :     Name of the table being loaded.
196 :    
197 :     =item fields
198 :    
199 :     Hash of field names to field values for the fields in the table.
200 :    
201 :     =back
202 :    
203 :     =cut
204 :    
205 :     sub Put {
206 :     # Get the parameters.
207 :     my ($self, $table, %fields) = @_;
208 :     # Get the loader for this table.
209 :     my $loader = $self->{loaders}->{$table};
210 :     # Complain if it doesn't exist.
211 :     Confess("Table $table not found in load group $self->{group}.") if ! defined $loader;
212 :     # Put this record to the loader's output file.
213 :     my $bytes = $loader->Put(%fields);
214 :     # Count the record and the bytes of data. If no bytes were output, the record
215 :     # was discarded.
216 :     if (! $bytes) {
217 :     $self->Add("$table-discards" => 1);
218 :     } else {
219 :     $self->Add("$table-records" => 1);
220 :     $self->Add("$table-bytes" => $bytes);
221 :     }
222 :     }
223 :    
224 : parrello 1.4 =head3 PutE
225 :    
226 :     $edbl->PutE($table => $id, %fields);
227 :    
228 :     Place an entity-based table record in a load file. The first field
229 :     specified after the table name is the ID.
230 :    
231 :     =over 4
232 :    
233 :     =item table
234 :    
235 :     Name of the relevant table.
236 :    
237 :     =item id
238 :    
239 :     ID of the relevant entity.
240 :    
241 :     =item fields
242 :    
243 :     Hash mapping field names to values.
244 :    
245 :     =back
246 :    
247 :     =cut
248 :    
249 :     sub PutE {
250 :     # Get the parameters.
251 :     my ($self, $table, $id, %fields) = @_;
252 :     # Put the record.
253 :     $self->Put($table, id => $id, %fields);
254 :     # Record that we've done a putE.
255 :     $self->Add(putE => 1);
256 :     }
257 :    
258 :     =head3 PutR
259 :    
260 :     $edbl->PutR($table => $from, $to, %fields);
261 :    
262 :     Place a relationship record in a load file. The first two fields
263 :     specified after the table name are the from-link and the to-link,
264 :     respectively.
265 :    
266 :     =over 4
267 :    
268 :     =item table
269 :    
270 :     Name of the relevant relationship.
271 :    
272 :     =item from
273 :    
274 :     ID of the from-entity.
275 :    
276 :     =item to
277 :    
278 :     ID of the to-entity.
279 :    
280 :     =item fields
281 :    
282 :     Hash mapping field names to field values.
283 :    
284 :     =back
285 :    
286 :     =cut
287 :    
288 :     sub PutR {
289 :     # Get the parameters.
290 :     my ($self, $table, $from, $to, %fields) = @_;
291 :     # Put the record.
292 :     $self->Put($table, 'from-link' => $from, 'to-link' => $to, %fields);
293 :     # Record that we've done a PutR.
294 :     $self->Add(putR => 1);
295 :     }
296 :    
297 :    
298 : parrello 1.1 =head3 Add
299 :    
300 :     $edbl->Add($statName => $count);
301 :    
302 :     Add the specified count to the named statistical counter. The statistical
303 :     counts are kept in an internal statistics object whose contents are
304 :     displayed when the group is finished.
305 :    
306 :     =over 4
307 :    
308 :     =item statName
309 :    
310 :     Name of the statistic to increment.
311 :    
312 :     =item count
313 :    
314 :     Value by which to increment it.
315 :    
316 :     =back
317 :    
318 :     =cut
319 :    
320 :     sub Add {
321 :     # Get the parameters.
322 :     my ($self, $statName, $count) = @_;
323 :     # Update the statistic.
324 :     $self->{stats}->Add($statName => $count);
325 :     }
326 :    
327 : parrello 1.4 =head3 AddWarning
328 :    
329 :     $edbl->AddWarning($errorType => $message);
330 :    
331 :     Record a warning. Warnings indicate possible errors in the incoming data.
332 :     The first warning of a specified type is added as a message to the load
333 :     statistic. All warnings are also traced at level 3.
334 :    
335 :     =over 4
336 :    
337 :     =item errorType
338 :    
339 :     Type of error indicated by the warning. This is used as the label when the
340 :     warning is counted in the statistics object.
341 :    
342 :     =item message
343 :    
344 :     Message describing the reason for the warning.
345 :    
346 :     =back
347 :    
348 :     =cut
349 :    
350 :     sub AddWarning {
351 :     # Get the parameters.
352 :     my ($self, $errorType, $message) = @_;
353 :     # Count the warning.
354 :     my $count = $self->Add($errorType);
355 :     # Is this the first one of this type?
356 :     if ($count == 1) {
357 :     # Yes, add it to the messages for the end.
358 :     $self->{stats}->AddMessage($errorType);
359 :     } else {
360 :     # No, just trace it.
361 :     Trace("Data warning: $message") if T(3);
362 :     }
363 :     }
364 :    
365 : parrello 1.1 =head3 Track
366 :    
367 :     $edbl->Track($statName => $key, $period);
368 :    
369 :     Save the specified key as the one currently in progress. If an error
370 :     occurs, the key value will appear in the output log. The named statistic
371 :     will also be incremented, and if the count is an even multiple of the stated
372 :     period, a trace message will be output at level 3.
373 :    
374 :     Most load groups have a primary object type that drives the main loop. When
375 :     something goes wrong, we want to know the ID of the offending object. When
376 :     things go right, we want to know how far we've progressed toward completion.
377 :     This method can be used to record each occurrence of a primary object, and
378 :     provide a log of the progress or our current position in times of stress.
379 :    
380 :     =over 4
381 :    
382 :     =item statName
383 :    
384 :     Name of the statistic to be incremented. This should be a plural noun
385 : parrello 1.5 describing the object whose key is coming in.
386 : parrello 1.1
387 :     =item key
388 :    
389 :     Key value to be displayed if something goes wrong.
390 :    
391 :     =item period (optional)
392 :    
393 :     If specified, should be the number of objects to be counted between each
394 :     level-3 trace message.
395 :    
396 :     =back
397 :    
398 :     =cut
399 :    
400 :     sub Track {
401 :     # Get the parameters.
402 :     my ($self, $statName, $key, $period) = @_;
403 :     # Save the key.
404 :     $self->{lastKey} = $key;
405 :     # Count it.
406 :     my $newValue = $self->{stats}->Add($statName => 1);
407 :     # Do we need to output a progress message?
408 :     if ($period && T(3) && ($newValue % $period == 0)) {
409 :     # Yes.
410 : parrello 1.9 Trace("$newValue $statName processed by $self->{label} for $self->{group} group.");
411 : parrello 1.1 }
412 :     }
413 :    
414 :     =head3 section
415 :    
416 :     my $sectionID = $edbl->section();
417 :    
418 :     Return the ID of the current section.
419 :    
420 :     =cut
421 :    
422 :     sub section {
423 :     # Get the parameters.
424 :     my ($self) = @_;
425 :     # Return the result.
426 :     return $self->{section};
427 :     }
428 :    
429 :     =head3 source
430 :    
431 :     my $sourceObject = $edbl->source();
432 :    
433 :     Return the source object used to get the data needed for creating
434 :     the load files.
435 :    
436 :     =cut
437 :    
438 :     sub source {
439 :     # Get the parameters.
440 :     my ($self) = @_;
441 : parrello 1.3 # If we do not have a source object, retrieve it.
442 :     if (! defined $self->{source}) {
443 :     $self->{source} = $self->{db}->GetSourceObject();
444 :     }
445 : parrello 1.1 # Return the result.
446 :     return $self->{source};
447 :     }
448 :    
449 :     =head3 db
450 :    
451 :     my $erdbObject = $edbl->db();
452 :    
453 :     Return the database object for the target database.
454 :    
455 :     =cut
456 :    
457 :     sub db {
458 :     # Get the parameters.
459 :     my ($self) = @_;
460 :     # Return the result.
461 :     return $self->{db};
462 :     }
463 :    
464 : parrello 1.7 =head3 FilterRelationship
465 :    
466 :     my $stats = $edbl->FilterRelationship($type => $relationshipName);
467 :    
468 :     This method will compare a relationship's load file to a target entity
469 :     file and remove rows for which no target entity exists. This is useful
470 :     when a relationship and entity are created by different load groups, so
471 :     there is no opportunity in the generator to verify that the relationship
472 :     records are relevant to this database. Typically, this method is called
473 : parrello 1.8 during post-processing, between generation by L<ERDBGenerator.pl> and the
474 : parrello 1.7 actual database table loads.
475 :    
476 :     =over 4
477 :    
478 :     =item type
479 :    
480 :     Relevant relationship direction-- C<from> or C<to>.
481 :    
482 :     =item relationshipName
483 :    
484 :     Name of the relationship whose load file is to be filtered.
485 :    
486 :     =item RETURN
487 :    
488 :    
489 :    
490 :     =back
491 :    
492 :     =cut
493 :    
494 :     sub FilterRelationship {
495 :     # Get the parameters.
496 :     my ($self, $type, $relationshipName) = @_;
497 :     # Declare the return variable.
498 :     my $retVal = Stats->new();
499 :     # Get the database object.
500 :     my $erdb = $self->db();
501 :     # Get the relationship's descriptor. We need this to find the relevant entity.
502 :     my $relData = $erdb->FindRelationship($relationshipName);
503 :     if (! defined $relData) {
504 :     Confess("Relationship $relationshipName not found in this database.");
505 :     } else {
506 :     # We have the relationship, so get the name of the target entity.
507 :     my $entityName = $relData->{$type};
508 :     # We need to find where the entity's ID will be in the relationship's
509 :     # load file. FROM is always first, then TO.
510 :     my $fieldPos = ($type eq 'from' ? 1 : 2);
511 :     Trace("Filtering relationship $relationshipName against $entityName using field $type($fieldPos).") if T(3);
512 :     # We will be reading from the entity and relationship load files in
513 :     # parallel, with both sorted by the entity ID. The output will be
514 :     # sort-piped to a temporary file.
515 :     my $relationshipFileName = ERDBGenerate::CreateFileName($relationshipName,
516 :     undef, 'data');
517 :     my $relationshipTempName = ERDBGenerate::CreateFileName($relationshipName,
518 :     undef, 'temp');
519 :     my $entityFileName = ERDBGenerate::CreateFileName($entityName,
520 :     undef, 'data');
521 :     # Get the desired sort for the relationship file. We use this for
522 :     # the relationship output.
523 :     my $sortOut = $erdb->SortNeeded($relationshipName);
524 :     # Now we can open our files.
525 :     my $rih = Open(undef, "sort -k$fieldPos,$fieldPos <$relationshipFileName |");
526 :     my $eih = Open(undef, "sort -k1,1 <$entityFileName |");
527 :     my $roh = Open(undef, "| $sortOut >$relationshipTempName");
528 :     # Convert the field position from 1-based (for the sort) to 0-based (for PERL).
529 :     $fieldPos--;
530 :     # Get the first record in each file.
531 :     my ($rKey, $relRecord) = GetRecord($rih, $fieldPos);
532 :     my ($eKey) = GetRecord($eih, 0);
533 :     # Loop until we run out of records in the relationship file.
534 :     while ($rKey lt TRAILER) {
535 :     # Roll the entity file forward until we find the spot for
536 :     # this relationship.
537 :     while ($rKey gt $eKey) {
538 :     ($eKey) = GetRecord($eih, 0);
539 :     }
540 :     # If we have a match, we output the relationship record.
541 :     # At this point eKey could be TRAILER, but rKey cannot, because
542 :     # it hasn't changed since the while condition was evaluated.
543 :     if ($eKey eq $rKey) {
544 :     Tracer::PutLine($roh, $relRecord);
545 :     $retVal->Add("kept-$relationshipName" => 1);
546 :     } else {
547 :     $retVal->Add("rejected-$relationshipName" => 1);
548 :     }
549 :     # Get the next relationship record.
550 :     ($rKey, $relRecord) = GetRecord($rih, $fieldPos);
551 :     }
552 :     # Now we close everything and move the temp file over the top of the
553 :     # real relationship file.
554 :     Trace("Closing files.") if T(3);
555 :     close $rih;
556 :     close $eih;
557 :     close $roh;
558 :     Trace("Renaming filtered relationship file for $relationshipName.") if T(3);
559 :     unlink $relationshipFileName;
560 :     rename $relationshipTempName, $relationshipFileName;
561 :     }
562 :     # Return the result.
563 :     return $retVal;
564 :     }
565 :    
566 :     =head3 GetTables
567 :    
568 :     my @tables = ERDBLoadGroup::GetTables($erdb, $group);
569 :    
570 :     or
571 :    
572 :     my @tables = $edbl->GetTables();
573 :    
574 :     Return the list of tables belonging to the specified load group.
575 :    
576 :     =over 4
577 :    
578 :     =item erdb
579 :    
580 : parrello 1.8 L<ERDB> subclass object for the relevant database.
581 : parrello 1.7
582 :     =item group
583 :    
584 :     Name of the relevant group.
585 :    
586 :     =item RETURN
587 :    
588 :     Returns a list of a tables loaded by the specified group.
589 :    
590 :     =back
591 :    
592 :     =cut
593 :    
594 :     sub GetTables {
595 :     # Get the parameters.
596 :     my ($self, $group) = @_;
597 :     # We need a loader. If the caller gave us an ERDB object instead, we need to
598 :     # convert it.
599 :     if (! $self->isa(__PACKAGE__)) {
600 :     $self = $self->Loader($group, undef, {});
601 :     }
602 :     # Extract the list of tables.
603 :     my @retVal = @{$self->{tables}};
604 :     # Return the result.
605 :     return @retVal;
606 :     }
607 :    
608 : parrello 1.1 =head2 Internal Methods
609 :    
610 :     =head3 ProcessSection
611 :    
612 :     my $flag = $edbl->ProcessSection($section);
613 :    
614 :     Generate the load file for a particular data section. This method calls
615 :     the virtual method L</Generate> to actually put the data into the load
616 :     files, and is responsible for assigning the section and finalizing the
617 :     load files if the load is successful.
618 :    
619 :     =over 4
620 :    
621 :     =item section
622 :    
623 :     ID of the section to load.
624 :    
625 :     =item RETURN
626 :    
627 :     Returns TRUE if successful, FALSE if an error prevented loading the section.
628 :    
629 :     =back
630 :    
631 :     =cut
632 :    
633 :     sub ProcessSection {
634 :     # Get the parameters.
635 :     my ($self, $section) = @_;
636 :     # Declare the return variable. We'll set it to 1 if we succeed.
637 :     # Save the section ID.
638 :     $self->{section} = $section;
639 :     # Get the database object.
640 :     my $db = $self->db();
641 : parrello 1.5 # Get the list of tables for this group.
642 :     my @tables = @{$self->{tables}};
643 :     # Should we skip this section?
644 :     if ($self->SkipIndicated($section, \@tables)) {
645 :     Trace("Resume mode: section $section skipped for group $self->{group}.") if T(3);
646 :     $self->Add("section-skips" => 1);
647 :     } else {
648 :     # Not skipping. Start a timer and protect ourselves from errors.
649 :     my $startTime = time();
650 :     eval {
651 :     # Get the loader hash.
652 :     my $loaderHash = $self->{loaders};
653 :     # Initialize the loaders for the necessary tables.
654 :     for my $table (@tables) {
655 :     # Get this table's loader.
656 :     my $loader = $loaderHash->{$table};
657 :     # If it doesn't exist yet, create it.
658 :     if (! defined $loader) {
659 :     $loader = ERDBGenerate->new($db, $self->{directory}, $table, $self->{stats});
660 :     # Save it for future use.
661 :     $loaderHash->{$table} = $loader;
662 :     # Count it.
663 :     $self->Add(tables => 1);
664 :     }
665 :     $loader->Start($section);
666 :     }
667 :     # Generate the data to put in the newly-created load files.
668 :     $self->Generate();
669 : parrello 1.6 # Release our hold on the source object. This allows the database object to
670 :     # decide whether or not we need a new one.
671 :     delete $self->{source};
672 :     # Clean up the database object.
673 :     $db->Cleanup();
674 : parrello 1.5 };
675 :     # Did it work?
676 :     if ($@) {
677 : parrello 1.7 # No, so we need to emit an error message and abort all the loaders.
678 :     # First, we need to clean the new-line from the message (if any).
679 :     my $msg = $@;
680 :     chomp $msg;
681 :     # Figure out what we were doing at the time of the error.
682 :     my $place = "Error in section $section";
683 : parrello 1.5 if (defined $self->{lastKey}) {
684 : parrello 1.7 $place .= "($self->{lastKey})";
685 : parrello 1.5 }
686 : parrello 1.7 # Format the message and denote we have a section failure.
687 :     $self->{stats}->AddMessage("$place: $msg");
688 : parrello 1.5 $self->Add("section-errors" => 1);
689 : parrello 1.7 # Abort the loaders.
690 : parrello 1.5 for my $loader (values %{$self->{loaders}}) {
691 :     $loader->Abort();
692 :     }
693 :     } else {
694 : parrello 1.7 # It did work! Finish all the loaders.
695 : parrello 1.5 for my $loader (values %{$self->{loaders}}) {
696 :     $loader->Finish();
697 : parrello 1.1 }
698 : parrello 1.5 # Update the load count.
699 :     $self->Add("section-loads" => 1);
700 : parrello 1.1 }
701 : parrello 1.5 # Update the timer.
702 : parrello 1.1 $self->Add(duration => (time() - $startTime));
703 :     }
704 :     }
705 :    
706 :     =head3 DisplayStats
707 :    
708 :     my $text = $edbl->DisplayStats();
709 :    
710 :     Display the statistics for this load gorup.
711 :    
712 :     =cut
713 :    
714 :     sub DisplayStats {
715 :     # Get the parameters.
716 :     my ($self) = @_;
717 :     # Return the result.
718 :     return $self->{stats}->Show();
719 :     }
720 :    
721 : parrello 1.7 =head3 AccumulateStats
722 :    
723 :     $edbl->AccumulateStats($stats);
724 :    
725 :     Add this load's statistics into the caller-specified statistics object.
726 :    
727 :     =over 4
728 :    
729 :     =item stats
730 :    
731 : parrello 1.8 L<Stats> object into which this load's statistics will be accumulated.
732 : parrello 1.7
733 :     =back
734 :    
735 :     =cut
736 :    
737 :     sub AccumulateStats {
738 :     # Get the parameters.
739 :     my ($self, $stats) = @_;
740 :     # Roll up our statistics in the caller's object.
741 :     $stats->Accumulate($self->{stats});
742 :     }
743 :    
744 :    
745 : parrello 1.1 =head3 GetGroupHash
746 :    
747 :     my $groupHash = ERDBLoadGroup::GetGroupHash($erdb);
748 :    
749 :     Return a hash that maps each load group in the specified database to its
750 :     constituent tables. This is useful when checking for problems with a load
751 :     or performing finishing tasks.
752 :    
753 :     =over 4
754 :    
755 :     =item erdb
756 :    
757 : parrello 1.8 L<ERDB> database whose load information is desired.
758 : parrello 1.1
759 :     =item RETURN
760 :    
761 :     Returns a reference to a hash that maps each group name to a list of
762 :     table names.
763 :    
764 :     =back
765 :    
766 :     =cut
767 :    
768 :     sub GetGroupHash {
769 :     # Get the parameters.
770 :     my ($erdb) = @_;
771 :     # Initialize the return variable.
772 :     my $retVal = {};
773 :     # Loop through the list of load groups.
774 :     for my $group ($erdb->LoadGroupList()) {
775 :     # Stash the loader's tables in the output hash.
776 : parrello 1.4 $retVal->{$group} = [ GetTables($erdb, $group) ];
777 : parrello 1.1 }
778 :     # Return the result.
779 :     return $retVal;
780 :     }
781 :    
782 :     =head3 ComputeGroups
783 :    
784 : parrello 1.2 my @groupList = ERDBLoadGroup::ComputeGroups($erdb, \@groups);
785 : parrello 1.1
786 : parrello 1.2 Compute the actual list of groups determined by the incoming group list.
787 : parrello 1.1
788 :     =over 4
789 :    
790 :     =item erdb
791 :    
792 : parrello 1.8 L<ERDB> object for the database being loaded.
793 : parrello 1.1
794 :     =item groups
795 :    
796 : parrello 1.2 Reference to a list of group names specified on the command line. A plus sign
797 :     (C<+>) has special meaning.
798 : parrello 1.1
799 :     =item RETURN
800 :    
801 :     Returns the actual list of groups to be processed by the calling command. The
802 :     names will have been normalized to capital case.
803 :    
804 :     =back
805 :    
806 :     =cut
807 :    
808 :     sub ComputeGroups {
809 :     # Get the parameters.
810 : parrello 1.2 my ($erdb, $groups) = @_;
811 :     # Get the complete group list in standard order.
812 :     my @allGroups = $erdb->LoadGroupList();
813 :     # Create a hash for validation purposes. This will map each valid group
814 :     # name to its position in the standard order.
815 :     my %allGroupHash;
816 :     for (my $i = 0; $i <= $#allGroups; $i++) {
817 :     $allGroupHash{$allGroups[$i]} = $i;
818 :     }
819 :     # This variable will be the index of the last-processed group in
820 :     # the standard order. We start it before the first group in the list.
821 :     my $lastI = -1;
822 :     # The listed groups will be put in here.
823 : parrello 1.1 my @retVal;
824 : parrello 1.2 # Process the group list.
825 :     for my $group (@$groups) {
826 :     # Process this group.
827 :     if ($group eq '+') {
828 :     # Here we have a plus sign. Push in everything after the previous
829 :     # group processed. Note that we'll be ending at the last position.
830 :     # A second "+" after this one will generate no entries in the result
831 :     # list.
832 :     my $firstI = $lastI + 1;
833 :     $lastI = $#allGroups;
834 :     push @retVal, @allGroups[$firstI..$lastI];
835 :     } elsif (exists $allGroupHash{$group}) {
836 :     # Here we have a valid group name. Push it into the list.
837 :     push @retVal, $group;
838 :     # Remember its location in case there's a plus sign.
839 :     $lastI = $allGroupHash{$group};
840 :     } else {
841 :     # This is an error.
842 :     Confess("Invalid load group name $group.");
843 : parrello 1.1 }
844 :     }
845 :     # Normalize the group names and return them.
846 : parrello 1.4 @retVal = map { ucfirst $_ } @retVal;
847 :     Trace("Final group list is " . join(" ", @retVal) . ".") if T(2);
848 :     return @retVal;
849 : parrello 1.1 }
850 :    
851 : parrello 1.2 =head3 KillFileName
852 :    
853 :     my $fileName = ERDBLoadGroup::KillFileName($erdb, $directory);
854 :    
855 :     Compute the kill file name for the specified database in the specified
856 : parrello 1.8 directory. When the L<ERDBGenerator.pl> script sees the kill file, it will
857 : parrello 1.2 terminate itself at the end of the current section.
858 :    
859 :     =over 4
860 :    
861 :     =item erdb
862 :    
863 :     Database
864 :    
865 :     =item directory (optional)
866 :    
867 :     Load directory for the database.
868 :    
869 :     =item RETURN
870 :    
871 :     Returns the specified database's kill file name. If a directory is specified,
872 :     it is prefixed to the name with an intervening slash.
873 :    
874 :    
875 :     =back
876 :    
877 :     =cut
878 :    
879 :     sub KillFileName {
880 :     # Get the parameters.
881 :     my ($erdb, $directory) = @_;
882 :     # Compute the kill file name. We start with the database name in
883 :     # lower case, then prefix it with "kill_";
884 :     my $dbName = lc ref $erdb;
885 :     my $retVal = ERDBGenerate::CreateFileName("kill_$dbName", undef, 'control', $directory);
886 :     # Return the result.
887 :     return $retVal;
888 :     }
889 :    
890 : parrello 1.5 =head3 SkipIndicated
891 :    
892 :     my $flag = $edbl->SkipIndicated($section, \@tables);
893 :    
894 :     Return FALSE if the current group should be run for the current section.
895 :     If the C<resume> option is not set, this method always returns FALSE;
896 :     otherwise, it will look at the files currently in the load directory and
897 :     if enough of them are present, it will return TRUE, indicating there's
898 :     no point in generating data for the indicated tables with respect to the
899 :     current section. In other words, it will return TRUE if, for every table,
900 :     there is either a load file for that table or a load file for the
901 :     specified section of that table.
902 :    
903 :     =over 4
904 :    
905 :     =item section
906 :    
907 :     ID of the relevant section.
908 :    
909 :     =item tables
910 :    
911 :     List of tables to check.
912 :    
913 :     =item RETURN
914 :    
915 :     Returns TRUE if load files are already generated for the specified section, else FALSE.
916 :    
917 :     =back
918 :    
919 :     =cut
920 :    
921 :     sub SkipIndicated {
922 :     # Get the parameters.
923 :     my ($self, $section, $tables) = @_;
924 :     # Declare the return variable. It's FALSE if there's no resume parameter.
925 :     my $retVal = $self->{options}->{resume};
926 :     # Loop through the table names while $retval is TRUE.
927 :     for my $table (@$tables) { last if ! $retVal;
928 :     # Compute the file names.
929 :     my @files = map { ERDBGenerate::CreateFileName($table, $_, data => $self->{directory}) }
930 :     (undef, $section);
931 :     # If neither is present, we can't skip. So, if the grep below returns an empty
932 :     # list, we set $retVal FALSE, which stops the loop.
933 :     if (scalar(grep { -f $_ } @files) == 0) {
934 :     $retVal = 0;
935 :     Trace("Section $section not found for $table in $self->{group}. Regeneration required.") if T(3);
936 :     }
937 :     }
938 :     # Return the result.
939 :     return $retVal;
940 :     }
941 :    
942 : parrello 1.7 =head3 GetRecord
943 :    
944 :     my ($key, $record) = ERDBLoadGroup::GetRelRecord($ih, $fieldPos);
945 :    
946 :     Read the next record from a tab-delimited file, returning the key field
947 :     in the specified position and a reference to a list of all the fields. If
948 :     end-of-file has been reached, the value TRAILER and an empty list
949 :     reference will be returned.
950 :    
951 :     =over 4
952 :    
953 :     =item ih
954 :    
955 :     Open handle of the input file containing the records.
956 :    
957 :     =item fieldPos
958 :    
959 :     Ordinal position in the record of the desired key field. This should be
960 :     C<0> for the first field, C<1> for the second, and so forth.
961 :    
962 :     =item RETURN
963 :    
964 :     Returns a two-element list, the first of which contains the indicated key
965 :     field and the second of which is a reference to a list of all fields in the
966 :     record (including the key). If end-of-file is reached, the returned key will
967 :     be TRAILER and the returned list will be empty.
968 :    
969 :     =back
970 :    
971 :     =cut
972 :    
973 :     sub GetRecord {
974 :     # Get the parameters.
975 :     my ($ih, $fieldPos) = @_;
976 :     # Declare the return variables.
977 :     my ($key, $record) = (TRAILER, []);
978 :     # Only proceed if we're NOT at end of file.
979 :     if (! eof $ih) {
980 :     # Read the record.
981 :     my @fields = Tracer::GetLine($ih);
982 :     # Extract the key and form the list.
983 :     $key = $fields[$fieldPos];
984 :     $record = \@fields;
985 :     }
986 :     # Return the results.
987 :     return ($key, $record);
988 :     }
989 : parrello 1.2
990 : parrello 1.1 =head2 Virtual Methods
991 :    
992 :     =head3 Generate
993 :    
994 :     $edbl->Generate();
995 :    
996 :     Generate the data for this load group with respect to the current
997 :     section. This method must be overridden by the subclass and should call
998 :     the L</Put> method to put data into the tables.
999 :    
1000 :     =cut
1001 :    
1002 :     sub Generate {
1003 :     Confess("Pure virtual method Generate called.");
1004 :     }
1005 :    
1006 : parrello 1.7 =head3 PostProcess
1007 :    
1008 :     my $stats = $edbl->PostProcess();
1009 :    
1010 :     Post-process the load files for this group. This method is called after all
1011 :     of the load files have been assembled, but before anything is actually loaded.
1012 :     It allows a final pass through the data to do filtering between groups or to
1013 :     accumulate totals and counters. The default is to do nothing.
1014 :    
1015 :     This method returns a statistics object describing the post-processing activity,
1016 :     or an undefined value if nothing happened.
1017 :    
1018 :     =cut
1019 :    
1020 :     sub PostProcess { }
1021 :    
1022 : parrello 1.1 1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3