[Bio] / Sprout / AttrDBRefresh.pl Repository:
ViewVC logotype

View of /Sprout/AttrDBRefresh.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.25 - (download) (as text) (annotate)
Wed Sep 3 20:50:24 2008 UTC (11 years, 2 months ago) by parrello
Branch: MAIN
CVS Tags: rast_rel_2008_12_18, rast_2008_0924, rast_rel_2008_09_30, mgrast_rel_2008_0924, mgrast_rel_2008_1110_v2, mgrast_rel_2008_0625, rast_rel_2008_10_09, rast_release_2008_09_29, mgrast_rel_2008_0923, mgrast_rel_2008_0919, mgrast_rel_2008_1110, rast_rel_2008_09_29, mgrast_rel_2008_0917, rast_rel_2008_10_29, rast_rel_2008_11_24
Changes since 1.24: +133 -2 lines
Added a new function to clean out duplicate entries.

#!/usr/bin/perl -w

=head1 AttrDBRefresh

This script performs useful function on the custom attributes database.

The currently-supported command-line options are as follows.

=over 4

=item user

Name suffix to be used for log files. If omitted, the PID is used.

=item trace

Numeric trace level. A higher trace level causes more messages to appear. The
default trace level is C<3>.

=item sql

If specified, turns on tracing of SQL activity.

=item background

Save the standard and error output to files. The files will be created
in the FIG temporary directory and will be named C<err>I<User>C<.log> and
C<out>I<User>C<.log>, respectively, where I<User> is the value of the
B<user> option above.

=item h

Display this command's parameters and options.

=item phone

Phone number to message when the script is complete.

=item migrate

If specified, a comma-delimited list of attributes to be migrated from the old system to the
new one. The attributes will be erased before migration.

=item initializeAndClear

If specified, then the tables in the attribute database are dropped and re-created.

=item replace

If specified, existing keys will be erased before loading the attribute data. This
option only makes sense if C<load> is specified.

=item load

If specified, the name of a file containing attribute data to be loaded into the
system. The file is presumed to be tab-delimited. The first column must be the
object ID, the second the attribute key name, and the remaining columns the
attribute values. Existing attributes will be unchanged unless the C<replace>
option is specified.

=item loadKeys

If specified, the name of a tab-delimited file containing attribute key data. For each key,
there is a pair of lines. The first line contains the ID, value table name, and
description of the key. The second line contains the marker C<#GROUPS> followed by zero or
more group names. The attribute will be connected to all the specified groups.

=item backup

If specified, the name of a file into which all the attribute data should be
dumped. The file itself will receive the attribute data in the format expected
by C<load>.

=item backupKeys

If specified, the name of a file into which all the attribute key data should be
dumped. The file will receive the attribute key data in the format expected by C<loadKey>.

=item compare

If specified, the name of a file to contain a comparision report. The comparison report
contains a table of the attribute keys and the number of values of the key in the new
and old systems, along with a list of the attributes and values not in the new system.
The report is formatted as a web page.

=item summary

If specified, the name of a file to contain a summary report. The summary report
contains a table of the attribute keys and the number of values of each. The report
is formatted as a web page.

=item trimSpaces

If specified, the name of an attribute value backup file. The file will be processed to
remove excess spaces. The fixed file will have the same name as the incoming backup
file with the extension <.fixed>. This new file can then be reloaded using the
C<load> option.

=item showKeyDef

If specified, the name of an attribute key. The key's descriptive data will be displayed.

=item mapSubkey

Subkey mapping rule. Consists of a key name, followed by a substitution command enclosed in
slashes. For example, to remove the word C<ZINC> from the beginning of C<docking_results>
subkeys, you would code

    -mapSubkey=docking_results/^ZINC//

=item mapObjectID

Object ID mapping rule. Consists of a key name, followed by a substitution command enclosed in
slashes. For example, to add the prefix C<fig|> to all the object IDs for the C<PRODOM> key,
you would code

    -mapObjectID=PRODOM/^/fig\|/

=item dockClean

If specified, Predicted docking results will be removed from the attribute database.

=item dupClean

If specified, duplicate attribute values will be removed from the database. The
parameter should be an attribute key. All attribute keys whose names are greater than
or equal to the specified value will be processed. (This is to allow restarting.)

=item resume

If specified, key-value pairs already in the database will not be reinserted.
Specify a number to start checking after the specified number of lines and
then admit everything after the first line not yet loaded. Specify C<careful>
to check every single line. Specify C<none> to ignore this option. The default
is C<none>. So, if you believe that a previous load failed somewhere after 50000
lines, a resume value of C<50000> would skip 50000 lines in the file, then
check each line after that until it finds one not already in the database. The
first such line found and all lines after that will be loaded. On the other
hand, if you have a file of 100000 records, and some have been loaded and some
not, you would use the word C<careful>, so that every line would be checked before
it is inserted. A resume of C<0> will start checking the first line of the
input file and then begin loading once it finds a line not in the database.

=item chunkSize

Number of lines to load in each burst. The default is 10,000. This option
is only used if C<load> is specified.

=item mode

C<concurrent> to use concurrent loading in MySQL or C<low_priority> to use
low-priority loading in MySQL. If C<normal>, normal loading will be used. The
default is C<concurrent>. This option is only used if C<load> is specified.

=back

=cut

use strict;
use Tracer;
use Cwd;
use File::Copy;
use File::Path;
use CustomAttributes;
use ERDBLoad;
use Stats;
use CGI;

# Get the command-line options and parameters.
my ($options, @parameters) = StandardSetup([qw(CustomAttributes DBKernel) ],
                                           {
                                              trace => [3, "trace level"],
                                              initializeAndClear => ["", "if specified, the tables of the attribute database will be re-created"],
                                              phone => ["", "phone number (international format) to call when load finishes"],
                                              load => ["", "file from which to load attribute data"],
                                              loadKeys => ["", "file from which to load attribute key data"],
                                              backup => ["", "file to which attribute data should be dumped"],
                                              backupKeys => ["", "file to which attribute key data should be dumped"],
                                              compare => ["", "name of a file into which a comparison report will be written"],
                                              summary => ["", "name of a file into which a summary report will be written"],
                                              trimSpaces => ["", "if specified, the name of a backup file, which will be processed to remove excess spaces"],
                                              replace => ["", "if specified, data will be erased before loading from the load file"],
                                              showKeyDef => ["", "if specified, the name of a key whose descriptive data is to be displayed"],
                                              mapSubkey => ["", "instructions for fixing subkey values"],
                                              mapObjectID => ["", "instructions for fixing object ID values"],
                                              dockClean => ["", "if specified, Predicted docking results will be removed from the database"],
                                              resume => ["", "if specified, key-value pairs already in the database will not be inserted when loading from the load file"],
                                              mode => ["concurrent", "MySQL load mode to use"],
                                              chunksize => ["", "number of attributes to load in each burst"],
                                              dupClean => ["", "clean duplicate attributes"]
                                           },
                                           "",
                                           @ARGV);
# Set a variable to contain return type information.
my $rtype;
# Create a CGI object.
my $cgi = CGI->new();
# Insure we catch errors.
eval {
    # Insure we don't use the new attribute system for accessing the old attributes.
    $FIG_Config::attrOld = 1;
    # Get the attribute database.
    Trace("Connecting to local attribute database.") if T(2);
    my $ca = CustomAttributes->new();
    # Process according to the options selected.
    if ($options->{backup}) {
        # Back up the attributes to the specified file.
        my $backupFileName = $options->{backup};
        Trace("Backing up attribute data.") if T(2);
        my $stats = $ca->BackupAllAttributes($backupFileName);
        Trace("Attribute backup statistics:\n" . $stats->Show()) if T(2);
    }
    if ($options->{backupKeys}) {
        # Back up the attribute key data to the specified file.
        Trace("Backing up key data.") if T(2);
        my $backupFileName = $options->{backupKeys};
        my $stats = $ca->BackupKeys($backupFileName);
        Trace("Key backup statistics:\n" . $stats->Show()) if T(2);
    }
    if ($options->{initializeAndClear}) {
        # Create the tables.
        $ca->CreateTables();
        Trace("Tables recreated.") if T(2);
    }
    if ($options->{dupClean}) {
        # Clean out duplicates. Determine the point at which we should start.
        # The default is at the beginning of the key list.
        my $startPoint = " ";
        # If the user specified a start value, start from there. An unspecified
        # value defaults to 1.
        if ($options->{dupClean} ne "1") {
            $startPoint = $options->{dupClean};
        }
        CleanDuplicates($ca, $startPoint);
    }
    if ($options->{trimSpaces}) {
        # Here we need to remove unnecessary spaces from an attribute values backup
        # file. First, we open the input backup file.
        my $fileName = $options->{trimSpaces};
        my $ih = Open(undef, "<$fileName");
        # Now we open the proposed output file.
        my $oh = Open(undef, ">$fileName.fixed");
        # Create a statistics object to track our progress.
        my $stats = Stats->new('lines', 'trims');
        Trace("Cleaning $fileName and copying to $fileName.fixed.") if T(2);
        # Loop through the input file.
        while (! eof $ih) {
            # Get the next record in the input file.
            $stats->Add(lines => 1);
            my ($id, $key, @values) = Tracer::GetLine($ih);
            # Trim the values.
            for my $value (@values) {
                if ($value =~ /(\S.+\S)\s+/) {
                    $value = $1;
                    $stats->Add(trims => 1);
                }
            }
            # Write the result to the output file.
            Tracer::PutLine($oh, [$id, $key, @values]);
        }
        # Close the files.
        close $ih;
        close $oh;
        Trace("$fileName.fixed is now a cleaned backup.\n" . $stats->Show()) if T(2);
    }
    if ($options->{loadKeys}) {
        # We want to load the attribute data from the specified file, but
        # first we need to verify that the file exists.
        my $loadFileName = $options->{loadKeys};
        if (! -f $loadFileName) {
            Confess("Cannot load keys: file \"$loadFileName\" is not found or not a file.");
        } else {
            Trace("Loading key data from $loadFileName.") if T(2);
            my $stats = $ca->RestoreKeys($loadFileName);
            Trace("Load statistics:\n" . $stats->Show()) if T(2);
        }
    }
    if ($options->{showKeyDef}) {
        # We want to display the identified key's description. Get the key name.
        my $name = $options->{showKeyDef};
        # Look for keys with the same name.
        my %keys = $ca->GetAttributeData('find', $options->{showKeyDef});
        # See if we found the key.
        if (! $keys{$name}) {
            print "Key $name not found.\n";
        } else {
            print "Description for $name.\n\n";
            print $keys{$name}->[1];
            print "\n\n";
        }
    }
    if ($options->{load}) {
        # We want to load the attribute data from the specified file, but
        # first we need to verify that the file exists.
        my $loadFileName = $options->{load};
        if (! -f $loadFileName) {
            Confess("Cannot load: file \"$loadFileName\" is not found or not a file.");
        } else {
            # Set up options. We may need to specify the append and resume options, and
            # we need to archive.
            my %loadOptions;
            if ($options->{replace}) {
                $loadOptions{append} = 0;
            } else {
                $loadOptions{append} = 1;
            }
            if ($options->{resume}) {
                $loadOptions{resume} = 1;
            }
            if ($options->{mode}) {
                $loadOptions{mode} = $options->{mode};
            }
            # Insure we have the archive directory available.
            my $archiveDirectory = "$FIG_Config::fig/AttribData";
            Tracer::Insure($archiveDirectory, 0777);
            # Create an archive file name from the current time and the PID.
            my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time());
            $loadOptions{archive} = "$archiveDirectory/attrSave$$.$mon.$mday.$year.$hour.$min.$sec.log";
            # Now we can load.
            Trace("Loading attribute data from $loadFileName.") if T(2);
            my $stats = $ca->LoadAttributesFrom($loadFileName, %loadOptions);
            Trace("Load statistics:\n" . $stats->Show()) if T(2);
        }
    }
    if ($options->{summary}) {
        # Here we have a summary report. The value of the option is the name of a file that is to
        # contain an html-formatted report. We start by getting a complete list of the keys and
        # the associated counts.
        my $keyCounts = GetAllKeys($ca);
        # Buffer the lines in the following list.
        my @lines = ();
        # Start the table.
        push @lines, $cgi->start_table({border => 2});
        push @lines, $cgi->Tr($cgi->th({align => 'left'}, 'Key Name'), $cgi->th({align => 'right'}, 'Values'));
        # Loop through the key hash, building row data.
        for my $key (sort keys %{$keyCounts}) {
            push @lines, $cgi->Tr($cgi->td({align => 'left'}, $key), $cgi->td({align => 'right'}, $keyCounts->{$key}));
        }
        # Close off the table.
        push @lines, $cgi->end_table();
        # Output the page.
        WritePage($options->{summary}, 'Attribute Key Summary', \@lines);
    }
    if ($options->{compare}) {
        # Here we have a comparison report. The value of the option is the name of a file that is
        # to contain an html-formatted report. We need access to the SEED database to pull this
        # off.
        Trace("Connecting to FIG object.") if T(2);
        require FIG;
        my $fig = FIG->new();
        my $dbh = $fig->db_handle();
        Trace("Retrieving key data.") if T(2);
        # Get counts for all the keys in the old system.
        my $oldKeys = $dbh->SQL("SELECT tag, COUNT(*) FROM attribute GROUP BY tag");
        # Insure the query worked.
        if (! defined($oldKeys)) {
            Trace("Database error retrieving old system keys:" . $dbh->errstr) if T(0);
        } else {
            # Convert the key data into a hash.
            my %oldKeyHash = map { $_->[0] => $_->[1] } @{$oldKeys};
            # Get the counts for all the keys in the new system.
            my $newKeyHashRef = GetNewKeyCounts($ca);
            # We've got our data, so the next step is to start accumulating the lines of the web page.
            Trace("Processing key table for new system.") if T(2);
            my @lines = ();
            # Start the first table.
            push @lines, $cgi->h3("New System Keys");
            push @lines, $cgi->start_table({border => 2});
            push @lines, $cgi->Tr($cgi->th({align => 'left'}, 'Key Name'), $cgi->th({align => 'right'}, 'New Values'),
                                  $cgi->th({align => 'right'}, 'Old Values'));
            # Now we process the new keys. As we run through them, we'll delete matching keys from
            # the old key hash. The remaining keys will be output as a missing-keys table.
            for my $key (sort keys %{$newKeyHashRef}) {
                # Get the new system count.
                my $newCount = $newKeyHashRef->{$key};
                # Default to a blank in the old system count column.
                my $oldCount = '&nbsp;';
                # Check to see if this key exists in the old system.
                if (exists $oldKeyHash{$key}) {
                    # If it does, save its count and delete it from the old system hash.
                    $oldCount = $oldKeyHash{$key};
                    delete $oldKeyHash{$key};
                }
                # Output this table row.
                push @lines, $cgi->Tr($cgi->td({align => 'left'}, $key), $cgi->td({align => 'right'}, $newCount),
                                      $cgi->td({align => 'right'}, $oldCount));
            }
            # Close the table.
            push @lines, $cgi->end_table();
            # Now the remaining keys in the old key hash are missing from the new system. We create a new table
            # to display them.
            my @missingKeys = sort keys %oldKeyHash;
            # Only do this if there's at least one missing key.
            if (@missingKeys == 0) {
                push @lines, $cgi->p("No missing keys found.");
            } else {
                Trace("Processing missing key table.") if T(2);
                # Start the second table.
                push @lines, $cgi->h3("Keys Missing from New System");
                push @lines, $cgi->start_table({border => 2});
                push @lines, $cgi->Tr($cgi->th({align => 'left'}, 'Key Name'), $cgi->td({align => 'right'}, 'Values'));
                # Loop through the missing keys, writing them to the table.
                for my $key (@missingKeys) {
                    push @lines, $cgi->Tr($cgi->td({align => 'left'}, $key), $cgi->td({align => 'right'}, $oldKeyHash{$key}));
                }
                # Close the table.
                push @lines, $cgi->end_table();
            }
            # Write the web page.
            WritePage($options->{compare}, 'Attribute System Comparison', \@lines);
        }
    }
    if ($options->{mapSubkey} || $options->{mapObjectID}) {
        # Parse out the main key.
        my $type = ($options->{mapSubkey} ? 'mapSubkey' : 'mapObjectID');
        my $mapThing = $options->{$type};
        my $field = ($options->{mapSubkey} ? 'subkey' : 'to-link');
        if ($mapThing =~ m#([^/]+)(/.+)#) {
            my ($keyName, $pattern) = ($1, $2);
            Trace("Processing $type mapping for $keyName with pattern s$pattern.") if T(2);
            # Create a statistics object.
            my $results = Stats->new();
            # Get all the field values.
            my %things = map { $_ => 1 } $ca->GetFlat(['HasValueFor'], "HasValueFor(from-link) = ?", [$keyName],
                                                       "HasValueFor($field)");
            my $totalThings = scalar keys %things;
            Trace("$totalThings ${field}s found.") if T(2);
            # Loop through the values, doing updates where necessary.
            for my $thing (keys %things) {
                # Count this subkey.
                my $thingCount = $results->Add("${field}s" => 1);
                # Apply the substitution.
                my $newThing = $thing;
                eval("\$newThing =~ s$pattern");
                # If the evaluation resulted in an error, stop immediately.
                if ($@) {
                    Confess("Error in substitution pattern: $@");
                } elsif ($newThing ne $thing) {
                    # Here the substitution worked and it changed the key value.
                    # We need to update the database.
                    $results->Add(updates => 1);
                    my $count = $ca->UpdateField("HasValueFor($field)", $thing, $newThing,
                                                 "HasValueFor(from-link) = ?", [$keyName]);
                    $results->Add(rowsChanged => $count);
                }
                # Trace our progress.
                if ($thingCount % 100 == 0) {
                    Trace("$thingCount processed.") if T(3);
                }
            }
            # Display the statistics.
            Trace("Statistics from $field update:\n" . $results->Show()) if T(2);
        } else {
            # Here the incoming parameter was in the wrong format. Mostly this means there
            # was nothing before the slash or no slash was found.
            Confess("Invalid substitution syntax in map option.");
        }
    }
    if ($options->{dockClean}) {
        # Get the list of PDBs with results.
        my @pdbList = sort map { $_->[0] } $ca->GetAttributes(undef, 'has_results');
        # Loop through the PDB IDs.
        for my $pdbID (@pdbList) {
            Trace("Processing $pdbID.") if T(3);
            # Loop until we run out of rows to delete.
            my $thisCount = 1;
            my $totalCount = 0;
            while ($thisCount) {
                # Delete a bunch of rows. To avoid a timeout, we limit the results.
                $thisCount = $ca->DeleteLike('HasValueFor', 'HasValueFor(to-link) = ? AND HasValueFor(value) LIKE ? LIMIT 10000',
                                           [$pdbID, '%Predicted']);
                $totalCount += $thisCount;
                Trace("$thisCount rows deleted in batch. $totalCount total deletions for pdb $pdbID.") if T(3);
            }
        }
    }
    Trace("Processing complete.") if T(2);
};
if ($@) {
    Trace("Script failed with error: $@") if T(0);
    $rtype = "error";
} else {
    Trace("Script complete.") if T(2);
    $rtype = "no error";
}
if ($options->{phone}) {
    my $msgID = Tracer::SendSMS($options->{phone}, "RefreshAttrDB terminated with $rtype.");
    if ($msgID) {
        Trace("Phone message sent with ID $msgID.") if T(2);
    } else {
        Trace("Phone message not sent.") if T(2);
    }
}

=head3 MigrateAttributes

    my $stats = MigrateAttributes($ca, $fig, $preserve);

Migrate all the attributes data from the specified FIG instance. This is a long, slow
method used to convert the old attribute data to the new system. Only attribute
keys that are already in the database will be loaded, and they will completely
replace the existing values for those keys. Therefore, it is very important that the
FIG instance not be connected to the attribute database.

=over 4

=item ca

B<CustomAttributes> object used to access the attribute database.

=item fig

A FIG object that can be used to retrieve attributes for migration purposes.

=item preserve (optional)

A comma-delimited list of attributes that are not to be migrated.

=item RETURN

Returns a statistical object for the load process.

=back

=cut

sub MigrateAttributes {
    # Get the parameters.
    my ($ca, $fig, $preserve) = @_;
    # Create the return value.
    my $retVal = Stats->new('keysIn');
    # Create a loader for the value table.
    my $hasValueFor = ERDBLoad->new($ca, 'HasValueFor', $FIG_Config::temp);
    # Create a hash for the target objects.
    my %targetObjectHash = ();
    # Get a list of the attributes we're to preserve.
    my %preserve = ();
    if (defined $preserve) {
        %preserve = map { $_ => 1 } split /\s*,\s*/, $preserve;
    }
    # Put the preserved keys into the load file.
    for my $key (keys %preserve) {
        Trace("Preserving key $key.") if T(3);
        my @newValues = $ca->GetAttributes(undef, $key);
        Trace(scalar(@newValues) . " values of $key will be preserved.");
        # Put the values into the load file.
        PutValue($hasValueFor, $ca, @newValues);
    }
    # Get a list of all our attribute keys.
    my @allKeys = $ca->GetFlat(['AttributeKey'], "", [], 'AttributeKey(id)');
    # Delete the preserved keys.
    my @keys = grep { ! $preserve{$_} } @allKeys;
    # Loop through the reset, building the load files.
    for my $key (@keys) {
        Trace("Migrating key $key.") if T(3);
        $retVal->Add(keysIn => 1);
        # Get all the values of the specified key.
        my @oldValues = $fig->get_attributes(undef, $key);
        my $count = scalar(@oldValues);
        Trace("$count values found for $key in source system.") if T(3);
        # Put the values into the load file.
        PutValue($hasValueFor, $ca, @oldValues);
    }
    # Close and finish the loads to upload the data.
    Trace("Closing value table.") if T(2);
    my $hvfStats = $hasValueFor->FinishAndLoad();
    Trace("Statistics from value table load:\n" . $hvfStats->Show()) if T(2);
    # Merge the statistics.
    $retVal->Accumulate($hvfStats);
    # Return the statistics object.
    return $retVal;
}

=head3 PutValue

    PutValue($hasValueFor, $ca, @values);

Put the values from an attribute value list into a HasValueFor load file.

=over 4

=item hasValueFor

Load object for the HasValueFor table.

=item ca

A CustomAttribute object. We get the splitter value from it.

=item value

A list of tuples, each consisting of an object ID, a key name, and one or more values.

=back

=cut

sub PutValue {
    # Get the parameters.
    my ($hasValueFor, $ca, @values) = @_;
    # Loop through the value rows.
    for my $row (@values) {
        # Get this row's data.
        my ($id, $key, @values) = @{$row};
        # Format the values.
        my $valueString = join($ca->{splitter}, @values);
        # Add the value.
        $hasValueFor->Put($key, $id, $valueString);
    }
}

=head3 MigrateCollections

    my $stats = MigrateCollections($ca, $fig);

This method copies the collection data from the specified FIG object and stores it as values
of the C<collection> attribute in the specified custom attribute database.

=over 4

=item ca

Custom attribute database into which the collections are to be stored.

=item fig

FIG object from which the collection attributes are to be harvested.

=item RETURN

Returns a statistics object with informatino about the migration.

=back

=cut

sub MigrateCollections {
    # Get the parameters.
    my ($ca, $fig) = @_;
    # Declare the return variable.
    my $retVal = Stats->new();
    # Get the collection names.
    my @collections = qw(higher_plants eukaryotic_ps nonoxygenic_ps hundred_hundred functional_coupling_paper ecoli_essentiality_paper);
    # Erase the current collection date.
    $ca->EraseAttribute('collection');
    # Loop through the collection attributes.
    for my $cname (@collections) {
        $retVal->Add(collection => 1);
        # Get this attribute from the old system.
        my @rows = $fig->get_attributes(undef, $cname);
        # Loop through its values.
        for my $row (@rows) {
            $retVal->Add($cname => 1);
            # Determine the object key.
            my $objectID = ($row->[0] eq 'Subsystem' ? $row->[2] : $row->[0]);
            $ca->AddAttribute($objectID, 'collection', $cname);
        }
    }
    # Return the statistics.
    return $retVal;
}

=head3 GetAllKeys

    my @keys = GetAllKeys($ca);

Return a sorted list of the attribute keys.

=over 4

=item ca

CustomAttributes object used to access the database.

=item RETURN

Returns a sorted list of all the attribute keys.

=back

=cut

sub GetAllKeys {
    # Get the parameters.
    my ($ca) = @_;
    # Get the attribute data.
    my %keyData = $ca->GetAttributeData('name', '');
    # Sort the keys.
    my @retVal = sort keys %keyData;
    # Return the result.
    return @retVal;
}

=head3 OpenPage

    my $fh = OpenPage($fileName, $title);

Start writing an HTML page to a file and return the file handle.

=over 4

=item fileName

Name of the file to which the page will be written.

=item title

Title for the page.

=item RETURN

Returns the file handle for writing the rest of the page.

=back

=cut

sub OpenPage {
    # Get the parameters.
    my ($fileName, $title) = @_;
    # Open the file.
    my $retVal = Open(undef, ">$fileName");
    # Write out the HTML headers.
    print $retVal $cgi->start_html(-title => $title, -BGCOLOR => 'silver');
    print $retVal "\n";
    # Return the file handle.
    return $retVal;
}

=head3 GetNewKeyCounts

    my %counts = GetNewKeyCounts($ca);

Return a hash mapping attribute key names to counts.

=over 4

=item ca

CustomAttributes object for accessing the attribute database.

=item RETURN

Returns a reference to a hash mapping each key name to a count of the key's values.

=back

=cut

sub GetNewKeyCounts {
    # Get the parameters.
    my ($ca) = @_;
    # Declare the return variable.
    my $retVal = {};
    # Get all of the keys.
    my @keys = GetAllKeys($ca);
    # Loop through the list, filling the hash.
    for my $key (@keys) {
        my $count = $ca->GetCount(['HasValueFor'], "HasValueFor(from-link) = ?", [$key]);
        $retVal->{$key} = $count;
    }
    # Return the result.
    return $retVal;
}

=head3 WritePage

    WritePage($fileName, $title, \@lines);

Write the specified lines to the specified file as a web page. The lines are assumed to be raw
HTML body text. They will be preceded by a standard HTML header and followed by a standard
HTML footer.

=over 4

=item fileName

Name of the output file.

=item title

Title for the web page.

=item lines

Reference to a list of lines of HTML.

=back

=cut

sub WritePage {
    # Get the parameters.
    my ($fileName, $title, $lines) = @_;
    # Open the file and write the header to it. The header includes everything up to and including
    # the BODY tag.
    Trace("Writing web page to $fileName.") if T(2);
    my $oh = OpenPage($fileName, $title);
    # Write the lines one at a time.
    for my $line (@{$lines}) {
        print $oh "$line\n";
    }
    # Write the HTML footer.
    print $oh $cgi->end_html();
    # Close the output file.
    close $oh;
    Trace("Web page created in $fileName.") if T(2);
}

=head3 CleanDuplicates

    CleanDuplicates($ca, $startPoint);

Remove duplicate attribute values from the attribute database, starting
with the specified key. This is a long, slow process. We look through all
the values for a particular key. If duplicate values are found, we delete
all the matching values and re-insert.

=over 4

=item ca

[[CustomAttributesPm]] object for accessing the attribute database.

=item startPoint

Name of the first key to process. All keys that are lexically equal to or greater than this
value will be processed.

=back

=cut

sub CleanDuplicates {
    # Get the parameters.
    my ($ca, $startPoint) = @_;
    # Get a statistics object.
    my $stats = Stats->new();
    # Get the attribute keys we'll be wanting to process. For each key we get the
    # key ID and the relevant relationship name.
    my %keyList = map { $_->[0] => $_->[1] } $ca->GetAll(['AttributeKey'],
                                                         "AttributeKey(id) >= ? ORDER BY AttributeKey(id)",
                                                         [$startPoint],
                                                         ['AttributeKey(id)', 'AttributeKey(relationship-name)']);
    # Form the actual keys into a sorted list. We do this so we can more easily trace the number of
    # keys we have to process.
    my @keys = sort keys %keyList;
    my $n = scalar(@keys);
    Trace("$n will be cleaned for duplicates.") if T(2);
    # Loop through the keys.
    for my $key (@keys) {
        Trace("Processing key " . $stats->Add(keys => 1) . " of $n: $key.") if T(3);
        # Get the key's table.
        my $table = $keyList{$key};
        # Now we will loop through the table's values in sequence, checking for duplicates.
        # we will read the values in clumps, one clump for each target object ID. In general
        # the clumps will be small, and we roll them into a hash to identify the duplicates.
        # This next variable holds the current object ID.
        my $objectID = "";
        # This will be the hash used to check for duplicate values.
        my %valueHash;
        # Duplicates found will be put in this list.
        my @dupList = ();
        # Count the values for this key.
        my $keyVals = 0;
        # Now loop through all the entries for this key.
        my $query = $ca->Get([$table], "$table(from-link) = ? ORDER BY $table(from-link), $table(to-link)",
                             [$key]);
        while (my $value = $query->Fetch()) {
            # Get the fields for this value.
            my ($myID, $subKey, $value) = $value->Values(["$table(to-link)", "$table(subkey)",
                                                          "$table(value)"]);
            # Count it.
            Trace($stats->Ask('values') . " total values processed.") if $stats->Check(values => 500) && T(3);
            $keyVals++;
            # Is this a new clump?
            if ($myID ne $objectID) {
                # Yes it is. Clear the value hash and save the new object ID.
                %valueHash = ();
                $objectID = $myID;
                $stats->Add(clumps => 1);
            }
            # Now determine if we have a duplicate.
            my $valueKey = "$subKey::$value";
            if (! $valueHash{$valueKey}) {
                # No. Record it for future use.
                $valueHash{$valueKey} = 1;
            } else {
                # Yes. Count it as a duplicate.
                my $count = $valueHash{$valueKey}++;
                $stats->Add(duplicates => 1);
                # Is this our first time for it?
                if ($count == 1) {
                    # Yes. Save it in the duplicates list.
                    push @dupList, [$key, $objectID, $subKey, $value];
                }
            }
        }
        Trace(scalar(@dupList) . " duplicates found for $key out of $keyVals.") if T(3);
        # Now we've processed the key. Go through deleting and restoring the values found.
        # This next variable contains the filter clause to use.
        my $filter = "$table(from-link) = ? AND $table(to-link) = ? AND $table(subkey) = ? AND $table(value) = ?";
        # This is a counter for tracing.
        my $dupCount = 0;
        # Loop through the duplicates.
        for my $dup (@dupList) {
            # Delete all copies of this duplicate.
            my $count = $ca->DeleteLike($table => $filter, $dup);
            $stats->Add(deleted => $count - 1);
            # Put a single instance back in.
            $ca->InsertObject($table, {'from-link' => $dup->[0], 'to-link' => $dup->[1], subkey => $dup->[2],
                                       value => $dup->[3]});
            # Count this.
            $dupCount++;
            Trace("$dupCount duplicates processed for $key.") if ($dupCount % 100 == 0) && T(3);
        }
        Trace("Key $key finished. $dupCount duplicates removed.") if T(3);
    }
    Trace("Processing complete:\n" . $stats->Show()) if T(2);
}


1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3