[Bio] / Sprout / AttrDBRefresh.pl Repository:
ViewVC logotype

View of /Sprout/AttrDBRefresh.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.26 - (download) (as text) (annotate)
Mon Jan 19 21:46:21 2009 UTC (10 years, 7 months ago) by parrello
Branch: MAIN
CVS Tags: mgrast_dev_08112011, mgrast_dev_08022011, rast_rel_2014_0912, mgrast_dev_04082011, rast_rel_2010_0928, mgrast_version_3_2, mgrast_dev_12152011, mgrast_dev_06072011, rast_rel_2009_0925, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, rast_rel_2010_1206, mgrast_release_3_0, mgrast_dev_03252011, rast_rel_2010_0118, rast_rel_2009_02_05, rast_rel_2011_0119, mgrast_release_3_0_4, mgrast_release_3_0_2, mgrast_release_3_0_3, mgrast_release_3_0_1, mgrast_dev_03312011, mgrast_release_3_1_2, mgrast_release_3_1_1, mgrast_release_3_1_0, mgrast_dev_04132011, mgrast_dev_04012011, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, mgrast_dev_04052011, rast_rel_2009_03_26, mgrast_dev_10262011, HEAD
Changes since 1.25: +7 -252 lines
ERDB 2.0 support

#!/usr/bin/perl -w

=head1 AttrDBRefresh

This script performs useful function on the custom attributes database.

The currently-supported command-line options are as follows.

=over 4

=item user

Name suffix to be used for log files. If omitted, the PID is used.

=item trace

Numeric trace level. A higher trace level causes more messages to appear. The
default trace level is C<3>.

=item sql

If specified, turns on tracing of SQL activity.

=item background

Save the standard and error output to files. The files will be created
in the FIG temporary directory and will be named C<err>I<User>C<.log> and
C<out>I<User>C<.log>, respectively, where I<User> is the value of the
B<user> option above.

=item h

Display this command's parameters and options.

=item phone

Phone number to message when the script is complete.

=item migrate

If specified, a comma-delimited list of attributes to be migrated from the old system to the
new one. The attributes will be erased before migration.

=item initializeAndClear

If specified, then the tables in the attribute database are dropped and re-created.

=item replace

If specified, existing keys will be erased before loading the attribute data. This
option only makes sense if C<load> is specified.

=item load

If specified, the name of a file containing attribute data to be loaded into the
system. The file is presumed to be tab-delimited. The first column must be the
object ID, the second the attribute key name, and the remaining columns the
attribute values. Existing attributes will be unchanged unless the C<replace>
option is specified.

=item loadKeys

If specified, the name of a tab-delimited file containing attribute key data. For each key,
there is a pair of lines. The first line contains the ID, value table name, and
description of the key. The second line contains the marker C<#GROUPS> followed by zero or
more group names. The attribute will be connected to all the specified groups.

=item backup

If specified, the name of a file into which all the attribute data should be
dumped. The file itself will receive the attribute data in the format expected
by C<load>.

=item backupKeys

If specified, the name of a file into which all the attribute key data should be
dumped. The file will receive the attribute key data in the format expected by C<loadKey>.

=item summary

If specified, the name of a file to contain a summary report. The summary report
contains a table of the attribute keys and the number of values of each. The report
is formatted as a web page.

=item trimSpaces

If specified, the name of an attribute value backup file. The file will be processed to
remove excess spaces. The fixed file will have the same name as the incoming backup
file with the extension <.fixed>. This new file can then be reloaded using the
C<load> option.

=item showKeyDef

If specified, the name of an attribute key. The key's descriptive data will be displayed.

=item mapSubkey

Subkey mapping rule. Consists of a key name, followed by a substitution command enclosed in
slashes. For example, to remove the word C<ZINC> from the beginning of C<docking_results>
subkeys, you would code

    -mapSubkey=docking_results/^ZINC//

=item mapObjectID

Object ID mapping rule. Consists of a key name, followed by a substitution command enclosed in
slashes. For example, to add the prefix C<fig|> to all the object IDs for the C<PRODOM> key,
you would code

    -mapObjectID=PRODOM/^/fig\|/

=item dockClean

If specified, Predicted docking results will be removed from the attribute database.

=item dupClean

If specified, duplicate attribute values will be removed from the database. The
parameter should be an attribute key. All attribute keys whose names are greater than
or equal to the specified value will be processed. (This is to allow restarting.)

=item resume

If specified, key-value pairs already in the database will not be reinserted.
Specify a number to start checking after the specified number of lines and
then admit everything after the first line not yet loaded. Specify C<careful>
to check every single line. Specify C<none> to ignore this option. The default
is C<none>. So, if you believe that a previous load failed somewhere after 50000
lines, a resume value of C<50000> would skip 50000 lines in the file, then
check each line after that until it finds one not already in the database. The
first such line found and all lines after that will be loaded. On the other
hand, if you have a file of 100000 records, and some have been loaded and some
not, you would use the word C<careful>, so that every line would be checked before
it is inserted. A resume of C<0> will start checking the first line of the
input file and then begin loading once it finds a line not in the database.

=item chunkSize

Number of lines to load in each burst. The default is 10,000. This option
is only used if C<load> is specified.

=item mode

C<concurrent> to use concurrent loading in MySQL or C<low_priority> to use
low-priority loading in MySQL. If C<normal>, normal loading will be used. The
default is C<concurrent>. This option is only used if C<load> is specified.

=back

=cut

use strict;
use Tracer;
use Cwd;
use File::Copy;
use File::Path;
use CustomAttributes;
use ERDBLoad;
use Stats;
use CGI qw(-nosticky);

# Get the command-line options and parameters.
my ($options, @parameters) = StandardSetup([qw(CustomAttributes DBKernel) ],
                                           {
                                              trace => [3, "trace level"],
                                              initializeAndClear => ["", "if specified, the tables of the attribute database will be re-created"],
                                              phone => ["", "phone number (international format) to call when load finishes"],
                                              load => ["", "file from which to load attribute data"],
                                              loadKeys => ["", "file from which to load attribute key data"],
                                              backup => ["", "file to which attribute data should be dumped"],
                                              backupKeys => ["", "file to which attribute key data should be dumped"],
                                              compare => ["", "name of a file into which a comparison report will be written"],
                                              summary => ["", "name of a file into which a summary report will be written"],
                                              trimSpaces => ["", "if specified, the name of a backup file, which will be processed to remove excess spaces"],
                                              replace => ["", "if specified, data will be erased before loading from the load file"],
                                              showKeyDef => ["", "if specified, the name of a key whose descriptive data is to be displayed"],
                                              mapSubkey => ["", "instructions for fixing subkey values"],
                                              mapObjectID => ["", "instructions for fixing object ID values"],
                                              dockClean => ["", "if specified, Predicted docking results will be removed from the database"],
                                              resume => ["", "if specified, key-value pairs already in the database will not be inserted when loading from the load file"],
                                              mode => ["concurrent", "MySQL load mode to use"],
                                              chunksize => ["", "number of attributes to load in each burst"],
                                              dupClean => ["", "clean duplicate attributes"]
                                           },
                                           "",
                                           @ARGV);
# Set a variable to contain return type information.
my $rtype;
# Create a CGI object.
my $cgi = CGI->new();
# Insure we catch errors.
eval {
    # Get the attribute database.
    Trace("Connecting to local attribute database.") if T(2);
    my $ca = CustomAttributes->new();
    # Process according to the options selected.
    if ($options->{backup}) {
        # Back up the attributes to the specified file.
        my $backupFileName = $options->{backup};
        Trace("Backing up attribute data.") if T(2);
        my $stats = $ca->BackupAllAttributes($backupFileName);
        Trace("Attribute backup statistics:\n" . $stats->Show()) if T(2);
    }
    if ($options->{backupKeys}) {
        # Back up the attribute key data to the specified file.
        Trace("Backing up key data.") if T(2);
        my $backupFileName = $options->{backupKeys};
        my $stats = $ca->BackupKeys($backupFileName);
        Trace("Key backup statistics:\n" . $stats->Show()) if T(2);
    }
    if ($options->{initializeAndClear}) {
        # Create the tables.
        $ca->CreateTables();
        Trace("Tables recreated.") if T(2);
    }
    if ($options->{dupClean}) {
        # Clean out duplicates. Determine the point at which we should start.
        # The default is at the beginning of the key list.
        my $startPoint = " ";
        # If the user specified a start value, start from there. An unspecified
        # value defaults to 1.
        if ($options->{dupClean} ne "1") {
            $startPoint = $options->{dupClean};
        }
        CleanDuplicates($ca, $startPoint);
    }
    if ($options->{trimSpaces}) {
        # Here we need to remove unnecessary spaces from an attribute values backup
        # file. First, we open the input backup file.
        my $fileName = $options->{trimSpaces};
        my $ih = Open(undef, "<$fileName");
        # Now we open the proposed output file.
        my $oh = Open(undef, ">$fileName.fixed");
        # Create a statistics object to track our progress.
        my $stats = Stats->new('lines', 'trims');
        Trace("Cleaning $fileName and copying to $fileName.fixed.") if T(2);
        # Loop through the input file.
        while (! eof $ih) {
            # Get the next record in the input file.
            $stats->Add(lines => 1);
            my ($id, $key, @values) = Tracer::GetLine($ih);
            # Trim the values.
            for my $value (@values) {
                if ($value =~ /(\S.+\S)\s+/) {
                    $value = $1;
                    $stats->Add(trims => 1);
                }
            }
            # Write the result to the output file.
            Tracer::PutLine($oh, [$id, $key, @values]);
        }
        # Close the files.
        close $ih;
        close $oh;
        Trace("$fileName.fixed is now a cleaned backup.\n" . $stats->Show()) if T(2);
    }
    if ($options->{loadKeys}) {
        # We want to load the attribute data from the specified file, but
        # first we need to verify that the file exists.
        my $loadFileName = $options->{loadKeys};
        if (! -f $loadFileName) {
            Confess("Cannot load keys: file \"$loadFileName\" is not found or not a file.");
        } else {
            Trace("Loading key data from $loadFileName.") if T(2);
            my $stats = $ca->RestoreKeys($loadFileName);
            Trace("Load statistics:\n" . $stats->Show()) if T(2);
        }
    }
    if ($options->{showKeyDef}) {
        # We want to display the identified key's description. Get the key name.
        my $name = $options->{showKeyDef};
        # Look for keys with the same name.
        my %keys = $ca->GetAttributeData('find', $options->{showKeyDef});
        # See if we found the key.
        if (! $keys{$name}) {
            print "Key $name not found.\n";
        } else {
            print "Description for $name.\n\n";
            print $keys{$name}->[1];
            print "\n\n";
        }
    }
    if ($options->{load}) {
        # We want to load the attribute data from the specified file, but
        # first we need to verify that the file exists.
        my $loadFileName = $options->{load};
        if (! -f $loadFileName) {
            Confess("Cannot load: file \"$loadFileName\" is not found or not a file.");
        } else {
            # Set up options. We may need to specify the append and resume options, and
            # we need to archive.
            my %loadOptions;
            if ($options->{replace}) {
                $loadOptions{append} = 0;
            } else {
                $loadOptions{append} = 1;
            }
            if ($options->{resume}) {
                $loadOptions{resume} = 1;
            }
            if ($options->{mode}) {
                $loadOptions{mode} = $options->{mode};
            }
            # Insure we have the archive directory available.
            my $archiveDirectory = "$FIG_Config::fig/AttribData";
            Tracer::Insure($archiveDirectory, 0777);
            # Create an archive file name from the current time and the PID.
            my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time());
            $loadOptions{archive} = "$archiveDirectory/attrSave$$.$mon.$mday.$year.$hour.$min.$sec.log";
            # Now we can load.
            Trace("Loading attribute data from $loadFileName.") if T(2);
            my $stats = $ca->LoadAttributesFrom($loadFileName, %loadOptions);
            Trace("Load statistics:\n" . $stats->Show()) if T(2);
        }
    }
    if ($options->{summary}) {
        # Here we have a summary report. The value of the option is the name of a file that is to
        # contain an html-formatted report. We start by getting a complete list of the keys and
        # the associated counts.
        my $keyCounts = GetAllKeys($ca);
        # Buffer the lines in the following list.
        my @lines = ();
        # Start the table.
        push @lines, CGI::start_table({border => 2});
        push @lines, CGI::Tr(CGI::th({align => 'left'}, 'Key Name'), CGI::th({align => 'right'}, 'Values'));
        # Loop through the key hash, building row data.
        for my $key (sort keys %{$keyCounts}) {
            push @lines, CGI::Tr(CGI::td({align => 'left'}, $key), CGI::td({align => 'right'}, $keyCounts->{$key}));
        }
        # Close off the table.
        push @lines, CGI::end_table();
        # Output the page.
        WritePage($options->{summary}, 'Attribute Key Summary', \@lines);
    }
    if ($options->{mapSubkey} || $options->{mapObjectID}) {
        # Parse out the main key.
        my $type = ($options->{mapSubkey} ? 'mapSubkey' : 'mapObjectID');
        my $mapThing = $options->{$type};
        my $field = ($options->{mapSubkey} ? 'subkey' : 'to-link');
        if ($mapThing =~ m#([^/]+)(/.+)#) {
            my ($keyName, $pattern) = ($1, $2);
            Trace("Processing $type mapping for $keyName with pattern s$pattern.") if T(2);
            # Create a statistics object.
            my $results = Stats->new();
            # Get all the field values.
            my %things = map { $_ => 1 } $ca->GetFlat(['HasValueFor'], "HasValueFor(from-link) = ?", [$keyName],
                                                       "HasValueFor($field)");
            my $totalThings = scalar keys %things;
            Trace("$totalThings ${field}s found.") if T(2);
            # Loop through the values, doing updates where necessary.
            for my $thing (keys %things) {
                # Count this subkey.
                my $thingCount = $results->Add("${field}s" => 1);
                # Apply the substitution.
                my $newThing = $thing;
                eval("\$newThing =~ s$pattern");
                # If the evaluation resulted in an error, stop immediately.
                if ($@) {
                    Confess("Error in substitution pattern: $@");
                } elsif ($newThing ne $thing) {
                    # Here the substitution worked and it changed the key value.
                    # We need to update the database.
                    $results->Add(updates => 1);
                    my $count = $ca->UpdateField("HasValueFor($field)", $thing, $newThing,
                                                 "HasValueFor(from-link) = ?", [$keyName]);
                    $results->Add(rowsChanged => $count);
                }
                # Trace our progress.
                if ($thingCount % 100 == 0) {
                    Trace("$thingCount processed.") if T(3);
                }
            }
            # Display the statistics.
            Trace("Statistics from $field update:\n" . $results->Show()) if T(2);
        } else {
            # Here the incoming parameter was in the wrong format. Mostly this means there
            # was nothing before the slash or no slash was found.
            Confess("Invalid substitution syntax in map option.");
        }
    }
    if ($options->{dockClean}) {
        # Get the list of PDBs with results.
        my @pdbList = sort map { $_->[0] } $ca->GetAttributes(undef, 'has_results');
        # Loop through the PDB IDs.
        for my $pdbID (@pdbList) {
            Trace("Processing $pdbID.") if T(3);
            # Loop until we run out of rows to delete.
            my $thisCount = 1;
            my $totalCount = 0;
            while ($thisCount) {
                # Delete a bunch of rows. To avoid a timeout, we limit the results.
                $thisCount = $ca->DeleteLike('HasValueFor', 'HasValueFor(to-link) = ? AND HasValueFor(value) LIKE ? LIMIT 10000',
                                           [$pdbID, '%Predicted']);
                $totalCount += $thisCount;
                Trace("$thisCount rows deleted in batch. $totalCount total deletions for pdb $pdbID.") if T(3);
            }
        }
    }
    Trace("Processing complete.") if T(2);
};
if ($@) {
    Trace("Script failed with error: $@") if T(0);
    $rtype = "error";
} else {
    Trace("Script complete.") if T(2);
    $rtype = "no error";
}
if ($options->{phone}) {
    my $msgID = Tracer::SendSMS($options->{phone}, "RefreshAttrDB terminated with $rtype.");
    if ($msgID) {
        Trace("Phone message sent with ID $msgID.") if T(2);
    } else {
        Trace("Phone message not sent.") if T(2);
    }
}


=head3 GetAllKeys

    my @keys = GetAllKeys($ca);

Return a sorted list of the attribute keys.

=over 4

=item ca

CustomAttributes object used to access the database.

=item RETURN

Returns a sorted list of all the attribute keys.

=back

=cut

sub GetAllKeys {
    # Get the parameters.
    my ($ca) = @_;
    # Get the attribute data.
    my %keyData = $ca->GetAttributeData('name', '');
    # Sort the keys.
    my @retVal = sort keys %keyData;
    # Return the result.
    return @retVal;
}

=head3 OpenPage

    my $fh = OpenPage($fileName, $title);

Start writing an HTML page to a file and return the file handle.

=over 4

=item fileName

Name of the file to which the page will be written.

=item title

Title for the page.

=item RETURN

Returns the file handle for writing the rest of the page.

=back

=cut

sub OpenPage {
    # Get the parameters.
    my ($fileName, $title) = @_;
    # Open the file.
    my $retVal = Open(undef, ">$fileName");
    # Write out the HTML headers.
    print $retVal CGI::start_html(-title => $title, -BGCOLOR => 'silver');
    print $retVal "\n";
    # Return the file handle.
    return $retVal;
}

=head3 GetNewKeyCounts

    my %counts = GetNewKeyCounts($ca);

Return a hash mapping attribute key names to counts.

=over 4

=item ca

CustomAttributes object for accessing the attribute database.

=item RETURN

Returns a reference to a hash mapping each key name to a count of the key's values.

=back

=cut

sub GetNewKeyCounts {
    # Get the parameters.
    my ($ca) = @_;
    # Declare the return variable.
    my $retVal = {};
    # Get all of the keys.
    my @keys = GetAllKeys($ca);
    # Loop through the list, filling the hash.
    for my $key (@keys) {
        my $count = $ca->GetCount(['HasValueFor'], "HasValueFor(from-link) = ?", [$key]);
        $retVal->{$key} = $count;
    }
    # Return the result.
    return $retVal;
}

=head3 WritePage

    WritePage($fileName, $title, \@lines);

Write the specified lines to the specified file as a web page. The lines are assumed to be raw
HTML body text. They will be preceded by a standard HTML header and followed by a standard
HTML footer.

=over 4

=item fileName

Name of the output file.

=item title

Title for the web page.

=item lines

Reference to a list of lines of HTML.

=back

=cut

sub WritePage {
    # Get the parameters.
    my ($fileName, $title, $lines) = @_;
    # Open the file and write the header to it. The header includes everything up to and including
    # the BODY tag.
    Trace("Writing web page to $fileName.") if T(2);
    my $oh = OpenPage($fileName, $title);
    # Write the lines one at a time.
    for my $line (@{$lines}) {
        print $oh "$line\n";
    }
    # Write the HTML footer.
    print $oh CGI::end_html();
    # Close the output file.
    close $oh;
    Trace("Web page created in $fileName.") if T(2);
}

=head3 CleanDuplicates

    CleanDuplicates($ca, $startPoint);

Remove duplicate attribute values from the attribute database, starting
with the specified key. This is a long, slow process. We look through all
the values for a particular key. If duplicate values are found, we delete
all the matching values and re-insert.

=over 4

=item ca

[[CustomAttributesPm]] object for accessing the attribute database.

=item startPoint

Name of the first key to process. All keys that are lexically equal to or greater than this
value will be processed.

=back

=cut

sub CleanDuplicates {
    # Get the parameters.
    my ($ca, $startPoint) = @_;
    # Get a statistics object.
    my $stats = Stats->new();
    # Get the attribute keys we'll be wanting to process. For each key we get the
    # key ID and the relevant relationship name.
    my %keyList = map { $_->[0] => $_->[1] } $ca->GetAll(['AttributeKey'],
                                                         "AttributeKey(id) >= ? ORDER BY AttributeKey(id)",
                                                         [$startPoint],
                                                         ['AttributeKey(id)', 'AttributeKey(relationship-name)']);
    # Form the actual keys into a sorted list. We do this so we can more easily trace the number of
    # keys we have to process.
    my @keys = sort keys %keyList;
    my $n = scalar(@keys);
    Trace("$n will be cleaned for duplicates.") if T(2);
    # Loop through the keys.
    for my $key (@keys) {
        Trace("Processing key " . $stats->Add(keys => 1) . " of $n: $key.") if T(3);
        # Get the key's table.
        my $table = $keyList{$key};
        # Now we will loop through the table's values in sequence, checking for duplicates.
        # we will read the values in clumps, one clump for each target object ID. In general
        # the clumps will be small, and we roll them into a hash to identify the duplicates.
        # This next variable holds the current object ID.
        my $objectID = "";
        # This will be the hash used to check for duplicate values.
        my %valueHash;
        # Duplicates found will be put in this list.
        my @dupList = ();
        # Count the values for this key.
        my $keyVals = 0;
        # Now loop through all the entries for this key.
        my $query = $ca->Get([$table], "$table(from-link) = ? ORDER BY $table(from-link), $table(to-link)",
                             [$key]);
        while (my $value = $query->Fetch()) {
            # Get the fields for this value.
            my ($myID, $subKey, $value) = $value->Values(["$table(to-link)", "$table(subkey)",
                                                          "$table(value)"]);
            # Count it.
            Trace($stats->Ask('values') . " total values processed.") if $stats->Check(values => 500) && T(3);
            $keyVals++;
            # Is this a new clump?
            if ($myID ne $objectID) {
                # Yes it is. Clear the value hash and save the new object ID.
                %valueHash = ();
                $objectID = $myID;
                $stats->Add(clumps => 1);
            }
            # Now determine if we have a duplicate.
            my $valueKey = "$subKey::$value";
            if (! $valueHash{$valueKey}) {
                # No. Record it for future use.
                $valueHash{$valueKey} = 1;
            } else {
                # Yes. Count it as a duplicate.
                my $count = $valueHash{$valueKey}++;
                $stats->Add(duplicates => 1);
                # Is this our first time for it?
                if ($count == 1) {
                    # Yes. Save it in the duplicates list.
                    push @dupList, [$key, $objectID, $subKey, $value];
                }
            }
        }
        Trace(scalar(@dupList) . " duplicates found for $key out of $keyVals.") if T(3);
        # Now we've processed the key. Go through deleting and restoring the values found.
        # This next variable contains the filter clause to use.
        my $filter = "$table(from-link) = ? AND $table(to-link) = ? AND $table(subkey) = ? AND $table(value) = ?";
        # This is a counter for tracing.
        my $dupCount = 0;
        # Loop through the duplicates.
        for my $dup (@dupList) {
            # Delete all copies of this duplicate.
            my $count = $ca->DeleteLike($table => $filter, $dup);
            $stats->Add(deleted => $count - 1);
            # Put a single instance back in.
            $ca->InsertObject($table, {'from-link' => $dup->[0], 'to-link' => $dup->[1], subkey => $dup->[2],
                                       value => $dup->[3]});
            # Count this.
            $dupCount++;
            Trace("$dupCount duplicates processed for $key.") if ($dupCount % 100 == 0) && T(3);
        }
        Trace("Key $key finished. $dupCount duplicates removed.") if T(3);
    }
    Trace("Processing complete:\n" . $stats->Show()) if T(2);
}


1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3