[Bio] / FigKernelScripts / TransactFeatures.pl Repository:
ViewVC logotype

View of /FigKernelScripts/TransactFeatures.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (download) (as text) (annotate)
Mon Aug 8 20:13:08 2005 UTC (14 years, 4 months ago) by parrello
Branch: MAIN
New script to process transaction files containing commands to add, delete, and replace features.

#!/usr/bin/perl -w

=head1 Add / Delete / Change Features

This method will run through a set of transaction files, adding, deleting, and changing
features in the FIG data store. The command takes three input parameters. The first is
a command. The second specifies a directory full of transaction files. The third
specifies a file that tells us which feature IDs are available for each organism.

C<TransactFeatures> I<[options]> I<command> I<transactionDirectory> I<idFile>

The supported commands are

=over 4

=item count

Count the number of IDs needed to process the ADD and CHANGE transactions. This
will produce an listing of the number of feature IDs needed for each
organism and feature type. This command is mostly a sanity check: it provides
useful statistics without changing anything.

=item register

Create an ID file by requesting IDs from the clearinghouse. This performs the
same function as B<count>, but takes the additional step of creating an ID
file that can be used to process the transactions.

=item process

Process the transactions and update the FIG data store. This will also create
a copy of each transaction file in which the pseudo-IDs have been replaced by
real IDs.

=back

=head2 The Transaction File

Each transaction file is a standard tab-delimited file, one transaction per line. The
name of the file is C<tbl_diff_>I<org> where I<org> is an organism ID. All records in
the transaction file refer to transactions against the organism encoded in the file
name.

The file must specify IDs for new features, but the real IDs cannot be known until
they are requested from the SEED clearing house. Therefore, each new ID is specified
in a special format consisting of the feature type (C<peg>, C<rna>, and so forth)
followed by a dot and the 0-based ordinal number of the new ID within that
feature type. So, for example, if the transaction file consists of a delete,
a change, and two adds, it might look like this

    delete fig|83333.1.peg.2
    change fig|83333.1.peg.6 peg.0 ...
    add peg.1 ...
    add rna.0 ...

Note that the old feature IDs do not participate in the numbering process, and the RNA
numbering is independent of the PEG numbering. In the discussion below of transaction
types, a field named I<newID> will always indicate one of these type/number pairs.
So, the field setup for the B<chang> command is

    change fid newID locations aliases translation

And the I<newID> corresponds to the C<peg.6> in the example above.

The first field of each record is the transaction type. The list of subsequent fields
depends on this type.

=over 4

=item DELETE fid

Deletes a feature. The feature is marked as deleted in the FIG database, which
causes it to be skipped or ignored by most of the SEED software. The ID of the
feature to be deleted is the second field (I<fid>).

=item ADD newID locations translation

Adds a new feature. The I<newID> indicates the feature type and its ordinal number.
The location is a comma-separated list of location strings. The translation is the
protein translation for the location. If the translation is omitted, then it will
be generated from the location information in the normal way.

=item CHANGE fid newID locations aliases translation

Changes an existing feature. The current copy of the feature is marked as deleted,
and a new feature is created with a new ID. All annotations and assignments are
transferred from the deleted feature to the new one. The location is a
comma-separated list of location strings. The aliases are specified as a comma-delimited
list of alternate names for the feature. These replace any existing aliases for the
old feature. If the alias list is omitted, no aliases will be assigned to the new
feature. The translation is the protein translation for the location. If the
translation is omitted, then it will be generated from the location information in the
normal way.

=back

=head2 The ID File

The ID file is a tab-delimited file containing one record for each feature type
of each organism that has a transaction file. Each record consists of three
fields.

=over 4

=item orgID

The ID of the organism being updated.

=item ftype

The relevant feature type.

=item firstNumber

The first available ID number for the organism and feature type.

=back

This file's primary purpose is that it tells us how to create the feature IDs
for features we'll be adding to the data store, whether it be via a straight
B<add> or a B<chang> that deletes an old ID and recreates the feature with a
new ID.

If we need new IDs for an organism not listed in this ID file, an error will be
thrown.

=head2 Command-Line Options

The command-line options for this script are as follows.

=over 4

=item trace

Numeric trace level. A higher trace level causes more messages to appear. The
default trace level is 3.

=cut

use strict;
use Tracer;
use DocUtils;
use TestUtils;
use Cwd;
use File::Copy;
use File::Path;
use FIG;
use Stats;

# Get the command-line options.
my ($options, @parameters) = Tracer::ParseCommand({ trace => 3 }, @ARGV);
# Set up tracing.
my $traceLevel = $options->{trace};
TSetup("$traceLevel Tracer DocUtils FIG", "TEXT");
# Get the FIG object.
my $fig = FIG->new();
# Get the command.
my $mainCommand = lc shift @parameters;
Trace("$mainCommand command specified.") if T(2);

# Create the ID table. This maps each organism/ftype pair to the currently-
# available ID number. If we're counting, we leave it empty. If we're not
# counting, we need to read it in.
my %idHash = ();
if ($mainCommand eq 'process') {
    my $inCount = 0;
    Open(\*IDFILE, "<$parameters[1]");
    while (my $idRecord = <IDFILE>) {
        chomp $idRecord;
        my ($orgID, $ftype, $firstNumber) = split /\t/, $idRecord;
        $idHash{"$orgID.$ftype"} = $firstNumber;
        $inCount++;
    }
    Trace("$inCount ID ranges read in from $parameters[1].") if T(2);
}

# Create some counters we can use for statistical purposes.
my $stats = Stats->new("genomes", "add", "change", "delete");
# Verify that the organism directory exists.
if (! -d $parameters[0]) {
    Confess("Directory of genome files \"$parameters[0]\" not found.");
} else {
    # Here we have a valid directory, so we need the list of transaction
    # files in it.
    my $orgsFound = 0;
    my %transFiles = ();
    my @transDirectory = OpenDir($parameters[0], 1);
    # The next step is to create a hash of organism IDs to file names. This
    # saves us some painful parsing later.
    for my $transFileName (@transDirectory) {
        if ($transFileName =~ /^tbl_diff_(\d+\.\d+)$/) {
            $transFiles{$1} = "$parameters[0]/$transFileName";
            $orgsFound++;
        }
    }
    Trace("$orgsFound genome transaction files found in directory $parameters[0].") if T(2);
    if (! $orgsFound) {
        Confess("No \"tbl_diff\" files found in directory $parameters[1].");
    } else {
        # Loop through the organisms.
        for my $genomeID (sort keys %transFiles) {
            Trace("Processing changes for $genomeID.") if T(3);
            # Create a statistics object for this organism.
            my $orgStats = Stats->new("add", "change", "delete");
            # Create a control block for passing around our key data.
            my $controlBlock = { stats => $orgStats, genomeID => $genomeID,
                                 idHash => \%idHash, options => $options,
                                 fig => $fig, command => $mainCommand };
            # Open the organism file.
            my $orgFileName = $transFiles{$genomeID};
            Open(\*TRANS, "<$orgFileName");
            my $tranCount = 0;
            # If we're processing rather than counting, open a file for
            # writing out corrected transactions.
            if ($mainCommand eq 'process') {
                Open(\*TRANSOUT, ">$orgFileName.tbl");
            }
            # Loop through the organism's data.
            while (my $transaction = <TRANS>) {
                # Parse the record.
                chomp $transaction;
                my @fields = split /\t/, $transaction;
                $tranCount++;
                # Save the record number in the control block.
                $controlBlock->{line} = $tranCount;
                # Process according to the transaction type.
                my $command = lc shift @fields;
                if ($command eq 'add') {
                    Add($controlBlock, @fields);
                } elsif ($command eq 'delete') {
                    Delete($controlBlock, @fields);
                } elsif ($command eq 'change') {
                    Change($controlBlock, @fields);
                } else {
                    $orgStats->AddMessage("Invalid command $command in line $tranCount for genome $genomeID");
                }
                $orgStats->Add($command, 1);
            }
            Trace("Statistics for $genomeID\n\n" . $orgStats->Show()) if T(3);
            # Merge the statistics for this run into the globals statistics object.
            $stats->Accumulate($orgStats);
            $stats->Add("genomes", 1);
            # Close the transaction files.
            close TRANS;
            if ($mainCommand eq 'process') {
                close TRANSOUT;
            }
        }
    }
    Trace("Statistics for this run\n\n" . $stats->Show()) if T(1);
    # If we're counting, we need to write out the counts file or allocate IDs
    # from the clearinghouse.
    if ($mainCommand ne "process") {
        # Loop through the ID hash, printing the counts. We will also write them
        # to a file called "counts.tbl".
        my $countfile = "$parameters[0]/counts.tbl";
        Open(\*COUNTFILE, ">$countfile");
        print "\nTable of Counts\n";
        for my $idKey (keys %idHash) {
            $idKey =~ /^(\d+\.\d+)\.([a-z]+)$/;
            my ($org, $ftype) = ($1, $2);
            my $count = $idHash{$idKey};
            print "$idKey\t$count\n";
            print COUNTFILE "$org\t$ftype\t$count\n";
        }
        close COUNTFILE;
        if ($mainCommand eq "register") {
            # Here we are registering as well as counting. This process also produces
            # the ID file.
            Trace("Submitting ID file to clearing house.") if T(2);
            system("register_features_batch <$countfile >$parameters[1]");
            Trace("Clearing house request complete.") if T(2);
        }
    }
    Trace("Processing complete.") if T(1);
}

=head2 Utility Methods

=head3 Add

C<< Add($controlBlock, $newID, $locations, $translation); >>

Add a new feature to the data store.

=over 4

=item controlBlock

Reference to a hash containing the data structures required to manage feature
transactions.

=item newID

ID to give to the new feature.

=item locations

Location of the new feature, in the form of a comma-separated list of location
strings in SEED format.

=item translation (optional)

Protein translation string for the new feature. If this field is omitted and
the feature is a peg, the translation will be generated by normal means.

=back

=cut

sub Add {
    my ($controlBlock, $newID, $locations, $translation) = @_;
    my $fig = $controlBlock->{fig};
    # Extract the feature type and ordinal number from the new ID.
    my ($ftype, $ordinal, $key) = ParseNewID($controlBlock, $newID);
    # If we're counting, we need to count the ID. Otherwise, we need to
    # add the new feature.
    if ($controlBlock->{command} ne 'process') {
        $controlBlock->{idHash}->{$key}++;
    } else {
        # Here we need to add the new feature.
        my $realID = AddFeature($controlBlock, $ordinal, $key, $ftype,
                                "", $locations, $translation);
        Trace("Feature $realID added for pseudo-ID $newID.") if T(4);
        # Write a corrected transaction to the transaction output file.
        print TRANSOUT "add\t$realID\t$locations\t$translation\n";
    }
}

=head3 Change

C<< Change($controlBlock, $fid, $newID, $locations, $aliases, $translation); >>

Replace a feature to the data store. The feature will be marked for deletion and
a new feature will be put in its place.

This is a much more complicated process than adding a feature. In addition to
the add, we have to create new aliases and transfer across the assignment and
the annotations.

=over 4

=item controlBlock

Reference to a hash containing the data structures required to manage feature
transactions.

=item fid

ID of the feature being changed.

=item newID

New ID to give to the feature.

=item locations

New location to give to the feature, in the form of a comma-separated list of location
strings in SEED format.

=item aliases (optional)

A new list of alias names for the feature.

=item translation (optional)

New protein translation string for the feature. If this field is omitted and
the feature is a peg, the translation will be generated by normal means.

=back

=cut

sub Change {
    my ($controlBlock, $fid, $newID, $locations, $aliases, $translation) = @_;
    my $fig = $controlBlock->{fig};
    # Extract the feature type and ordinal number from the new ID.
    my ($ftype, $ordinal, $key) = ParseNewID($controlBlock, $newID);
    # If we're counting, we need to count the ID. Otherwise, we need to
    # replace the feature.
    if ($controlBlock->{command} ne 'process') {
        $controlBlock->{idHash}->{$key}++;
    } else {
        # Here we can go ahead and change the feature. First, we must
        # get the old feature's assignment and annotations. Note that
        # for the annotations we ask for the time in its raw format.
        my @functions = $fig->function_of($fid);
        my @annotations = $fig->feature_annotations($fid, 1);
        # Create some counters.
        my ($assignCount, $annotateCount) = (0, 0);
        # Add the new version of the feature and get its ID.
        my $realID = AddFeature($controlBlock, $ordinal, $key, $ftype, $locations,
                                $aliases, $translation);
        # Copy over the assignments.
        for my $assignment (@functions) {
            my ($user, $function) = @{$assignment};
            $fig->assign_function($realID, $user, $function);
            $assignCount++;
        }
        # Copy over the annotations.
        for my $annotation (@annotations) {
            my ($oldID, $timestamp, $user, $annotation) = @{$annotation};
            $fig->add_annotation($realID, $user, $annotation, $timestamp);
            $controlBlock->{stats}->Add("annotation", 1);
            $annotateCount++;
        }
        # Mark the old feature for deletion.
        $fig->delete_feature($fid);
        # Tell the user what we did.
        $controlBlock->{stat}->Add("assignments", $assignCount);
        $controlBlock->{stat}->Add("annotations", $annotateCount);
        Trace("Feature $realID created from $fid. $assignCount assignments and $annotateCount annotations copied.") if T(4);
        # Write a corrected transaction to the transaction output file.
        print TRANSOUT "change\t$fid\t$realID\t$locations\t$aliases\t$translation\n";
    }
}

=head3 Delete

C<< Delete($controlBlock, $fid); >>

Delete a feature from the data store. The feature will be marked as deleted,
which will remove it from consideration by most FIG methods. A garbage
collection job will be run later to permanently delete the feature.

=over 4

=item controlBlock

Reference to a hash containing the data structures required to manage feature
transactions.

=item fid

ID of the feature to delete.

=back

=cut

sub Delete {
    my ($controlBlock, $fid) = @_;
    my $fig = $controlBlock->{fig};
    # Extract the feature type and count it.
    my $ftype = FIG::ftype($fid);
    $controlBlock->{stats}->Add($ftype, 1);
    # If we're not counting, delete the feature.
    if ($controlBlock->{command} eq 'process') {
        # Mark the feature for deletion.
        $fig->delete_feature($fid);
        # Echo the transaction to the transaction output file.
        print TRANSOUT "del\t$fid\n";
    }
}

=head3 ParseNewID

C<< my ($ftype, $ordinal, $key) = ParseNewID($controlBlock, $newID); >>

Extract the feature type and ordinal number from an incoming new ID.

=over 4

=item controlBlock

Reference to a hash containing the data structures needed to manage transactions.

=item newID

New ID specification taken from a transaction input record. This contains the
feature type followed by a period and then the ordinal number of the ID.

=item RETURN

Returna a three-element list. If successful, the list will contain the feature
type followed by the ordinal number and the key to use in the ID hash to find
the feature's true ID number. If the incoming ID is invalid, the list
will contain three C<undef>s.

=back

=cut

sub ParseNewID {
    # Get the parameters.
    my ($controlBlock, $newID) = @_;
    my ($ftype, $ordinal, $key);
    # Parse the ID.
    if ($newID =~ /^([a-z]+)\.(\d+)$/) {
        # Here we have a valid ID.
        ($ftype, $ordinal) = ($1, $2);
        $key = $controlBlock->{genomeID} . ".$ftype";
        # Update the feature type count in the statistics.
        $controlBlock->{stats}->Add($ftype, 1);
    } else {
        # Here we have an invalid ID.
        $controlBlock->{stats}->AddMessage("Invalid ID $newID found in line " .
                                           $controlBlock->{line} . " for genome " .
                                           $controlBlock->{genomeID} . ".");
    }
    # Return the result.
    return ($ftype, $ordinal, $key);
}

=head3 GetRealID

C<< my $realID = GetRealID($controlBlock, $ftype, $ordinal, $key); >>

Compute the real ID of a new feature. This involves interrogating the ID hash and
formatting a full-blown ID out of little bits of information.

=over 4

=item controlBlock

Reference to a hash containing data used to manage the transaction process.

=item ordinal

Zero-based ordinal number of this feature. The ordinal number is added to the value
stored in the control block's ID hash to compute the real feature number.

=item key

Key in the ID hash relevant to this feature.

=item RETURN

Returns a fully-formatted FIG ID for the new feature.

=back

=cut

sub GetRealID {
    # Get the parameters.
    my ($controlBlock, $ordinal, $key) = @_;
    #Declare the return value.
    my $retVal;
    # Get the base value for the feature ID number.
    my $base = $controlBlock->{idHash}->{$key};
    # If it didn't exist, we have an error.
    if (! defined $base) {
        Confess("No ID range found for genome ID and feature type $key.");
    } else {
        # Now we have enough data to format the ID.
        my $num = $base + $ordinal;
        $retVal = "fig|$key.$num";
    }
    # Return the result.
    return $retVal;
}

=head3 CheckTranslation

C<< my $actualTranslation = CheckTranslation($controlBlock, $ftype, $locations, $translation); >>

If we are processing a PEG, insure we have a translation for the peg's locations.

This method checks the feature type and the incoming translation string. If the
translation string is empty and the feature type is C<peg>, it will generate
a translation string using the specified locations for the genome currently
being processed.

=over 4

=item controlBlock

Reference to a hash containing data used to manage the transaction process.

=item ftype

Feature type (C<peg>, C<rna>, etc.)

=item locations

Comma-delimited list of location strings for the feature in question.

=item translation (optional)

If specified, will be returned to the caller as the result.

=item RETURN

Returns the protein translation string for the specified locations, or C<undef>
if no translation is warranted.

=back

=cut

sub CheckTranslation {
    # Get the parameters.
    my ($controlBlock, $ftype, $locations, $translation) = @_;
    my $fig = $controlBlock->{fig};
    # Declare the return variable.
    my $retVal;
    if ($ftype eq 'peg') {
        # Here we have a protein encoding gene. Check to see if we already have
        # a translation.
        if (defined $translation) {
            # Pass it back unmodified.
            $retVal = $translation;
        } else {
            # Here we need to compute the translation.
            my $dna = $fig->dna_seq($controlBlock->{genomeID}, $locations);
            $retVal = FIG::translate($dna);
        }
    }
    # Return the result.
    return $retVal;
}

=head3 AddFeature

C<< my $realID = AddFeature($controlBlock, $ordinal, $key, $ftype, $locations, $translation); >>

Add the specified feature to the FIG data store. This involves generating the new feature's
ID, creating the translation (if needed), adding the feature to the data store, and
queueing a request to update the similarities. The generated ID will be returned to the
caller.

=over 4

=item controlBlock

Reference to a hash containing the data structures required to manage feature
transactions.

=item ordinal

Zero-based ordinal number of the proposed feature in the ID space. This is added to the
base ID number to get the real ID number.

=item key

Key to use for getting the base ID number from the ID hash.

=item ftype

Proposed feature type (C<peg>, C<rna>, etc.)

=item locations

Location of the new feature, in the form of a comma-separated list of location
strings in SEED format.

=item aliases (optional)

A new list of alias names for the feature.

=item translation (optional)

Protein translation string for the new feature. If this field is omitted and
the feature is a peg, the translation will be generated by normal means.

=back

=cut

sub AddFeature {
    # Get the parameters.
    my ($controlBlock, $ordinal, $key, $ftype, $locations, $aliases, $translation) = @_;
    my $fig = $controlBlock->{fig};
    # We want to add a new feature using the information provided. First, we
    # generate its ID.
    my $retVal = GetRealID($controlBlock, $ordinal, $key);
    # Next, we insure that we have a translation.
    my $actualTranslation = CheckTranslation($controlBlock, $ftype,
                                             $locations, $translation);
    # Now we add it to FIG.
    $fig->add_feature($controlBlock->{genomeID}, $ftype, $locations, "",
                      $actualTranslation, $retVal);
    # Tell FIG to recompute the similarities.
    $fig->enqueue_similarities([$retVal]);
    # Return the ID we generated.
    return $retVal;
}

1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3