[Bio] / FigKernelScripts / TransactFeatures.pl Repository:
ViewVC logotype

View of /FigKernelScripts/TransactFeatures.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.9 - (download) (as text) (annotate)
Mon Aug 15 20:29:10 2005 UTC (14 years, 3 months ago) by parrello
Branch: MAIN
Changes since 1.8: +7 -0 lines
Added a command fort copying attributes from old features to their replacements.

#!/usr/bin/perl -w

=head1 Add / Delete / Change Features

This method will run through a set of transaction files, adding, deleting, and changing
features in the FIG data store. The command takes three input parameters. The first is
a command. The second specifies a directory full of transaction files. The third
specifies a file that tells us which feature IDs are available for each organism.

C<TransactFeatures> I<[options]> I<command> I<transactionDirectory> I<idFile>

The supported commands are

=over 4

=item count

Count the number of IDs needed to process the ADD and CHANGE transactions. This
will produce an listing of the number of feature IDs needed for each
organism and feature type. This command is mostly a sanity check: it provides
useful statistics without changing anything.

=item register

Create an ID file by requesting IDs from the clearinghouse. This performs the
same function as B<count>, but takes the additional step of creating an ID
file that can be used to process the transactions.

=item process

Process the transactions and update the FIG data store. This will also create
a copy of each transaction file in which the pseudo-IDs have been replaced by
real IDs.

=item annotate

Annotate the features created by the transactions so as to indicate how they were
derived.

=item check

Verify that the locations and translations of the new and changed features are
correct.

=item fix

Fix the locations and translations of the new and changed features.

=item aliasMove

Move the aliases from the old features to the ones that replaced them.

=item attribute

Move the attributes from the old features to the ones that replaced them.

=back

=head2 The Transaction File

Each transaction file is a standard tab-delimited file, one transaction per line. The
name of the file is C<tbl_diff_>I<org> where I<org> is an organism ID. All records in
the transaction file refer to transactions against the organism encoded in the file
name.

The file must specify IDs for new features, but the real IDs cannot be known until
they are requested from the SEED clearing house. Therefore, each new ID is specified
in a special format consisting of the feature type (C<peg>, C<rna>, and so forth)
followed by a dot and the 0-based ordinal number of the new ID within that
feature type. So, for example, if the transaction file consists of a delete,
a change, and two adds, it might look like this

    delete fig|83333.1.peg.2
    change fig|83333.1.peg.6 peg.0 ...
    add peg.1 ...
    add rna.0 ...

Note that the old feature IDs do not participate in the numbering process, and the RNA
numbering is independent of the PEG numbering. In the discussion below of transaction
types, a field named I<newID> will always indicate one of these type/number pairs.
So, the field setup for the B<chang> command is

    change fid newID locations aliases translation

And the I<newID> corresponds to the C<peg.6> in the example above.

The first field of each record is the transaction type. The list of subsequent fields
depends on this type.

=over 4

=item DELETE fid

Deletes a feature. The feature is marked as deleted in the FIG database, which
causes it to be skipped or ignored by most of the SEED software. The ID of the
feature to be deleted is the second field (I<fid>).

=item ADD newID locations translation

Adds a new feature. The I<newID> indicates the feature type and its ordinal number.
The location is a comma-separated list of location strings. The translation is the
protein translation for the location. If the translation is omitted, then it will
be generated from the location information in the normal way.

=item CHANGE fid newID locations aliases translation

Changes an existing feature. The current copy of the feature is marked as deleted,
and a new feature is created with a new ID. All annotations and assignments are
transferred from the deleted feature to the new one. The location is a
comma-separated list of location strings. The aliases are specified as a comma-delimited
list of alternate names for the feature. These replace any existing aliases for the
old feature. If the alias list is omitted, no aliases will be assigned to the new
feature. The translation is the protein translation for the location. If the
translation is omitted, then it will be generated from the location information in the
normal way.

=back

=head2 The ID File

The ID file is a tab-delimited file containing one record for each feature type
of each organism that has a transaction file. Each record consists of three
fields.

=over 4

=item orgID

The ID of the organism being updated.

=item ftype

The relevant feature type.

=item firstNumber

The first available ID number for the organism and feature type.

=back

This file's primary purpose is that it tells us how to create the feature IDs
for features we'll be adding to the data store, whether it be via a straight
B<add> or a B<chang> that deletes an old ID and recreates the feature with a
new ID.

If we need new IDs for an organism not listed in this ID file, an error will be
thrown.

=head2 Command-Line Options

The command-line options for this script are as follows.

=over 4

=item trace

Numeric trace level. A higher trace level causes more messages to appear. The
default trace level is 3.

=item safe

Wrap each organism's processing in a database transaction. This makes the process
slightly more restartable than it would be otherwise.

=item noAlias

Assume that the transaction files do not contain aliases. This means that in CHANGE
records the translation will immediately follow the location.

=item sql

Trace SQL commands.

=back

=cut

use strict;
use Tracer;
use DocUtils;
use TestUtils;
use Cwd;
use File::Copy;
use File::Path;
use FIG;
use Stats;
use TransactionProcessor;
use ApplyTransactions;
use CountTransactions;
use AnnotateTransactions;
use AttributeTransactions;
use FixTransactions;
use MoveAliases;

# Get the command-line options.
my ($options, @parameters) = Tracer::ParseCommand({ trace => 3, sql => 0, safe => 0, noAlias => 0 },
                                                  @ARGV);
# Get the command.
my $mainCommand = lc shift @parameters;
# Set up tracing.
my $traceLevel = $options->{trace};
my $tracing = "$traceLevel Tracer DocUtils FIG";
if ($options->{sql}) {
    $tracing .= " SQL";
}
TSetup($tracing, "TEXT");
# Get the FIG object.
my $fig = FIG->new();
# Create the transaction object.
my $controlBlock;
if ($mainCommand eq 'count' || $mainCommand eq 'register') {
    $controlBlock = CountTransactions->new($options, $mainCommand, @parameters);
} elsif ($mainCommand eq 'process') {
    $controlBlock = ApplyTransactions->new($options, $mainCommand, @parameters);
} elsif ($mainCommand eq 'annotate') {
    $controlBlock = AnnotateTransactions->new($options, $mainCommand, @parameters);
} elsif ($mainCommand eq 'fix' || $mainCommand eq 'check') {
    $controlBlock = FixTransactions->new($options, $mainCommand, @parameters);
} elsif ($mainCommand eq 'aliasmove') {
    $controlBlock = MoveAliases->new($options, $mainCommand, @parameters);
} elsif ($mainCommand eq 'attribute') {
    $controlBlock = AttributeTransactions->new($options, $mainCommand, @parameters);
} else {
    Confess("Invalid command \"$mainCommand\" specified on command line.");
}
# Setup the process.
$controlBlock->Setup();
# Verify that the organism directory exists.
if (! -d $parameters[0]) {
    Confess("Directory of genome files \"$parameters[0]\" not found.");
} else {
    # Here we have a valid directory, so we need the list of transaction
    # files in it.
    my $orgsFound = 0;
    my %transFiles = ();
    my @transDirectory = OpenDir($parameters[0], 1);
    # The next step is to create a hash of organism IDs to file names. This
    # saves us some painful parsing later.
    for my $transFileName (@transDirectory) {
        if ($transFileName =~ /^tbl_diff_(\d+\.\d+)$/) {
            $transFiles{$1} = "$parameters[0]/$transFileName";
            $orgsFound++;
        }
    }
    Trace("$orgsFound genome transaction files found in directory $parameters[0].") if T(2);
    if (! $orgsFound) {
        Confess("No \"tbl_diff\" files found in directory $parameters[1].");
    } else {
        # Loop through the organisms.
        for my $genomeID (sort keys %transFiles) {
            # Start this organism.
            Trace("Processing changes for $genomeID.") if T(3);
            my $orgFileName = $transFiles{$genomeID};
            $controlBlock->StartGenome($genomeID, $orgFileName);
            # Open the organism file.
            Open(\*TRANS, "<$orgFileName");
            # Clear the transaction counter.
            my $tranCount = 0;
            # Loop through the organism's data.
            while (my $transaction = <TRANS>) {
                # Parse the record.
                chomp $transaction;
                my @fields = split /\t/, $transaction;
                $tranCount++;
                # Save the record number in the control block.
                $controlBlock->{line} = $tranCount;
                # Process according to the transaction type.
                my $command = lc shift @fields;
                if ($command eq 'add') {
                    $controlBlock->Add(@fields);
                } elsif ($command eq 'delete') {
                    $controlBlock->Delete(@fields);
                } elsif ($command eq 'change') {
                    # Here we have a special case. If "noalias" is in effect, we need
                    # to splice an empty field in before the translation.
                    if ($controlBlock->Option("noAlias")) {
                        splice @fields, 3, 0, "";
                    }
                    $controlBlock->Change(@fields);
                } else {
                    $controlBlock->AddMessage("Invalid command $command in line $tranCount for genome $genomeID");
                }
                $controlBlock->IncrementStat($command);
            }
            # Terminate processing for this genome.
            my $orgStats = $controlBlock->EndGenome();
            Trace("Statistics for $genomeID\n\n" . $orgStats->Show() . "\n") if T(3);
            # Close the transaction input file.
            close TRANS;
        }
    }
    # Terminate processing.
    $controlBlock->Teardown();
    Trace("Statistics for this run\n\n" . $controlBlock->Show() . "\n") if T(1);
    Trace("Processing complete.") if T(1);
}


1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3