[Bio] / FigKernelScripts / TransactFeatures.pl Repository:
ViewVC logotype

View of /FigKernelScripts/TransactFeatures.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.20 - (download) (as text) (annotate)
Tue Feb 5 03:54:42 2008 UTC (11 years, 10 months ago) by parrello
Branch: MAIN
CVS Tags: mgrast_dev_08112011, rast_rel_2009_05_18, mgrast_dev_08022011, rast_rel_2014_0912, rast_rel_2008_06_18, myrast_rel40, rast_rel_2008_06_16, mgrast_dev_05262011, rast_rel_2008_12_18, mgrast_dev_04082011, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, mgrast_version_3_2, mgrast_dev_12152011, rast_rel_2008_04_23, mgrast_dev_06072011, rast_rel_2008_09_30, rast_rel_2009_0925, rast_rel_2010_0526, rast_rel_2014_0729, mgrast_dev_02212011, rast_rel_2010_1206, mgrast_release_3_0, mgrast_dev_03252011, rast_rel_2010_0118, mgrast_rel_2008_0924, mgrast_rel_2008_1110_v2, rast_rel_2009_02_05, rast_rel_2011_0119, mgrast_rel_2008_0625, mgrast_release_3_0_4, mgrast_release_3_0_2, mgrast_release_3_0_3, mgrast_release_3_0_1, mgrast_dev_03312011, mgrast_release_3_1_2, mgrast_release_3_1_1, mgrast_release_3_1_0, mgrast_dev_04132011, rast_rel_2008_10_09, mgrast_dev_04012011, rast_release_2008_09_29, mgrast_rel_2008_0806, mgrast_rel_2008_0923, mgrast_rel_2008_0919, rast_rel_2009_07_09, rast_rel_2010_0827, mgrast_rel_2008_1110, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, mgrast_rel_2008_0917, rast_rel_2008_10_29, mgrast_dev_04052011, mgrast_dev_02222011, rast_rel_2009_03_26, mgrast_dev_10262011, rast_rel_2008_11_24, rast_rel_2008_08_07, HEAD
Changes since 1.19: +1 -2 lines
Removed obsolete use clause.

#!/usr/bin/perl -w
#
# Copyright (c) 2003-2006 University of Chicago and Fellowship
# for Interpretations of Genomes. All Rights Reserved.
#
# This file is part of the SEED Toolkit.
# 
# The SEED Toolkit is free software. You can redistribute
# it and/or modify it under the terms of the SEED Toolkit
# Public License. 
#
# You should have received a copy of the SEED Toolkit Public License
# along with this program; if not write to the University of Chicago
# at info@ci.uchicago.edu or the Fellowship for Interpretation of
# Genomes at veronika@thefig.info or download a copy from
# http://www.theseed.org/LICENSE.TXT.
#


=head1 Add / Delete / Change Features

This method will run through a set of transaction files, adding, deleting, and changing
features in the FIG data store. The command takes three input parameters. The first is
a command. The second specifies a directory full of transaction files. The third
specifies a file that tells us which feature IDs are available for each organism.

C<TransactFeatures> [I<options>] I<command> I<transactionDirectory> I<idFile>

The supported commands are

=over 4

=item count

Count the number of IDs needed to process the ADD and CHANGE transactions. This
will produce a listing of the number of feature IDs needed for each
organism and feature type. This command is mostly a sanity check: it provides
useful statistics without changing anything.

=item register

Create an ID file by requesting IDs from the clearinghouse. This performs the
same function as B<count>, but takes the additional step of creating an ID
file that can be used to process the transactions.

=item process

Process the transactions and update the FIG data store. This will also update
the NR file and queue features for similarity generation.

=item fudge

Convert transactions that have already been applied to new transactions that can
be used to test the transaction processor.

=back

=head2 The Transaction File

Each transaction file is a standard tab-delimited file, one transaction per line. The
name of the file is C<tbl_diff_>I<org> where I<org> is an organism ID. All records in
the transaction file refer to transactions against the organism encoded in the file
name.

The file must specify IDs for new features, but the real IDs cannot be known until
they are requested from the SEED clearing house. Therefore, each new ID is specified
in a special format consisting of the feature type (C<peg>, C<rna>, and so forth)
followed by a dot and the 0-based ordinal number of the new ID within that
feature type. So, for example, if the transaction file consists of a delete,
a change, and two adds, it might look like this

    delete fig|83333.1.peg.2
    change fig|83333.1.peg.6 peg.0 ...
    add peg.1 ...
    add rna.0 ...

Note that the old feature IDs do not participate in the numbering process, and the RNA
numbering is independent of the PEG numbering. In the discussion below of transaction
types, a field named I<newID> will always indicate one of these type/number pairs.
So, the field setup for the B<chang> command is

    change fid newID locations aliases translation

And the I<newID> corresponds to the C<peg.6> in the example above.

The first field of each record is the transaction type. The list of subsequent fields
depends on this type.

=over 4

=item DELETE fid

Deletes a feature. The feature is marked as deleted in the FIG database, which
causes it to be skipped or ignored by most of the SEED software. The ID of the
feature to be deleted is the second field (I<fid>).

=item ADD newID locations translation

Adds a new feature. The I<newID> indicates the feature type and its ordinal number.
The location is a comma-separated list of location strings. The translation is the
protein translation for the location. If the translation is omitted, then it will
be generated from the location information in the normal way.

=item CHANGE fid newID locations aliases translation

Changes an existing feature. The current copy of the feature is marked as deleted,
and a new feature is created with a new ID. All annotations and assignments are
transferred from the deleted feature to the new one. The location is a
comma-separated list of location strings. The aliases are specified as a comma-delimited
list of alternate names for the feature. These replace any existing aliases for the
old feature. If the alias list is omitted, no aliases will be assigned to the new
feature. The translation is the protein translation for the location. If the
translation is omitted, then it will be generated from the location information in the
normal way.

=back

=head2 The ID File

The ID file is a tab-delimited file containing one record for each feature type
of each organism that has a transaction file. Each record consists of three
fields.

=over 4

=item orgID

The ID of the organism being updated.

=item ftype

The relevant feature type.

=item firstNumber

The first available ID number for the organism and feature type.

=back

This file's primary purpose is that it tells us how to create the feature IDs
for features we'll be adding to the data store, whether it be via a straight
B<add> or a B<chang> that deletes an old ID and recreates the feature with a
new ID.

If we need new IDs for an organism not listed in this ID file, an error will be
thrown.

=head2 Command-Line Options

The command-line options for this script are as follows.

=over 4

=item trace

Numeric trace level. A higher trace level causes more messages to appear. The
default trace level is 3.

=item safe

Wrap each organism's processing in a database transaction. This makes the process
slightly more restartable than it would be otherwise.

=item noAlias

Assume that the transaction files do not contain aliases. This means that in CHANGE
records the translation will immediately follow the location.

=item sql

Trace SQL commands.

=item tblFiles

Output TBL files containing the corrected IDs. (B<process> command only)

=item start

ID of the first genome to process. This allows restarting a transaction run that failed
in the middle. The default is to run all transaction files.

=back

=cut

use strict;
use Tracer;
use Cwd;
use File::Copy;
use File::Path;
use FIG;
use Stats;
use TransactionProcessor;
use ApplyTransactions;
use CountTransactions;
use FudgeTransactions;

# Get the command-line options.
my ($options, @parameters) = StandardSetup(["FIG"],
                    { safe => [0, "use database transactions"],
                      trace => [2, "trace level"],
                      noAlias => [0, "do not expect aliases in CHANGE transactions"],
                      start => [' ', "start with this genome"],
                      tblFiles => [0, "output TBL files containing the corrected IDs"] },
                    "command transactionDirectory IDfile",
                  @ARGV);
# Get the command.
my $mainCommand = lc shift @parameters;
# Get the FIG object.
my $fig = FIG->new();
# Create the transaction object.
my $controlBlock;
if ($mainCommand eq 'count' || $mainCommand eq 'register') {
    $controlBlock = CountTransactions->new($options, $mainCommand, @parameters);
} elsif ($mainCommand eq 'process') {
    $controlBlock = ApplyTransactions->new($options, $mainCommand, @parameters);
} elsif ($mainCommand eq 'fudge') {
    $controlBlock = FudgeTransactions->new($options, $mainCommand, @parameters);
} else {
    Confess("Invalid command \"$mainCommand\" specified on command line.");
}
# Setup the process.
$controlBlock->Setup();
# Verify that the organism directory exists.
if (! -d $parameters[0]) {
    Confess("Directory of genome files \"$parameters[0]\" not found.");
} else {
    # Here we have a valid directory, so we need the list of transaction
    # files in it.
    my $orgsFound = 0;
    my %transFiles = ();
    my @transDirectory = OpenDir($parameters[0], 1);
    # Pull out the "start" option value. This will be a space if all genomes should
    # be processed, in which case it will always compare less than the genome ID.
    my $startGenome = $options->{start};
    # The next step is to create a hash of organism IDs to file names. This
    # saves us some painful parsing later.
    for my $transFileName (@transDirectory) {
        # Parse the file name. This will only match if it's a real transaction file.
        if ($transFileName =~ /^tbl_diff_(\d+\.\d+)$/) {
            # Get the genome ID;
            my $genomeID = $1;
            # If we're skipping, only include this genome ID if it's equal to
            # or greater than the start value.
            if ($genomeID ge $startGenome) {
                $transFiles{$1} = "$parameters[0]/$transFileName";
                $orgsFound++;
            }
        }
    }
    Trace("$orgsFound genome transaction files found in directory $parameters[0].") if T(2);
    if (! $orgsFound) {
        Confess("No \"tbl_diff\" files found in directory $parameters[1].");
    } else {
        # Loop through the organisms.
        for my $genomeID (sort keys %transFiles) {
            # Start this organism.
            Trace("Processing changes for $genomeID.") if T(3);
            my $orgFileName = $transFiles{$genomeID};
            $controlBlock->StartGenome($genomeID, $orgFileName);
            # Open the organism file.
            Open(\*TRANS, "<$orgFileName");
            # Clear the transaction counter.
            my $tranCount = 0;
            # Loop through the organism's data.
            while (my $transaction = <TRANS>) {
                # Parse the record.
                chomp $transaction;
                my @fields = split /\t/, $transaction;
                $tranCount++;
                # Save the record number in the control block.
                $controlBlock->{line} = $tranCount;
                # Process according to the transaction type.
                my $command = lc shift @fields;
                if ($command eq 'add') {
                    $controlBlock->Add(@fields);
                } elsif ($command eq 'delete') {
                    $controlBlock->Delete(@fields);
                } elsif ($command eq 'change') {
                    # Here we have a special case. If "noalias" is in effect, we need
                    # to splice an empty field in before the translation.
                    if ($controlBlock->Option("noAlias")) {
                        splice @fields, 3, 0, "";
                    }
                    $controlBlock->Change(@fields);
                } else {
                    $controlBlock->AddMessage("Invalid command $command in line $tranCount for genome $genomeID");
                }
                $controlBlock->IncrementStat($command);
            }
            # Close the transaction input file.
            close TRANS;
            # Terminate processing for this genome.
            my $orgStats = $controlBlock->EndGenome();
            Trace("Statistics for $genomeID\n\n" . $orgStats->Show() . "\n") if T(3);
        }
    }
    # Terminate processing.
    $controlBlock->Teardown();
    Trace("Statistics for this run\n\n" . $controlBlock->Show() . "\n") if T(1);
    Trace("Processing complete.") if T(1);
}


1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3