[Bio] / Sprout / SimLoad.pl Repository:
ViewVC logotype

View of /Sprout/SimLoad.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.10 - (download) (as text) (annotate)
Mon Feb 13 15:42:48 2006 UTC (13 years, 9 months ago) by parrello
Branch: MAIN
Changes since 1.9: +7 -2 lines
Fixed to work even if the config variables are undefined.

#!/usr/bin/perl -w

use strict;
use CGI;
use Tracer;
use Genome;
use ERDBLoad;
use DBKernel;
use SimBlocks;
use File::Path;
use BasicLocation;
use Cwd;

=head1 Similarity Block Loader

This script loads the similarity block database from
the input files. The load process involves two steps:
converting the input files into C<dtx> load files
(B<generate>), and loading the C<dtx> files into the
database (B<load>).

The script takes a single parameter-- a directory name.
The default directory name is C<"$FIG_Config::data/CloseStrainSets">.
The input files should be in subdirectories called
C<Block> under the subdirectories of the input directory.
The subdirectory names themselves are considered the name
of the close-strain set. So, for example

    Data/CloseStrainSets/Vibrio/Block

would be presumed to contain the Vibrio strains.

The output files will be produced in the similarity block
data directory C<$fig_config::SimBlockData>, which will be
created if it does not exist. The input directory and all its
subdirectories will be processed for input files.

In addition to the directory name, the following
option parameters are supported.

=over 4

=item trace

Trace level for output messages. A higher number means more
messages. The default is C<3>. Trace messages are sent to
the file C<trace.log> in the B<$FIG_Config::tmp>
directory.

=item sql

If specified, SQL activity will be traced at the specified
trace level.

=item load

C<yes> to load the data into the database, else C<no>.
The default is C<yes>.

=item generate

C<yes> to generate output files from input files, else
C<no>. The default is C<yes>.

=item createDB

If specified, the database will be dropped and
re-created before loading.

=back

For example, the following command line will process the
input files in the C</Users/fig/BlastData> directory tree
and run at a trace level of 3.

C<< SimLoad -trace=3 /Users/fig/BlastData >>

The following command line converts the input files in
the default directory into load files but does not load
the database and runs at a trace level of 2.

C<< SimLoad -load=no >>

=head2 Input Directory

The following files must exist in each C<Block> directory
under the input directory.

=over 4

=item genome.tbl

This is a tab-delimited file that contains the ID of each
genome followed by a description string.

=item block.tbl, intergenic_block.tbl

These are tab-delimited files that associate a gene name
with each block. The InterGenic file is optional.

=item region.tbl, intergenic_region.tbl

These are tab-delimited files that describe each region
of a block. The InterGenic file is optional.

=back

The format of each file is given below.

=head3 genome.tbl

The Genome file is copied almost unmodified to the
load file for the B<Genome> entity. Each record
represents a single genome. It has the following
fields.

=over 4

=item genomeID

The ID of the genome.

=item description

A text description of the genome (usually the species name with
a strain ID).

=item groupName

The name of the group to which the genome belongs.

=back

=head3 block.tbl, intergenic_block.tbl

These files produce most of the data found in the B<GroupBlock>
entity. Each record represents a single block. Blocks either
correspond to genes or to inter-genic regions. Both types
of blocks may appear in multiple locations in multiple
contigs. The files should be sorted by block ID.

=over 4

=item blockID

The unique ID of the block. This ID is also used in the
C<Region.tbl> file.

=item blockName

The name of the block. For a gene block, this is the gene
name. For an inter-genic block, it is a name computed
from the names of the genes that are commonly nearby.

=back

=head3 region.tbl, intergenic_region.tbl

These files describe the regions in each block. They are
used to derive the relationships between genomes and
contigs (B<ConsistsOf>), the contigs themselves
(B<Contig>), the relationships between blocks and
contigs (B<ContainsRegionIn>), and the derived
relationship between genomes and blocks
(B<HasInstanceOf>). The files must be sorted by block
ID, and each record in a file represents a single
region in a contig. Each region belongs to a
single block. Note that the C<Region.tbl> file contains
the regions for the C<Block.tbl> file, and the
C<InterGenic_Region.tbl> file contains the regions for
the C<InterGenic_Block.tbl> file. No mixing is allowed.

=over 4

=item regionPEG

PEG ID for this region. If the region is in an
inter-genic block, this field will be composed of
the IDs for the neighboring genes.

=item genomeID

ID of the relevant genome.

=item contigID

ID of the contig containing this region. This is a standard contig
ID that does not include the genome ID. It will be converted to
a Sprout-style contig ID (which includes the genome data) before
it is written to the output files.

=item begin

The start point of the region. For a forward region this is the
left endpoint; for a reverse region it is the right endpoint. It
is a 1-based offset (which is consistent with Sprout usage), and
the identified location is inside the region.

=item end

The end point of the region. For a forward region this is the
right endpoint; for a reverse region it is the left endpoint. It
is a 1-based offset (which is consistent with Sprout usage), and
the identified location is inside the region.

=item blockID

The ID of the block containing this region.

=item snippet

A DNA snippet representing the contents of the region. The region
may be shorter than the block length. If that is the case, the
snippet will contain insertion characters (C<->). So, while it
is not the case that every region in a block must be the same
length, all of the snippets for a block must be the same length.
The snippets will be in alignment form. In other words, if the
region is reversed, the nucleotide letters will be the complement
in transcription order. (For example, if positions 20 through 25
of contig B<XXX> are C<AGCCTT>, then the snippet for C<XXX_25_20>
will be C<AAGGCT>.)

=back

=head2 Output File Notes

=over 4

=item Genome.dtx

This file is a direct copy of the C<Genome.tbl> file; however, we
also use it to create a hash of genome IDs (C<%genomes>). The hash
is useful when validating the C<Region.tbl> file.

=item Contig.dtx

This file contains nothing but contig IDs. As contigs are
discovered from the C<Region.tbl> file their IDs are put
into the C<%contigs> hash. This hash maps contig IDs to
their parent genome IDs. When processing is complete,
this file is generated from the hash.

=item GroupBlock.dtx

This file describes the blocks. As records come in from
C<Region.tbl>, we build a hash called C<%blockData> that
contains our latest estimate of all the C<GroupBlock.dtx>
columns for the current block (with the exception of
B<variance>, which is computed by dividing the B<snip-count>
by the length (B<len>).

=item ConsistsOf.dtx

This file maps genomes to contigs, and is generated from
the C<%contigs> hash built while reading the C<Region.tbl>
file.

=item HasInstanceOf.dtx

This file lists the genomes containing each block. The
C<Region.tbl> file is sorted by block. While inside a
block's section of the file, we use a hash called
C<%genomesFound> that contains the ID of every genome
found for the block. When we finish with a block,
we run through the C<%genomesFound> hash to produce
the block's B<HasInstanceOf> data.

=item Region.dtx

This file describes the contig regions in the blocks.
As the C<Region.tbl> file is read in, we build a
hash called C<%regionMap> that maps a region's
SEED-style location string to the DNA content.
When we finish with a block, the DNA content is
converted into an alignment by comparing it to
the block's pattern in C<%blockData>. (Essentially,
we only include the region's content for the
positions that vary between regions in the block.)
From this and the region string itself, we have
enough data to create the B<Region>
data.

=item IncludesRegion.dtx

This file maps group blocks to regions. The first column
is the block ID and the second column is the SEED-style
region string for the target region. This file is built
in parallel with C<Region.dtx>. It will have one record
for each region.

=item ContainsRegion.dtx

This file maps contigs to regions. The first column is
the contig ID and the second column is the SEED-style
location string for the region. It contains two redundant
columns used for sorting-- the region length (column 3)
and the left-most region position (column 4). This
file is built in parallel with C<Region.dtx>. It will
have one record for each region.

=cut

# Create a huge number we can use for an end-of-file
# indicator in the block ID.
my $TRAILER = 999999999;

# Parse the command line.
my ($options, @arguments) = StandardSetup(['SimBlocks'],
                                                { load => ['yes', "'no' to suppress loading the database"],
                                                  generate => ['yes', "'no' to suppress generating the load files"],
                                                  createDB => [0, 'drop and create the database before loading']
                                                },
                                                "directoryName",
                                                 @ARGV);
# Extract the directory name from the argument array.
my $inDirectoryTree = ($arguments[0] ? Cwd::abs_path($arguments[0]) : "$FIG_Config::data/CloseStrainSets");
# Check to see if we need to create the database.
if ($options->{createDB}) {
    my $dbname = SimBlocks::DBName();
    Trace("Creating database $dbname.") if T(2);
    DBKernel::CreateDB($dbname);
}
# Get the output directory.
my $outDirectory = $FIG_Config::simBlocksData;
# Default it if it isn't present.
if (! $outDirectory) {
    $outDirectory = "$FIG_Config::fig/BlockData";
}
# Insure that it exists.
if (! -d $outDirectory) {
    Trace("Creating output directory $outDirectory.") if T(2);
    mkpath($outDirectory);
} elsif ($options->{generate} eq 'yes') {
    # Here we have an output directory already and are going to generate new
    # load files. Clear any leftover data from previous runs.
    my @files = grep { $_ =~ /.dtx$/ } Tracer::OpenDir($outDirectory);
    my $numFiles = @files;
    if ($numFiles > 0) {
        Trace("Deleting $numFiles old dtx files from $outDirectory.") if T(2);
        unlink map { "$outDirectory/$_" } @files;
    }
}
# Create an error counter and a directory counter.
my $errorCount = 0;
my $dirCount = 0;
# Get a similarity block object. Note that we've waited until the database
# was already created before doing this. The tables, however, may not yet
# exist, so we can't do very much with it.
my $simBlocks = SimBlocks->new();
# Determine if this is a load-only run.
my $loadOnly = ($options->{generate} eq 'no');
# Create the loader objects for the tables. These are global to the whole
# script.
my $GenomeLoad = ERDBLoad->new($simBlocks, 'Genome', $outDirectory, $loadOnly);
my $ContigLoad = ERDBLoad->new($simBlocks, 'Contig', $outDirectory, $loadOnly);
my $GroupBlockLoad = ERDBLoad->new($simBlocks, 'GroupBlock', $outDirectory, $loadOnly);
my $RegionLoad = ERDBLoad->new($simBlocks, 'Region', $outDirectory, $loadOnly);
my $ConsistsOfLoad = ERDBLoad->new($simBlocks, 'ConsistsOf', $outDirectory, $loadOnly);
my $ContainsRegionLoad = ERDBLoad->new($simBlocks, 'ContainsRegion', $outDirectory, $loadOnly);
my $HasInstanceOfLoad = ERDBLoad->new($simBlocks, 'HasInstanceOf', $outDirectory, $loadOnly);
my $IncludesRegionLoad = ERDBLoad->new($simBlocks, 'IncludesRegion', $outDirectory, $loadOnly);
# This hash helps us to clean up when we're done.
my $loaderList = { Genome => $GenomeLoad, Contig => $ContigLoad,
                   GroupBlock => $GroupBlockLoad, Region => $RegionLoad,
                   ConsistsOf => $ConsistsOfLoad, ContainsRegion => $ContainsRegionLoad,
                   HasInstanceOf => $HasInstanceOfLoad,
                   IncludesRegion => $IncludesRegionLoad};
# Check to see if we should generate the output files.
if ($loadOnly) {
    # Here we are to use existing output files.
    Trace("Existing database load files will be used.") if T(2);
} else {
    # Here we need to produce new output files.
    # Verify that the input directory exists.
    if (! -d $inDirectoryTree) {
        Confess("Input directory \"$inDirectoryTree\" not found.");
    }
    # Loop through the subdirectories.
    for my $inputDirectory (Tracer::OpenDir($inDirectoryTree, 1)) {
        # Verify that this is a directory.
        my $inDirectory = "$inDirectoryTree/$inputDirectory/Blocks";
        if (-d $inDirectory) {
            # Here we have a directory to process. Check for a genome
            # file.
            my $genomeFileName = "$inDirectory/genome.tbl";
            if (! -e $genomeFileName) {
                Trace("$genomeFileName not found. Directory skipped.") if T(1);
            } else {
                # Now we can process the directory and accumulate the error
                # count.
                $errorCount += ProcessDirectory($inDirectory, $outDirectory, $inputDirectory);
                $dirCount++;
            }
        }
    }
    Trace("Load files generated from $dirCount directories.") if T(2);
    # Now finish up the loading.
    my $stats = Stats->new();
    for my $table (keys %{$loaderList}) {
        Trace("Finishing $table.") if T(2);
        my $tableThing = $loaderList->{$table};
        my $newStats = $tableThing->Finish();
        $stats->Accumulate($newStats);
    }
    Trace("Statistics for generate.\n" . $stats->Show()) if T(2);
}
# Check for errors.
if ($errorCount > 0) {
    Trace("$errorCount errors found in input files.") if T(0);
} else {
    # No errors, so it's okay to load the database.
    if ($options->{load} eq 'yes') {
        # Here we have no outstanding errors and the user wants us to load
        # the database. First, we create a similarity block object.
        # Use it to load the database. Note we specify that the tables are to be
        # dropped and rebuilt.
        $simBlocks->LoadTables($outDirectory, 1);
        Trace("Database loaded.") if T(2);
    }
}

# Process a single input directory.
sub ProcessDirectory {
    my ($inDirectory, $outDirectory, $groupName) = @_;
    Trace("Processing directory $inDirectory.") if T(2);
    # Our first task is to copy the genome data to the output
    # and add the genomes to the genome list.
    my %genomes = ();
    Open(\*GENOMESIN, "<$inDirectory/genome.tbl");
    # Count the genomes read and errors found.
    my $genomeCount = 0;
    my $errorCount = 0;
    # Loop through the input.
    while (my $genomeData = <GENOMESIN>) {
        # Echo the genome record to the output.
        my @fields = Tracer::ParseRecord($genomeData);
        $GenomeLoad->Put(@fields, $groupName);
        # Extract the genome ID.
        my $genomeID = $fields[0];
        # Store it in the genomes hash. We start with a value of 0. If
        # contig information for the genome is found, we change the value
        # to 1. When we're all done with the regions, we can check the
        # hash to insure all the genomes were represented in the input.
        $genomes{$genomeID} = 0;
        # Count this genome.
        $genomeCount++;
    }
    Trace("$genomeCount genomes found.") if T(2);
    # Close the genome file.
    close GENOMESIN;
    # Create the contig hash used to associate contigs to their parent
    # genomes.
    my %contigs = ();
    # Now we begin to read the Block and Region files in parallel. Both
    # are sorted by block ID, so all processing for this section of the
    # script is done a block at a time. First, we determine which file
    # sets we'll be processing. There are at most two: regular and
    # intergenic.
    my @fileSets = ();
    my @prefixes = ("", "intergenic_");
    for my $prefix (@prefixes) {
        if (-e "$inDirectory/${prefix}block.tbl") {
            push @fileSets, $prefix;
        }
    }
    # Set up the duplicate-region check.
    my %allRegions = ();
    # Set up some counters.
    my ($blocksCount, $regionsCount) = (0, 0);
    # Loop through the useful file sets.
    for my $fileSet (@fileSets) {
        Open(\*BLOCKSIN, "<$inDirectory/${fileSet}block.tbl");
        Open(\*REGIONSIN, "<$inDirectory/${fileSet}region.tbl");
        Trace("Processing ${fileSet}Blocks.") if T(2);
        # The outer loop processes blocks. This is accomplished by reading
        # through the block file. We prime the loop by reading the first
        # region record. This is because we finish processing a block when
        # the first record of the next block is found in the region file.
        my %regionRecord = GetRegionRecord();
        $regionsCount++;
        while (my $blockRecord = <BLOCKSIN>) {
            $blocksCount++;
            # Parse the block record.
            my ($blockID, $blockName, $pegID) = Tracer::ParseRecord($blockRecord);
            # Create the block data for this block.
            my %blockData = ( id => $blockID, description => $blockName );
            # Initialize the tracking hashes. "genomesFound" tracks the
            # genomes whose contigs are represented by the block,
            # "regionMap" maps each region to its contents, and
            # "regionPeg" maps each region to its PEG (if any).
            my %genomesFound = ();
            my %regionMap = ();
            my %regionPeg = ();
            # Count the number of regions found in this block.
            my $regionCounter = 0;
            # Loop through the regions in the block. Because of the way
            # "GetRegionRecord" works, the "blockID" field will have an
            # impossibly high value if we've hit end-of-file in the
            # region input file.
            while ($regionRecord{blockID} <= $blockID) {
                # If this region's block ID is invalid, complain
                # and skip it.
                if ($regionRecord{blockID} != $blockID) {
                    Trace("Block $regionRecord{blockID} in region record $regionsCount not found in block input file at record $blocksCount.") if T(0);
                    $errorCount++;
                } else {
                    # Here both files are in sync, which is good. The next step is
                    # to connect with the Genome and the Contig.
                    my $genomeID = $regionRecord{genomeID};
                    my $contigID = "$genomeID:$regionRecord{contigID}";
                    if (! exists $genomes{$genomeID}) {
                        Trace("Genome $genomeID in region record $regionsCount not found in genome input file.") if T(0);
                        $errorCount++;
                    } else {
                        # Denote this genome has an instance of this block.
                        $genomesFound{$genomeID} = 1;
                        # Denote this genome has occurred in the region file.
                        $genomes{$genomeID} = 1;
                        # Connect the contig to the genome.
                        $contigs{$contigID} = $genomeID;
                        # Now we need to process the snippet. First, we create a
                        # region string.
                        my $regionString = "${contigID}_$regionRecord{begin}_$regionRecord{end}";
                        # Next, we stuff the snippet and PEG in the region's hash entries.
                        my $snippet = $regionRecord{snippet};
                        $regionMap{$regionString} = $snippet;
                        $regionPeg{$regionString} = $regionRecord{peg};
                        # Check to see if this is the block's first snippet.
                        if (! exists $blockData{pattern}) {
                            # Here it is, so store the snippet as the pattern.
                            $blockData{pattern} = $snippet;
                            $blockData{"snip-count"} = 0;
                            $blockData{len} = length $snippet;
                        } elsif ($blockData{len} != length $snippet) {
                            # Here it is not the first, but the lengths do not match.
                            Trace("Snippet for region record $regionsCount does not match block length $blockData{len}.") if T(0);
                            $errorCount++;
                        } else {
                            # Here everything is legitimate, so we merge the new
                            # snippet into the pattern.
                            ($blockData{pattern}, $blockData{"snip-count"}) =
                                SimBlocks::MergeDNA($blockData{pattern}, $snippet);
                        }
                    }
                    # Count this region.
                    $regionCounter++;
                }
                # Get the next region record.
                %regionRecord = GetRegionRecord();
            }
            # We have now processed all the regions in the block. Insure we found at least
            # one.
            if (! $regionCounter) {
                Trace("No regions found for block $blockID at $blocksCount in block input file.") if T(0);
                $errorCount++;
            } else {
                Trace("$regionCounter regions found in block $blockID.") if T(4);
                # Write the block record. Note that the block ID is prefixed by the group name to
                # make it unique.
                my $variance = $blockData{"snip-count"} / $blockData{len};
                $GroupBlockLoad->Put("$groupName:$blockID", $blockData{description}, $blockData{len},
                                     $blockData{"snip-count"}, $variance, $blockData{pattern});
                # Find all the variance points in the block pattern. We'll use them to create
                # the content strings for each region.
                my @positions = SimBlocks::ParsePattern($blockData{pattern});
                # Loop through the regions, writing them out to the region output file.
                for my $region (keys %regionMap) {
                    if (length($region) > 80) {
                        Trace("Invalid region key \"$region\".") if T(1);
                    }
                    # Get the region's snips.
                    my $source = $regionMap{$region};
                    my $content = "";
                    for my $pos (@positions) {
                        $content .= substr $source, $pos, 1;
                    }
                    # Get the region's location data.
                    my $location = BasicLocation->new($region);
                    # Write this region to the output files.
                    $RegionLoad->Put($region, $location->Contig, $location->Dir,
                                     $location->Right, $location->Length,
                                     $regionPeg{$region}, $location->Left, $content);
                    $ContainsRegionLoad->Put($location->Contig, $region, $location->Length,
                                             $location->Left);
                    $IncludesRegionLoad->Put("$groupName:$blockID", $region);
                }
                # Finally, we need to connect this block to the genomes in which it occurs.
                for my $genomeID (keys %genomesFound) {
                    $HasInstanceOfLoad->Put($genomeID, "$groupName:$blockID");
                }
                # Count this block's regions.
                $regionsCount += $regionCounter;
            }
        }
        # Close the input files.
        close BLOCKSIN;
        close REGIONSIN;
    }
    # Close the output files.
    # All the block data has been written. Tell the user what we found.
    Trace("$blocksCount blocks processed, $regionsCount regions processed.") if T(2);
    # The next task is to write the genome/contig data. This is provided by the
    # "%contigs" hash.
    my $contigsCount = 0;
    for my $contigID (keys %contigs) {
        $ContigLoad->Put("$contigID");
        $ConsistsOfLoad->Put($contigs{$contigID}, $contigID);
        $contigsCount++;
    }
    Trace("$contigsCount contigs found.") if T(2);
    # Now warn the user about all the genomes that didn't have blocks.
    for my $genomeID (keys %genomes) {
        if (! $genomes{$genomeID}) {
            Trace("Genome $genomeID did not have any regions.") if T(1);
            $errorCount++;
        }
    }
    return $errorCount;
}
# Tell the user we're done.
Trace("Processing complete.") if T(0);

# Read a region record from the file and parse it into a hash
# for return to the caller. If we reach end-of-file, the
# hash returned will have $TRAILER in the blockID field.
sub GetRegionRecord {
    # Create the return hash.
    my %retVal = ();
    # Read the record.
    my $regionData = <REGIONSIN>;
    # Check for end-of-file.
    if (!defined $regionData) {
        # Here we have end-of-file, so stuff in a trailer
        # value for the block ID.
        $retVal{blockID} = $TRAILER;
    } else {
        # Here we have a real record.
        ($retVal{peg}, $retVal{genomeID}, $retVal{contigID},
            $retVal{begin}, $retVal{end}, $retVal{blockID},
            $retVal{snippet}) = Tracer::ParseRecord($regionData);
    }
    # Return the hash created.
    return %retVal;
}

1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3