[Bio] / Sprout / SimBlocks.pm Repository:
ViewVC logotype

View of /Sprout/SimBlocks.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.11 - (download) (as text) (annotate)
Thu Dec 6 14:58:03 2007 UTC (11 years, 10 months ago) by parrello
Branch: MAIN
CVS Tags: mgrast_dev_08112011, mgrast_dev_08022011, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_12_18, mgrast_dev_04082011, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, mgrast_dev_06072011, rast_rel_2008_09_30, rast_rel_2009_0925, rast_rel_2010_0526, rast_rel_2009_05_18, mgrast_dev_03252011, mgrast_release_3_0, rast_rel_2010_1206, rast_rel_2010_0118, mgrast_rel_2008_0924, mgrast_rel_2008_1110_v2, rast_rel_2009_02_05, rast_rel_2011_0119, mgrast_rel_2008_0625, mgrast_release_3_0_4, mgrast_release_3_0_2, mgrast_release_3_0_3, mgrast_release_3_0_1, mgrast_dev_03312011, mgrast_release_3_1_1, mgrast_release_3_1_0, mgrast_dev_04132011, rast_rel_2008_10_09, mgrast_dev_04012011, rast_release_2008_09_29, mgrast_rel_2008_0806, mgrast_rel_2008_0923, mgrast_rel_2008_0919, rast_rel_2009_07_09, rast_rel_2010_0827, mgrast_rel_2008_1110, myrast_33, rast_rel_2008_04_23, rast_rel_2008_09_29, mgrast_rel_2008_0917, rast_rel_2008_10_29, mgrast_dev_04052011, rast_rel_2009_03_26, rast_rel_2008_11_24, rast_rel_2008_08_07
Changes since 1.10: +26 -26 lines
Changed POD format for better compatability with Wiki.

package SimBlocks;

    require Exporter;
    use ERDB;
    @ISA = qw(Exporter ERDB);
    @EXPORT = qw(DefaultDistances);
    @EXPORT_OK = qw(BlocksInSet GetBlock GetBlockPieces CompareGenomes DBLoad DistanceMatrix GetAlignment GetRegions ParsePattern RemoveBlocks SequenceDistance SetNumber SnipScan);

    use strict;
    use Tracer;
    use PageBuilder;
    use Genome;
    use BasicLocation;
    use Overlap;
    use FIG_Config;
    use FIG;

=head1 Similarity Block Database

=head2 Introduction

A I<similarity Block Database> describes the similarities amongst a group
of contigs. The load files for the database are produced by the
C<SimBlast.pl> script from genome FASTA files. The database is described
by the C<SimBlocksDBD.xml> file.

The primary entities of the database are B<Genome>, which represents a specified
organism, B<GroupBlock>, which represents a set of similar regions, and B<Contig>,
which represents a contiguous segment of DNA. The key relationship is
B<ContainsRegionIn>, which maps blocks to contigs. The mapping information
includes the DNA nucleotides as well as the location in the contig.

=head2 Formal Definitions

Let {B<G(1)>, B<G(2)>, ..., B<G(>I<n>B<)>} be a set of genomes
and let {B<g(1)>, B<g(2)>, ..., B<g(>I<m>B<)>} be the set of
genes called on these genomes.

The primitive notion from which similarity blocks are derived is the
I<correspondence mapping>. Roughly, this mapping connects genes on
one genome to corresponding genes on other genomes. We will treat
the mapping as an equivalence class. In practice, the entire mapping
will not be available when we are building to load files for the
similarity block database. Instead, the user will provide a subset
of the mapping from which the entire partitioning can be deduced
via commutativity, reflexivity, and transitivity. This subset is
called the I<correspondence relation>.

In actual practice, the correspondence relation will contain some
false positives. The process of developing the correspondence
relation usually involves analyzing similarities above a certain
threshhold. (This type of similarity is not necessarily transitive;
however, the true, underlying correspondence mapping we want I<is>
transitive.) Often, a sequence of correspondence pairs that maps
a gene on one particular genome to another gene on the same genome
is an indication of a false positive.

In addition to false positives, we may have invalid genes, missing
genes, frame shifts, and incorrect start locations. Detecting and
cleaning these problems is an important part of getting a good
correspondence relation.

Given a full-blown correspondence mapping, we define the following
concepts.

=over 4

=item corresponding gene

The I<corresponding genes> of a given gene B<g(>I<i>B<)> are those
genes mapped to it by the correspondence mapping.

=item gene block

A I<gene block> is a maximal set of corresponding genes.

=item intergenic region

An I<intergenic region> is a section of the contig between
two adjacent genes. The adjacent genes are called the
I<left gene> and the I<right gene>, depending on whether they
precede or follow the region. For example, if C<G2_100+50>
and C<G2_175+100> are adjacent, C<G2_150+25> is their
intergenic region, the left gene is C<G2_100+50>, and
the right gene is C<G2_175+100>.

=item intergenic block

An I<intergenic block> is a maximal set of intergenic regions such
that (1) the left genes all correspond and have the same orientation,
(2) the right genes all correspond and have the same orientation, and
(3) the region lengths differ by no more than a specified threshhold
(usually 10%).

=item similarity block

A I<similarity block> is either a gene block or an intergenic block.
A similarity block is therefore a set of regions (either gene regions
or intergenic regions) that are tied together by the correspondence
mapping.

=item common block

A I<common block> with respect to a subset of the genomes
{B<G(>I<i1>B<)>, B<G(>I<i2>B<)>, ... B<G(>I<ik>B<)>} is a
similarity block that has exactly one region in each
of the genomes of the subset. There is a lesser notion of
a block that occurs in most genomes in the subset. The actual
program allows the notion of I<most> to be defined externally.
The default is all of the genomes in the subset.

=item unique sequence

A I<unique sequence> is a region that does not appear in any
similarity block with another region. In other words, a unique
sequence is a region that does not correspond to any other region.

=back

=head2 Usage

The similarity block database must be able to answer two questions, taking
as input two genome sets. (These would be subsets of the total set of genomes
represented in the correspondence mapping.)

=over 4

=item (1) Which blocks are common in one set and do not occur in the other.

These are considered major variations. They result from prophage insertions,
transposition events, and recombinations. A major variation results in a
different set of proteins in an operating cell.

=item (2) What individual variations in the shared common blocks distinguish the two sets?

These are considered minor variations, and are the most common result of
single-nucleotide mutations. Rather than changing the set of proteins,
minor variations change the proteins themselves.

=back

This class is a subclass of B<ERDB>, and inherits all of its public methods.

=head2 Setup

To begin working with similarity blocks, you must first create a database to
contain the block data. Next, you must specify the various similarity block
parameters in C<FIG_Config>. These are as follows. (Of course, the values
you specify will be different.)

    $simBlocksDB     = "conser5_blocks";    # SimBlocks database schema name
    $simBlocksData   = "$data/BlockData";   # directory containing SimBlocks data files
                                            # (used to load the Similarity database)
    $simBlastData    = "$data/BlastData";   # directory containing SimBlast data files
                                            # (input to SimLoad.pl)

Run the C<SimLoad> script to load the database.

The C<SimBlockForm.cgi> service enables you to run a similarity analysis between genome
sets in the database. The C<Html/ERDBTest.html> page enables you to run general queries
against the database and execute simple scripts (however, this latter capability is
disabled unless you have

    $debug_mode      = 1;                   # TRUE to enable the debugging scripts

The Similarity Block methods and scripts work in the Windows version of the SEED (see
C<WinBuild>).

=cut

#: Constructor SimBlocks->new();

=head2 Public Methods

=head3 new

    my $simdb = SimBlocks->new($dbname, $dbType, $port);

Construct a new SimBlocks object connected to the specified database. If no
database is specified, the default database indicated by C<FIG_Config::simBlocksDB>
will be used. Similarly, if the database type is omitted, the type will be
taken from C<FIG_Config::dbms>, and if the port is omitted, C<FIG_Config::dbport>
will be used. The user name and password for connecting are taken from
C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.

In almost every case, you will be calling

    my $simdb = SimBlocks->new();

=over 4

=item dbname (optional)

Database schema name. This is how we choose the desired database on the server.
If omitted, the value of C<$FIG_Config::simBlocksDB> is used.

=item dbType

Database type (e.g. C<mysql> for MySQL or C<pg> for Postgres). If omitted, the
value of C<$FIG_Config::dbms> will be used.

=item port

Port number for connecting to the database server. If omitted, the value of
C<$FIG_Config::dbport> will be used.

=back

=cut

sub new {
    # Get the parameters.
    my ($class, $dbname, $dbType, $port) = @_;
    # Plug in the default values.
    if (! $dbname) {
        $dbname = DBName();
    }
    if (! $dbType) {
        $dbType = $FIG_Config::dbms;
    }
    if (! $port) {
        $port = $FIG_Config::dbport;
    }
    # Connect to the database.
    my $dbh = DBKernel->new($dbType, $dbname, $FIG_Config::dbuser, $FIG_Config::dbpass, $port);
    # Get the data directory name.
    my $directory = $FIG_Config::simBlocksData;
    # Create and bless the ERDB object.
    my $retVal = ERDB::new($class, $dbh, "$FIG_Config::fig/SimBlocksDBD.xml");
    # Return it.
    return $retVal;
}

=head3 DBName

    my $name = SimBlocks::DBName;

Return the name of the database. This is set from a config variable, but if the
variable is undefined a default value is used.

=cut

sub DBName {
    my $retVal;
    if (defined $FIG_Config::simBlocksDB) {
        $retVal = $FIG_Config::simBlocksDB;
    } else {
        $retVal = "Blocks";
    }
    return $retVal;
}

=head3 DefaultDistances

    my $distances = DefaultDistances();

Return the default distance matrix for computing the alignment distances (see
also L</DistanceMatrix>. This matrix returns a distance of 0 between an insertion
character (C<->) and any other nucleotide, a distance of 0.5 between nucleotides
of the same type, and a distance of 1 between nucleotides of different types.

=cut

my $DefaultDistanceMatrix = { AA  => 0,      AC  => 1,     AG  => 0.5,   AT  => 1,
                              AN  => 0.625,  AR  => 0.25,  AY  => 1,    'A-' => 0,
                              CA  => 1,      CC  => 0,     CG  => 1,     CT  => 0.5,
                              CN  => 0.625,  CR  => 1,     CY  => 0.25, 'C-' => 0,
                              GA  => 0.5,    GC  => 1,     GG  => 0,     GT  => 1,
                              GN  => 0,      GR  => 0.25,  GY  => 1,    'G-' => 0,
                              TA  => 1,      TC  => 0.5,   TG  => 1,     TT  => 0,
                              TN  => 0,      TR  => 1,     TY  => 0.25, 'T-' => 0,
                              RA  => 0.25,   RC  => 1,     RG  => 0.25,  RT  => 1,
                              RN  => 0.625,  RR  => 0.25,  RY  => 1,    'R-' => 0,
                              YA  => 1,      YC  => 0.25,  YG  => 1,     YT  => 0.25,
                              YN  => 0.625,  YR  => 1,     YY  => 0.25, 'Y-' => 0,
                              NA  => 0.625,  NC  => 0.625, NG  => 0.625, NT  => 0.625,
                              NN  => 0.625,  NR  => 0.625, NY  => 0.625,'N-' => 0,
                             '-A' => 0,     '-C' => 0,    '-G' => 0,    '-T' => 0,
                             '-N' => 0,     '-R' => 0,    '-Y' => 0,    '--' => 0 };

sub DefaultDistances {
    return $DefaultDistanceMatrix;
}

=head3 DBLoad

    my $stats = $simBlocks->DBLoad($rebuild);

Load the database from the default directory. This is essentially identical to
a B<LoadTables> call with the default directory used instead of a caller-specified
directory.

=over 4

=item rebuild

TRUE if the tables should be re-created; FALSE if the tables should be cleared and
reloaded. The default is FALSE; however, TRUE is considerably faster. You would use
FALSE if you had added non-standard indexes to the database and didn't want to lose
them.

=item RETURN

Returns a statistics object describing the duration of the load, the number of
records loaded, and a list of error messages.

=back

=cut
#: Return Type $%@;
sub DBLoad {
    # Get the parameters.
    my ($self, $rebuild) = @_;
    # Load the tables.
    my $retVal = $self->LoadTables($FIG_Config::simBlocksData, $rebuild);
    # Return the result.
    return $retVal;
}

=head3 CompareGenomes

    my (\%set0Blocks, \%set1Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set0, \@set1);

Analyze two sets of genomes for commonalities. The group blocks returned will be divided
into three hashes: one for those common to set 0 and not occurring at all in set 1, one
for those common to set 1 and not occurring at all in set 0, and one for those common
to both sets. Each hash is keyed by group ID and will contain B<ERDBObject>s for
B<GroupBlock> records with B<HasInstanceOf> data attached, though the genome ID in
the B<HasInstanceOf> section is not generally predictable.

=over 4

=item set0

Reference to a list of genome IDs for the genomes in the first set (set 0).

=item set1

Reference to a list of genome IDs for the genomes in the second set (set 1).

=item RETURN

Returns a triple of hashes. Each hash is keyed by group ID, and will contain
B<ERDBObject>s for records in the B<GroupBlock> table. Groups found in all of the
genomes in set 0 but in none of the genomes of set 1 will be in the first hash,
groups found in all of the genomes in set 1 but in none of the genomes of set 0
will be in the second hash, and groups found in all of the genomes of both sets
will be in the third hash.

=back

=cut
#: Return Type @%;
sub CompareGenomes {
    # Get the parameters.
    my ($self, $set0, $set1) = @_;
    # Declare the three hashes.
    my (%set0Blocks, %set1Blocks, %bothBlocks);
    # Our first task is to find the common groups in each genome set.
    my %groups0 = $self->BlocksInSet($set0);
    my %groups1 = $self->BlocksInSet($set1);
    # Create a trailer key to help tighten the loops.
    my $trailer = "\xFF";
    # The groups are hashed by group ID. We will move through them in key order
    # to get the "set0", "set1", and "both" groups.
    my @blockIDs0 = (sort keys %groups0, $trailer);
    my @blockIDs1 = (sort keys %groups1, $trailer);
    # Prime the loop by getting the first ID from each list. Thanks to the trailers
    # we know this process can't fail.
    my $blockID0 = shift @blockIDs0;
    my $blockID1 = shift @blockIDs1;
    while ($blockID0 lt $trailer || $blockID1 lt $trailer) {
        # Compare the block IDs.
        if ($blockID0 lt $blockID1) {
            # Block 0 is only in the first set.
            Trace("Block IDs $blockID0 vs. $blockID1. Set 0 selected.") if T(SimCompare => 3);
            $set0Blocks{$blockID0} = $groups0{$blockID0};
            $blockID0 = shift @blockIDs0;
        } elsif ($blockID0 gt $blockID1) {
            # Block 1 is only in the second set.
            Trace("Block IDs $blockID0 vs. $blockID1. Set 1 selected.") if T(SimCompare => 3);
            $set1Blocks{$blockID1} = $groups1{$blockID1};
            $blockID1 = shift @blockIDs1;
        } else {
            # We have a block in both sets.
            Trace("Block IDs $blockID0 vs. $blockID1. Both selected.") if T(SimCompare => 3);
            $bothBlocks{$blockID1} = $groups1{$blockID1};
            $blockID0 = shift @blockIDs0;
            $blockID1 = shift @blockIDs1;
        }
    }
    # The set1 and set2 lists are too big. They contain groups that are common in their
    # respective sets but not common in both sets. We want groups that are common in
    # their respective sets but completely absent in the other set. We do this by getting
    # a complete list of the blocks in each set and deleting anything we find.
    $self->RemoveBlocks(\%set0Blocks, $set1);
    $self->RemoveBlocks(\%set1Blocks, $set0);
    # Return the result.
    return (\%set0Blocks, \%set1Blocks, \%bothBlocks);
}

=head3 RemoveBlocks

    $simBlocks->RemoveBlocks(\%blockMap, \@set);

Remove from the specified block map any blocks that occur in the specified set of genomes.
The block map can contain any data, but it must be keyed by block ID.

=over 4

=item blockMap

Reference to a hash of block IDs to block data. The character of the block data is
irrelevant to this method.

=item set

Reference to a list of genome IDs. Any block that occurs in any one of the specified
genomes will be removed from the block map.

=back

=cut
#: Return Type ;
sub RemoveBlocks {
    # Get the parameters.
    my ($self, $blockMap, $set) = @_;
    # Get a list of the blocks in the genome set.
    my %setBlocks = $self->BlocksInSet($set, 1);
    # Loop through the incoming block map, deleting any blocks found
    # in the new block set. Note we make the assumption that the
    # incoming block map is smaller.
    my @blockList = keys %{$blockMap};
    for my $blockID (@blockList) {
        if (exists $setBlocks{$blockID}) {
            delete $blockMap->{$blockID};
        }
    }
}

=head3 BlocksInSet

    my %blockList = $simBlocks->BlocksInSet($set, $count);

Return a list of the group blocks found in a given number of the genomes in a given
set. The list returned will be a hash of B<ERDBObject>s, each corresponding to a single
B<GroupBlock> record, with a B<HasInstanceOf> record attached, though the content of
the B<HasInstanceOf> record is not predictable. The hash will be keyed by block ID.

=over 4

=item set

Reference to a list of genome IDs.

=item count (optional)

Minimum number of genomes from the set in which the block must appear. If C<1> is
specified, the list will include any block that occurs in at least one genome.
If C<5> is specified, the list will only include blocks that occur in at least
5 genomes from the set. If the set consists of only 4 genomes in this latter
case, no blocks will be returned. The default is to return blocks that occur in
all genomes of the set.

=item RETURN

Returns a hash of B<ERDBObject>s corresponding to the group blocks found in the
genomes of the set.

=back

=cut
#: Return Type %%;
sub BlocksInSet {
    # Get the parameters.
    my ($self, $set, $count) = @_;
    # If the count was not specified, set it t
    if (!defined $count) {
        $count = @{$set};
    }
    Trace("BlocksInSet called for $count genomes of set (" . join(", ", @{$set}) . ").") if T(2);
    # Create a hash to hold the groups found. The hash will be keyed by group ID
    # and contain the relevant DB object.
    my %retVal = ();
    # Create a hash to contain the number of genomes in which each block was found.
    my %counters = ();
    # Loop through the genome IDs in the specified set.
    for my $genomeID (@{$set}) {
        # Get a list of group blocks for this genome.
        my @blocks = $self->GetList(['HasInstanceOf', 'GroupBlock'],
                                    "HasInstanceOf(from-link) = ?", [$genomeID]);
        # Loop through the blocks, storing any new ones in the hash.
        for my $block (@blocks) {
            # Get the ID of this block.
            my ($blockID) = $block->Value('GroupBlock(id)');
            # Determine whether this block is new or has been found in a previous
            # genome of the set.
            if (exists $counters{$blockID}) {
                # Here it's old. Increment its counter.
                $counters{$blockID}++;
            } else {
                # Here it's new. Create a counter for it.
                $counters{$blockID} = 1;
            }
            # If it meets the threshhold, put it in the return hash. If it's already
            # there, we'll simply overwrite the old copy, which makes no difference
            # to the caller.
            Trace("Block $blockID has $counters{$blockID} occurrences as of genome $genomeID.") if T(SimSet => 4);
            if ($counters{$blockID} >= $count) {
                $retVal{$blockID} = $block;
                Trace("Block $blockID added to set.") if T(SimSet => 3);
            }
        }
    }
    # Return the result.
    return %retVal;
}

=head3 GetRegions

    my %regions = $simBlocks->GetRegions($blockID, \@genomes);

Return the regions of the specified block that occur in the contigs of
the specified genomes. The return value is a hash of DNA strings keyed
on the region's location descriptor.

=over 4

=item blockID

ID of the group block whose regions are to be returned.

=item genomes

Reference to a list of genome IDs. Only regions in one of the listed
genomes will be returned.

=item RETURN

Returns a hash keyed by region location. The objects in the hash are
the DNA strings for the specified region.

=back

=cut
#: Return Type %$;
sub GetRegions {
    # Get the parameters.
    my ($self, $blockID, $genomes) = @_;
    # Create the return hash.
    my %retVal = ();
    # Count the regions found.
    my $regionCount = 0;
    # Query all the regions for the specified block.
    my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
                           [$blockID]);
    # Loop through the query.
    while (my $region = $query->Fetch) {
        # Get this region's data.
        my ($location, $dna) =
            $region->Values(['Region(id)', 'Region(content)']);
        # Insure it belongs to one of our genomes.
        my $found = SetNumber($location, $genomes);
        # If it does, put it in the hash.
        if ($found >= 0) {
            $retVal{$location} = $dna;
            $regionCount++;
        }
    }
    Trace("$regionCount regions found by GetRegions.") if T(2);
    # Return the result.
    return %retVal;
}

=head3 SetNumber

    my $setNumber = SimBlocks::SetNumber($contigRegion, \@set0, \@set1, ..., \@setN);

Examine a region string, contig ID, or genome ID, and return the number of the genome
set to which it belongs.

=over 4

=item contigRegion

A region string, contig ID, or genome ID. Because the contig ID is prefixed by the genome ID
delimited by a colon symbol (C<:>), we split at the first colon to get the desired ID.

=item set0, set1, ..., setN

A list of references to genome ID lists. Essentially, we will return the index of the
first incoming list that has the specified genome ID in it.

=item RETURN

The index in the list of sets of the set containing the relevant genome, or -1 if
the genome was not found.

=back

=cut
#: Return Type $
sub SetNumber {
    # Get the parameters.
    my ($contigRegion, @sets) = @_;
    # Extract the genome ID.
    my ($genomeID) = split /:/, $contigRegion;
    # Clear the return variable. A negative number means nothing found.
    my $retVal = -1;
    # Loop until we find the genome.
    for (my $i = 0; $retVal < 0 && $i <= $#sets; $i++) {
        my $set = $sets[$i];
        if (grep /^$genomeID$/, @{$set}) {
            $retVal = $i;
        }
    }
    # Return the result.
    return $retVal;
}

=head3 TagDNA

    my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix);

Convert a DNA string from the B<Region> relation to the actual DNA.
The incoming DNA string will contain only the values corresponding to the
question marks in the pattern. The pattern should be taken from the DNA
string's parent B<GroupBlock>. The resulting DNA sequence is built by
copying the DNA string data into the pattern positions that have question
marks. If the string is intended for display in an HTML page, the
I<$prefix> and I<$suffix> parameters can be used to surround the
question mark positions with HTML markers that specify a different style,
weight, or font.

=over 4

=item pattern

DNA pattern from which the positions of variance are to be taken.

=item dnaString

String containing DNA string to be marked.

=item prefix (optional)

String to be inserted before each position of variance.

=item suffix (optional)

String to be inserted after each position of variance.

=item RETURN

Returns a copy of the pattern string with the positions of variance filled
in from the DNA string. If a prefix and suffix are specified, they are
placed before and after each character copied into the pattern string.

=back

=cut
#: Return Type $;
sub TagDNA {
    # Get the parameters.
    my ($pattern, $dnaString, $prefix, $suffix) = @_;
    # Insure the prefix and suffix behave properly.
    if (!defined $prefix) {
        $prefix = "";
    }
    if (!defined $suffix) {
        $suffix = "";
    }
    # Initialize the return variable.
    my $retVal = "";
    # Convert the DNA string to a list with the prefixes and suffixes
    # inserted.
    my @dnaList = map { $prefix . $_ . $suffix } split //, $dnaString;
    # Loop through the DNA string.
    my ($start, $position) = (0, 0);
    while (($position = index $pattern, "?", $start) >= 0) {
        # If we have invariant text, copy it over.
        if ($position > $start) {
            $retVal .= substr $pattern, $start, $position - $start;
        }
        # Pop off the next position from the DNA string.
        $retVal .= shift @dnaList;
        # Start past the question mark.
        $start = $position + 1;
    }
    # Tack on the residual.
    $retVal .= substr $pattern, $start;
    # Return the result.
    return $retVal;
}

=head3 SnipScan

    my %positions = $simBlocks->SnipScan($blockObject, \@set0, \@set1);

Examine the specified block and return a list of the positions at which the
nucleotide values for regions in the first genome set differ from the values
for regions in the second genome set. This conditions is only satisfied if
the nucleotides occurring in the first set's regions never occur in the
second set's regions. This is a very strict condition. If, for example, C<A>
occurs in one of the regions for the first set, it cannot occur in any of the
regions for the second set.

=over 4

=item blockObject

A C<ERDBObject> representing the B<GroupBlock> record for the desired block or
the actual ID of the block whose regions are to be examined. It is expected that the
block will have regions in all of the genomes for both sets, but this is not
required by the algorithm.

=item set0

Reference to a list of the IDs for the genomes in the first set (0).

=item set1

Reference to a list of the IDs for the genomes in the second set (1).

=item RETURN

A hash that maps positions to contents. Each position will be relative to the
start of the block. The mapped value will list the nucleotides found in the
first set and the nucleotides found in the second set. So, for example,
C<['AC', 'G']> means that both C<A> and C<C> are found in the genomes of the
first set (0), but only C<G> is found in genomes of the second set (1).

=back

=cut
#: Return Type %;
sub SnipScan {
    # Get the parameters.
    my ($self, $blockObject, $set0, $set1) = @_;
    # Convert an incoming block ID to a block object.
    if (ref $blockObject ne "ERDBObject") {
        $blockObject = $self->GetEntity('GroupBlock', $blockObject);
    }
    # Get the ID and length of this block.
    my ($blockID) = $blockObject->Value('GroupBlock(id)');
    my ($len) = $blockObject->Value('GroupBlock(snip-count)');
    # Create a hash that can be used to identify the genomes in each set.
    my %setMap = ();
    for my $genomeID (@{$set0}) {
        $setMap{$genomeID} = 0;
    }
    for my $genomeID (@{$set1}) {
        $setMap{$genomeID} = 1;
    }
    # Create the snip hash. We'll create empty lists for each snip position.
    # As soon as a position becomes ineligible we delete it from the hash.
    my %snipHash = ();
    for (my $i = 0; $i < $len; $i++) {
        $snipHash{$i} = ['', ''];
    }
    # Ask for the regions in the block.
    my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
                           [$blockID]);
    # Loop through the regions.
    while (my $region = $query->Fetch) {
        # Determine this region's genome set. We only continue if the region is in
        # one of the sets.
        my ($location) = $region->Value('Region(id)');
        my $genomeID = Genome::FromContig($location);
        if (exists $setMap{$genomeID}) {
            my $setIndex = $setMap{$genomeID};
            # Get the region's snip values.
            my ($regionValue) = $region->Value('Region(content)');
            # Loop through the positions that are still active.
            my @positions = keys %snipHash;
            for my $position (@positions) {
                # Get the nucleotide at the current position in this region.
                my $letter = substr $regionValue, $position, 1;
                # Find it in the appropriate list.
                my $letterList = $snipHash{$position};
                if ((index $letterList->[1-$setIndex], $letter) >= 0) {
                    # It's in the other set, so this position is no
                    # longer valid.
                    delete $snipHash{$position};
                } elsif ((index $letterList->[$setIndex], $letter) < 0) {
                    # It's in neither set, so we add it to the list we're
                    # accumulating for this set at this position.
                    $letterList->[$setIndex] .= $letter;
                }
            }
        }
    }
    # Now %snipHash contains all of the information we need, but we must relocate the
    # positions so they are relative to the block address instead of the snip
    # positions. We need the block pattern to do this.
    my ($blockPattern) = $blockObject->Value('GroupBlock(pattern)');
    my @positions = ParsePattern($blockPattern);
    # Create the return hash.
    my %retVal = ();
    # Relocate the positions.
    for my $inPosition (keys %snipHash) {
        $retVal{$positions[$inPosition]} = $snipHash{$inPosition};
    }
    # Return the result.
    return %retVal;
}

=head3 ParsePattern

    my @positions = SimBlocks::ParsePattern($pattern);

Get a list of the positions of variance inside a block pattern. The
positions of variance are marked by question marks, so all we need to
do is return the location of every question mark in the incoming
patter string.

=over 4

=item pattern

DNA pattern for a group block.

=item RETURN

Returns a list of position numbers. These are coded as offsets, with 0 indicating
the first character of the pattern.

=back

=cut
#: Return Type @;
sub ParsePattern {
    # Get the parameters.
    my ($pattern) = @_;
    # Create the return list.
    my @retVal = ();
    # Loop through the pattern, looking for question marks.
    my $position = 0;
    while (($position = index($pattern, "?", $position)) >= 0) {
        push @retVal, $position;
        $position++;
    }
    # Return the result.
    return @retVal;
}

=head3 MergeDNA

    my ($groupSequence, $variance) = SimBlocks::MergeDNA($groupSequence, $newSequence);

Merge the DNA for a region into the group representation of similar DNA, returning the
result and the positions of variance. Positions of variance in the group representation
are indicated by question marks. This method essentially compares the new DNA to the
group DNA and adds question marks wherever the new DNA does not match.

=over 4

=item groupSequence

Group representation of the DNA into which the new DNA is to be merged.

=item newSequence

New DNA to be merged into the group representation.

=item RETURN

Returns the merged group representation followed by an updated count of the number
of question marks in the group.

=back

=cut
sub MergeDNA {
    # Get the parameters.
    my ($groupSequence, $newSequence) = @_;
    # Declare the return variables.
    my ($retVal, $varianceCount) = ("", 0);
    # If there is no group representation, or if the group representation is the
    # same as the new DNA, we return the new DNA; otherwise, we have to merge.
    # Note that the new DNA cannot contain any question marks, so if either of
    # the degenerate cases is true, we return a variance of 0.
    if ((! $groupSequence) || ($groupSequence eq $newSequence)) {
        $retVal = $newSequence;
    } else {
        # Here the group representation and the new DNA are different, so we
        # have to find all the points of difference and replace them with
        # question marks. We do this using a trick: by XOR-ing the two
        # strings together, we create a new string with nulls everywhere
        # except where there's a mismatch. We then find the non-null characters
        # and replace the corresponding positions in the return string with
        # question marks. First, we get a copy of the group representation to
        # play with.
        $retVal = $groupSequence;
        # Next we compute the XOR string.
        my $hexor = $retVal ^ $newSequence;
        # This loop finds all non-null string positions.
        while ($hexor =~ m/[^\x00]/g) {
            # Here we get the position past the mismatch point.
            my $pos = pos $hexor;
            # Replace the real mismatch point with a question mark.
            substr $retVal, $pos - 1, 1, "?";
            $varianceCount++;
        }
    }
    # Return the result.
    return ($retVal, $varianceCount);
}

=head3 GetAlignment

    my %sequences = $$simBlocks->GetAlignment(\@blockIDs, \@genomeIDs, $indels);

Return an alignment of the specified genomes relative to the specified block
IDs. Only blocks in which all of the genomes occur will produce output for
the alignment.

=over 4

=item blockIDs

Reference to a list of the IDs of the blocks from which the alignment
should be constructed.

=item genomeIDs

Reference to a list of the genome IDs participating in the alignment.

=item indels (optional)

C<1> if columns containing insertion characters (C<->) should be included
in the alignment, else C<0>. The default is C<0>.

=item RETURN

Returns a hash that maps each genome ID to its sequence in the alignment.

=back

=cut
#: Return Type %;
sub GetAlignment {
    # Get the parameters.
    my ($self, $blockIDs, $genomeIDs, $indels) = @_;
    # Create the return hash. We will start with a null string for each
    # genome, and add the alignment data one block at a time.
    my %retVal = ();
    for my $genomeID (@{$genomeIDs}) {
        $retVal{$genomeID} = "";
    }
    # Get the number of genomes. This will help us in determining which
    # blocks to keep.
    my $totalGenomes = @{$genomeIDs};
    # Loop through the blocks.
    for my $blockID (@{$blockIDs}) {
        # Get the block's data.
        my $blockData = $self->GetEntity('GroupBlock', $blockID);
        my ($blockName) = $blockData->Value('GroupBlock(description)');
        Trace("Processing common block $blockID ($blockName).") if T(Alignment => 4);
        # Only proceed if the block has real variance in it.
        my ($snipCount) = $blockData->Value('GroupBlock(snip-count)');
        if ($snipCount > 0) {
            # Get this block's regions in the specified genomes.
            my %regionHash = $self->GetRegions($blockID, $genomeIDs);
            # Verify that each genome is represented.
            my %blockHash = ();
            my $genomeCount = 0;
            for my $region (keys %regionHash) {
                # Get the genome ID from the region string.
                $region =~ m/^([^:]+):/;
                # If it's new, count it.
                if (! exists $blockHash{$1}) {
                    $blockHash{$1} = $regionHash{$region};
                    $genomeCount++;
                }
            }
            # At this point, if the number of genomes found matches the
            # size of the list, this block is good.
            if ($genomeCount == $totalGenomes) {
                if (! $indels) {
                    # If indels are off, we need to remove some of the columns. We do
                    # this by finding C<-> characters in each genome's DNA sequence,
                    # and deleting all instances of the relevant columns.
                    for my $genomeID (@{$genomeIDs}) {
                        Trace("Hyphen check for genome $genomeID.") if T(Alignment => 3);
                        while ((my $pos = index($blockHash{$genomeID},"-")) >= 0) {
                            # Here we found an insertion character. We delete its column
                            # from all the genomes in the alignment.
                            Trace("Hyphen found at position $pos in genome $genomeID.") if T(Alignment => 4);
                            for my $genomeID2 (@{$genomeIDs}) {
                                substr $blockHash{$genomeID2}, $pos, 1, "";
                            }
                        }
                    }
                }
                Trace("Common block $blockID ($blockName) added to alignment.") if T(Alignment => 3);
                for my $genomeID (@{$genomeIDs}) {
                    $retVal{$genomeID} .= $blockHash{$genomeID};
                }
            }
        }
    }
    # Return the result.
    return %retVal;
}

=head3 DistanceMatrix

    my %distances = SimBlocks::DistanceMatrix(\%sequences, \%distances);

Compute the distances between the sequences in an alignment. the L</SequenceDistance>
method is used to compute the individual distances.

=over 4

=item sequences

Reference to a hash of genome IDs to alignment sequences. The alignment sequences may
only contain the characters C<AGCT->, and must all be the same length.

=item distances

Reference to a hash that maps pairs of letters to distance values. Each letter pair
must be represented as a two-character string. Dual strings such as "AG" and "GA" must
map to the same distance. If no distance matrix is specified, the default distance
matrix is used.

=item RETURN

Returns a hash that maps genome ID pairs to distance values. The ID pairs are
formed using a slash. For example, the distance from C<158878.1> to C<282459.1>
would be found attached to the key C<158878.1/282459.1>.

=back

=cut
#: Return Type %
sub DistanceMatrix {
    # Get the parameters.
    my ($sequences, $distances) = @_;
    # Declare the return variable.
    my %retVal = ();
    # Create a sorted list of the genome IDs.
    my @genomes = sort keys %{$sequences};
    # Loop through the genome IDs.
    for (my $i = 0; $i <= $#genomes; $i++) {
        my $genomei = $genomes[$i];
        # Create the 0-distance pair.
        $retVal{"$genomei/$genomei"} = 0;
        # Loop through the genomes lexically after this one. We'll
        # generate one distance and use it for both orderings.
        for (my $j = $i + 1; $j <= $#genomes; $j++) {
            my $genomej = $genomes[$j];
            # Compute the distance.
            my $distance = SequenceDistance($sequences->{$genomei}, $sequences->{$genomej},
                                            $distances);
            Trace("Distance from $genomei to $genomej is $distance.") if T(Distance => 3);
            # Store it in the return hash.
            $retVal{"$genomei/$genomej"} = $distance;
            $retVal{"$genomej/$genomei"} = $distance;
        }
    }
    # Return the result matrix.
    return %retVal;
}

=head3 SequenceDistance

    my $dist = SimBlocks::SequenceDistance($seq1, $seq2, $distances);

Return the distance between two sequences. The distance presumes that
each alignment is a vector of sorts, with purines (C<A>/C<T>) and pyrmidines (C<G>/C<C>)
being a half-unit from themselves and a full unit from each other. The distance is normalized
by the number of positions in the alignment. Thus, C<AATGCTT> is 0.6267 from C<ATTTGCA>.
The fourth and sixth positions are a full-unit jump and the second, fifth, and seventh
positions are half-unit jumps. This works out to

    sqrt(0 + 0.5^2 + 0 + 1^2 + 0.5^2 + 1^2 + 0.5^2)/sqrt(7)

which is approximately 0.6267. Insertion characters (C<->) are considered a distance of
0 from everything. This distance model can be overridden by supplying a custom distance
matrix as the third parameter.

=over 4

=item seq1

Origin sequence for computing the distance.

=item seq2

End sequence for computing the distance. Both sequences must be the same length.

=item distances

Reference to a hash that maps pairs of letters to distance values. Each letter pair
must be represented as a two-character string. Dual strings such as "AG" and "GA" must
map to the same distance. If no distance matrix is specified, the default distance
matrix is used.

=back

=cut
#: Return Type $;
sub SequenceDistance {
    # Get the parameters.
    my ($seq1, $seq2, $distances) = @_;
    if (!defined $distances) {
        $distances = DefaultDistances();
    }
    # Declare the return value. Initially, this will be a sum of squares. We'll
    # take the square root and divide out the length later.
    my $retVal = 0;
    # Loop through the sequences, character by character, comparing.
    my $n = length $seq1;
    for (my $i = 0; $i < $n; $i++) {
        my $s1 = substr $seq1, $i, 1;
        my $s2 = substr $seq2, $i, 1;
        my $step = $distances->{"$s1$s2"};
        Trace("No step distance found for \"$s1$s2\".") if !defined $step && T(Distance => 1);
        $retVal += $step * $step;
    }
    # Divide by the length and take the square root.
    $retVal = sqrt($retVal / $n);
    # Return the result.
    return $retVal;
}

=head3 GetBlock

    my $blockData = $simBlocks->GetBlock($blockID);

Return a B<ERDBObject> for a specified group block.

=over 4

=item blockID

ID of the desired block.

=item RETURN

Returns a B<ERDBObject> for the group block in question. The object allows access to
all of the fields in the C<GroupBlock> relation of the database.

=back

=cut
#: Return Type $%;
sub GetBlock {
    # Get the parameters.
    my ($self, $blockID) = @_;
    # Get the group block.
    my $retVal = $self->GetEntity('GroupBlock', $blockID);
    # Return the result.
    return $retVal;
}

=head3 GetBlockPieces

    my @blocks = $blockData->GetBlockPieces($location);

Return a map of the block pieces inside the specified location. The return
value will be a list of block locations. A block location is essentially a
location string with a block ID in place of the contig ID.

This routine depends upon the fact that there is no overlap between blocks.
A given nucleotide on a Contig is in exactly one block. Therefore, to find
all the blocks that overlap a specific section of the Contig, we find the
first block that ends after the start of the section and proceed until we
find the last block that begins before the end of the section. The only
complexity that occurs is if the Contig folds back in upon itself and
appears twice in the same block. In this case, we might get two segments from
the same block, and they may or may not overlap.

=over 4

=item location

A basic location object or location string defining the range whose block pieces
are desired.

=item RETURN

Returns a list of block location strings. Each string consists of a block ID
followed by a start location and an end location. The three pieces of the
string are separated by underscores (C<_>).

=back

=cut
#: Return Type %;
sub GetBlockPieces {
    # Get the parameters.
    my ($self, $location) = @_;
    # Declare the return variable.
    my @retVal = ();
    # Parse the incoming location.
    my $loc = BasicLocation->new($location);
    # Determine the parameters needed to get the desired list of regions.
    my ($left, $right, $contigID) = ($loc->Left, $loc->Right, $loc->Contig);
    Trace("Searching for regions near " . $loc->String) if T(4);
    # Ask for the regions in the section we want.
    my @regions = $self->GetAll(['Region', 'IncludesRegion', 'GroupBlock'],
                                'Region(endpoint) >= ? AND Region(position) <= ? AND Region(contigID) = ?',
                                [$left, $right, $contigID],
                                ['GroupBlock(id)', 'GroupBlock(pattern)',
                                 'GroupBlock(len)',
                                 'Region(position)', 'Region(endpoint)',
                                 'Region(content)']);
    # Loop through the regions found. For each region we will output a location string.
    for my $regionData (@regions) {
        # Declare the variable to contain the formatted location.
        my $blockLoc;
        # Separate the fields from the region data.
        my ($blockID, $pattern, $len, $position, $end, $content) = @{$regionData};
        # Determine whether we're using the full region or only part of it.
        if ($position >= $left && $end <= $right) {
            # Here we're using the full region, so the location is simple.
            $blockLoc = "${blockID}_1_${len}";
        } else {
            # This is the hard part. We need to convert the contig positions to
            # block positions. Since the block contains insertion characters and
            # the regions don't, this requires fancy dancing. First, we get the
            # region's DNA string.
            my $dnaSequence = TagDNA($pattern, $content);
            # Let's call the "area of interest" the part of the Contig that
            # is in the incoming location AND in the current region. This could
            # be at either end of the region or somewhere in the middle. We
            # need to walk the block DNA to find what positions in the block
            # correspond to the desired positions in the contig. First, we
            # put ourselves at the beginning of the block and the region,
            # and walk to the start point. Note that our position arithmetic
            # is 1-based, in keeping with the conventions of biologists.
            my $blockPos = 1;
            my $contigPos = $position;
            # Check to see if we can use the current position as are starting
            # point or we need to find it.
            if ($left > $position) {
                # Here we need to find the left edge of the area of interest.
                $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $left);
                # Update the contig position to match.
                $contigPos = $left;
            }
            # Save the left edge.
            my $blockLeft = $blockPos;
            # Now find the right edge.
            if ($right >= $end) {
                # Here the right edge is the end of the block.
                $blockPos = $len;
            } else {
                # Here we have to find the right edge of the area of interest.
                $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $right);
            }
            my $blockRight = $blockPos;
            # Format the location.
            $blockLoc = "${blockID}_${blockLeft}_${blockRight}";
        }
        # Push the block location onto the return list.
        push @retVal, $blockLoc;
    }
    # Return the result.
    return @retVal;
}

=head3 GetFeatureBlockPieces

    my @pieces = $simBlocks->GetFeatureBlockPieces($fig, \@featureIDs, $distance);

Return a list of the block pieces within the specified distance of the
specified features. This method essentially computes locations from
the distance and the feature locations, then passes them to L/GetBlockPieces>.
That method will return the sections of various similarity blocks that
occur inside the the locations computed.

=over 4

=item fig

A FIG-like object that can be used to convert features into locations.

=item featureIDs

Reference to a list of feature IDs. Blocks in the vicinity of any of the
features are returned.

=item distance

Distance to search from the feature's actual location.

=item RETURN

Returns a list of block piece location objects. A block piece location object
is a standard B<BasicLocation> object with a block ID instead of a contig ID.

=back

=cut
#: Return Type @;
sub GetFeatureBlockPieces {
    # Get the parameters.
    my ($self, $fig, $featureIDs, $distance) = @_;
    # Declare a hash for keeping the return data. This will weed out
    # duplicates, of which we expect plenty. We'll need to handle
    # overlaps later, however.
    my %retHash = ();
    # Loop through the features.
    for my $featureID (@{$featureIDs}) {
        # Get the feature's genome ID.
        my $genomeID = FIG::genome_of($featureID);
        # Get the feature's locations.
        my @locations = $fig->feature_location($featureID);
        # Loop through the locations, sticking the resulting pieces in the return
        # hash.
        for my $loc (@locations) {
            my $locObject = BasicLocation->new($loc);
            # Get the length of the contig in question.
            my $len = $fig->contig_ln($genomeID, $locObject->Contig);
            # Expand the location.
            $locObject->Widen($distance, $len);
            # Insure the location is Sprout-style;
            $locObject->FixContig($genomeID);
            # Get the desired block pieces.
            my @pieces = $self->GetBlockPieces($locObject);
            Trace(scalar(@pieces) . " pieces found for location $loc.") if T(4);
            # Put them in the hash.
            for my $piece (@pieces) {
                $retHash{$piece} = 1;
            }
        }
    }
    # Now we have all the block pieces that occur in any one of our regions. The
    # next step is to merge adjacent and overlapping regions. First we convert
    # the pieces to location objects so we can sort them.
    my @retVal = ();
    for my $piece (keys %retHash) {
        my $loc = BasicLocation->new($piece);
        push @retVal, $loc;
    }
    Trace("Beginning sort.") if T(3);
    @retVal = sort { BasicLocation::Cmp($a,$b) } @retVal;
    Trace(scalar(@retVal) . " pieces found before overlap check.") if T(3);
    # Now the locations are sorted by block ID, start position, and descending
    # length. This means that if there's an overlap, the two overlapping
    # pieces will be next to each other.
    my $i = 0;
    while ($i < $#retVal) {
        # Check for overlap between this location and the next.
        my ($type, $len) = Overlap::CheckOverlap($retVal[$i], $retVal[$i+1]);
        if ($len == 0) {
            # Here there is no overlap, so we push ahead.
            $i++;
        } elsif ($type eq 'embedded') {
            # Here the second piece is inside the first, so we delete the
            # second.
            delete $retVal[$i+1];
        } else {
            # Here we have a normal overlap. Because all of our pieces
            # are forward locations, this means the second piece starts
            # before the end of the of the first. We set the end point
            # if the first to the end point of the second and delete the
            # second.
            $retVal[$i]->SetEnd($retVal[$i+1]->EndPoint);
            delete $retVal[$i+1];
        }
    }
    # Return the result.
    return @retVal;
}

=head3 WalkDNA

    my $blockPos = SimBlocks::WalkDNA($blockPos, $contigPos, $dna, $loc);

Location the desired position within a block of a specified location in a contig.

The idea here is that a block contains insertion characters and a contig doesn't.
The incoming DNA sequence is for the block. On entry, we have a current position
in the block and a current position in the contig. We advance the two pointers
in parallel until the contig pointer equals the desired location.

=over 4

=item blockPos

Current position in the block (1-based).

=item contigPos

Current position in the contig.

=item dna

DNA string for the block with the points of variance filled in with data from
the contig. In other word's the block's version of the current contig region.

=item loc

Desired location in the contig.

=item RETURN

Returns the position in the block corresponding to the desired position in the
contig.

=back

=cut
#: Return Type $;
sub WalkDNA {
    # Get the parameters.
    my ($blockPos, $contigPos, $dna, $loc) = @_;
    # Copy the incoming positions into variables with which we can play. While
    # we're at it, we'll convert the block position from 1-based to 0-based.
    my ($retVal, $contigPtr) = ($blockPos - 1, $contigPos);
    # Loop until we find our destination point.
    while ($contigPtr < $loc) {
        # Find the next insertion character in the DNA string.
        my $nextHyphen = index $dna, '-', $retVal;
        # At this point, "nextHyphen" is either -1 (indicating no hyphen
        # found) or it is positioned on a hyphen. If it's -1, move it after
        # the end of the dna string.
        if ($nextHyphen < 0) {
            $nextHyphen = length $dna;
        }
        # Now, if we subtract $retVal from $nextHyphen, we have the largest
        # value by which we can advance the contig pointer without getting
        # out of sync. The first step is to see if this would move us past
        # our desired location.
        my $offset = $nextHyphen - $retVal;
        my $contigNext = $contigPtr + $offset;
        if ($contigNext >= $loc) {
            # We now know that everything between the current position and
            # the desired position is real nucleotides. We simply push both
            # pointers forward by the same amount so that the contig pointer
            # becomes $loc.
            $retVal += $loc - $contigPtr;
            $contigPtr = $loc
        } else {
            # Here we need to go the whole distance to the next hyphen and
            # keep going.
            $contigPtr += $offset;
            $retVal += $offset;
            $retVal++;
        }
    }
    # Return the result. Note we convert back to 1-based.
    return $retVal + 1;
}

1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3