[Bio] / Sprout / SimBlocks.pm Repository:
ViewVC logotype

View of /Sprout/SimBlocks.pm

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.1 - (download) (as text) (annotate)
Wed May 4 03:24:43 2005 UTC (14 years, 9 months ago) by parrello
Branch: MAIN
*** empty log message ***

package SimBlocks;

    require Exporter;
    use ERDB;
    @ISA = qw(Exporter ERDB);
    @EXPORT = qw();
    @EXPORT_OK = qw();

    use strict;
    use Tracer;
    use PageBuilder;
    use Genome;
    use FIG_Config;

=head1 Similarity Block Database

=head2 Introduction

A I<similarity Block Database> describes the similarities amongst a group
of contigs. The load files for the database are produced by the
C<SimBlast.pl> script from genome FASTA files. The database is described
by the C<SimBlocksDBD.xml> file.

The primary entities of the database are B<Genome>, which represents a specified
organism, B<Group>, which represents a set of similar regions, and B<Contig>,
which represents a contiguous segment of DNA. The key relationship is
B<ContainsRegionIn>, which maps groups to contigs. The mapping information
includes the DNA nucleotides as well as the location in the contig.

For a set of genomes, we will define a I<common group> as a group that
is present in most of the genomes. (The definition of I<most> is
determined by a parameter.)

The similarity block database must be able to answer two questions, taking
as input two sets of genomes.

=over 4

=item (1) Which groups are common in both sets and which are common in only one or the other?

=item (2) For each group that is common in both sets, how do the individual regions differ?


This class is a subclass of B<ERDB>, and inherits all of its public methods.


#: Constructor SimBlocks->new();

=head2 Public Methods

=head3 new

C<< my $simdb = SimBlocks->new($dbname, $dbType, $port); >>

Construct a new SimBlocks object connected to the specified database. If no
database is specified, the default database indicated by C<FIG_Config::simBlocksDB>
will be used. Similarly, if the database type is omitted, the type will be
taken from C<FIG_Config::dbms>, and if the port is omitted, C<FIG_Config::dbport>
will be used. The user name and password for connecting are taken from
C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.

=over 4

=item dbname (optional)

Database schema name. This is how we choose the desired database on the server.
If omitted, the value of C<$FIG_Config::simBlocksDB> is used.

=item dbType

Database type (e.g. C<mysql> for MySQL or C<pg> for Postgres). If omitted, the
value of C<$FIG_Config::dbms> will be used.

=item port

Port number for connecting to the database server. If omitted, the value of
C<$FIG_Config::dbport> will be used.



sub new {
    # Get the parameters.
    my ($class, $dbname, $dbType, $port) = @_;
    # Plug in the default values.
    if (! $dbname) {
        $dbname = $FIG_Config::simBlocksDB;
    if (! $dbType) {
        $dbType = $FIG_Config::dbms;
    if (! $port) {
        $port = $FIG_Config::dbport;
    # Connect to the database.
    my $dbh = DBKernel->new($dbType, $dbname, $FIG_Config::dbuser, $FIG_Config::dbpass, $port);
    # Get the data directory name.
    my $directory = $FIG_Config::simBlocksData;
    # Create and bless the ERDB object.
    my $retVal = ERDB::new($class, $dbh, "$directory/SimBlocksDBD.xml");
    # Return it.
    return $retVal;

=head3 DBLoad

C<< my $stats = $simBlocks->DBLoad($rebuild); >>

Load the database from the default directory. This is essentially identical to
a B<LoadTables> call with the default directory used instead of a caller-specified

=over 4

=item rebuild

TRUE if the tables should be re-created; FALSE if the tables should be cleared and
reloaded. The default is FALSE; however, TRUE is considerably faster.

=item RETURN

Returns a statistics object describing the duration of the load, the number of
records loaded, and a list of error messages.


#: Return Type $%@;
sub DBLoad {
    # Get the parameters.
    my ($self, $rebuild) = @_;
    # Load the tables.
    my $retVal = $self->LoadTables($FIG_Config::simBlocksData, $rebuild);
    # Return the result.
    return $retVal;

=head3 CompareGenomes

C<< my (\%set1Blocks, \%set2Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set1, \@set2); >>

Analyze two sets of genomes for commonalities. The group blocks returned will be divided
into three hashes: one for those found only in set 1, one for those found only in set 2,
and one for those found in both sets. Each hash is keyed by group ID and will contain
B<DBObject>s for B<GroupBlock> records with B<HasInstanceOf> data attached, though the
genome ID in the B<HasInstanceOf> section is not generally predictable.

=over 4

=item set1

Reference to a list of genome IDs for the genomes in the first set.

=item set2

Reference to a list of genome IDs for the genomes in the second set.

=item RETURN

Returns a list of hashes of lists. Each hash is keyed by group ID, and will contain
B<DBObject>s for records in the B<GroupBlock> table. Groups found only in set
1 will be in the first hash, groups found only in set 2 will be in the second
hash, and groups found in both sets will be in the third hash.


#: Return Type @%;
sub CompareGenomes {
    # Get the parameters.
    my ($self, $set1, $set2) = @_;
    # Declare the three hashes.
    my (%set1Blocks, %set2Blocks, %bothBlocks);
    # Our first task is to find the groups in each genome set.
    my %groups1 = $self->BlocksInSet($set1);
    my %groups2 = $self->BlocksInSet($set2);
    # Create a trailer key to help tighten the loops.
    my $trailer = "\xFF";
    # The groups are hashed by group ID. We will move through them in key order
    # to get the "set1", "set2", and "both" groups.
    my @blockIDs1 = (sort keys %groups1, $trailer);
    my @blockIDs2 = (sort keys %groups2, $trailer);
    # Prime the loop by getting the first ID from each list. Thanks to the trailers
    # we know this process can't fail.
    my $blockID1 = shift @blockIDs1;
    my $blockID2 = shift @blockIDs2;
    while ($blockID1 lt $trailer || $blockID2 lt $trailer) {
        # Compare the block IDs.
        if ($blockID1 lt $blockID2) {
            # Block 1 is only in the first set.
            $set1Blocks{$blockID1} = $groups1{$blockID1};
            $blockID1 = shift @blockIDs1;
        } elsif ($blockID1 gt $blockID2) {
            # Block 2 is only in the second set.
            $set2Blocks{$blockID2} = $groups2{$blockID2};
            $blockID2 = shift @blockIDs2;
        } else {
            # We have a block in both sets.
            $bothBlocks{$blockID1} = $groups1{$blockID1};
            $blockID1 = shift @blockIDs1;
            $blockID2 = shift @blockIDs2;
    # Return the result.
    return (\%set1Blocks, \%set2Blocks, \%bothBlocks);

=head3 BlocksInSet

C<< my %blockList = $simBlocks->BlocksInSet($set); >>

Return a list of the group blocks found in a given set of genomes. The list returned
will be a hash of B<DBObject>s, each corresponding to a single B<GroupBlock> record,
with a B<HasInstanceOf> record attached, though the content of the B<HasInstanceOf>
record is not predictable. The hash will be keyed by group ID.

=over 4

=item set

Reference to a list of genome IDs. All blocks appearing in any one of the genome
IDs will be in the list returned.

=item RETURN

Returns a hash of B<DBObject>s corresponding to the group blocks found in the
genomes of the set.


#: Return Type %%;
sub BlocksInSet {
    # Get the parameters.
    my ($self, $set) = @_;
    # Create a hash to hold the groups found. The hash will be keyed by group ID
    # and contain the relevant DB object.
    my %retVal = ();
    # Loop through the genome IDs in the specified set.
    for my $genomeID (@{$set}) {
        # Get a list of group blocks for this genome.
        my @blocks = $self->GetList(['HasInstanceOf', 'GroupBlock'],
                                    "HasInstanceOf(from-link) = ?", $genomeID);
        # Loop through the blocks, storing any new ones in the hash.
        for my $block (@blocks) {
            # Get the ID of this block.
            my ($blockID) = $block->Value('GroupBlock(id)');
            # Store it in the hash. If it's a duplicate, it will erase the
            # old one.
            $retVal{$blockID} = $block;
    # Return the result.
    return %retVal;

=head3 GetRegions

C<< my %regions = $simBlocks->GetRegions($blockID, \@genomes); >>

Return the regions of the specified block that occur in the contigs of
the specified genomes. The return value is a hash of DNA strings keyed
on the region's location descriptor.

=over 4

=item blockID

ID of the group block whose regions are to be returned.

=item genomes

Reference to a list of genome IDs. Only regions in one of the listed
genomes will be returned.

=item RETURN

Returns a hash keyed by region location. The objects in the hash are
the DNA strings for the specified region.


#: Return Type %$;
sub GetRegions {
    # Get the parameters.
    my ($self, $blockID, $genomes) = @_;
    # Create the return hash.
    my %retVal = ();
    # Query all the regions for the specified block.
    my $query = $self->Get(['ContainsRegionIn'], "ContainsRegionIn(from-link) = ?", $blockID);
    # Loop through the query.
    while (my $region = $query->Fetch) {
        # Get this region's data.
        my ($contigID, $start, $dir, $len, $dna) =
            $region->Values(['ContainsRegionIn(to-link)', 'ContainsRegionIn(position)',
                             'ContainsRegionIn(direction)', 'ContainsRegionIn(len)',
        # Insure it belongs to one of our genomes.
        my $genomeID = Genome::FromContig($contigID);
        my $found = grep($_ eq $genomeID, @{$genomes});
        # If it does, put it in the hash.
        if ($found) {
            my $location = "${contigID}_$start$dir$len";
            $retVal{$location} = $dna;
    # Return the result.
    return %retVal;

=head3 TagDNA

C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>

Convert a DNA string from the B<ContainsRegionIn> relationship to the actual
DNA, optionally with markings surrounding them. The DNA string will contain
only the values corresponding to the question marks in the pattern, 
which should be taken from the DNA string's parent B<GroupBlock>. The
resulting DNA sequence is built by copying 

=over 4

=item pattern

DNA pattern from which the positions of variance are to be taken.

=item dnaString

String containing DNA string to be marked.

=item prefix

String to be inserted before each position of variance.

=item suffix

String to be inserted after each position of variance.

=item RETURN

Returns a copy of the pattern string with the positions of variance filled
in from the DNA string. If a prefix and suffix are specified, they are
placed before and after each character copied into the pattern string.


#: Return Type $;
sub TagDNA {
    # Get the parameters.
    my ($pattern, $dnaString, $prefix, $suffix) = @_;
    # Insure the prefix and suffix behave properly.
    if (!defined $prefix) {
        $prefix = "";
    if (!defined $suffix) {
        $suffix = "";
    # Initialize the return variable.
    my $retVal = "";
    # Convert the DNA string to a list with the prefixes and suffixes
    # inserted.
    my @dnaList = map { $prefix . $_ . $suffix } split //, $dnaString;
    # Loop through the DNA string.
    my ($start, $position) = (0, 0);
    while (($position = index $pattern, "?", $start) >= 0) {
        # If we have invariant text, copy it over.
        if ($position > $start) {
            $retVal .= substr $pattern, $start, $position - $start;
        # Pop off the next position from the DNA string.
        $retVal .= shift @dnaList;
        # Start past the question mark.
        $start = $position + 1;
    # Return the result.
    return $retVal;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3