[Bio] / Sprout / Sapling.pm Repository:
ViewVC logotype

View of /Sprout/Sapling.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.6 - (download) (as text) (annotate)
Tue Apr 21 21:22:10 2009 UTC (10 years, 5 months ago) by olson
Branch: MAIN
Changes since 1.5: +1 -1 lines
add $FIG_Config::saplingDB

#!/usr/bin/perl -w

#
# Copyright (c) 2003-2006 University of Chicago and Fellowship
# for Interpretations of Genomes. All Rights Reserved.
#
# This file is part of the SEED Toolkit.
#
# The SEED Toolkit is free software. You can redistribute
# it and/or modify it under the terms of the SEED Toolkit
# Public License.
#
# You should have received a copy of the SEED Toolkit Public License
# along with this program; if not write to the University of Chicago
# at info@ci.uchicago.edu or the Fellowship for Interpretation of
# Genomes at veronika@thefig.info or download a copy from
# http://www.theseed.org/LICENSE.TXT.
#

package Sapling;

    use strict;
    use Tracer;
    use DBKernel;
    use base 'ERDB';
    use Stats;
    use XML::Simple;

=head1 Sapling Package

Sapling Database Access Methods

=head2 Introduction

The Sapling database is a new [[ErdbPm]] database that attempts to encapsulate
our data in a portable form for distribution. It is loaded directly from the
complete genomes and trusted subsystems of the SEED. This object has minimal
capabilities: in essence, it's just enough to get the database loaded and
working. As with the earlier Sprout database, most of the work required to use
the database can be performed using the base-class methods.

The fields in this object are as follows.

=over 4

=item loadDirectory

Name of the directory containing the files used by the loaders.

=item loaderSource

Source object for the loaders (a [[FigPm]] in our case).

=item genomeHash

Reference to a hash of the genomes to include when loading.

=item subHash

Reference to a hash of the subsystems to include when loading.

=item tuning

Reference to a hash of tuning parameters.

=back

=head2 Configuration

The default loading profile for the Sapling database is to include all complete
genomes and all usable subsystems. This can be overridden by specifying a list of
genomes and subsystems in an XML configuration file. The file name should be
C<SaplingConfig.xml> in the specified data directory. The document element should
be C<Sapling>, and it has two sub-elements. The C<Genomes> element should contain as
its text a space-delimited list of genome IDs. The <Subsystems> element should contain
a list of subsystem names, one per line. If a particular section is missing, the
default list will be used.

=head3 Example

The following configuration file specifies 10 genomes and 6 subsystems.

    <Sapling>
      <Genomes>
        100226.1 31033.3 31964.1 36873.1 126740.4
        155864.1 349307.7 350058.5 351348.5 412694.5
      </Genomes>
      <Subsystems>
        Sugar_utilization_in_Thermotogales
        Coenzyme_F420_hydrogenase
        Ribosome_activity_modulation
        prophage_tails
        CBSS-393130.3.peg.794
        Apigenin_derivatives
      </Subsystems>
    </Sapling>

The XML file also contains tuning parameters that affect the way the data
is loaded. These are specified as attributes in the TuningParameters element,
as follows.

=over 4

=item maxLocationLength

The maximum number of base pairs allowed in a single location. B<IsLocatedIn>
records are split into sections based on this length, so when you are looking
for all the features in a particular neighborhood, you can look for locations
within the maximum location distance from the neighborhood, and even if you have
a huge operon that contains tens of thousands of base pairs, you'll still be
able to find it.

=item maxSequenceLength

The maximum number of base pairs allowed in a single DNA sequence. DNA sequences
are broken into segments to prevent excessively large genomes from clogging
memory during sequence resolution.

=back

=head2 Special Methods

=head3 Global Section Constant

Each section of the database used by the loader corresponds to a single genome.
The global section is loaded after all the others, and is concerned with data
not related to a particular genome.

=cut

    # Name of the global section
    use constant GLOBAL => 'Globals';

=head3 Tuning Parameter Defaults

Each tuning parameter must have a default value, in case it is not present in
the XML configuration file. The defaults are specified in a constant hash
reference called C<TUNING_DEFAULTS>.

=cut

    use constant TUNING_DEFAULTS => {
        maxLocationLength => 4000,
        maxSequenceLength => 1000000,
    };

=head3 new

    my $sap = Sapling->new(%options);

Construct a new Sapling object. The following options are supported.

=over 4

=item loadDirectory

Data directory to be used by the loaders.

=item dbd

XML database definition file.

=item dbName

Name of the database to use.

=item sock

Socket for accessing the database.

=item userData

Name and password used to log on to the database, separated by a slash.

=item dbhost

Database host name.

=back

=cut

sub new {
    # Get the parameters.
    my ($class, %options) = @_;
    # Get the options.
    my $loadDirectory = $options{loadDirectory} || $FIG_Config::saplingData ||
                        "$FIG_Config::fig/SaplingData";
    my $dbd = $options{dbd} || "$loadDirectory/SaplingDBD.xml";
    my $dbName = $options{dbName} || $FIG_Config::saplingDB || "nmpdr_sapling";
    my $sock = $options{sock} || "$FIG_Config::sproutSock" || "";
    my $userData = $options{userData} || "seed/";
    my $dbhost = $options{dbhost} || $FIG_Config::saplingHost || "localhost";
    # Compute the user name and password.
    my ($user, $pass) = split '/', $userData, 2;
    $pass = "" if ! defined $pass;
    # Connect to the database.
    my $dbh = DBKernel->new('mysql', $dbName, $user, $pass, 3306, $dbhost, $sock);
    # Create the ERDB object.
    my $retVal = ERDB::new($class, $dbh, $dbd, %options);
    # Add the load directory pointer.
    $retVal->{loadDirectory} = $loadDirectory;
    # Set up the spaces for the loader source object, the subsystem hash, the
    # genome hash, and the tuning parameters.
    $retVal->{source} = undef;
    $retVal->{genomeHash} = undef;
    $retVal->{subHash} = undef;
    $retVal->{tuning} = undef;
    # Return it.
    return $retVal;
}


=head2 Public Methods

=head3 Taxonomy

    my @taxonomy = $sap->Taxonomy($genomeID);

Return the full taxonomy of the specified genome, starting from the
domain downward. The returned values will be primary names, not taxonomy
IDs.

=over 4

=item genomeID

ID of the genome whose taxonomy is desired. The genome does not need to exist
in the database: the version number will be lopped off and the result used as
an entry point into the taxonomy tree.

=item RETURN

Returns a list of taxonomy names, starting from the domain and moving
down to the node where the genome is attached.

=back

=cut

sub Taxonomy {
    # Get the parameters.
    my ($self, $genomeID) = @_;
    # Get the genome's taxonomic group.
    my ($taxon) = split /\./, $genomeID, 2;
    # We'll put the return data in here.
    my @retVal;
    # Loop until we hit a domain.
    my $domainFlag;
    while (! $domainFlag) {
        # Get the data we need for this taxonomic group.
        my ($taxonData) = $self->GetAll('TaxonomicGrouping IsInGroup',
                                        'TaxonomicGrouping(id) = ?', [$taxon],
                                        'domain scientific-name IsInGroup(to-link)');
        # If we didn't find what we're looking for, then we have a problem. This
        # would indicate a node below the domain level that doesn't have a parent
        # or (more likely) an invalid input string.
        if (! $taxonData) {
            # Terminate the loop and trace a warning.
            $domainFlag = 1;
            Trace("Could not find node or parent for \"$taxon\".") if T(1);
        } else {
            # Extract the data for the current group. Note we overwrite our
            # taxonomy ID with the ID of our parent, priming the next iteration
            # of the loop.
            my $name;
            ($domainFlag, $name, $taxon) = @$taxonData;
            # Put the current group's name in the return list.
            unshift @retVal, $name;
        }
    }
    # Return the result.
    return @retVal;
}


=head3 GenomeHash

    my $genomeHash = $sap->GenomeHash();

Return a hash of the genomes configured to be in this database. The list
is either taken from the active SEED database or from a configuration
file in the data directory. The hash maps genome IDs to TRUE.

=cut

sub GenomeHash {
    # Get the parameters.
    my ($self) = @_;
    # We'll build the hash in here.
    my %genomeHash;
    # Do we already have a list?
    if (! defined $self->{genomeHash}) {
        # No, check for a configuration file.
        my $xml = $self->ReadConfigFile();
        if (defined $xml && $xml->{Genomes}) {
            # We found one and it has a genome list, so extract the genomes.
            %genomeHash = map { $_ => 1 } grep { $_ =~ /\S/ } split /\s+/, $xml->{Genomes};
        } else {
            # No, so get the genome list.
            my $fig = $self->GetSourceObject();
            my @genomes = $fig->genomes(1);
            # Verify the genome list to insure every genome has an organism
            # directory.
            for my $genome (@genomes) {
                if (-d "$FIG_Config::organisms/$genome") {
                    $genomeHash{$genome} = 1;
                }
            }
        }
        # Store the genomes in this object.
        $self->{genomeHash} = \%genomeHash;
    }
    # Return the result.
    return $self->{genomeHash};
}

=head3 SubsystemID

    my $subID = $sap->SubsystemID($subName);

Return the ID of the subsystem with the specified name.

=over 4

=item subName

Name of the relevant subsystem. A subsystem name with underscores for spaces
will return the same ID as a subsystem name with the spaces still in it.

=item RETURN

Returns a normalized subsystem name.

=back

=cut

sub SubsystemID {
    # Get the parameters.
    my ($self, $subName) = @_;
    # Normalize the subsystem name by converting underscores to spaces.
    my $retVal = $subName;
    $retVal =~ s/_/ /g;
    # Return the result.
    return $retVal;
}

=head3 SubsystemHash

    my $subHash = $sap->SubsystemHash();

Return a hash of the subsystems configured to be in this database. The
list is either taken from the active SEED database or from a
configuration file in the data directory. The hash maps subsystem names
to TRUE.

=cut

sub SubsystemHash {
    # Get the parameters.
    my ($self) = @_;
    # We'll build the hash in here.
    my %subHash;
    # Do we already have a list?
    if (! defined $self->{subHash}) {
        # No, check for a configuration file.
        my $xml = $self->ReadConfigFile();
        if (defined $xml && $xml->{Subsystems}) {
            # We found one, and it has subsystems, so we extract them.
            # A little dancing is necessary to trim spaces.
            my @subs = map { $_ =~ /\s*(\S.+\S)/; $1 } split /\n/, $xml->{Subsystems};
            # Here we need to clear out any null subsystem names resulting from
            # blank lines in the file.
            %subHash = map { $_ => 1 } grep { $_ } @subs;
        } else {
            # No config file, so we ask the FIG object.
            my $fig = $self->GetSourceObject();
            my @subs = map { $self->SubsystemID($_) } $fig->all_subsystems();
            %subHash = map { $_ => 1 } grep { $fig->usable_subsystem($_) } @subs;
        }
        # Store the subsystems in this object.
        $self->{subHash} = \%subHash;
    }
    # Return the result.
    return $self->{subHash};
}

=head3 TuningParameter

    my $parm = $erdb->TuningParameter($parmName);

Return the value of the specified tuning parameter. Tuning parameters are
read from the XML configuration file.

=over 4

=item parmName

Name of the parameter whose value is desired.

=item RETURN

Returns the paramter value.

=back

=cut

sub TuningParameter {
    # Get the parameters.
    my ($self, $parmName) = @_;
    # Insure we have the parameters in memory.
    if (! defined $self->{tuning}) {
        # Read the configuration file.
        my $configFile = $self->ReadConfigFile();
        # Get the tuning parameters (if any).
        my $tuning;
        if (! defined $configFile || ! exists $configFile->{TuningParameters}) {
            $tuning = {};
        } else {
            $tuning = $configFile->{TuningParameters};
        }
        # Merge in the default option values.
        Tracer::MergeOptions($tuning, TUNING_DEFAULTS);
        # Save the result in our object.
        $self->{tuning} = $tuning;
    }
    # Extract the tuning paramter.
    my $retVal = $self->{tuning}{$parmName};
    # Throw an error if it does not exist.
    Confess("Invalid tuning parameter \"$parmName\".") if ! defined $retVal;
    # Return the result.
    return $retVal;
}


=head3 ReadConfigFile

    my $xmlObject = $sap->ReadConfigFile();

Return the hash structure created from reading the configuration file, or
an undefined value if the file is not found.

=cut

sub ReadConfigFile {
    my ($self) = @_;
    # Declare the return variable.
    my $retVal;
    # Compute the configuration file name.
    my $fileName = "$self->{loadDirectory}/SaplingConfig.xml";
    # Did we find it?
    if (-f $fileName) {
        # Yes, read it in.
        $retVal = XMLin($fileName);
    }
    # Return the result.
    return $retVal;
}

=head3 GlobalSection

    my $flag = $sap->GlobalSection($name);

Return TRUE if the specified section name is the global section, FALSE
otherwise.

=over 4

=item name

Section name to test.

=item RETURN

Returns TRUE if the parameter matches the GLOBAL constant, else FALSE.

=back

=cut

sub GlobalSection {
    # Get the parameters.
    my ($self, $name) = @_;
    # Return the result.
    return ($name eq GLOBAL);
}


=head2 Virtual Methods

=head3 GetSourceObject

    my $source = $erdb->GetSourceObject();

Return the object to be used in creating load files for this database. This is
only the default source object. Loaders have the option of overriding the chosen
source object when constructing the [[ERDBLoadGroupPm]] objects.

=cut

sub GetSourceObject {
    my ($self) = @_;
    # Insure the source object exists in our internal cache.
    if (! defined $self->{source}) {
        # We require the FIG object. If the user has no intention of
        # doing a load, this method won't be used, so he won't need to
        # have the FIG object on his system.
        require FIG;
        $self->{source} = FIG->new();
    }
    # Return it to the caller.
    return $self->{source};
}

=head3 SectionList

    my @sections = $erdb->SectionList();

Return a list of the names for the different data sections used when loading this database.
The default is a single string, in which case there is only one section representing the
entire database.

=cut

sub SectionList {
    # Get the parameters.
    my ($self) = @_;
    # Get the genome hash.
    my $genomes = $self->GenomeHash();
    # Create one section per genome.
    my @retVal = sort keys %$genomes;
    # Append the global section.
    push @retVal, GLOBAL;
    # Return the section list.
    return @retVal;
}

=head3 Loader

    my $groupLoader = $erdb->Loader($groupName, $source, $options);

Return an [[ERDBLoadGroupPm]] object for the specified load group. This method is used
by [[ERDBGeneratorPl]] to create the load group objects. If you are not using
[[ERDBGeneratorPl]], you don't need to override this method.

=over 4

=item groupName

Name of the load group whose object is to be returned. The group name is
guaranteed to be a single word with only the first letter capitalized.

=item source

The source object used to access the data from which the load file is derived. This 
is the same object returned by L</GetSourceObject>; however, we allow the caller to pass
it in as a parameter so that we don't end up creating multiple copies of a potentially
expensive data structure. It is permissible for this value to be undefined, in which
case the source will be retrieved the first time the client asks for it.

=item options

Reference to a hash of command-line options.

=item RETURN

Returns an [[ERDBLoadGroupPm]] object that can be used to process the specified load group
for this database.

=back

=cut

sub Loader {
    # Get the parameters.
    my ($self, $groupName, $options) = @_;
    # Compute the loader name.
    my $loaderClass = "${groupName}SaplingLoader";
    # Pull in its definition.
    require "$loaderClass.pm";
    # Create an object for it.
    my $retVal = eval("$loaderClass->new(\$self, \$options)");
    # Insure it worked.
    Confess("Could not create $loaderClass object: $@") if $@;
    # Return it to the caller.
    return $retVal;
}

=head3 LoadGroupList

    my @groups = $erdb->LoadGroupList();

Returns a list of the names for this database's load groups. This method is used
by [[ERDBGeneratorPl]] when the user wishes to load all table groups. The default
is a single group called 'All' that loads everything.

=cut

sub LoadGroupList {
    # Return the list.
    return qw(Genome Feature Subsystem Family Scenario Model); # ##TODO Drug, Protein
}

=head3 LoadDirectory

    my $dirName = $erdb->LoadDirectory();

Return the name of the directory in which load files are kept. The default is
the FIG temporary directory, which is a really bad choice, but it's always there.

=cut

sub LoadDirectory {
    # Get the parameters.
    my ($self) = @_;
    # Return the directory name.
    return $self->{loadDirectory};
}


1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3