[Bio] / Sprout / SproutLoad.pm Repository:
ViewVC logotype

View of /Sprout/SproutLoad.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.9 - (download) (as text) (annotate)
Wed Sep 14 11:21:24 2005 UTC (14 years, 2 months ago) by parrello
Branch: MAIN
Changes since 1.8: +1 -1 lines
*** empty log message ***

#!/usr/bin/perl -w

package SproutLoad;

    use strict;
    use Tracer;
    use PageBuilder;
    use ERDBLoad;
    use FIG;
    use Sprout;
    use Stats;
    use BasicLocation;

=head1 Sprout Load Methods

=head2 Introduction

This object contains the methods needed to copy data from the FIG data store to the
Sprout database. It makes heavy use of the ERDBLoad object to manage the load into
individual tables. The client can create an instance of this object and then
call methods for each group of tables to load. For example, the following code will
load the Genome- and Feature-related tables. (It is presumed the first command line
parameter contains the name of a file specifying the genomes.)

    my $fig = FIG->new();
    my $sprout = SFXlate->new_sprout_only();
    my $spl = SproutLoad->new($sprout, $fig, $ARGV[0]);
    my $stats = $spl->LoadGenomeData();
    $stats->Accumulate($spl->LoadFeatureData());
    print $stats->Show();

This module makes use of the internal Sprout property C<_erdb>.

It is worth noting that the FIG object does not need to be a real one. Any object
that implements the FIG methods for data retrieval could be used. So, for example,
this object could be used to copy data from one Sprout database to another, or
from any FIG-compliant data story implemented in the future.

To insure that this is possible, each time the FIG object is used, it will be via
a variable called C<$fig>. This makes it fairly straightforward to determine which
FIG methods are required to load the Sprout database.

This object creates the load files; however, the tables are not created until it
is time to actually do the load from the files into the target database.

=cut

#: Constructor SproutLoad->new();

=head2 Public Methods

=head3 new

C<< my $spl = SproutLoad->new($sprout, $fig, $genomeFile, $subsysFile, $options); >>

Construct a new Sprout Loader object, specifying the two participating databases and
the name of the files containing the list of genomes and subsystems to use.

=over 4

=item sprout

Sprout object representing the target database. This also specifies the directory to
be used for creating the load files.

=item fig

FIG object representing the source data store from which the data is to be taken.

=item genomeFile

Either the name of the file containing the list of genomes to load or a reference to
a hash of genome IDs to access codes. If nothing is specified, all complete genomes
will be loaded and the access code will default to 1. The genome list is presumed
to be all-inclusive. In other words, all existing data in the target database will
be deleted and replaced with the data on the specified genes. If a file is specified,
it should contain one genome ID and access code per line, tab-separated.

=item subsysFile

Either the name of the file containing the list of trusted subsystems or a reference
to a list of subsystem names. If nothing is specified, all known subsystems will be
considered trusted. Only subsystem data related to the trusted subsystems is loaded.

=item options

Reference to a hash of command-line options.

=back

=cut

sub new {
    # Get the parameters.
    my ($class, $sprout, $fig, $genomeFile, $subsysFile, $options) = @_;
    # Load the list of genomes into a hash.
    my %genomes;
    if (! defined($genomeFile) || $genomeFile eq '') {
        # Here we want all the complete genomes and an access code of 1.
        my @genomeList = $fig->genomes(1);
        %genomes = map { $_ => 1 } @genomeList;
    } else {
        my $type = ref $genomeFile;
        Trace("Genome file parameter type is \"$type\".") if T(3);
        if ($type eq 'HASH') {
            # Here the user specified a hash of genome IDs to access codes, which is
            # exactly what we want.
            %genomes = %{$genomeFile};
        } elsif (! $type || $type eq 'SCALAR' ) {
            # The caller specified a file, so read the genomes from the file. (Note
            # that some PERLs return an empty string rather than SCALAR.)
            my @genomeList = Tracer::GetFile($genomeFile);
            if (! @genomeList) {
                # It's an error if the genome file is empty or not found.
                Confess("No genomes found in file \"$genomeFile\".");
            } else {
                # We build the genome Hash using a loop rather than "map" so that
                # an omitted access code can be defaulted to 1.
                for my $genomeLine (@genomeList) {
                    my ($genomeID, $accessCode) = split("\t", $genomeLine);
                    if (undef $accessCode) {
                        $accessCode = 1;
                    }
                    $genomes{$genomeID} = $accessCode;
                }
            }
        } else {
            Confess("Invalid genome parameter ($type) in SproutLoad constructor.");
        }
    }
    # Load the list of trusted subsystems.
    my %subsystems = ();
    if (! defined $subsysFile || $subsysFile eq '') {
        # Here we want all the subsystems.
        %subsystems = map { $_ => 1 } $fig->all_subsystems();
    } else {
        my $type = ref $subsysFile;
        if ($type eq 'ARRAY') {
            # Here the user passed in a list of subsystems.
            %subsystems = map { $_ => 1 } @{$subsysFile};
        } elsif (! $type || $type eq 'SCALAR') {
            # Here the list of subsystems is in a file.
            if (! -e $subsysFile) {
                # It's an error if the file does not exist.
                Confess("Trusted subsystem file not found.");
            } else {
                # GetFile automatically chomps end-of-line characters, so this
                # is an easy task.
                %subsystems = map { $_ => 1 } Tracer::GetFile($subsysFile);
            }
        } else {
            Confess("Invalid subsystem parameter in SproutLoad constructor.");
        }
    }
    # Get the data directory from the Sprout object.
    my ($directory) = $sprout->LoadInfo();
    # Create the Sprout load object.
    my $retVal = {
                  fig => $fig,
                  genomes => \%genomes,
                  subsystems => \%subsystems,
                  sprout => $sprout,
                  loadDirectory => $directory,
                  erdb => $sprout->{_erdb},
                  loaders => [],
                  options => $options
                 };
    # Bless and return it.
    bless $retVal, $class;
    return $retVal;
}

=head3 LoadGenomeData

C<< my $stats = $spl->LoadGenomeData(); >>

Load the Genome, Contig, and Sequence data from FIG into Sprout.

The Sequence table is the largest single relation in the Sprout database, so this
method is expected to be slow and clumsy. At some point we will need to make it
restartable, since an error 10 gigabytes through a 20-gigabyte load is bound to be
very annoying otherwise.

The following relations are loaded by this method.

    Genome
    HasContig
    Contig
    IsMadeUpOf
    Sequence

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

B<TO DO>

Real quality vectors instead of C<unknown> for everything.

GenomeGroup relation. (The original script took group information from the C<NMPDR> file
in each genome's main directory, but no such file exists anywhere in my version of the
data store.)

=cut
#: Return Type $%;
sub LoadGenomeData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the genome count.
    my $genomeHash = $self->{genomes};
    my $genomeCount = (keys %{$genomeHash});
    Trace("Beginning genome data load.") if T(2);
    # Create load objects for each of the tables we're loading.
    my $loadGenome = $self->_TableLoader('Genome', $genomeCount);
    my $loadHasContig = $self->_TableLoader('HasContig', $genomeCount * 300);
    my $loadContig = $self->_TableLoader('Contig', $genomeCount * 300);
    my $loadIsMadeUpOf = $self->_TableLoader('IsMadeUpOf', $genomeCount * 60000);
    my $loadSequence = $self->_TableLoader('Sequence', $genomeCount * 60000);
    # Now we loop through the genomes, generating the data for each one.
    for my $genomeID (sort keys %{$genomeHash}) {
        Trace("Loading data for genome $genomeID.") if T(3);
        $loadGenome->Add("genomeIn");
        # The access code comes in via the genome hash.
        my $accessCode = $genomeHash->{$genomeID};
        # Get the genus, species, and strain from the scientific name. Note that we append
        # the genome ID to the strain. In some cases this is the totality of the strain name.
        my ($genus, $species, @extraData) = split / /, $self->{fig}->genus_species($genomeID);
        my $extra = join " ", @extraData, "[$genomeID]";
        # Get the full taxonomy.
        my $taxonomy = $fig->taxonomy_of($genomeID);
        # Output the genome record.
        $loadGenome->Put($genomeID, $accessCode, $fig->is_complete($genomeID), $genus,
                         $species, $extra, $taxonomy);
        # Now we loop through each of the genome's contigs.
        my @contigs = $fig->all_contigs($genomeID);
        for my $contigID (@contigs) {
            Trace("Processing contig $contigID for $genomeID.") if T(4);
            $loadContig->Add("contigIn");
            $loadSequence->Add("contigIn");
            # Create the contig ID.
            my $sproutContigID = "$genomeID:$contigID";
            # Create the contig record and relate it to the genome.
            $loadContig->Put($sproutContigID);
            $loadHasContig->Put($genomeID, $sproutContigID);
            # Now we need to split the contig into sequences. The maximum sequence size is
            # a property of the Sprout object.
            my $chunkSize = $self->{sprout}->MaxSequence();
            # Now we get the sequence a chunk at a time.
            my $contigLen = $fig->contig_ln($genomeID, $contigID);
            for (my $i = 1; $i <= $contigLen; $i += $chunkSize) {
                $loadSequence->Add("chunkIn");
                # Compute the endpoint of this chunk.
                my $end = FIG::min($i + $chunkSize - 1, $contigLen);
                # Get the actual DNA.
                my $dna = $fig->get_dna($genomeID, $contigID, $i, $end);
                # Compute the sequenceID.
                my $seqID = "$sproutContigID.$i";
                # Write out the data. For now, the quality vector is always "unknown".
                $loadIsMadeUpOf->Put($sproutContigID, $seqID, $end + 1 - $i, $i);
                $loadSequence->Put($seqID, "unknown", $dna);
            }
        }
    }
    # Finish the loads.
    my $retVal = $self->_FinishAll();
    # Return the result.
    return $retVal;
}

=head3 LoadCouplingData

C<< my $stats = $spl->LoadCouplingData(); >>

Load the coupling and evidence data from FIG into Sprout.

The coupling data specifies which genome features are functionally coupled. The
evidence data explains why the coupling is functional.

The following relations are loaded by this method.

    Coupling
    IsEvidencedBy
    PCH
    ParticipatesInCoupling
    UsesAsEvidence

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadCouplingData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the genome hash.
    my $genomeFilter = $self->{genomes};
    my $genomeCount = (keys %{$genomeFilter});
    my $featureCount = $genomeCount * 4000;
    # Start the loads.
    my $loadCoupling = $self->_TableLoader('Coupling', $featureCount * $genomeCount);
    my $loadIsEvidencedBy = $self->_TableLoader('IsEvidencedBy', $featureCount * 8000);
    my $loadPCH = $self->_TableLoader('PCH', $featureCount * 2000);
    my $loadParticipatesInCoupling = $self->_TableLoader('ParticipatesInCoupling', $featureCount * 2000);
    my $loadUsesAsEvidence = $self->_TableLoader('UsesAsEvidence', $featureCount * 8000);
    Trace("Beginning coupling data load.") if T(2);
    # Loop through the genomes found.
    for my $genome (sort keys %{$genomeFilter}) {
        Trace("Generating coupling data for $genome.") if T(3);
        $loadCoupling->Add("genomeIn");
        # Create a hash table for holding coupled pairs. We use this to prevent
        # duplicates. For example, if A is coupled to B, we don't want to also
        # assert that B is coupled to A, because we already know it. Fortunately,
        # all couplings occur within a genome, so we can keep the hash table
        # size reasonably small.
        my %dupHash = ();
        # Get all of the genome's PEGs.
        my @pegs = $fig->pegs_of($genome);
        # Loop through the PEGs.
        for my $peg1 (@pegs) {
            $loadCoupling->Add("pegIn");
            Trace("Processing PEG $peg1 for $genome.") if T(4);
            # Get a list of the coupled PEGs.
            my @couplings = $fig->coupled_to($peg1);
            # For each coupled PEG, we need to verify that a coupling already
            # exists. If not, we have to create one.
            for my $coupleData (@couplings) {
                my ($peg2, $score) = @{$coupleData};
                # Compute the coupling ID.
                my $coupleID = Sprout::CouplingID($peg1, $peg2);
                if (! exists $dupHash{$coupleID}) {
                    $loadCoupling->Add("couplingIn");
                    # Here we have a new coupling to store in the load files.
                    Trace("Storing coupling ($coupleID) with score $score.") if T(4);
                    # Ensure we don't do this again.
                    $dupHash{$coupleID} = $score;
                    # Write the coupling record.
                    $loadCoupling->Put($coupleID, $score);
                    # Connect it to the coupled PEGs.
                    $loadParticipatesInCoupling->Put($peg1, $coupleID, 1);
                    $loadParticipatesInCoupling->Put($peg2, $coupleID, 2);
                    # Get the evidence for this coupling.
                    my @evidence = $fig->coupling_evidence($peg1, $peg2);
                    # Organize the evidence into a hash table.
                    my %evidenceMap = ();
                    # Process each evidence item.
                    for my $evidenceData (@evidence) {
                        $loadPCH->Add("evidenceIn");
                        my ($peg3, $peg4, $usage) = @{$evidenceData};
                        # Only proceed if the evidence is from a Sprout
                        # genome.
                        if ($genomeFilter->{$fig->genome_of($peg3)}) {
                            $loadUsesAsEvidence->Add("evidenceChosen");
                            my $evidenceKey = "$coupleID $peg3 $peg4";
                            # We store this evidence in the hash if the usage
                            # is nonzero or no prior evidence has been found. This
                            # insures that if there is duplicate evidence, we
                            # at least keep the meaningful ones. Only evidence is
                            # the hash makes it to the output.
                            if ($usage || ! exists $evidenceMap{$evidenceKey}) {
                                $evidenceMap{$evidenceKey} = $evidenceData;
                            }
                        }
                    }
                    for my $evidenceID (keys %evidenceMap) {
                        # Create the evidence record.
                        my ($peg3, $peg4, $usage) = @{$evidenceMap{$evidenceID}};
                        $loadPCH->Put($evidenceID, $usage);
                        # Connect it to the coupling.
                        $loadIsEvidencedBy->Put($coupleID, $evidenceID);
                        # Connect it to the features.
                        $loadUsesAsEvidence->Put($evidenceID, $peg3, 1);
                        $loadUsesAsEvidence->Put($evidenceID, $peg4, 1);
                    }
                }
            }
        }
    }
    # All done. Finish the load.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadFeatureData

C<< my $stats = $spl->LoadFeatureData(); >>

Load the feature data from FIG into Sprout.

Features represent annotated genes, and are therefore the heart of the data store.

The following relations are loaded by this method.

    Feature
    FeatureAlias
    FeatureLink
    FeatureTranslation
    FeatureUpstream
    IsLocatedIn

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadFeatureData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Find out if this is a limited run.
    my $limited = $self->{options}->{limitedFeatures};
    # Get the table of genome IDs.
    my $genomeHash = $self->{genomes};
    my $genomeCount = (keys %{$genomeHash});
    my $featureCount = $genomeCount * 4000;
    # Create load objects for each of the tables we're loading.
    my $loadFeature = $self->_TableLoader('Feature', $featureCount);
    my $loadIsLocatedIn = $self->_TableLoader('IsLocatedIn', $featureCount);
    my ($loadFeatureAlias, $loadFeatureLink, $loadFeatureTranslation, $loadFeatureUpstream);
    if (! $limited) {
        $loadFeatureAlias = $self->_TableLoader('FeatureAlias', $featureCount * 6);
        $loadFeatureLink = $self->_TableLoader('FeatureLink', $featureCount * 10);
        $loadFeatureTranslation = $self->_TableLoader('FeatureTranslation', $featureCount);
        $loadFeatureUpstream = $self->_TableLoader('FeatureUpstream', $featureCount);
    }
    # Get the maximum sequence size. We need this later for splitting up the
    # locations.
    my $chunkSize = $self->{sprout}->MaxSegment();
    Trace("Beginning feature data load.") if T(2);
    # Now we loop through the genomes, generating the data for each one.
    for my $genomeID (sort keys %{$genomeHash}) {
        Trace("Loading features for genome $genomeID.") if T(3);
        $loadFeature->Add("genomeIn");
        # Get the feature list for this genome.
        my $features = $fig->all_features_detailed($genomeID);
        # Loop through the features.
        for my $featureData (@{$features}) {
            $loadFeature->Add("featureIn");
            # Split the tuple.
            my ($featureID, $locations, $aliases, $type) = @{$featureData};
            # Create the feature record.
            $loadFeature->Put($featureID, 1, $type);
            # The next stuff is for a full load only.
            if (! $limited) {
                # Create the aliases.
                for my $alias (split /\s*,\s*/, $aliases) {
                    $loadFeatureAlias->Put($featureID, $alias);
                }
                # Get the links.
                my @links = $fig->fid_links($featureID);
                for my $link (@links) {
                    $loadFeatureLink->Put($featureID, $link);
                }
                # If this is a peg, generate the translation and the upstream.
                if ($type eq 'peg') {
                    $loadFeatureTranslation->Add("pegIn");
                    my $translation = $fig->get_translation($featureID);
                    if ($translation) {
                        $loadFeatureTranslation->Put($featureID, $translation);
                    }
                    # We use the default upstream values of u=200 and c=100.
                    my $upstream = $fig->upstream_of($featureID, 200, 100);
                    if ($upstream) {
                        $loadFeatureUpstream->Put($featureID, $upstream);
                    }
                }
            }
            # This part is the roughest. We need to relate the features to contig
            # locations, and the locations must be split so that none of them exceed
            # the maximum segment size. This simplifies the genes_in_region processing
            # for Sprout.
            my @locationList = map { "$genomeID:$_" } split /\s*,\s*/, $locations;
            # Create the location position indicator.
            my $i = 1;
            # Loop through the locations.
            for my $location (@locationList) {
                # Parse the location.
                my $locObject = BasicLocation->new($location);
                # Split it into a list of chunks.
                my @locOList = ();
                while (my $peeling = $locObject->Peel($chunkSize)) {
                    $loadIsLocatedIn->Add("peeling");
                    push @locOList, $peeling;
                }
                push @locOList, $locObject;
                # Loop through the chunks, creating IsLocatedIn records. The variable
                # "$i" will be used to keep the location index.
                for my $locChunk (@locOList) {                    
                    $loadIsLocatedIn->Put($featureID, $locChunk->Contig, $locChunk->Left,
                                          $locChunk->Dir, $locChunk->Length, $i);
                    $i++;
                }
            }
        }
    }
    # Finish the loads.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadBBHData

C<< my $stats = $spl->LoadBBHData(); >>

Load the bidirectional best hit data from FIG into Sprout.

Sprout does not store information on similarities. Instead, it has only the
bi-directional best hits. Even so, the BBH table is one of the largest in
the database.

The following relations are loaded by this method.

    IsBidirectionalBestHitOf

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadBBHData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the table of genome IDs.
    my $genomeHash = $self->{genomes};
    my $genomeCount = (keys %{$genomeHash});
    my $featureCount = $genomeCount * 4000;
    # Create load objects for each of the tables we're loading.
    my $loadIsBidirectionalBestHitOf = $self->_TableLoader('IsBidirectionalBestHitOf',
                                                           $featureCount * $genomeCount);
    Trace("Beginning BBH load.") if T(2);
    # Now we loop through the genomes, generating the data for each one.
    for my $genomeID (sort keys %{$genomeHash}) {
        $loadIsBidirectionalBestHitOf->Add("genomeIn");
        Trace("Processing features for genome $genomeID.") if T(3);
        # Get the feature list for this genome.
        my $features = $fig->all_features_detailed($genomeID);
        # Loop through the features.
        for my $featureData (@{$features}) {
            # Split the tuple.
            my ($featureID, $locations, $aliases, $type) = @{$featureData};
            # Get the bi-directional best hits.
            my @bbhList = $fig->bbhs($featureID);
            for my $bbhEntry (@bbhList) {
                # Get the target feature ID and the score.
                my ($targetID, $score) = @{$bbhEntry};
                # Check the target feature's genome.
                my $targetGenomeID = $fig->genome_of($targetID);
                # Only proceed if it's one of our genomes.
                if ($genomeHash->{$targetGenomeID}) {
                    $loadIsBidirectionalBestHitOf->Put($featureID, $targetID, $targetGenomeID,
                                                       $score);
                }
            }
        }
    }
    # Finish the loads.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadSubsystemData

C<< my $stats = $spl->LoadSubsystemData(); >>

Load the subsystem data from FIG into Sprout.

Subsystems are groupings of genetic roles that work together to effect a specific
chemical reaction. Similar organisms require similar subsystems. To curate a subsystem,
a spreadsheet is created with genomes on one axis and subsystem roles on the other
axis. Similar features are then mapped into the cells, allowing the annotation of one
genome's roles to be used to assist in the annotation of others.

The following relations are loaded by this method.

    Subsystem
    Role
    SSCell
    ContainsFeature
    IsGenomeOf
    IsRoleOf
    OccursInSubsystem
    ParticipatesIn
    HasSSCell

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

B<TO DO>

Generate RoleName table?

=cut
#: Return Type $%;
sub LoadSubsystemData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the genome hash. We'll use it to filter the genomes in each
    # spreadsheet.
    my $genomeHash = $self->{genomes};
    # Get the subsystem hash. This lists the subsystems we'll process.
    my $subsysHash = $self->{subsystems};
    my @subsysIDs = sort keys %{$subsysHash};
    my $subsysCount = @subsysIDs;
    my $genomeCount = (keys %{$genomeHash});
    my $featureCount = $genomeCount * 4000;
    # Create load objects for each of the tables we're loading.
    my $loadSubsystem = $self->_TableLoader('Subsystem', $subsysCount);
    my $loadRole = $self->_TableLoader('Role', $featureCount * 6);
    my $loadSSCell = $self->_TableLoader('SSCell', $featureCount * $genomeCount);
    my $loadContainsFeature = $self->_TableLoader('ContainsFeature', $featureCount * $subsysCount);
    my $loadIsGenomeOf = $self->_TableLoader('IsGenomeOf', $featureCount * $genomeCount);
    my $loadIsRoleOf = $self->_TableLoader('IsRoleOf', $featureCount * $genomeCount);
    my $loadOccursInSubsystem = $self->_TableLoader('OccursInSubsystem', $featureCount * 6);
    my $loadParticipatesIn = $self->_TableLoader('ParticipatesIn', $subsysCount * $genomeCount);
    my $loadHasSSCell = $self->_TableLoader('HasSSCell', $featureCount * $genomeCount);
    Trace("Beginning subsystem data load.") if T(2);
    # Loop through the subsystems. Our first task will be to create the
    # roles. We do this by looping through the subsystems and creating a
    # role hash. The hash tracks each role ID so that we don't create
    # duplicates. As we move along, we'll connect the roles and subsystems.
    my %roleData = ();
    for my $subsysID (@subsysIDs) {
        Trace("Creating subsystem $subsysID.") if T(3);
        $loadSubsystem->Add("subsystemIn");
        # Create the subsystem record.
        $loadSubsystem->Put($subsysID);
        # Get the subsystem's roles.
        my @roles = $fig->subsystem_to_roles($subsysID);
        # Connect the roles to the subsystem. If a role is new, we create
        # a role record for it.
        for my $roleID (@roles) {
            $loadOccursInSubsystem->Add("roleIn");
            $loadOccursInSubsystem->Put($roleID, $subsysID);
            if (! exists $roleData{$roleID}) {
                $loadRole->Put($roleID);
                $roleData{$roleID} = 1;
            }
        }
        # Now all roles for this subsystem have been filled in. We create the
        # spreadsheet by matches roles to genomes. To do this, we need to
        # get the genomes on the sheet.
        Trace("Creating subsystem $subsysID spreadsheet.") if T(3);
        my @genomes = map { $_->[0] } @{$fig->subsystem_genomes($subsysID)};
        for my $genomeID (@genomes) {
            # Only process this genome if it's one of ours.
            if (exists $genomeHash->{$genomeID}) {
                # Connect the genome to the subsystem.
                $loadParticipatesIn->Put($genomeID, $subsysID);
                # Loop through the subsystem's roles. We use an index because it is
                # part of the spreadsheet cell ID.
                for (my $i = 0; $i <= $#roles; $i++) {
                    my $role = $roles[$i];
                    # Get the features in the spreadsheet cell for this genome and role.
                    my @pegs = $fig->pegs_in_subsystem_cell($subsysID, $genomeID, $i);
                    # Only proceed if features exist.
                    if (@pegs > 0) {
                        # Create the spreadsheet cell.
                        my $cellID = "$subsysID:$genomeID:$i";
                        $loadSSCell->Put($cellID);
                        $loadIsGenomeOf->Put($genomeID, $cellID);
                        $loadIsRoleOf->Put($role, $cellID);
                        $loadHasSSCell->Put($subsysID, $cellID);
                        # Attach the features to it.
                        for my $pegID (@pegs) {
                            $loadContainsFeature->Put($cellID, $pegID);
                        }
                    }
                }
            }
        }
    }
    # Finish the load.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadDiagramData

C<< my $stats = $spl->LoadDiagramData(); >>

Load the diagram data from FIG into Sprout.

Diagrams are used to organize functional roles. The diagram shows the
connections between chemicals that interact with a subsystem.

The following relations are loaded by this method.

    Diagram
    RoleOccursIn

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadDiagramData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the map list.
    my @maps = $fig->all_maps;
    my $mapCount = @maps;
    my $genomeCount = (keys %{$self->{genomes}});
    my $featureCount = $genomeCount * 4000;
    # Create load objects for each of the tables we're loading.
    my $loadDiagram = $self->_TableLoader('Diagram', $mapCount);
    my $loadRoleOccursIn = $self->_TableLoader('RoleOccursIn', $featureCount * 6);
    Trace("Beginning diagram data load.") if T(2);
    # Loop through the diagrams.
    for my $map ($fig->all_maps) {
        Trace("Loading diagram $map.") if T(3);
        # Get the diagram's descriptive name.
        my $name = $fig->map_name($map);
        $loadDiagram->Put($map, $name);
        # Now we need to link all the map's roles to it.
        # A hash is used to prevent duplicates.
        my %roleHash = ();
        for my $role ($fig->map_to_ecs($map)) {
            if (! $roleHash{$role}) {
                $loadRoleOccursIn->Put($role, $map);
                $roleHash{$role} = 1;
            }
        }
    }
    # Finish the load.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadPropertyData

C<< my $stats = $spl->LoadPropertyData(); >>

Load the attribute data from FIG into Sprout.

Attribute data in FIG corresponds to the Sprout concept of Property. As currently
implemented, each key-value attribute combination in the SEED corresponds to a
record in the B<Property> table. The B<HasProperty> relationship links the
features to the properties.

The SEED also allows attributes to be assigned to genomes, but this is not yet
supported by Sprout.

The following relations are loaded by this method.

    HasProperty
    Property

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadPropertyData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the genome hash.
    my $genomeHash = $self->{genomes};
    my $genomeCount = (keys %{$genomeHash});
    # Create load objects for each of the tables we're loading.
    my $loadProperty = $self->_TableLoader('Property', $genomeCount * 1500);
    my $loadHasProperty = $self->_TableLoader('HasProperty', $genomeCount * 1500);
    Trace("Beginning property data load.") if T(2);
    # Create a hash for storing property IDs.
    my %propertyKeys = ();
    my $nextID = 1;
    # Loop through the genomes.
    for my $genomeID (keys %{$genomeHash}) {
        $loadProperty->Add("genomeIn");
        # Get the genome's features. The feature ID is the first field in the
        # tuples returned by "all_features_detailed". We use "all_features_detailed"
        # rather than "all_features" because we want all features regardless of type.
        my @features = map { $_->[0] } @{$fig->all_features_detailed($genomeID)};
        # Loop through the features, creating HasProperty records.
        for my $fid (@features) {
            $loadProperty->Add("featureIn");
            # Get all attributes for this feature. We do this one feature at a time
            # to insure we do not get any genome attributes.
            my @attributeList = $fig->get_attributes($fid, '', '', '');
            # Loop through the attributes.
            for my $tuple (@attributeList) {
                # Get this attribute value's data. Note that we throw away the FID,
                # since it will always be the same as the value if "$fid".
                my (undef, $key, $value, $url) = @{$tuple};
                # Concatenate the key and value and check the "propertyKeys" hash to
                # see if we already have an ID for it. We use a tab for the separator
                # character.
                my $propertyKey = "$key\t$value";
                # Use the concatenated value to check for an ID. If no ID exists, we
                # create one.
                my $propertyID = $propertyKeys{$propertyKey};
                if (! $propertyID) {
                    # Here we need to create a new property ID for this key/value pair.
                    $propertyKeys{$propertyKey} = $nextID;
                    $propertyID = $nextID;
                    $nextID++;
                    $loadProperty->Put($propertyID, $key, $value);
                }
                # Create the HasProperty entry for this feature/property association.
                $loadHasProperty->Put($fid, $propertyID, $url);
            }
        }
    }
    # Finish the load.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadAnnotationData

C<< my $stats = $spl->LoadAnnotationData(); >>

Load the annotation data from FIG into Sprout.

Sprout annotations encompass both the assignments and the annotations in SEED.
These describe the function performed by a PEG as well as any other useful
information that may aid in identifying its purpose.

The following relations are loaded by this method.

    Annotation
    IsTargetOfAnnotation
    SproutUser
    MadeAnnotation

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadAnnotationData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the genome hash.
    my $genomeHash = $self->{genomes};
    my $genomeCount = (keys %{$genomeHash});
    # Create load objects for each of the tables we're loading.
    my $loadAnnotation = $self->_TableLoader('Annotation', $genomeCount * 4000);
    my $loadIsTargetOfAnnotation = $self->_TableLoader('IsTargetOfAnnotation', $genomeCount * 4000);
    my $loadSproutUser = $self->_TableLoader('SproutUser', 100);
    my $loadUserAccess = $self->_TableLoader('UserAccess', 1000);
    my $loadMadeAnnotation = $self->_TableLoader('MadeAnnotation', $genomeCount * 4000);
    Trace("Beginning annotation data load.") if T(2);
    # Create a hash of user names. We'll use this to prevent us from generating duplicate
    # user records.
    my %users = ( FIG => 1, master => 1 );
    # Put in FIG and "master".
    $loadSproutUser->Put("FIG", "Fellowship for Interpretation of Genomes");
    $loadUserAccess->Put("FIG", 1);
    $loadSproutUser->Put("master", "Master User");
    $loadUserAccess->Put("master", 1);
    # Get the current time.
    my $time = time();
    # Loop through the genomes.
    for my $genomeID (sort keys %{$genomeHash}) {
        Trace("Processing $genomeID.") if T(3);
        # Get the genome's PEGs.
        my @pegs = $fig->pegs_of($genomeID);
        for my $peg (@pegs) {
            Trace("Processing $peg.") if T(4);
            # Create a hash of timestamps. We use this to prevent duplicate time stamps
            # from showing up for a single PEG's annotations.
            my %seenTimestamps = ();
            # Check for a functional assignment.
            my $func = $fig->function_of($peg);
            if ($func) {
                # If this is NOT a hypothetical assignment, we create an
                # assignment annotation for it.
                if (! FIG::hypo($peg)) {
                    # Note that we double the slashes so that what goes into the database is
                    # a new-line escape sequence rather than an actual new-line.
                    $loadAnnotation->Put("$peg:$time", $time, "FIG\\nSet function to\\n$func");
                    $loadIsTargetOfAnnotation->Put($peg, "$peg:$time");
                    $loadMadeAnnotation->Put("FIG", "$peg:$time");
                    # Denote we've seen this timestamp.
                    $seenTimestamps{$time} = 1;
                }
                # Now loop through the real annotations.
                for my $tuple ($fig->feature_annotations($peg, "raw")) {
                    my ($fid, $timestamp, $user, $text) = @{$tuple};
                    # Here we fix up the annotation text. "\r" is removed,
                    # and "\t" and "\n" are escaped. Note we use the "s"
                    # modifier so that new-lines inside the text do not
                    # stop the substitution search.
                    $text =~ s/\r//gs;
                    $text =~ s/\t/\\t/gs;
                    $text =~ s/\n/\\n/gs;
                    # Change assignments by the master user to FIG assignments.
                    $text =~ s/Set master function/Set FIG function/s;
                    # Insure the time stamp is valid.
                    if ($timestamp =~ /^\d+$/) {
                        # Here it's a number. We need to insure it's unique.
                        while ($seenTimestamps{$timestamp}) {
                            $timestamp++;
                        }
                        $seenTimestamps{$timestamp} = 1;
                        my $annotationID = "$peg:$timestamp";
                        # Insure the user exists.
                        if (! $users{$user}) {
                            $loadSproutUser->Put($user, "SEED user");
                            $loadUserAccess->Put($user, 1);
                            $users{$user} = 1;
                        }
                        # Generate the annotation.
                        $loadAnnotation->Put($annotationID, $timestamp, "$user\\n$text");
                        $loadIsTargetOfAnnotation->Put($peg, $annotationID);
                        $loadMadeAnnotation->Put($user, $annotationID);
                    } else {
                        # Here we have an invalid time stamp.
                        Trace("Invalid time stamp \"$timestamp\" in annotations for $peg.") if T(1);
                    }
                }
            }
        }
    }
    # Finish the load.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadSourceData

C<< my $stats = $spl->LoadSourceData(); >>

Load the source data from FIG into Sprout.

Source data links genomes to information about the organizations that
mapped it.

The following relations are loaded by this method.

    ComesFrom
    Source
    SourceURL

There is no direct support for source attribution in FIG, so we access the SEED
files directly.

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadSourceData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the genome hash.
    my $genomeHash = $self->{genomes};
    my $genomeCount = (keys %{$genomeHash});
    # Create load objects for each of the tables we're loading.
    my $loadComesFrom = $self->_TableLoader('ComesFrom', $genomeCount * 4);
    my $loadSource = $self->_TableLoader('Source', $genomeCount * 4);
    my $loadSourceURL = $self->_TableLoader('SourceURL', $genomeCount * 8);
    Trace("Beginning source data load.") if T(2);
    # Create hashes to collect the Source information.
    my %sourceURL = ();
    my %sourceDesc = ();
    # Loop through the genomes.
    my $line;
    for my $genomeID (sort keys %{$genomeHash}) {
        Trace("Processing $genomeID.") if T(3);
        # Open the project file.
        if ((open(TMP, "<$FIG_Config::organisms/$genomeID/PROJECT")) &&
            defined($line = <TMP>)) {
            chomp $line;
            my($sourceID, $desc, $url) = split(/\t/,$line);
            $loadComesFrom->Put($genomeID, $sourceID);
            if ($url && ! exists $sourceURL{$genomeID}) {
                $loadSourceURL->Put($sourceID, $url);
                $sourceURL{$sourceID} = 1;
            }
            if ($desc && ! exists $sourceDesc{$sourceID}) {
                $loadSource->Put($sourceID, $desc);
                $sourceDesc{$sourceID} = 1;
            }
        }
        close TMP;
    }
    # Finish the load.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadExternalData

C<< my $stats = $spl->LoadExternalData(); >>

Load the external data from FIG into Sprout.

External data contains information about external feature IDs.

The following relations are loaded by this method.

    ExternalAliasFunc
    ExternalAliasOrg

The support for external IDs in FIG is hidden beneath layers of other data, so
we access the SEED files directly to create these tables. This is also one of
the few load methods that does not proceed genome by genome.

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadExternalData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the genome hash.
    my $genomeHash = $self->{genomes};
    my $genomeCount = (keys %{$genomeHash});
    # Convert the genome hash. We'll get the genus and species for each genome and make
    # it the key.
    my %speciesHash = map { $fig->genus_species($_) => $_ } (keys %{$genomeHash});
    # Create load objects for each of the tables we're loading.
    my $loadExternalAliasFunc = $self->_TableLoader('ExternalAliasFunc', $genomeCount * 4000);
    my $loadExternalAliasOrg = $self->_TableLoader('ExternalAliasOrg', $genomeCount * 4000);
    Trace("Beginning external data load.") if T(2);
    # We loop through the files one at a time. First, the organism file.
    Open(\*ORGS, "<$FIG_Config::global/ext_org.table");
    my $orgLine;
    while (defined($orgLine = <ORGS>)) {
        # Clean the input line.
        chomp $orgLine;
        # Parse the organism name.
        my ($protID, $name) = split /\s*\t\s*/, $orgLine;
        $loadExternalAliasOrg->Put($protID, $name);
    }
    close ORGS;
    # Now the function file.
    my $funcLine;
    Open(\*FUNCS, "<$FIG_Config::global/ext_func.table");
    while (defined($funcLine = <FUNCS>)) {
        # Clean the line ending.
        chomp $funcLine;
        # Only proceed if the line is non-blank.
        if ($funcLine) {
            # Split it into fields.
            my @funcFields = split /\s*\t\s*/, $funcLine;
            # If there's an EC number, append it to the description.
            if ($#funcFields >= 2 && $funcFields[2] =~ /^(EC .*\S)/) {
                $funcFields[1] .= " $1";
            }
            # Output the function line.
            $loadExternalAliasFunc->Put(@funcFields[0,1]);
        }
    }
    # Finish the load.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head3 LoadGroupData

C<< my $stats = $spl->LoadGroupData(); >>

Load the genome Groups into Sprout.

The following relations are loaded by this method.

    GenomeGroups

There is no direct support for genome groups in FIG, so we access the SEED
files directly.

=over 4

=item RETURNS

Returns a statistics object for the loads.

=back

=cut
#: Return Type $%;
sub LoadGroupData {
    # Get this object instance.
    my ($self) = @_;
    # Get the FIG object.
    my $fig = $self->{fig};
    # Get the genome hash.
    my $genomeHash = $self->{genomes};
    my $genomeCount = (keys %{$genomeHash});
    # Create a load object for the table we're loading.
    my $loadGenomeGroups = $self->_TableLoader('GenomeGroups', $genomeCount * 4);
    Trace("Beginning group data load.") if T(2);
    # Loop through the genomes.
    my $line;
    for my $genomeID (keys %{$genomeHash}) {
        Trace("Processing $genomeID.") if T(3);
        # Open the NMPDR group file for this genome.
        if (open(TMP, "<$FIG_Config::organisms/$genomeID/NMPDR") &&
            defined($line = <TMP>)) {
            # Clean the line ending.
            chomp $line;
            # Add the group to the table. Note that there can only be one group
            # per genome.
            $loadGenomeGroups->Put($genomeID, $line);
        }
        close TMP;
    }
    # Finish the load.
    my $retVal = $self->_FinishAll();
    return $retVal;
}

=head2 Internal Utility Methods

=head3 TableLoader

Create an ERDBLoad object for the specified table. The object is also added to
the internal list in the C<loaders> property of this object. That enables the
L</FinishAll> method to terminate all the active loads.

This is an instance method.

=over 4

=item tableName

Name of the table (relation) being loaded.

=item rowCount (optional)

Estimated maximum number of rows in the table.

=item RETURN

Returns an ERDBLoad object for loading the specified table.

=back

=cut

sub _TableLoader {
    # Get the parameters.
    my ($self, $tableName, $rowCount) = @_;
    # Create the load object.
    my $retVal = ERDBLoad->new($self->{erdb}, $tableName, $self->{loadDirectory}, $rowCount);
    # Cache it in the loader list.
    push @{$self->{loaders}}, $retVal;
    # Return it to the caller.
    return $retVal;
}

=head3 FinishAll

Finish all the active loads on this object.

When a load is started by L</TableLoader>, the controlling B<ERDBLoad> object is cached in
the list pointed to be the C<loaders> property of this object. This method pops the loaders
off the list and finishes them to flush out any accumulated residue.

This is an instance method.

=over 4

=item RETURN

Returns a statistics object containing the accumulated statistics for the load.

=back

=cut

sub _FinishAll {
    # Get this object instance.
    my ($self) = @_;
    # Create the statistics object.
    my $retVal = Stats->new();
    # Get the loader list.
    my $loadList = $self->{loaders};
    # Loop through the list, finishing the loads. Note that if the finish fails, we die
    # ignominiously. At some future point, we want to make the loads restartable.
    while (my $loader = pop @{$loadList}) {
        my $stats = $loader->Finish();
        $retVal->Accumulate($stats);
        my $relName = $loader->RelName;
        Trace("Statistics for $relName:\n" . $stats->Show()) if T(2);
    }
    # Return the load statistics.
    return $retVal;
}

1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3