[Bio] / Sprout / Sapling.pm Repository:
ViewVC logotype

Annotation of /Sprout/Sapling.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.50 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 : olson 1.44 #
21 :     # This is a SAS component.
22 :     #
23 :    
24 : parrello 1.1 package Sapling;
25 :    
26 :     use strict;
27 :     use Tracer;
28 : parrello 1.11 use base qw(ERDB);
29 : parrello 1.1 use Stats;
30 : parrello 1.9 use DBKernel;
31 : parrello 1.11 use SeedUtils;
32 :     use BasicLocation;
33 : parrello 1.41 use ERDBGenerate;
34 : parrello 1.1 use XML::Simple;
35 : parrello 1.27 use Digest::MD5;
36 : parrello 1.1
37 :     =head1 Sapling Package
38 :    
39 :     Sapling Database Access Methods
40 :    
41 :     =head2 Introduction
42 :    
43 : parrello 1.11 The Sapling database is a new Entity-Relationship Database that attempts to
44 :     encapsulate our data in a portable form for distribution. It is loaded directly
45 :     from the genomes and subsystems of the SEED. This object has minimal
46 :     capabilities: most of its power comes the L<ERDB> base class.
47 : parrello 1.1
48 :     The fields in this object are as follows.
49 :    
50 :     =over 4
51 :    
52 :     =item loadDirectory
53 :    
54 :     Name of the directory containing the files used by the loaders.
55 :    
56 :     =item loaderSource
57 :    
58 : parrello 1.11 Source object for the loaders (a L<FIG> in our case).
59 : parrello 1.1
60 :     =item genomeHash
61 :    
62 :     Reference to a hash of the genomes to include when loading.
63 :    
64 :     =item subHash
65 :    
66 :     Reference to a hash of the subsystems to include when loading.
67 :    
68 :     =item tuning
69 :    
70 :     Reference to a hash of tuning parameters.
71 :    
72 : parrello 1.16 =item otuHash
73 :    
74 :     Reference to a hash that maps genome IDs to genome set names.
75 :    
76 : parrello 1.1 =back
77 :    
78 : parrello 1.8 =head2 Configuration and Construction
79 : parrello 1.1
80 :     The default loading profile for the Sapling database is to include all complete
81 :     genomes and all usable subsystems. This can be overridden by specifying a list of
82 :     genomes and subsystems in an XML configuration file. The file name should be
83 :     C<SaplingConfig.xml> in the specified data directory. The document element should
84 :     be C<Sapling>, and it has two sub-elements. The C<Genomes> element should contain as
85 :     its text a space-delimited list of genome IDs. The <Subsystems> element should contain
86 :     a list of subsystem names, one per line. If a particular section is missing, the
87 :     default list will be used.
88 :    
89 :     =head3 Example
90 :    
91 :     The following configuration file specifies 10 genomes and 6 subsystems.
92 :    
93 :     <Sapling>
94 :     <Genomes>
95 :     100226.1 31033.3 31964.1 36873.1 126740.4
96 :     155864.1 349307.7 350058.5 351348.5 412694.5
97 :     </Genomes>
98 :     <Subsystems>
99 :     Sugar_utilization_in_Thermotogales
100 :     Coenzyme_F420_hydrogenase
101 :     Ribosome_activity_modulation
102 :     prophage_tails
103 :     CBSS-393130.3.peg.794
104 :     Apigenin_derivatives
105 :     </Subsystems>
106 :     </Sapling>
107 :    
108 :     The XML file also contains tuning parameters that affect the way the data
109 :     is loaded. These are specified as attributes in the TuningParameters element,
110 :     as follows.
111 :    
112 :     =over 4
113 :    
114 :     =item maxLocationLength
115 :    
116 :     The maximum number of base pairs allowed in a single location. B<IsLocatedIn>
117 :     records are split into sections based on this length, so when you are looking
118 :     for all the features in a particular neighborhood, you can look for locations
119 :     within the maximum location distance from the neighborhood, and even if you have
120 :     a huge operon that contains tens of thousands of base pairs, you'll still be
121 :     able to find it.
122 :    
123 : parrello 1.4 =item maxSequenceLength
124 :    
125 :     The maximum number of base pairs allowed in a single DNA sequence. DNA sequences
126 :     are broken into segments to prevent excessively large genomes from clogging
127 :     memory during sequence resolution.
128 :    
129 : parrello 1.1 =back
130 :    
131 :     =head3 Global Section Constant
132 :    
133 :     Each section of the database used by the loader corresponds to a single genome.
134 :     The global section is loaded after all the others, and is concerned with data
135 :     not related to a particular genome.
136 :    
137 :     =cut
138 :    
139 :     # Name of the global section
140 :     use constant GLOBAL => 'Globals';
141 :    
142 :     =head3 Tuning Parameter Defaults
143 :    
144 :     Each tuning parameter must have a default value, in case it is not present in
145 :     the XML configuration file. The defaults are specified in a constant hash
146 :     reference called C<TUNING_DEFAULTS>.
147 :    
148 :     =cut
149 :    
150 :     use constant TUNING_DEFAULTS => {
151 : parrello 1.4 maxLocationLength => 4000,
152 : parrello 1.27 maxSequenceLength => 10000,
153 : parrello 1.1 };
154 :    
155 :     =head3 new
156 :    
157 :     my $sap = Sapling->new(%options);
158 :    
159 :     Construct a new Sapling object. The following options are supported.
160 :    
161 :     =over 4
162 :    
163 :     =item loadDirectory
164 :    
165 :     Data directory to be used by the loaders.
166 :    
167 : parrello 1.7 =item DBD
168 : parrello 1.1
169 :     XML database definition file.
170 :    
171 :     =item dbName
172 :    
173 :     Name of the database to use.
174 :    
175 :     =item sock
176 :    
177 :     Socket for accessing the database.
178 :    
179 :     =item userData
180 :    
181 :     Name and password used to log on to the database, separated by a slash.
182 :    
183 :     =item dbhost
184 :    
185 :     Database host name.
186 :    
187 : parrello 1.25 =item port
188 :    
189 : parrello 1.35 MYSQL port number to use (MySQL only).
190 :    
191 :     =item dbms
192 :    
193 :     Database management system to use (e.g. C<SQLite> or C<postgres>, default C<mysql>).
194 : parrello 1.25
195 : parrello 1.1 =back
196 :    
197 :     =cut
198 :    
199 :     sub new {
200 :     # Get the parameters.
201 :     my ($class, %options) = @_;
202 :     # Get the options.
203 : parrello 1.48 if (! $options{loadDirectory}) {
204 :     $options{loadDirectory} = $FIG_Config::saplingData ||
205 :     "$FIG_Config::fig/SaplingData";
206 :     }
207 :     my $dbd = $options{DBD} || "$options{loadDirectory}/SaplingDBD.xml";
208 : olson 1.6 my $dbName = $options{dbName} || $FIG_Config::saplingDB || "nmpdr_sapling";
209 : parrello 1.3 my $userData = $options{userData} || "seed/";
210 :     my $dbhost = $options{dbhost} || $FIG_Config::saplingHost || "localhost";
211 : parrello 1.25 my $port = $options{port} || 3306;
212 : parrello 1.35 my $dbms = $options{dbms} || 'mysql';
213 : parrello 1.45 # Insure that if the user specified a DBD, it overrides the internal one.
214 :     if ($options{DBD} && ! defined $options{externalDBD}) {
215 :     $options{externalDBD} = 1;
216 :     }
217 : parrello 1.25 # Compute the socket. An empty string is a valid override here.
218 :     my $sock = $options{sock};
219 :     if (! defined $sock) {
220 :     $sock = $FIG_Config::sproutSock || "";
221 :     }
222 : parrello 1.1 # Compute the user name and password.
223 :     my ($user, $pass) = split '/', $userData, 2;
224 :     $pass = "" if ! defined $pass;
225 : parrello 1.24 Trace("Connecting to sapling database.") if T(2);
226 : parrello 1.1 # Connect to the database.
227 : parrello 1.35 my $dbh = DBKernel->new($dbms, $dbName, $user, $pass, $port, $dbhost, $sock);
228 : parrello 1.1 # Create the ERDB object.
229 : parrello 1.2 my $retVal = ERDB::new($class, $dbh, $dbd, %options);
230 : parrello 1.1 # Set up the spaces for the loader source object, the subsystem hash, the
231 :     # genome hash, and the tuning parameters.
232 :     $retVal->{source} = undef;
233 :     $retVal->{genomeHash} = undef;
234 :     $retVal->{subHash} = undef;
235 :     $retVal->{tuning} = undef;
236 : parrello 1.16 # Set up the hash of genome IDs to OTUs.
237 :     $retVal->{otuHash} = {};
238 : parrello 1.1 # Return it.
239 :     return $retVal;
240 :     }
241 :    
242 : parrello 1.8 =head2 Public Methods
243 :    
244 : parrello 1.16 =head3 OTU
245 :    
246 :     my $otu = $sap->OTU($genomeID);
247 :    
248 :     Return the name of the Organism Taxonomic Unit (GenomeSet) for the
249 :     specified genome ID. OTU information is cached in memory, so that once it
250 :     is known, it does not need to be re-fetched from the database.
251 :    
252 :     =over 4
253 :    
254 :     =item genomeID
255 :    
256 :     ID of a genome or feature. If a feature ID is specified, the genome ID will be
257 :     extracted from it.
258 :    
259 :     =item RETURN
260 :    
261 :     Returns the name of the genome set for the specified genome, or C<undef> if the
262 :     genome is not in the
263 :    
264 :     =back
265 :    
266 :     =cut
267 :    
268 :     sub OTU {
269 :     # Get the parameters.
270 :     my ($self, $genomeID) = @_;
271 :     # Get the OTU hash.
272 :     my $otuHash = $self->{otuHash};
273 :     # Compute the real genome ID.
274 :     my $realGenomeID = ($genomeID =~ /^fig\|(\d+\.\d+)/ ? $1 : $genomeID);
275 :     # Look it up in the hash.
276 :     my $retVal = $otuHash->{$realGenomeID};
277 :     # Was it found?
278 :     if (! defined $retVal) {
279 :     # No, get the OTU from the database.
280 :     ($retVal) = $self->GetFlat("IsCollectedInto", "IsCollectedInto(from-link) = ?",
281 :     [$realGenomeID], "to-link");
282 :     # Save it in the hash for future use.
283 :     $otuHash->{$realGenomeID} = $retVal;
284 :     }
285 :     # Return the result.
286 :     return $retVal;
287 :     }
288 :    
289 : parrello 1.27 =head3 ProteinID
290 :    
291 :     my $key = $sap->ProteinID($sequence);
292 :    
293 :     Return the protein sequence ID that would be associated with a specific
294 :     protein sequence.
295 :    
296 :     =over 4
297 :    
298 :     =item sequence
299 :    
300 :     String containing the protein sequence in question.
301 :    
302 :     =item RETURN
303 :    
304 :     Returns the ID value for the specified protein sequence. If the sequence exists
305 :     in the database, it will have this ID in the B<ProteinSequence> table.
306 :    
307 :     =back
308 :    
309 :     =cut
310 :    
311 :     sub ProteinID {
312 :     # Get the parameters.
313 :     my ($self, $sequence) = @_;
314 :     # Compute the MD5 hash.
315 :     my $retVal = Digest::MD5::md5_hex($sequence);
316 :     # Return the result.
317 :     return $retVal;
318 :     }
319 :    
320 :    
321 : parrello 1.28 =head3 IsProteinID
322 :    
323 :     my $md5 = $sap->IsProteinID($identifier);
324 :    
325 :     Check for a protein identifier. If a protein identifier is found, the
326 :     corresponding protein sequence ID will be returned; otherwise, an
327 :     undefined value will be returned. A protein identifier is either a
328 :     raw protein sequence ID, an ID preceded by C<md5|>, or an ID preceded by
329 :     C<gnl|md5|>
330 :    
331 :     =over 4
332 :    
333 :     =item identifier
334 :    
335 :     Identifier to test.
336 :    
337 :     =item RETURN
338 :    
339 :     Returns the MD5 code from the protein identifier, or C<undef> if the incoming
340 :     string is not a protein identifier.
341 :    
342 :     =back
343 :    
344 :     =cut
345 :    
346 :     sub IsProteinID {
347 :     # Get the parameters.
348 :     my ($self, $identifier) = @_;
349 :     # Declare the return variable.
350 :     my $retVal;
351 :     # Check the input.
352 :     if ($identifier =~ /^(?:gnl\|)?(?:md5\|)?([0-9a-f]{32})$/) {
353 :     $retVal = $1;
354 :     }
355 :     # Return the result.
356 :     return $retVal;
357 :     }
358 :    
359 :    
360 : parrello 1.18 =head3 Assignment
361 :    
362 :     my $assignment = $sapling->Assignment($fid);
363 :    
364 :     Return the functional assignment for the specified feature.
365 :    
366 :     =over 4
367 :    
368 :     =item fid
369 :    
370 :     FIG ID of the desired feature.
371 :    
372 :     =item RETURN
373 :    
374 :     Returns the functional assignment of the specified feature, or C<undef>
375 :     if the feature does not exist.
376 :    
377 :     =back
378 :    
379 :     =cut
380 :    
381 :     sub Assignment {
382 :     # Get the parameters.
383 :     my ($self, $fid) = @_;
384 :     # Get the functional assignment.
385 :     my ($retVal) = $self->GetFlat("Feature", "Feature(id) = ?", [$fid], 'function');
386 :     # Return the result.
387 :     return $retVal;
388 :     }
389 :    
390 : parrello 1.27 =head3 IdsForProtein
391 :    
392 :     my @ids = $sap->IdsForProtein($protID);
393 :    
394 :     Return a list of all the identifiers associated with the specified
395 :     protein.
396 :    
397 :     =over 4
398 :    
399 :     =item protID
400 :    
401 :     ID of the protein of interest.
402 :    
403 :     =item RETURN
404 :    
405 :     Returns a list of the Identifiers for the specific protein or for genes that
406 :     produce the specific protein.
407 :    
408 :     =back
409 :    
410 :     =cut
411 :    
412 :     sub IdsForProtein {
413 :     # Get the parameters.
414 :     my ($self, $protID) = @_;
415 :     # We'll put the identifiers found in here.
416 :     my %retVal;
417 :     # Ask for identifiers that directly name the protein.
418 :     for my $id ($self->GetFlat("ProteinSequence IsNamedBy Identifier",
419 :     "ProteinSequence(id) = ?", [$protID],
420 :     'Identifier(id)')) {
421 :     $retVal{$id} = 1;
422 :     }
423 :     # Add identifiers that name genes producing the protein.
424 :     for my $id ($self->GetFlat("ProteinSequence IsProteinFor Feature IsIdentifiedBy Identifier",
425 :     "ProteinSequence(id) = ?", [$protID],
426 :     'Identifier(id)')) {
427 :     $retVal{$id} = 1;
428 :     }
429 :     # Return the results found.
430 :     return sort keys %retVal;
431 :     }
432 :    
433 : parrello 1.16
434 : parrello 1.11 =head3 ComputeDNA
435 :    
436 :     my $dna = $sap->ComputeDNA($location);
437 :    
438 :     Return the DNA sequence for the specified location.
439 :    
440 :     =over 4
441 :    
442 :     =item location
443 :    
444 :     A L<BasicLocation> object indicating the contig, start location, direction, and
445 :     length of the desired DNA segment.
446 :    
447 :     =item RETURN
448 :    
449 : parrello 1.13 Returns a string containing the desired DNA. The DNA comes back in pure lower-case.
450 : parrello 1.11
451 :     =back
452 :    
453 :     =cut
454 :    
455 :     sub ComputeDNA {
456 :     # Get the parameters.
457 :     my ($self, $location) = @_;
458 : parrello 1.20 # Get the contig, left end, and right end of the location. Note we subtract
459 :     # 1 to convert contig positions to string offsets.
460 : parrello 1.11 my $contig = $location->Contig;
461 : parrello 1.20 my $left = $location->Left - 1;
462 :     my $right = $location->Right - 1;
463 : parrello 1.31 # Insure the left location is valid.
464 :     if ($left < 0) {
465 :     $left = 0;
466 :     }
467 : parrello 1.11 # Get the DNA segment length.
468 :     my $maxSequenceLength = $self->TuningParameter("maxSequenceLength");
469 :     # Compute the key of the first segment of our DNA and the starting
470 :     # point in that segment.
471 :     my $leftOffset = $left % $maxSequenceLength;
472 :     my $leftKey = "$contig:" . Tracer::Pad(($left - $leftOffset)/$maxSequenceLength,
473 :     7, 1, '0');
474 :     # Compute the key of the last segment containing our DNA.
475 :     my $rightKey = "$contig:" . Tracer::Pad(int($right/$maxSequenceLength), 7, 1, '0');
476 :     my @results = $self->GetFlat("DNASequence",
477 : parrello 1.26 'DNASequence(id) >= ? AND DNASequence(id) <= ?',
478 : parrello 1.11 [$leftKey, $rightKey], 'sequence');
479 :     # Form all the DNA into a string and extract our piece.
480 :     my $retVal = substr(join("", @results), $leftOffset, $location->Length);
481 :     # If this is a backwards string, we need the reverse complement.
482 :     rev_comp(\$retVal) if $location->Dir eq '-';
483 :     # Return the result.
484 :     return $retVal;
485 :     }
486 :    
487 : parrello 1.12 =head3 FilterByGenome
488 :    
489 :     my @filteredFids = $sapling->FilterByGenome(\@fids, $genomeFilter);
490 :    
491 :     Filter the features using the specified genome-based criterion. The
492 :     criterion can be either a comma-separated list of genome IDs, or a
493 :     partial organism name.
494 :    
495 :     =over 4
496 :    
497 :     =item fids
498 :    
499 :     Reference to a list of feature IDs.
500 :    
501 :     =item genomeFilter
502 :    
503 :     A string specifying the filtering criterion. If undefined or blank, then
504 :     no filter is applied. If a name, then only features from genomes with a
505 :     matching name will be returned. A name is a match if the filter is an
506 :     exact match for some prefix of the organism name. Thus, C<Listeria> would
507 :     get all Listerias, while C<Listeria monocytogenes EGD-e> would match only
508 :     the specific EGD-e strain. For a more precise match, you can specify
509 :     instead a comma-delimited list of genome IDs. In this latter case, only
510 :     features for the listed genomes will be included in the results.
511 :    
512 :     =item RETURN
513 :    
514 :     Returns the features from the incoming list that match the filter condition.
515 :    
516 :     =back
517 :    
518 :     =cut
519 :    
520 :     sub FilterByGenome {
521 :     # Get the parameters.
522 :     my ($self, $fids, $genomeFilter) = @_;
523 :     # Declare the return variable.
524 :     my @retVal;
525 :     # Check the type of filter.
526 :     if (! $genomeFilter) {
527 :     # No filter, so copy the input directly to the result.
528 :     @retVal = @$fids;
529 :     } else {
530 :     # Trim edge spaces from the filter.
531 :     $genomeFilter = Tracer::Trim($genomeFilter);
532 :     # This hash will contain the permissible genome IDs.
533 :     my %genomeIDs;
534 :     # Check for a name. We assume we have a name if there's an
535 :     # alphabetic letter anywhere in the filter string.
536 :     if ($genomeFilter =~ /[a-zA-Z]/) {
537 :     # The filter contains something that does not look like a genome
538 :     # ID, so it is treated as a genome name. We get the IDs of the
539 :     # genomes with that name and put them in the hash.
540 :     %genomeIDs = map { $_ => 1 } $self->GetFlat("Genome",
541 :     'Genome(scientific-name) LIKE ?',
542 :     ["$genomeFilter%"], 'id');
543 :     } else {
544 :     # We are expecting a comma-delimited list of genome IDs, so we
545 :     # put these in our hash.
546 :     %genomeIDs = map { $_ => 1 } split(/\s*,\s*/, $genomeFilter);
547 :     }
548 :     # Now we loop through the features, keeping the ones whose genome ID
549 :     # matches something in the hash.
550 :     @retVal = grep { $genomeIDs{genome_of($_)} } @$fids;
551 :     }
552 :     # Return the result.
553 :     return @retVal;
554 :     }
555 :    
556 :    
557 : parrello 1.11 =head3 GetLocations
558 :    
559 :     my @locs = $sapling->GetLocations($fid);
560 :    
561 :     Return the locations of the DNA for the specified feature.
562 :    
563 :     =over 4
564 :    
565 :     =item fid
566 :    
567 :     ID of the feature whose location is desired.
568 :    
569 :     =item RETURN
570 :    
571 :     Returns a list of L<BasicLocation> objects for the locations containing the
572 :     feature's DNA.
573 :    
574 :     =back
575 :    
576 :     =cut
577 :    
578 :     sub GetLocations {
579 :     # Get the parameters.
580 :     my ($self, $fid) = @_;
581 :     # Declare the return variable.
582 :     my @retVal;
583 : parrello 1.23 # This will contain the last location found.
584 :     my $lastLoc;
585 : parrello 1.11 # Get this feature's locations.
586 :     my $qh = $self->Get("IsLocatedIn",
587 :     'IsLocatedIn(from-link) = ? ORDER BY IsLocatedIn(ordinal)',
588 :     [$fid]);
589 :     while (my $resultRow = $qh->Fetch()) {
590 :     # Compute the contig ID and other information.
591 :     my $contig = $resultRow->PrimaryValue('to-link');
592 :     my $begin = $resultRow->PrimaryValue('begin');
593 :     my $dir = $resultRow->PrimaryValue('dir');
594 :     my $len = $resultRow->PrimaryValue('len');
595 :     # Create a location from the location information.
596 :     my $start = ($dir eq '+' ? $begin : $begin + $len - 1);
597 : parrello 1.23 my $loc = BasicLocation->new($contig, $start, $dir, $len);
598 :     # Check to see if this location is adjacent to the previous one.
599 :     if ($lastLoc && $lastLoc->Adjacent($loc)) {
600 :     # It is, so merge it in.
601 :     $lastLoc->Merge($loc);
602 :     } else {
603 :     # It isn't, so push the new one on the list.
604 :     $lastLoc = $loc;
605 :     push @retVal, $loc;
606 :     }
607 : parrello 1.11 }
608 :     # Return the result.
609 :     return @retVal;
610 :     }
611 :    
612 :    
613 : parrello 1.8 =head3 IdentifiedProtein
614 :    
615 :     my $proteinID = $sap->IdentifiedProtein($id);
616 :    
617 :     Compute the protein for a specified identifier. If the identifier does
618 :     not exist or does not identify a protein, this method will return
619 :     C<undef>.
620 :    
621 :     =over 4
622 :    
623 :     =item id
624 :    
625 :     Identifier whose protein is desired.
626 :    
627 :     =item RETURN
628 :    
629 :     Returns the protein ID corresponding to the incoming identifier,
630 :     or C<undef> if the identifier does not exist or is not for a protein.
631 :    
632 :     =back
633 :    
634 :     =cut
635 :    
636 :     sub IdentifiedProtein {
637 :     # Get the parameters.
638 :     my ($self, $id) = @_;
639 :     # Declare the return variable.
640 :     my $retVal;
641 :     # Try to find a protein for this ID.
642 :     my ($proteinID) = $self->GetFlat("Identifier Names ProteinSequence",
643 :     "Identifier(id) = ?", [$id],
644 :     'ProteinSequence(id)');
645 :     if (defined $proteinID) {
646 :     # We found one, so we're done.
647 :     $retVal = $proteinID;
648 :     } else {
649 :     # Not a protein ID. See if it's the ID of a feature that has a
650 :     # protein connected. Note that it's possible to find more than one,
651 :     # but we're going to punt and pick the first.
652 :     ($proteinID) = $self->GetFlat("Identifier Identifies Feature Produces ProteinSequence",
653 :     "Identifier(id) = ? LIMIT 1", [$id],
654 :     'ProteinSequence(id)');
655 :     if (defined $proteinID) {
656 :     # We found a protein ID, so return it.
657 :     $retVal = $proteinID;
658 :     }
659 :     }
660 :     # Return the result.
661 :     return $retVal;
662 :     }
663 :    
664 : parrello 1.9 =head3 FeaturesByID
665 :    
666 :     my @fids = $sapling->FeaturesByID($id);
667 :    
668 :     Return all the features corresponding to the specified identifier. Only features
669 :     that represent the same locus will be returned.
670 :    
671 :     =over 4
672 :    
673 :     =item id
674 :    
675 :     Identifier of interest.
676 :    
677 :     =item RETURN
678 :    
679 :     Returns a list of all the features in the database that match the given
680 :     identifier.
681 :    
682 :     =back
683 :    
684 :     =cut
685 :    
686 :     sub FeaturesByID {
687 :     # Get the parameters.
688 :     my ($self, $id) = @_;
689 :     # Ask for features from the database.
690 :     my @retVal = $self->GetFlat("Identifies", "Identifies(from-link) = ?", [$id],
691 :     'to-link');
692 :     # Return the result.
693 :     return @retVal;
694 :     }
695 :    
696 :     =head3 ProteinsByID
697 :    
698 :     my @fids = $sapling->ProteinsByID($id);
699 :    
700 :     Return all the features that have the same protein sequence as the
701 : parrello 1.10 identified feature. The returned features mar or may not have the same locus. If
702 :     the identifier is not for a protein encoding gene, no result will be returned.
703 : parrello 1.9
704 :     =over 4
705 :    
706 :     =item id
707 :    
708 : parrello 1.10 Identifier of interest. This can be any alias identifier from the B<Identifier>
709 :     table (which includes the FIG ID).
710 : parrello 1.9
711 :     =item RETURN
712 :    
713 : parrello 1.10 Returns a list of FIG IDs for features having the same protein sequence. If the
714 :     identifier does not specify a protein-encoding gene, the list will be empty.
715 : parrello 1.9
716 :     =back
717 :    
718 :     =cut
719 :    
720 :     sub ProteinsByID {
721 :     # Get the parameters.
722 :     my ($self, $id) = @_;
723 :     # Declare the return variable.
724 :     my @retVal;
725 : parrello 1.10 # Compute the protein for this identifier.
726 :     my $protID = $self->IdentifiedProtein($id);
727 :     # Only proceed if a protein was found. If no protein was found, we're
728 :     # already set up to return an empty list.
729 :     if (defined $protID) {
730 :     # Get all the features connected to the identified protein.
731 :     @retVal = $self->GetFlat("IsProteinFor", "IsProteinFor(from-link) = ?",
732 :     [$protID], "IsProteinFor(to-link)");
733 :     }
734 : parrello 1.9 # Return the result.
735 :     return @retVal;
736 :     }
737 :    
738 :     =head3 GetSubsystem
739 :    
740 :     my $ssData = $sapling->GetSubsystem($ssName);
741 :    
742 : parrello 1.11 Return a L<SaplingSubsys> object for the named subsystem.
743 : parrello 1.9
744 :     =over 4
745 : parrello 1.8
746 : parrello 1.9 =item ssName
747 :    
748 :     Name of the desired subsystem.
749 :    
750 :     =item RETURN
751 :    
752 :     Returns an object that defines multiple useful methods for manipulating the
753 :     named subsystem.
754 :    
755 :     =back
756 :    
757 :     =cut
758 :    
759 :     sub GetSubsystem {
760 :     # Get the parameters.
761 :     my ($self, $ssName) = @_;
762 :     # Declare the return variable.
763 :     require SaplingSubsys;
764 :     my $retVal = SaplingSubsys->new($ssName, $self);
765 :     # Return the result.
766 :     return $retVal;
767 :     }
768 : parrello 1.8
769 :    
770 :     =head3 GenesInRegion
771 :    
772 :     my @pegs = $sap->GenesInRegion($location);
773 :    
774 :     Return a list of the IDs for the features that overlap the specified
775 :     region on a contig.
776 :    
777 :     =over 4
778 :    
779 :     =item location
780 :    
781 :     Location of interest, either in the form of a location string (e.g.
782 : parrello 1.11 C<360108.3:NZ_AANK01000002_264528_264007>) or a L<BasicLocation>
783 : parrello 1.8 object.
784 :    
785 :     =item RETURN
786 :    
787 :     Returns a list of feature IDs. The features in the list will be all
788 :     those that overlap or occur inside the location of interest.
789 :    
790 :     =back
791 :    
792 :     =cut
793 :    
794 :     sub GenesInRegion {
795 :     # Get the parameters.
796 :     my ($self, $location) = @_;
797 :     # Insure we have a location object.
798 :     my $locObject = (ref $location ? $location : BasicLocation->new($location));
799 :     # Get the beginning and the end of the location of interest.
800 :     my $begin = $locObject->Left();
801 :     my $end = $locObject->Right();
802 :     # For performance reasons, we limit the possible starting location, using the
803 :     # tuning parameter for maximum location length.
804 :     my $limit = $begin - $self->TuningParameter('maxLocationLength');
805 :     # Perform the query. Note we use a hash to eliminate duplicates.
806 :     my %retVal = map { $_ => 1 } $self->GetFlat('Contig IsLocusFor Feature',
807 :     "Contig(id) = ? AND IsLocusFor(begin) <= ? AND " .
808 :     "IsLocusFor(begin) > ? AND " .
809 :     "IsLocusFor(begin) + IsLocusFor(len) >= ?",
810 :     [$locObject->Contig(), $end, $limit, $begin],
811 :     'Feature(id)');
812 :     # Return the result.
813 :     return sort keys %retVal;
814 :     }
815 :    
816 :     =head3 GetFasta
817 :    
818 : parrello 1.9 my $fasta = $sapling->GetFasta($proteinID, $id, $comment);
819 : parrello 1.8
820 :     Return a FASTA sequence for the specified protein. An optional identifier
821 :     can be provided to be used as the identification string.
822 :    
823 :     =over 4
824 :    
825 :     =item proteinID
826 :    
827 :     Protein sequence identifier.
828 :    
829 :     =item id (optional)
830 :    
831 :     The identifier to be used in the FASTA output. If omitted, the protein ID
832 :     is used.
833 :    
834 : parrello 1.9 =item comment (optional)
835 :    
836 :     The comment string to be used in the identification line of the FASTA output.
837 :     If omitted, no comment will be present.
838 :    
839 : parrello 1.8 =item RETURN
840 :    
841 :     Returns a FASTA string for the protein. This includes the identification
842 : parrello 1.14 line and the protein letters themselves.
843 : parrello 1.8
844 :     =back
845 :    
846 :     =cut
847 :    
848 :     sub GetFasta {
849 :     # Get the parameters.
850 : parrello 1.9 my ($self, $proteinID, $id, $comment) = @_;
851 : parrello 1.11 # Compute the identifier.
852 : parrello 1.8 my $realID = $id || "md5|$proteinID";
853 :     # Declare the return variable.
854 :     my $retVal;
855 :     # Get the protein sequence.
856 :     my ($sequence) = $self->GetFlat("ProteinSequence", "ProteinSequence(id) = ?",
857 :     [$proteinID], "sequence");
858 :     # It's an error if the sequence was not found.
859 :     if (! defined $sequence) {
860 :     Confess("No protein found with the sequence identifier $proteinID.");
861 :     } else {
862 : parrello 1.11 # Create a FASTA string for the protein.
863 :     $retVal = SeedUtils::create_fasta_record($realID, $comment, $sequence);
864 : parrello 1.8 }
865 :     # Return the result.
866 :     return $retVal;
867 :     }
868 : parrello 1.1
869 :    
870 : parrello 1.4 =head3 Taxonomy
871 :    
872 : parrello 1.14 my @taxonomy = $sap->Taxonomy($genomeID, $format);
873 : parrello 1.4
874 :     Return the full taxonomy of the specified genome, starting from the
875 : parrello 1.14 domain downward.
876 : parrello 1.4
877 :     =over 4
878 :    
879 :     =item genomeID
880 :    
881 :     ID of the genome whose taxonomy is desired. The genome does not need to exist
882 :     in the database: the version number will be lopped off and the result used as
883 :     an entry point into the taxonomy tree.
884 :    
885 : parrello 1.14 =item format (optional)
886 :    
887 :     Format of the taxonomy. C<names> will return primary names, C<numbers> will
888 :     return taxonomy numbers, and C<both> will return taxonomy number followed by
889 :     primary name. The default is C<names>.
890 :    
891 : parrello 1.4 =item RETURN
892 :    
893 :     Returns a list of taxonomy names, starting from the domain and moving
894 :     down to the node where the genome is attached.
895 :    
896 :     =back
897 :    
898 :     =cut
899 :    
900 :     sub Taxonomy {
901 :     # Get the parameters.
902 : parrello 1.14 my ($self, $genomeID, $format) = @_;
903 : parrello 1.4 # Get the genome's taxonomic group.
904 : parrello 1.49 my ($taxon) = split m/\./, $genomeID, 2;
905 : parrello 1.4 # We'll put the return data in here.
906 :     my @retVal;
907 :     # Loop until we hit a domain.
908 :     my $domainFlag;
909 :     while (! $domainFlag) {
910 :     # Get the data we need for this taxonomic group.
911 :     my ($taxonData) = $self->GetAll('TaxonomicGrouping IsInGroup',
912 :     'TaxonomicGrouping(id) = ?', [$taxon],
913 :     'domain scientific-name IsInGroup(to-link)');
914 :     # If we didn't find what we're looking for, then we have a problem. This
915 :     # would indicate a node below the domain level that doesn't have a parent
916 :     # or (more likely) an invalid input string.
917 :     if (! $taxonData) {
918 :     # Terminate the loop and trace a warning.
919 :     $domainFlag = 1;
920 :     Trace("Could not find node or parent for \"$taxon\".") if T(1);
921 :     } else {
922 :     # Extract the data for the current group. Note we overwrite our
923 :     # taxonomy ID with the ID of our parent, priming the next iteration
924 :     # of the loop.
925 :     my $name;
926 : parrello 1.14 my $oldTaxon = $taxon;
927 : parrello 1.4 ($domainFlag, $name, $taxon) = @$taxonData;
928 : parrello 1.14 # Compute the value we want to put in the output list.
929 :     my $value;
930 :     if ($format eq 'numbers') {
931 :     $value = $oldTaxon;
932 :     } elsif ($format eq 'both') {
933 :     $value = "$oldTaxon $name";
934 :     } else {
935 :     $value = $name;
936 :     }
937 :     # Put the current group's data in the return list.
938 :     unshift @retVal, $value;
939 : parrello 1.4 }
940 :     }
941 :     # Return the result.
942 :     return @retVal;
943 :     }
944 :    
945 : parrello 1.18 =head3 IsDeletedFid
946 :    
947 :     my $flag = $sapling->IsDeletedFid($fid);
948 :    
949 :     Return TRUE if the specified feature is B<not> in the database, else
950 :     FALSE.
951 :    
952 :     =over 4
953 :    
954 :     =item fid
955 :    
956 :     FIG ID of the relevant feature.
957 :    
958 :     =item RETURN
959 :    
960 :     Returns TRUE if the specified feature is in the database, else FALSE.
961 :    
962 :     =back
963 :    
964 :     =cut
965 :    
966 :     sub IsDeletedFid {
967 :     # Get the parameters.
968 :     my ($self, $fid) = @_;
969 :     # Check for the feature. If the feature does not exist, we'll get an
970 :     # undefined value (FALSE). If it does, we'll get the feature ID itself
971 :     # (TRUE).
972 :     my ($retVal) = $self->GetFlat("Feature", "Feature(id) = ?", [$fid], 'id');
973 :     # Return FALSE if the feature was found, TRUE if it was not found.
974 :     return ($retVal ? 0 : 1);
975 :     }
976 :    
977 : parrello 1.4
978 : parrello 1.1 =head3 GenomeHash
979 :    
980 :     my $genomeHash = $sap->GenomeHash();
981 :    
982 :     Return a hash of the genomes configured to be in this database. The list
983 :     is either taken from the active SEED database or from a configuration
984 :     file in the data directory. The hash maps genome IDs to TRUE.
985 :    
986 :     =cut
987 :    
988 :     sub GenomeHash {
989 :     # Get the parameters.
990 :     my ($self) = @_;
991 :     # We'll build the hash in here.
992 :     my %genomeHash;
993 :     # Do we already have a list?
994 :     if (! defined $self->{genomeHash}) {
995 :     # No, check for a configuration file.
996 :     my $xml = $self->ReadConfigFile();
997 :     if (defined $xml && $xml->{Genomes}) {
998 :     # We found one and it has a genome list, so extract the genomes.
999 :     %genomeHash = map { $_ => 1 } grep { $_ =~ /\S/ } split /\s+/, $xml->{Genomes};
1000 :     } else {
1001 :     # No, so get the genome list.
1002 :     my $fig = $self->GetSourceObject();
1003 : parrello 1.8 my @genomes = $fig->genomes();
1004 : parrello 1.1 # Verify the genome list to insure every genome has an organism
1005 :     # directory.
1006 :     for my $genome (@genomes) {
1007 :     if (-d "$FIG_Config::organisms/$genome") {
1008 :     $genomeHash{$genome} = 1;
1009 :     }
1010 :     }
1011 :     }
1012 :     # Store the genomes in this object.
1013 :     $self->{genomeHash} = \%genomeHash;
1014 :     }
1015 :     # Return the result.
1016 :     return $self->{genomeHash};
1017 :     }
1018 :    
1019 :     =head3 SubsystemID
1020 :    
1021 :     my $subID = $sap->SubsystemID($subName);
1022 :    
1023 :     Return the ID of the subsystem with the specified name.
1024 :    
1025 :     =over 4
1026 :    
1027 :     =item subName
1028 :    
1029 :     Name of the relevant subsystem. A subsystem name with underscores for spaces
1030 :     will return the same ID as a subsystem name with the spaces still in it.
1031 :    
1032 :     =item RETURN
1033 :    
1034 : parrello 1.4 Returns a normalized subsystem name.
1035 : parrello 1.1
1036 :     =back
1037 :    
1038 :     =cut
1039 :    
1040 :     sub SubsystemID {
1041 :     # Get the parameters.
1042 :     my ($self, $subName) = @_;
1043 : parrello 1.4 # Normalize the subsystem name by converting underscores to spaces.
1044 : parrello 1.50 # Underscores at the beginning and end are not converted.
1045 : parrello 1.4 my $retVal = $subName;
1046 : parrello 1.38 my $trailer = chop $retVal;
1047 : parrello 1.50 my $prefix = substr($retVal,0,1);
1048 :     $retVal = substr($retVal, 1);
1049 : parrello 1.38 $retVal =~ tr/_/ /;
1050 : parrello 1.50 $retVal = $prefix . $retVal . $trailer;
1051 : parrello 1.1 # Return the result.
1052 :     return $retVal;
1053 :     }
1054 :    
1055 : parrello 1.14 =head3 Alias
1056 :    
1057 :     my $translatedID = $sap->Alias($fid, $source);
1058 :    
1059 :     Return an alternate ID of the specified type for the specified feature.
1060 :     If no alternate ID of that type exists, the incoming value will be
1061 :     returned unchanged.
1062 :    
1063 :     =over 4
1064 :    
1065 :     =item fid
1066 :    
1067 :     FIG ID of the feature whose alias identifier is desired.
1068 :    
1069 :     =item source
1070 :    
1071 :     Database type for the alternate ID (e.g. C<LocusTag>, C<NCBI>, C<RefSeq>). If
1072 :     C<SEED> is specified, the ID will be returned unchanged and no database lookup
1073 :     will occur.
1074 :    
1075 :     =item RETURN
1076 :    
1077 :     Returns an equivalent ID for the specified feature that belongs to the specified
1078 :     database (that is, has the specified source). If no such ID exists, returns the
1079 :     incoming ID.
1080 :    
1081 :     =back
1082 :    
1083 :     =cut
1084 :    
1085 :     sub Alias {
1086 :     # Get the parameters.
1087 :     my ($self, $fid, $source) = @_;
1088 :     # Default to the incoming value.
1089 :     my $retVal = $fid;
1090 :     # We only have work to do if the database type isn't "SEED".
1091 :     if ($source ne 'SEED') {
1092 :     # Look for an alias.
1093 :     my ($alias) = $self->GetFlat("IsIdentifiedBy Identifier",
1094 : parrello 1.15 'IsIdentifiedBy(from-link) = ? AND Identifier(source) = ?',
1095 : parrello 1.14 [$fid, $source], 'Identifier(natural-form)');
1096 :     # If we found one, return it.
1097 :     if (defined $alias) {
1098 :     $retVal = $alias;
1099 :     }
1100 :     }
1101 :     # Return the result.
1102 :     return $retVal;
1103 :     }
1104 :    
1105 :    
1106 :     =head3 ContigLength
1107 :    
1108 :     my $contigLen = $sap->ContigLength($contigID);
1109 :    
1110 :     Return the number of base pairs in the specified contig.
1111 :    
1112 :     =over 4
1113 :    
1114 :     =item contigID
1115 :    
1116 :     ID of the contig of interest.
1117 :    
1118 :     =item RETURN
1119 :    
1120 :     Returns the number of base pairs in the specified contig, or 0 if the contig
1121 :     does not exist.
1122 :    
1123 :     =back
1124 :    
1125 :     =cut
1126 :    
1127 :     sub ContigLength {
1128 :     # Get the parameters.
1129 :     my ($self, $contigID) = @_;
1130 :     # Try to find the length.
1131 :     my ($retVal) = $self->GetEntityValues(Contig => $contigID, 'length');
1132 :     # Convert not-found to 0.
1133 :     $retVal = 0 if ! defined $retVal;
1134 :     # Return the result.
1135 :     return $retVal;
1136 :     }
1137 :    
1138 : parrello 1.46 =head3 ReactionRoles
1139 :    
1140 :     my @roles = $sap->ReactionRoles($rxnID);
1141 :    
1142 :     Return a list of all the roles for a single reaction. The reactions are connected
1143 :     to roles through the complexes, so an extra step is required to sort out
1144 :     duplicates from the results.
1145 :    
1146 :     =over 4
1147 :    
1148 :     =item rxnID
1149 :    
1150 :     ID of the reaction whose roles are desired.
1151 :    
1152 :     =item RETURN
1153 :    
1154 :     Returns a list of the roles associated with the reaction.
1155 :    
1156 :     =back
1157 :    
1158 :     =cut
1159 :    
1160 :     sub ReactionRoles {
1161 :     # Get the parameters.
1162 :     my ($self, $rxnID) = @_;
1163 :     # Get the roles for this reaction, using a hash to filter out the
1164 :     # duplicates.
1165 :     my %retVal = map { $_ => 1 } $self->GetFlat("IsElementOf IsTriggeredBy",
1166 :     "IsElementOf(from-link) = ?", [$rxnID], "IsTriggeredBy(to-link)");
1167 :     # Sort and return the results.
1168 :     return sort keys %retVal;
1169 :     }
1170 :    
1171 :     =head3 RoleReactions
1172 :    
1173 :     my @rxns = $sap->RoleReactions($roleID);
1174 :    
1175 :     Return a list of all the reactions for a single role. The reactions are connected
1176 :     to roles through the complexes, so an extra step is required to sort out
1177 :     duplicates from the results.
1178 :    
1179 :     =over 4
1180 :    
1181 :     =item roleID
1182 :    
1183 :     ID of the role whose reactions are desired.
1184 :    
1185 :     =item RETURN
1186 :    
1187 :     Returns a list of the IDs for the reactions associated with the role.
1188 :    
1189 :     =back
1190 :    
1191 :     =cut
1192 :    
1193 :     sub RoleReactions {
1194 :     # Get the parameters.
1195 :     my ($self, $roleID) = @_;
1196 :     # Get the roles for this reaction, using a hash to filter out the
1197 :     # duplicates.
1198 :     my %retVal = map { $_ => 1 } $self->GetFlat("Triggers IsSetOf",
1199 :     "Triggers(from-link) = ?", [$roleID], "IsSetOf(to-link)");
1200 :     # Sort and return the results.
1201 :     return sort keys %retVal;
1202 :     }
1203 :    
1204 : parrello 1.14
1205 :     =head2 Configuration-Related Methods
1206 :    
1207 : parrello 1.1 =head3 SubsystemHash
1208 :    
1209 :     my $subHash = $sap->SubsystemHash();
1210 :    
1211 :     Return a hash of the subsystems configured to be in this database. The
1212 :     list is either taken from the active SEED database or from a
1213 :     configuration file in the data directory. The hash maps subsystem names
1214 :     to TRUE.
1215 :    
1216 :     =cut
1217 :    
1218 :     sub SubsystemHash {
1219 :     # Get the parameters.
1220 :     my ($self) = @_;
1221 :     # We'll build the hash in here.
1222 :     my %subHash;
1223 :     # Do we already have a list?
1224 :     if (! defined $self->{subHash}) {
1225 :     # No, check for a configuration file.
1226 :     my $xml = $self->ReadConfigFile();
1227 :     if (defined $xml && $xml->{Subsystems}) {
1228 :     # We found one, and it has subsystems, so we extract them.
1229 :     # A little dancing is necessary to trim spaces.
1230 :     my @subs = map { $_ =~ /\s*(\S.+\S)/; $1 } split /\n/, $xml->{Subsystems};
1231 :     # Here we need to clear out any null subsystem names resulting from
1232 :     # blank lines in the file.
1233 :     %subHash = map { $_ => 1 } grep { $_ } @subs;
1234 :     } else {
1235 :     # No config file, so we ask the FIG object.
1236 :     my $fig = $self->GetSourceObject();
1237 : parrello 1.36 for my $subsystem ($fig->all_subsystems()) {
1238 :     my $subsysID = $self->SubsystemID($subsystem);
1239 : parrello 1.37 $subHash{$subsysID} = 1;
1240 :     }
1241 : parrello 1.1 }
1242 :     # Store the subsystems in this object.
1243 :     $self->{subHash} = \%subHash;
1244 :     }
1245 :     # Return the result.
1246 :     return $self->{subHash};
1247 :     }
1248 :    
1249 :     =head3 TuningParameter
1250 :    
1251 :     my $parm = $erdb->TuningParameter($parmName);
1252 :    
1253 :     Return the value of the specified tuning parameter. Tuning parameters are
1254 :     read from the XML configuration file.
1255 :    
1256 :     =over 4
1257 :    
1258 :     =item parmName
1259 :    
1260 :     Name of the parameter whose value is desired.
1261 :    
1262 :     =item RETURN
1263 :    
1264 :     Returns the paramter value.
1265 :    
1266 :     =back
1267 :    
1268 :     =cut
1269 :    
1270 :     sub TuningParameter {
1271 :     # Get the parameters.
1272 :     my ($self, $parmName) = @_;
1273 :     # Insure we have the parameters in memory.
1274 :     if (! defined $self->{tuning}) {
1275 :     # Read the configuration file.
1276 :     my $configFile = $self->ReadConfigFile();
1277 :     # Get the tuning parameters (if any).
1278 :     my $tuning;
1279 :     if (! defined $configFile || ! exists $configFile->{TuningParameters}) {
1280 :     $tuning = {};
1281 :     } else {
1282 :     $tuning = $configFile->{TuningParameters};
1283 :     }
1284 :     # Merge in the default option values.
1285 :     Tracer::MergeOptions($tuning, TUNING_DEFAULTS);
1286 :     # Save the result in our object.
1287 :     $self->{tuning} = $tuning;
1288 :     }
1289 :     # Extract the tuning paramter.
1290 :     my $retVal = $self->{tuning}{$parmName};
1291 :     # Throw an error if it does not exist.
1292 :     Confess("Invalid tuning parameter \"$parmName\".") if ! defined $retVal;
1293 :     # Return the result.
1294 :     return $retVal;
1295 :     }
1296 :    
1297 :    
1298 :     =head3 ReadConfigFile
1299 :    
1300 :     my $xmlObject = $sap->ReadConfigFile();
1301 :    
1302 :     Return the hash structure created from reading the configuration file, or
1303 :     an undefined value if the file is not found.
1304 :    
1305 :     =cut
1306 :    
1307 :     sub ReadConfigFile {
1308 :     my ($self) = @_;
1309 :     # Declare the return variable.
1310 :     my $retVal;
1311 :     # Compute the configuration file name.
1312 :     my $fileName = "$self->{loadDirectory}/SaplingConfig.xml";
1313 :     # Did we find it?
1314 :     if (-f $fileName) {
1315 :     # Yes, read it in.
1316 :     $retVal = XMLin($fileName);
1317 :     }
1318 :     # Return the result.
1319 :     return $retVal;
1320 :     }
1321 :    
1322 :     =head3 GlobalSection
1323 :    
1324 :     my $flag = $sap->GlobalSection($name);
1325 :    
1326 :     Return TRUE if the specified section name is the global section, FALSE
1327 :     otherwise.
1328 :    
1329 :     =over 4
1330 :    
1331 :     =item name
1332 :    
1333 :     Section name to test.
1334 :    
1335 :     =item RETURN
1336 :    
1337 : parrello 1.4 Returns TRUE if the parameter matches the GLOBAL constant, else FALSE.
1338 : parrello 1.1
1339 :     =back
1340 :    
1341 :     =cut
1342 :    
1343 :     sub GlobalSection {
1344 :     # Get the parameters.
1345 :     my ($self, $name) = @_;
1346 :     # Return the result.
1347 :     return ($name eq GLOBAL);
1348 :     }
1349 :    
1350 : parrello 1.35 =head3 LoadGenome
1351 :    
1352 :     my $stats = $sap->LoadGenome($genome, $directory);
1353 :    
1354 :     Load the specified genome directory into the database. The genome's DNA, features,
1355 :     protein sequences, and other supporting information will be inserted. If the
1356 :     genome already exists, numerous errors will occur; therefore, it is recommended
1357 :     that the genome be deleted first using the L<ERDB/Delete> method.
1358 :    
1359 :     =over 4
1360 :    
1361 :     =item genom
1362 :    
1363 :     The ID of the genome being loaded.
1364 :    
1365 :     =item directory
1366 :    
1367 :     Name of the genome directory.
1368 :    
1369 :     =item RETURN
1370 :    
1371 :     Returns a statistics object describing the load activity.
1372 :    
1373 :     =back
1374 :    
1375 :     =cut
1376 :    
1377 :     sub LoadGenome {
1378 :     # Get the parameters.
1379 :     my ($self, $genome, $directory) = @_;
1380 :     # Verify that the directory exists.
1381 :     Confess ("Genome directory $directory not found.") if ! -d $directory;
1382 :     # Verify that the ID is valid.
1383 :     Confess("Invalid genome ID $genome.") if $genome !~ /^\d+\.\d+$/;
1384 :     # Import the loader and call it.
1385 :     require SaplingGenomeLoader;
1386 :     my $retVal = SaplingGenomeLoader::Load($self, $genome, $directory);
1387 :     # Return the statistics object.
1388 :     return $retVal;
1389 :     }
1390 :    
1391 : parrello 1.14 =head2 Special-Purpose Methods
1392 :    
1393 :     =head3 ComputeFeatureFilter
1394 :    
1395 : parrello 1.30 my ($objects, $filter, @parms) = $sap->ComputeFeatureFilter($source, $genome);
1396 : parrello 1.14
1397 :     Compute the initial object name list, filter string, and parameter list
1398 :     for a query by feature ID. The object name list will always end with the
1399 :     B<Feature> entity, and the combination of the filter string and parameter
1400 :     list will translate the incoming ID from the specified format to a real
1401 :     FIG feature ID. If the specified format B<is> FIG feature IDs, then the
1402 :     query will start on the B<Feature> entity; otherwise, it will start with
1403 :     the B<Identifier> entity. This is a special-purpose method that performs
1404 :     the task of intelligently modifying queries to allow for external ID
1405 :     types.
1406 :    
1407 :     =over 4
1408 :    
1409 :     =item source (optional)
1410 :    
1411 :     Database source of the IDs specified-- C<SEED> for FIG IDs, C<GENE> for standard
1412 :     gene identifiers, or C<LocusTag> for locus tags. In addition, you may specify
1413 :     C<RefSeq>, C<CMR>, C<NCBI>, C<Trembl>, or C<UniProt> for IDs from those databases.
1414 :     Use C<mixed> to allow mixed ID types (though this may cause problems when the same
1415 : parrello 1.22 ID has different meanings in different databases). Use C<prefixed> to allow IDs with
1416 :     prefixing indicating the ID type (e.g. C<uni|P00934> for a UniProt ID, C<gi|135813> for
1417 :     an NCBI identifier, and so forth). The default is C<SEED>.
1418 : parrello 1.14
1419 : parrello 1.30 =item genome (optional)
1420 :    
1421 :     ID of a genome. If specified, only features from the specified genome will be
1422 :     accepted by the filter. This is important for IDs that are ambiguous between
1423 :     genomes (like Locus Tags). If omitted, no genome filtering will take place.
1424 :    
1425 : parrello 1.14 =item RETURN
1426 :    
1427 :     Returns a list containing parameters to the desired query call. The first element
1428 :     is the prefix for the object name list, the second is the prefix for the filter
1429 :     string, and the subsequent elements form the prefix for the parameter value list.
1430 :    
1431 :     =back
1432 :    
1433 :     =cut
1434 :    
1435 :     sub ComputeFeatureFilter {
1436 :     # Get the parameters.
1437 : parrello 1.30 my ($self, $source, $genome) = @_;
1438 : parrello 1.14 # Declare the return variables.
1439 :     my ($objects, $filter, @parms);
1440 : parrello 1.30 # This will be set to TRUE if we are directly processing FIG IDs.
1441 :     my $figOnly = 0;
1442 : parrello 1.14 # Determine the source type.
1443 :     if (! defined $source || $source eq 'SEED') {
1444 :     # Here we're directly processing FIG IDs.
1445 :     $objects = 'Feature';
1446 : parrello 1.30 $filter = 'Feature(id) = ?';
1447 :     $figOnly = 1;
1448 : parrello 1.14 } elsif ($source eq 'mixed') {
1449 :     # Here we're processing mixed IDs of unknown type.
1450 :     $objects = 'Identifier Identifies Feature';
1451 :     $filter = 'Identifier(natural-form) = ?';
1452 : parrello 1.22 } elsif ($source eq 'prefixed') {
1453 :     # Here we're processing mixed IDs with prefixes. This is the internal form
1454 :     # of the ID.
1455 :     $objects = 'Identifier Identifies Feature';
1456 :     $filter = 'Identifier(id) = ?';
1457 : parrello 1.14 } else {
1458 :     # Here we're processing a fixed ID type from an external database.
1459 :     # This is the case that requires an additional parameter. Note that
1460 :     # we insist that the additional parameter matches the first parameter
1461 :     # mark.
1462 :     $objects = 'Identifier Identifies Feature';
1463 :     $filter = 'Identifier(source) = ? AND Identifier(natural-form) = ?';
1464 :     push @parms, $source;
1465 :     }
1466 : parrello 1.30 # Was a genome ID specified?
1467 :     if ($genome) {
1468 :     # Yes. Add genome filtering.
1469 :     if ($figOnly) {
1470 :     # In a FIG ID situation, we can simply add the genome filtering to the front
1471 :     # of the object list.
1472 :     $objects = "Genome IsOwnerOf $objects";
1473 :     } else {
1474 :     # Otherwise, we need to do an AND thing.
1475 :     $objects = "Genome IsOwnerOf Feature AND $objects";
1476 :     }
1477 :     # Add the genome ID to the filter clause.
1478 :     $filter = "Genome(id) = ? AND $filter";
1479 :     # Add it to the parameter list.
1480 :     unshift @parms, $genome;
1481 :     }
1482 : parrello 1.14 # Return the results.
1483 :     return ($objects, $filter, @parms);
1484 :     }
1485 :    
1486 :    
1487 :     =head3 FindGapLeft
1488 :    
1489 :     my @operonData = $sap->FindGapLeft($loc, $maxGap, $interval, \%redundancyHash, \$redundancyFlag);
1490 :    
1491 :     This method performs a rather arcane task: searching for a gap to the
1492 :     left of a location in the contig. The search will proceed from the
1493 :     starting point to the left, and will stop when a gap between occupied
1494 :     locations is found that is larger than the specified maximum. The caller
1495 :     has the option of specifying a hash of feature IDs that are redundant. If
1496 :     any feature in the hash is found, the search will stop early and the
1497 :     provided redundancy flag will be set. In addition, an interval size can
1498 :     be specified to tune the process of retrieving data from the database.
1499 :    
1500 :     =over 4
1501 :    
1502 :     =item loc
1503 :    
1504 :     L<BasicLocation> object for the location from which the search is to start.
1505 :     This gives us the contig ID, the strand of interest (forward or backward),
1506 :     and the starting point of the search.
1507 :    
1508 :     =item maxGap
1509 :    
1510 :     The maximum allowable gap. The search will stop at the left end of the contig
1511 :     or the first gap larger than this amount.
1512 :    
1513 :     =item interval (optional)
1514 :    
1515 :     Interval to use for retrieving data from the database. This is the size of
1516 :     the contig segments being retrieved. The default is C<10000>
1517 :    
1518 :     =item redundancyHash (optional)
1519 :    
1520 :     A hash of feature IDs. If any feature present in this hash is found during
1521 :     the search, the search will stop and no data will be returned. The default
1522 :     is an empty hash (no check).
1523 :    
1524 :     =item redundancyFlag (optional)
1525 :    
1526 :     A reference to a scalar flag. If present, the entire method will be bypassed
1527 :     if the flag is TRUE. If a redundancy hash is specified and a redundant feature
1528 :     is found, this flag will be set to TRUE by the method.
1529 :    
1530 :     =item RETURN
1531 :    
1532 :     Returns a list of 4-tuples. Each tuple will contain a feature ID, a begin
1533 :     offset, a direction (C<+> or C<->), and a length, representing an occupied
1534 :     location on the contig and the feature to which it belongs. The complete
1535 :     list of locations will be to the left of the starting location and relatively
1536 :     close together, with no gap larger than the caller-specified maximum.
1537 :    
1538 :     =back
1539 :    
1540 :     =cut
1541 :    
1542 :     sub FindGapLeft {
1543 :     # Get the parameters.
1544 :     my ($self, $loc, $maxGap, $interval, $redundancyHash, $redundancyFlag) = @_;
1545 :     # Declare the return variable.
1546 :     my @retVal;
1547 :     # Fix up defaults for the missing parameters.
1548 :     $interval ||= 10000;
1549 :     if (! defined $redundancyHash) {
1550 :     $redundancyHash = {};
1551 :     }
1552 :     my $fakeFlag = 0;
1553 :     if (! defined $redundancyFlag) {
1554 :     $redundancyFlag = \$fakeFlag;
1555 :     }
1556 :     # This flag will be set to TRUE if we run out of locations or find a gap.
1557 :     my $gapFound = 0;
1558 :     # This will be used to store tuples found. If we are successful, it will
1559 :     # be copied to the return list.
1560 :     my @operonData;
1561 :     # Now we need to set up some data for the loop. In particular, the contig
1562 :     # ID, the strand (direction), and the starting point. We add one to the
1563 :     # starting current position to insure that the starting point is included
1564 :     # in the first search.
1565 :     my $currentPosition = $loc->Left + 1;
1566 :     my $contigID = $loc->Contig;
1567 :     my $strand = $loc->Dir;
1568 :     # This variable keeps the leftmost begin location found.
1569 :     my $begin = $loc->Left;
1570 :     # Loop until we find a redundancy or a gap.
1571 :     while (! $$redundancyFlag && ! $gapFound && $currentPosition >= 0) {
1572 :     # Compute the limits of the search interval for this iteration.
1573 :     my $nextPosition = $currentPosition - $interval;
1574 :     # Get all the locations in the interval.
1575 :     my @rows = $self->GetAll("IsLocatedIn",
1576 :     'IsLocatedIn(to-link) = ? AND IsLocatedIn(dir) = ? AND IsLocatedIn(begin) >= ? AND IsLocatedIn(begin) < ?',
1577 :     [$contigID, $strand, $nextPosition, $currentPosition],
1578 :     [qw(from-link begin dir len)]);
1579 :     # If nothing was found, it's a gap.
1580 :     if (! @rows) {
1581 :     $gapFound = 1;
1582 :     } else {
1583 :     # We got something, so we can loop through looking for gaps. The search
1584 :     # requires we sort by right point.
1585 :     my @sortableTuples;
1586 :     for my $tuple (@rows) {
1587 :     my ($fid, $left, $dir, $len) = @$tuple;
1588 :     push @sortableTuples, [$left + $len, $tuple];
1589 :     }
1590 :     my @sortedTuples = map { $_->[1] } sort { -($a->[0] <=> $b->[0]) } @sortableTuples;
1591 :     # Loop through the tuples, stopping at the first redundancy or gap.
1592 :     for my $tuple (@sortedTuples) { last if $gapFound || $$redundancyFlag;
1593 :     # Get this tuple's data.
1594 :     my ($fid, $left, $dir, $len) = @$tuple;
1595 :     # Is it close enough to be counted?
1596 :     if ($begin - ($left + $len) <= $maxGap) {
1597 :     # Yes. We can include this tuple.
1598 :     push @operonData, $tuple;
1599 :     # Update the begin point.
1600 :     $begin = $left;
1601 :     # Is it redundant? It's only reasonable to ask this if it's
1602 :     # an included feature.
1603 :     if ($redundancyHash->{$fid}) {
1604 :     $$redundancyFlag = 1;
1605 :     }
1606 :     } else {
1607 :     # No, it's not close enough. We've found a gap.
1608 :     $gapFound = 1;
1609 :     }
1610 :     }
1611 :     }
1612 :     # Set up for the next interval.
1613 :     $currentPosition = $nextPosition;
1614 :     }
1615 :     # If we're nonredundant, save our results.
1616 :     if (! $$redundancyFlag) {
1617 :     @retVal = @operonData;
1618 :     }
1619 :     # Return the result.
1620 :     return @retVal;
1621 :     }
1622 :    
1623 :     =head3 FindGapRight
1624 : parrello 1.13
1625 : parrello 1.14 my @operonData = $sap->FindGapRight($loc, $maxGap, $interval, \%redundancyHash, \$redundancyFlag);
1626 : parrello 1.13
1627 : parrello 1.14 This method is the dual of L</FindGapLeft>: it searches for a gap to the
1628 :     right of a location in the contig. The search will proceed from the
1629 :     starting point to the right, and will stop when a gap between occupied
1630 :     locations is found that is larger than the specified maximum. The caller
1631 :     has the option of specifying a hash of feature IDs that are redundant. If
1632 :     any feature in the hash is found, the search will stop early and the
1633 :     provided redundancy flag will be set. In addition, an interval size can
1634 :     be specified to tune the process of retrieving data from the database.
1635 : parrello 1.13
1636 :     =over 4
1637 :    
1638 : parrello 1.14 =item loc
1639 :    
1640 :     L<BasicLocation> object for the location from which the search is to start.
1641 :     This gives us the contig ID, the strand of interest (forward or backward),
1642 :     and the starting point of the search.
1643 :    
1644 :     =item maxGap
1645 :    
1646 :     The maximum allowable gap. The search will stop at the right end of the contig
1647 :     or the first gap larger than this amount.
1648 :    
1649 :     =item interval (optional)
1650 :    
1651 :     Interval to use for retrieving data from the database. This is the size of
1652 :     the contig segments being retrieved. The default is C<10000>
1653 :    
1654 :     =item redundancyHash (optional)
1655 :    
1656 :     A hash of feature IDs. If any feature present in this hash is found during
1657 :     the search, the search will stop and no data will be returned. The default
1658 :     is an empty hash (no check).
1659 :    
1660 :     =item redundancyFlag (optional)
1661 : parrello 1.13
1662 : parrello 1.14 A reference to a scalar flag. If present, the entire method will be bypassed
1663 :     if the flag is TRUE. If a redundancy hash is specified and a redundant feature
1664 :     is found, this flag will be set to TRUE by the method.
1665 : parrello 1.13
1666 :     =item RETURN
1667 :    
1668 : parrello 1.14 Returns a list of 4-tuples. Each tuple will contain a feature ID, a begin
1669 :     offset, a direction (C<+> or C<->), and a length, representing an occupied
1670 :     location on the contig and the feature to which it belongs. The complete
1671 :     list of locations will be to the right of the starting location and relatively
1672 :     close together, with no gap larger than the caller-specified maximum.
1673 : parrello 1.13
1674 :     =back
1675 :    
1676 :     =cut
1677 :    
1678 : parrello 1.14 sub FindGapRight {
1679 : parrello 1.13 # Get the parameters.
1680 : parrello 1.14 my ($self, $loc, $maxGap, $interval, $redundancyHash, $redundancyFlag) = @_;
1681 :     # Declare the return variable.
1682 :     my @retVal;
1683 :     # Fix up defaults for the missing parameters.
1684 :     $interval ||= 10000;
1685 :     if (! defined $redundancyHash) {
1686 :     $redundancyHash = {};
1687 :     }
1688 :     my $fakeFlag = 0;
1689 :     if (! defined $redundancyFlag) {
1690 :     $redundancyFlag = \$fakeFlag;
1691 :     }
1692 :     # This flag will be set to TRUE if we run out of locations or find a gap.
1693 :     my $gapFound = 0;
1694 :     # This will be used to store tuples found. If we are successful, it will
1695 :     # be copied to the return list.
1696 :     my @operonData;
1697 :     # Now we need to set up some data for the loop. In particular, the contig
1698 :     # ID, the strand (direction), and the starting point. We subtract one from the
1699 :     # starting current position to insure that the starting point is included
1700 :     # in the first search.
1701 :     my $currentPosition = $loc->Left - 1;
1702 :     my $contigID = $loc->Contig;
1703 :     my $strand = $loc->Dir;
1704 :     # Get the length of the contig.
1705 :     my $contigLen = $self->ContigLength($contigID);
1706 : parrello 1.19 Trace("Contig length is $contigLen. Starting at $currentPosition.") if T(3);
1707 : parrello 1.14 # This variable keeps the rightmost end location found.
1708 :     my $endPoint = $loc->Left;
1709 :     # Loop until we find a redundancy or a gap.
1710 :     while (! $$redundancyFlag && ! $gapFound && $currentPosition <= $contigLen) {
1711 : parrello 1.19 Trace("Checking at $currentPosition.") if T(3);
1712 : parrello 1.14 # Compute the limits of the search interval for this iteration.
1713 :     my $nextPosition = $currentPosition + $interval;
1714 :     # Get all the locations in the interval.
1715 :     my @rows = $self->GetAll("IsLocatedIn",
1716 :     'IsLocatedIn(to-link) = ? AND IsLocatedIn(dir) = ? AND IsLocatedIn(begin) >= ? AND IsLocatedIn(begin) < ?',
1717 : parrello 1.19 [$contigID, $strand, $currentPosition, $nextPosition],
1718 : parrello 1.14 [qw(from-link begin dir len)]);
1719 :     # If nothing was found, it's a gap.
1720 :     if (! @rows) {
1721 :     $gapFound = 1;
1722 : parrello 1.19 Trace("No result. Gap found.") if T(3);
1723 : parrello 1.14 } else {
1724 :     # We got something, so we can loop through looking for gaps. The search
1725 :     # requires we sort by left point.
1726 :     my @sortedTuples = sort { $a->[1] <=> $b->[1] } @rows;
1727 :     # Loop through the tuples, stopping at the first redundancy or gap.
1728 :     for my $tuple (@sortedTuples) { last if $gapFound || $$redundancyFlag;
1729 :     # Get this tuple's data.
1730 :     my ($fid, $left, $dir, $len) = @$tuple;
1731 :     # Is it close enough to be counted?
1732 :     if ($left - $endPoint <= $maxGap) {
1733 :     # Yes. We can include this tuple.
1734 :     push @operonData, $tuple;
1735 :     # Update the end point.
1736 :     $endPoint = $left + $len;
1737 :     # Is it redundant? It's only reasonable to ask this if it's
1738 :     # an included feature.
1739 :     if ($redundancyHash->{$fid}) {
1740 :     $$redundancyFlag = 1;
1741 :     }
1742 :     } else {
1743 :     # No, it's not close enough. We've found a gap.
1744 :     $gapFound = 1;
1745 : parrello 1.19 Trace("Long distance. Gap found.") if T(3);
1746 : parrello 1.14 }
1747 :     }
1748 :     }
1749 :     # Set up for the next interval.
1750 :     $currentPosition = $nextPosition;
1751 :     }
1752 :     # If we're nonredundant, save our results.
1753 :     if (! $$redundancyFlag) {
1754 :     @retVal = @operonData;
1755 :     }
1756 : parrello 1.13 # Return the result.
1757 : parrello 1.14 return @retVal;
1758 : parrello 1.13 }
1759 :    
1760 : parrello 1.17 =head3 GenomesInPairSet
1761 :    
1762 :     my @genomes = $sap->GenomesInPairSet($pairSetID);
1763 :    
1764 :     Return a list of the IDs for all of the genomes represented in the
1765 :     specified pair set. This is useful when analyzing what data is missing
1766 :     from the coupling tables.
1767 :    
1768 :     =over 4
1769 :    
1770 :     =item pairSetID
1771 :    
1772 :     ID of the pair set to examine.
1773 :    
1774 :     =item RETURN
1775 :    
1776 :     Returns a list of the IDs for the genomes represented in the specified pair set.
1777 :    
1778 :     =back
1779 :    
1780 :     =cut
1781 :    
1782 :     sub GenomesInPairSet {
1783 :     # Get the parameters.
1784 :     my ($self, $pairSetID) = @_;
1785 :     # We'll use this hash to isolate the genome IDs.
1786 :     my %retVal;
1787 :     # Get all the pairs in this set.
1788 :     my $query = $self->Get("IsDeterminedBy", "IsDeterminedBy(from-link) = ?",
1789 :     [$pairSetID]);
1790 :     while (my $pairData = $query->Fetch()) {
1791 :     # Record the genomes for the pegs in the pair. The pegs can be found
1792 :     # separated by a colon in the pairing ID.
1793 : parrello 1.49 for my $peg (split m/:/, $pairData->PrimaryValue('to-link')) {
1794 : parrello 1.17 $retVal{genome_of($peg)} = 1;
1795 :     }
1796 :     }
1797 :     # Return the genome IDs.
1798 :     return keys %retVal;
1799 :     }
1800 :    
1801 : parrello 1.1
1802 :     =head2 Virtual Methods
1803 :    
1804 : parrello 1.11 =head3 PreferredName
1805 :    
1806 :     my $name = $erdb->PreferredName();
1807 :    
1808 :     Return the variable name to use for this database when generating code.
1809 :    
1810 :     =cut
1811 :    
1812 :     sub PreferredName {
1813 :     return 'sap';
1814 :     }
1815 :    
1816 : parrello 1.1 =head3 GetSourceObject
1817 :    
1818 :     my $source = $erdb->GetSourceObject();
1819 :    
1820 :     Return the object to be used in creating load files for this database. This is
1821 :     only the default source object. Loaders have the option of overriding the chosen
1822 : parrello 1.11 source object when constructing the L<ERDBLoadGroup> objects.
1823 : parrello 1.1
1824 :     =cut
1825 :    
1826 :     sub GetSourceObject {
1827 :     my ($self) = @_;
1828 :     # Insure the source object exists in our internal cache.
1829 :     if (! defined $self->{source}) {
1830 :     # We require the FIG object. If the user has no intention of
1831 :     # doing a load, this method won't be used, so he won't need to
1832 :     # have the FIG object on his system.
1833 :     require FIG;
1834 :     $self->{source} = FIG->new();
1835 :     }
1836 :     # Return it to the caller.
1837 :     return $self->{source};
1838 :     }
1839 :    
1840 :     =head3 SectionList
1841 :    
1842 :     my @sections = $erdb->SectionList();
1843 :    
1844 :     Return a list of the names for the different data sections used when loading this database.
1845 :     The default is a single string, in which case there is only one section representing the
1846 :     entire database.
1847 :    
1848 :     =cut
1849 :    
1850 :     sub SectionList {
1851 :     # Get the parameters.
1852 :     my ($self) = @_;
1853 : parrello 1.39 # The section names will be put in here.
1854 :     my @retVal;
1855 : parrello 1.41 # Get the name of the section control file.
1856 : parrello 1.43 my $controlFileName = ERDBGenerate::CreateFileName("SectionList", undef, 'control', $self->LoadDirectory());
1857 : parrello 1.41 # Check to see if it exists.
1858 :     if (-f $controlFileName) {
1859 :     # Yes. Pull out the sections from it.
1860 : parrello 1.42 Trace("Reading section list from $controlFileName.") if T(ERDBGenerate => 2);
1861 : parrello 1.41 @retVal = Tracer::GetFile($controlFileName);
1862 :     } else {
1863 :     # No, so we have to create it. Get the genome hash.
1864 :     my $genomes = $self->GenomeHash();
1865 :     @retVal = sort keys %$genomes;
1866 :     # Append the global section.
1867 :     push @retVal, GLOBAL;
1868 :     # Write out the control file with the new sections.
1869 : parrello 1.42 Trace("Writing section list to $controlFileName.") if T(ERDBGenerate => 2);
1870 : parrello 1.41 Tracer::PutFile($controlFileName, \@retVal);
1871 :     }
1872 : parrello 1.1 # Return the section list.
1873 :     return @retVal;
1874 :     }
1875 :    
1876 :     =head3 Loader
1877 :    
1878 :     my $groupLoader = $erdb->Loader($groupName, $source, $options);
1879 :    
1880 : parrello 1.11 Return an L<ERDBLoadGroup> object for the specified load group. This method is used
1881 :     by L<ERDBGenerator.pl> to create the load group objects. If you are not using
1882 :     L<ERDBGenerator.pl>, you don't need to override this method.
1883 : parrello 1.1
1884 :     =over 4
1885 :    
1886 :     =item groupName
1887 :    
1888 :     Name of the load group whose object is to be returned. The group name is
1889 :     guaranteed to be a single word with only the first letter capitalized.
1890 :    
1891 :     =item source
1892 :    
1893 :     The source object used to access the data from which the load file is derived. This
1894 :     is the same object returned by L</GetSourceObject>; however, we allow the caller to pass
1895 :     it in as a parameter so that we don't end up creating multiple copies of a potentially
1896 :     expensive data structure. It is permissible for this value to be undefined, in which
1897 :     case the source will be retrieved the first time the client asks for it.
1898 :    
1899 :     =item options
1900 :    
1901 :     Reference to a hash of command-line options.
1902 :    
1903 :     =item RETURN
1904 :    
1905 : parrello 1.11 Returns an L<ERDBLoadGroup> object that can be used to process the specified load group
1906 : parrello 1.1 for this database.
1907 :    
1908 :     =back
1909 :    
1910 :     =cut
1911 :    
1912 :     sub Loader {
1913 :     # Get the parameters.
1914 :     my ($self, $groupName, $options) = @_;
1915 :     # Compute the loader name.
1916 :     my $loaderClass = "${groupName}SaplingLoader";
1917 :     # Pull in its definition.
1918 :     require "$loaderClass.pm";
1919 :     # Create an object for it.
1920 :     my $retVal = eval("$loaderClass->new(\$self, \$options)");
1921 :     # Insure it worked.
1922 :     Confess("Could not create $loaderClass object: $@") if $@;
1923 :     # Return it to the caller.
1924 :     return $retVal;
1925 :     }
1926 :    
1927 :     =head3 LoadGroupList
1928 :    
1929 :     my @groups = $erdb->LoadGroupList();
1930 :    
1931 :     Returns a list of the names for this database's load groups. This method is used
1932 : parrello 1.11 by L<ERDBGenerator.pl> when the user wishes to load all table groups. The default
1933 : parrello 1.1 is a single group called 'All' that loads everything.
1934 :    
1935 :     =cut
1936 :    
1937 :     sub LoadGroupList {
1938 :     # Return the list.
1939 : parrello 1.47 return qw(Model Alignment Expression Subsystem Family Feature Protein Genome Scenario);
1940 : parrello 1.1 }
1941 :    
1942 :     =head3 LoadDirectory
1943 :    
1944 :     my $dirName = $erdb->LoadDirectory();
1945 :    
1946 :     Return the name of the directory in which load files are kept. The default is
1947 :     the FIG temporary directory, which is a really bad choice, but it's always there.
1948 :    
1949 :     =cut
1950 :    
1951 :     sub LoadDirectory {
1952 :     # Get the parameters.
1953 :     my ($self) = @_;
1954 :     # Return the directory name.
1955 :     return $self->{loadDirectory};
1956 :     }
1957 :    
1958 : parrello 1.21 =head3 UseInternalDBD
1959 :    
1960 :     my $flag = $erdb->UseInternalDBD();
1961 :    
1962 :     Return TRUE if this database should be allowed to use an internal DBD.
1963 :     The internal DBD is stored in the C<_metadata> table, which is created
1964 :     when the database is loaded. The Sapling uses an internal DBD.
1965 :    
1966 :     =cut
1967 :    
1968 :     sub UseInternalDBD {
1969 :     return 1;
1970 :     }
1971 : parrello 1.1
1972 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3