[Bio] / Sprout / Sapling.pm Repository:
ViewVC logotype

Annotation of /Sprout/Sapling.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.24 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package Sapling;
21 :    
22 :     use strict;
23 :     use Tracer;
24 : parrello 1.11 use base qw(ERDB);
25 : parrello 1.1 use Stats;
26 : parrello 1.9 use DBKernel;
27 : parrello 1.11 use SeedUtils;
28 :     use BasicLocation;
29 : parrello 1.1 use XML::Simple;
30 :    
31 :     =head1 Sapling Package
32 :    
33 :     Sapling Database Access Methods
34 :    
35 :     =head2 Introduction
36 :    
37 : parrello 1.11 The Sapling database is a new Entity-Relationship Database that attempts to
38 :     encapsulate our data in a portable form for distribution. It is loaded directly
39 :     from the genomes and subsystems of the SEED. This object has minimal
40 :     capabilities: most of its power comes the L<ERDB> base class.
41 : parrello 1.1
42 :     The fields in this object are as follows.
43 :    
44 :     =over 4
45 :    
46 :     =item loadDirectory
47 :    
48 :     Name of the directory containing the files used by the loaders.
49 :    
50 :     =item loaderSource
51 :    
52 : parrello 1.11 Source object for the loaders (a L<FIG> in our case).
53 : parrello 1.1
54 :     =item genomeHash
55 :    
56 :     Reference to a hash of the genomes to include when loading.
57 :    
58 :     =item subHash
59 :    
60 :     Reference to a hash of the subsystems to include when loading.
61 :    
62 :     =item tuning
63 :    
64 :     Reference to a hash of tuning parameters.
65 :    
66 : parrello 1.16 =item otuHash
67 :    
68 :     Reference to a hash that maps genome IDs to genome set names.
69 :    
70 : parrello 1.1 =back
71 :    
72 : parrello 1.8 =head2 Configuration and Construction
73 : parrello 1.1
74 :     The default loading profile for the Sapling database is to include all complete
75 :     genomes and all usable subsystems. This can be overridden by specifying a list of
76 :     genomes and subsystems in an XML configuration file. The file name should be
77 :     C<SaplingConfig.xml> in the specified data directory. The document element should
78 :     be C<Sapling>, and it has two sub-elements. The C<Genomes> element should contain as
79 :     its text a space-delimited list of genome IDs. The <Subsystems> element should contain
80 :     a list of subsystem names, one per line. If a particular section is missing, the
81 :     default list will be used.
82 :    
83 :     =head3 Example
84 :    
85 :     The following configuration file specifies 10 genomes and 6 subsystems.
86 :    
87 :     <Sapling>
88 :     <Genomes>
89 :     100226.1 31033.3 31964.1 36873.1 126740.4
90 :     155864.1 349307.7 350058.5 351348.5 412694.5
91 :     </Genomes>
92 :     <Subsystems>
93 :     Sugar_utilization_in_Thermotogales
94 :     Coenzyme_F420_hydrogenase
95 :     Ribosome_activity_modulation
96 :     prophage_tails
97 :     CBSS-393130.3.peg.794
98 :     Apigenin_derivatives
99 :     </Subsystems>
100 :     </Sapling>
101 :    
102 :     The XML file also contains tuning parameters that affect the way the data
103 :     is loaded. These are specified as attributes in the TuningParameters element,
104 :     as follows.
105 :    
106 :     =over 4
107 :    
108 :     =item maxLocationLength
109 :    
110 :     The maximum number of base pairs allowed in a single location. B<IsLocatedIn>
111 :     records are split into sections based on this length, so when you are looking
112 :     for all the features in a particular neighborhood, you can look for locations
113 :     within the maximum location distance from the neighborhood, and even if you have
114 :     a huge operon that contains tens of thousands of base pairs, you'll still be
115 :     able to find it.
116 :    
117 : parrello 1.4 =item maxSequenceLength
118 :    
119 :     The maximum number of base pairs allowed in a single DNA sequence. DNA sequences
120 :     are broken into segments to prevent excessively large genomes from clogging
121 :     memory during sequence resolution.
122 :    
123 : parrello 1.1 =back
124 :    
125 :     =head3 Global Section Constant
126 :    
127 :     Each section of the database used by the loader corresponds to a single genome.
128 :     The global section is loaded after all the others, and is concerned with data
129 :     not related to a particular genome.
130 :    
131 :     =cut
132 :    
133 :     # Name of the global section
134 :     use constant GLOBAL => 'Globals';
135 :    
136 :     =head3 Tuning Parameter Defaults
137 :    
138 :     Each tuning parameter must have a default value, in case it is not present in
139 :     the XML configuration file. The defaults are specified in a constant hash
140 :     reference called C<TUNING_DEFAULTS>.
141 :    
142 :     =cut
143 :    
144 :     use constant TUNING_DEFAULTS => {
145 : parrello 1.4 maxLocationLength => 4000,
146 :     maxSequenceLength => 1000000,
147 : parrello 1.1 };
148 :    
149 :     =head3 new
150 :    
151 :     my $sap = Sapling->new(%options);
152 :    
153 :     Construct a new Sapling object. The following options are supported.
154 :    
155 :     =over 4
156 :    
157 :     =item loadDirectory
158 :    
159 :     Data directory to be used by the loaders.
160 :    
161 : parrello 1.7 =item DBD
162 : parrello 1.1
163 :     XML database definition file.
164 :    
165 :     =item dbName
166 :    
167 :     Name of the database to use.
168 :    
169 :     =item sock
170 :    
171 :     Socket for accessing the database.
172 :    
173 :     =item userData
174 :    
175 :     Name and password used to log on to the database, separated by a slash.
176 :    
177 :     =item dbhost
178 :    
179 :     Database host name.
180 :    
181 :     =back
182 :    
183 :     =cut
184 :    
185 :     sub new {
186 :     # Get the parameters.
187 :     my ($class, %options) = @_;
188 :     # Get the options.
189 : parrello 1.5 my $loadDirectory = $options{loadDirectory} || $FIG_Config::saplingData ||
190 :     "$FIG_Config::fig/SaplingData";
191 : parrello 1.7 my $dbd = $options{DBD} || "$loadDirectory/SaplingDBD.xml";
192 : olson 1.6 my $dbName = $options{dbName} || $FIG_Config::saplingDB || "nmpdr_sapling";
193 : parrello 1.7 my $sock = $options{sock} || $FIG_Config::sproutSock || "";
194 : parrello 1.3 my $userData = $options{userData} || "seed/";
195 :     my $dbhost = $options{dbhost} || $FIG_Config::saplingHost || "localhost";
196 : parrello 1.1 # Compute the user name and password.
197 :     my ($user, $pass) = split '/', $userData, 2;
198 :     $pass = "" if ! defined $pass;
199 : parrello 1.24 Trace("Connecting to sapling database.") if T(2);
200 : parrello 1.1 # Connect to the database.
201 :     my $dbh = DBKernel->new('mysql', $dbName, $user, $pass, 3306, $dbhost, $sock);
202 :     # Create the ERDB object.
203 : parrello 1.2 my $retVal = ERDB::new($class, $dbh, $dbd, %options);
204 : parrello 1.1 # Add the load directory pointer.
205 :     $retVal->{loadDirectory} = $loadDirectory;
206 :     # Set up the spaces for the loader source object, the subsystem hash, the
207 :     # genome hash, and the tuning parameters.
208 :     $retVal->{source} = undef;
209 :     $retVal->{genomeHash} = undef;
210 :     $retVal->{subHash} = undef;
211 :     $retVal->{tuning} = undef;
212 : parrello 1.16 # Set up the hash of genome IDs to OTUs.
213 :     $retVal->{otuHash} = {};
214 : parrello 1.1 # Return it.
215 :     return $retVal;
216 :     }
217 :    
218 : parrello 1.8 =head2 Public Methods
219 :    
220 : parrello 1.16 =head3 OTU
221 :    
222 :     my $otu = $sap->OTU($genomeID);
223 :    
224 :     Return the name of the Organism Taxonomic Unit (GenomeSet) for the
225 :     specified genome ID. OTU information is cached in memory, so that once it
226 :     is known, it does not need to be re-fetched from the database.
227 :    
228 :     =over 4
229 :    
230 :     =item genomeID
231 :    
232 :     ID of a genome or feature. If a feature ID is specified, the genome ID will be
233 :     extracted from it.
234 :    
235 :     =item RETURN
236 :    
237 :     Returns the name of the genome set for the specified genome, or C<undef> if the
238 :     genome is not in the
239 :    
240 :     =back
241 :    
242 :     =cut
243 :    
244 :     sub OTU {
245 :     # Get the parameters.
246 :     my ($self, $genomeID) = @_;
247 :     # Get the OTU hash.
248 :     my $otuHash = $self->{otuHash};
249 :     # Compute the real genome ID.
250 :     my $realGenomeID = ($genomeID =~ /^fig\|(\d+\.\d+)/ ? $1 : $genomeID);
251 :     # Look it up in the hash.
252 :     my $retVal = $otuHash->{$realGenomeID};
253 :     # Was it found?
254 :     if (! defined $retVal) {
255 :     # No, get the OTU from the database.
256 :     ($retVal) = $self->GetFlat("IsCollectedInto", "IsCollectedInto(from-link) = ?",
257 :     [$realGenomeID], "to-link");
258 :     # Save it in the hash for future use.
259 :     $otuHash->{$realGenomeID} = $retVal;
260 :     }
261 :     # Return the result.
262 :     return $retVal;
263 :     }
264 :    
265 : parrello 1.18 =head3 Assignment
266 :    
267 :     my $assignment = $sapling->Assignment($fid);
268 :    
269 :     Return the functional assignment for the specified feature.
270 :    
271 :     =over 4
272 :    
273 :     =item fid
274 :    
275 :     FIG ID of the desired feature.
276 :    
277 :     =item RETURN
278 :    
279 :     Returns the functional assignment of the specified feature, or C<undef>
280 :     if the feature does not exist.
281 :    
282 :     =back
283 :    
284 :     =cut
285 :    
286 :     sub Assignment {
287 :     # Get the parameters.
288 :     my ($self, $fid) = @_;
289 :     # Get the functional assignment.
290 :     my ($retVal) = $self->GetFlat("Feature", "Feature(id) = ?", [$fid], 'function');
291 :     # Return the result.
292 :     return $retVal;
293 :     }
294 :    
295 : parrello 1.16
296 : parrello 1.11 =head3 ComputeDNA
297 :    
298 :     my $dna = $sap->ComputeDNA($location);
299 :    
300 :     Return the DNA sequence for the specified location.
301 :    
302 :     =over 4
303 :    
304 :     =item location
305 :    
306 :     A L<BasicLocation> object indicating the contig, start location, direction, and
307 :     length of the desired DNA segment.
308 :    
309 :     =item RETURN
310 :    
311 : parrello 1.13 Returns a string containing the desired DNA. The DNA comes back in pure lower-case.
312 : parrello 1.11
313 :     =back
314 :    
315 :     =cut
316 :    
317 :     sub ComputeDNA {
318 :     # Get the parameters.
319 :     my ($self, $location) = @_;
320 : parrello 1.20 # Get the contig, left end, and right end of the location. Note we subtract
321 :     # 1 to convert contig positions to string offsets.
322 : parrello 1.11 my $contig = $location->Contig;
323 : parrello 1.20 my $left = $location->Left - 1;
324 :     my $right = $location->Right - 1;
325 : parrello 1.11 # Get the DNA segment length.
326 :     my $maxSequenceLength = $self->TuningParameter("maxSequenceLength");
327 :     # Compute the key of the first segment of our DNA and the starting
328 :     # point in that segment.
329 :     my $leftOffset = $left % $maxSequenceLength;
330 :     my $leftKey = "$contig:" . Tracer::Pad(($left - $leftOffset)/$maxSequenceLength,
331 :     7, 1, '0');
332 :     # Compute the key of the last segment containing our DNA.
333 :     my $rightKey = "$contig:" . Tracer::Pad(int($right/$maxSequenceLength), 7, 1, '0');
334 :     my @results = $self->GetFlat("DNASequence",
335 :     'DNASequence(id) <= ? AND DNASequence(id) >= ?',
336 :     [$leftKey, $rightKey], 'sequence');
337 :     # Form all the DNA into a string and extract our piece.
338 :     my $retVal = substr(join("", @results), $leftOffset, $location->Length);
339 :     # If this is a backwards string, we need the reverse complement.
340 :     rev_comp(\$retVal) if $location->Dir eq '-';
341 :     # Return the result.
342 :     return $retVal;
343 :     }
344 :    
345 : parrello 1.12 =head3 FilterByGenome
346 :    
347 :     my @filteredFids = $sapling->FilterByGenome(\@fids, $genomeFilter);
348 :    
349 :     Filter the features using the specified genome-based criterion. The
350 :     criterion can be either a comma-separated list of genome IDs, or a
351 :     partial organism name.
352 :    
353 :     =over 4
354 :    
355 :     =item fids
356 :    
357 :     Reference to a list of feature IDs.
358 :    
359 :     =item genomeFilter
360 :    
361 :     A string specifying the filtering criterion. If undefined or blank, then
362 :     no filter is applied. If a name, then only features from genomes with a
363 :     matching name will be returned. A name is a match if the filter is an
364 :     exact match for some prefix of the organism name. Thus, C<Listeria> would
365 :     get all Listerias, while C<Listeria monocytogenes EGD-e> would match only
366 :     the specific EGD-e strain. For a more precise match, you can specify
367 :     instead a comma-delimited list of genome IDs. In this latter case, only
368 :     features for the listed genomes will be included in the results.
369 :    
370 :     =item RETURN
371 :    
372 :     Returns the features from the incoming list that match the filter condition.
373 :    
374 :     =back
375 :    
376 :     =cut
377 :    
378 :     sub FilterByGenome {
379 :     # Get the parameters.
380 :     my ($self, $fids, $genomeFilter) = @_;
381 :     # Declare the return variable.
382 :     my @retVal;
383 :     # Check the type of filter.
384 :     if (! $genomeFilter) {
385 :     # No filter, so copy the input directly to the result.
386 :     @retVal = @$fids;
387 :     } else {
388 :     # Trim edge spaces from the filter.
389 :     $genomeFilter = Tracer::Trim($genomeFilter);
390 :     # This hash will contain the permissible genome IDs.
391 :     my %genomeIDs;
392 :     # Check for a name. We assume we have a name if there's an
393 :     # alphabetic letter anywhere in the filter string.
394 :     if ($genomeFilter =~ /[a-zA-Z]/) {
395 :     # The filter contains something that does not look like a genome
396 :     # ID, so it is treated as a genome name. We get the IDs of the
397 :     # genomes with that name and put them in the hash.
398 :     %genomeIDs = map { $_ => 1 } $self->GetFlat("Genome",
399 :     'Genome(scientific-name) LIKE ?',
400 :     ["$genomeFilter%"], 'id');
401 :     } else {
402 :     # We are expecting a comma-delimited list of genome IDs, so we
403 :     # put these in our hash.
404 :     %genomeIDs = map { $_ => 1 } split(/\s*,\s*/, $genomeFilter);
405 :     }
406 :     # Now we loop through the features, keeping the ones whose genome ID
407 :     # matches something in the hash.
408 :     @retVal = grep { $genomeIDs{genome_of($_)} } @$fids;
409 :     }
410 :     # Return the result.
411 :     return @retVal;
412 :     }
413 :    
414 :    
415 : parrello 1.11 =head3 GetLocations
416 :    
417 :     my @locs = $sapling->GetLocations($fid);
418 :    
419 :     Return the locations of the DNA for the specified feature.
420 :    
421 :     =over 4
422 :    
423 :     =item fid
424 :    
425 :     ID of the feature whose location is desired.
426 :    
427 :     =item RETURN
428 :    
429 :     Returns a list of L<BasicLocation> objects for the locations containing the
430 :     feature's DNA.
431 :    
432 :     =back
433 :    
434 :     =cut
435 :    
436 :     sub GetLocations {
437 :     # Get the parameters.
438 :     my ($self, $fid) = @_;
439 :     # Declare the return variable.
440 :     my @retVal;
441 : parrello 1.23 # This will contain the last location found.
442 :     my $lastLoc;
443 : parrello 1.11 # Get this feature's locations.
444 :     my $qh = $self->Get("IsLocatedIn",
445 :     'IsLocatedIn(from-link) = ? ORDER BY IsLocatedIn(ordinal)',
446 :     [$fid]);
447 :     while (my $resultRow = $qh->Fetch()) {
448 :     # Compute the contig ID and other information.
449 :     my $contig = $resultRow->PrimaryValue('to-link');
450 :     my $begin = $resultRow->PrimaryValue('begin');
451 :     my $dir = $resultRow->PrimaryValue('dir');
452 :     my $len = $resultRow->PrimaryValue('len');
453 :     # Create a location from the location information.
454 :     my $start = ($dir eq '+' ? $begin : $begin + $len - 1);
455 : parrello 1.23 my $loc = BasicLocation->new($contig, $start, $dir, $len);
456 :     # Check to see if this location is adjacent to the previous one.
457 :     if ($lastLoc && $lastLoc->Adjacent($loc)) {
458 :     # It is, so merge it in.
459 :     $lastLoc->Merge($loc);
460 :     } else {
461 :     # It isn't, so push the new one on the list.
462 :     $lastLoc = $loc;
463 :     push @retVal, $loc;
464 :     }
465 : parrello 1.11 }
466 :     # Return the result.
467 :     return @retVal;
468 :     }
469 :    
470 :    
471 : parrello 1.8 =head3 IdentifiedProtein
472 :    
473 :     my $proteinID = $sap->IdentifiedProtein($id);
474 :    
475 :     Compute the protein for a specified identifier. If the identifier does
476 :     not exist or does not identify a protein, this method will return
477 :     C<undef>.
478 :    
479 :     =over 4
480 :    
481 :     =item id
482 :    
483 :     Identifier whose protein is desired.
484 :    
485 :     =item RETURN
486 :    
487 :     Returns the protein ID corresponding to the incoming identifier,
488 :     or C<undef> if the identifier does not exist or is not for a protein.
489 :    
490 :     =back
491 :    
492 :     =cut
493 :    
494 :     sub IdentifiedProtein {
495 :     # Get the parameters.
496 :     my ($self, $id) = @_;
497 :     # Declare the return variable.
498 :     my $retVal;
499 :     # Try to find a protein for this ID.
500 :     my ($proteinID) = $self->GetFlat("Identifier Names ProteinSequence",
501 :     "Identifier(id) = ?", [$id],
502 :     'ProteinSequence(id)');
503 :     if (defined $proteinID) {
504 :     # We found one, so we're done.
505 :     $retVal = $proteinID;
506 :     } else {
507 :     # Not a protein ID. See if it's the ID of a feature that has a
508 :     # protein connected. Note that it's possible to find more than one,
509 :     # but we're going to punt and pick the first.
510 :     ($proteinID) = $self->GetFlat("Identifier Identifies Feature Produces ProteinSequence",
511 :     "Identifier(id) = ? LIMIT 1", [$id],
512 :     'ProteinSequence(id)');
513 :     if (defined $proteinID) {
514 :     # We found a protein ID, so return it.
515 :     $retVal = $proteinID;
516 :     }
517 :     }
518 :     # Return the result.
519 :     return $retVal;
520 :     }
521 :    
522 : parrello 1.9 =head3 FeaturesByID
523 :    
524 :     my @fids = $sapling->FeaturesByID($id);
525 :    
526 :     Return all the features corresponding to the specified identifier. Only features
527 :     that represent the same locus will be returned.
528 :    
529 :     =over 4
530 :    
531 :     =item id
532 :    
533 :     Identifier of interest.
534 :    
535 :     =item RETURN
536 :    
537 :     Returns a list of all the features in the database that match the given
538 :     identifier.
539 :    
540 :     =back
541 :    
542 :     =cut
543 :    
544 :     sub FeaturesByID {
545 :     # Get the parameters.
546 :     my ($self, $id) = @_;
547 :     # Ask for features from the database.
548 :     my @retVal = $self->GetFlat("Identifies", "Identifies(from-link) = ?", [$id],
549 :     'to-link');
550 :     # Return the result.
551 :     return @retVal;
552 :     }
553 :    
554 :     =head3 ProteinsByID
555 :    
556 :     my @fids = $sapling->ProteinsByID($id);
557 :    
558 :     Return all the features that have the same protein sequence as the
559 : parrello 1.10 identified feature. The returned features mar or may not have the same locus. If
560 :     the identifier is not for a protein encoding gene, no result will be returned.
561 : parrello 1.9
562 :     =over 4
563 :    
564 :     =item id
565 :    
566 : parrello 1.10 Identifier of interest. This can be any alias identifier from the B<Identifier>
567 :     table (which includes the FIG ID).
568 : parrello 1.9
569 :     =item RETURN
570 :    
571 : parrello 1.10 Returns a list of FIG IDs for features having the same protein sequence. If the
572 :     identifier does not specify a protein-encoding gene, the list will be empty.
573 : parrello 1.9
574 :     =back
575 :    
576 :     =cut
577 :    
578 :     sub ProteinsByID {
579 :     # Get the parameters.
580 :     my ($self, $id) = @_;
581 :     # Declare the return variable.
582 :     my @retVal;
583 : parrello 1.10 # Compute the protein for this identifier.
584 :     my $protID = $self->IdentifiedProtein($id);
585 :     # Only proceed if a protein was found. If no protein was found, we're
586 :     # already set up to return an empty list.
587 :     if (defined $protID) {
588 :     # Get all the features connected to the identified protein.
589 :     @retVal = $self->GetFlat("IsProteinFor", "IsProteinFor(from-link) = ?",
590 :     [$protID], "IsProteinFor(to-link)");
591 :     }
592 : parrello 1.9 # Return the result.
593 :     return @retVal;
594 :     }
595 :    
596 :     =head3 GetSubsystem
597 :    
598 :     my $ssData = $sapling->GetSubsystem($ssName);
599 :    
600 : parrello 1.11 Return a L<SaplingSubsys> object for the named subsystem.
601 : parrello 1.9
602 :     =over 4
603 : parrello 1.8
604 : parrello 1.9 =item ssName
605 :    
606 :     Name of the desired subsystem.
607 :    
608 :     =item RETURN
609 :    
610 :     Returns an object that defines multiple useful methods for manipulating the
611 :     named subsystem.
612 :    
613 :     =back
614 :    
615 :     =cut
616 :    
617 :     sub GetSubsystem {
618 :     # Get the parameters.
619 :     my ($self, $ssName) = @_;
620 :     # Declare the return variable.
621 :     require SaplingSubsys;
622 :     my $retVal = SaplingSubsys->new($ssName, $self);
623 :     # Return the result.
624 :     return $retVal;
625 :     }
626 : parrello 1.8
627 :    
628 :     =head3 GenesInRegion
629 :    
630 :     my @pegs = $sap->GenesInRegion($location);
631 :    
632 :     Return a list of the IDs for the features that overlap the specified
633 :     region on a contig.
634 :    
635 :     =over 4
636 :    
637 :     =item location
638 :    
639 :     Location of interest, either in the form of a location string (e.g.
640 : parrello 1.11 C<360108.3:NZ_AANK01000002_264528_264007>) or a L<BasicLocation>
641 : parrello 1.8 object.
642 :    
643 :     =item RETURN
644 :    
645 :     Returns a list of feature IDs. The features in the list will be all
646 :     those that overlap or occur inside the location of interest.
647 :    
648 :     =back
649 :    
650 :     =cut
651 :    
652 :     sub GenesInRegion {
653 :     # Get the parameters.
654 :     my ($self, $location) = @_;
655 :     # Insure we have a location object.
656 :     my $locObject = (ref $location ? $location : BasicLocation->new($location));
657 :     # Get the beginning and the end of the location of interest.
658 :     my $begin = $locObject->Left();
659 :     my $end = $locObject->Right();
660 :     # For performance reasons, we limit the possible starting location, using the
661 :     # tuning parameter for maximum location length.
662 :     my $limit = $begin - $self->TuningParameter('maxLocationLength');
663 :     # Perform the query. Note we use a hash to eliminate duplicates.
664 :     my %retVal = map { $_ => 1 } $self->GetFlat('Contig IsLocusFor Feature',
665 :     "Contig(id) = ? AND IsLocusFor(begin) <= ? AND " .
666 :     "IsLocusFor(begin) > ? AND " .
667 :     "IsLocusFor(begin) + IsLocusFor(len) >= ?",
668 :     [$locObject->Contig(), $end, $limit, $begin],
669 :     'Feature(id)');
670 :     # Return the result.
671 :     return sort keys %retVal;
672 :     }
673 :    
674 :     =head3 GetFasta
675 :    
676 : parrello 1.9 my $fasta = $sapling->GetFasta($proteinID, $id, $comment);
677 : parrello 1.8
678 :     Return a FASTA sequence for the specified protein. An optional identifier
679 :     can be provided to be used as the identification string.
680 :    
681 :     =over 4
682 :    
683 :     =item proteinID
684 :    
685 :     Protein sequence identifier.
686 :    
687 :     =item id (optional)
688 :    
689 :     The identifier to be used in the FASTA output. If omitted, the protein ID
690 :     is used.
691 :    
692 : parrello 1.9 =item comment (optional)
693 :    
694 :     The comment string to be used in the identification line of the FASTA output.
695 :     If omitted, no comment will be present.
696 :    
697 : parrello 1.8 =item RETURN
698 :    
699 :     Returns a FASTA string for the protein. This includes the identification
700 : parrello 1.14 line and the protein letters themselves.
701 : parrello 1.8
702 :     =back
703 :    
704 :     =cut
705 :    
706 :     sub GetFasta {
707 :     # Get the parameters.
708 : parrello 1.9 my ($self, $proteinID, $id, $comment) = @_;
709 : parrello 1.11 # Compute the identifier.
710 : parrello 1.8 my $realID = $id || "md5|$proteinID";
711 :     # Declare the return variable.
712 :     my $retVal;
713 :     # Get the protein sequence.
714 :     my ($sequence) = $self->GetFlat("ProteinSequence", "ProteinSequence(id) = ?",
715 :     [$proteinID], "sequence");
716 :     # It's an error if the sequence was not found.
717 :     if (! defined $sequence) {
718 :     Confess("No protein found with the sequence identifier $proteinID.");
719 :     } else {
720 : parrello 1.11 # Create a FASTA string for the protein.
721 :     $retVal = SeedUtils::create_fasta_record($realID, $comment, $sequence);
722 : parrello 1.8 }
723 :     # Return the result.
724 :     return $retVal;
725 :     }
726 : parrello 1.1
727 :    
728 : parrello 1.4 =head3 Taxonomy
729 :    
730 : parrello 1.14 my @taxonomy = $sap->Taxonomy($genomeID, $format);
731 : parrello 1.4
732 :     Return the full taxonomy of the specified genome, starting from the
733 : parrello 1.14 domain downward.
734 : parrello 1.4
735 :     =over 4
736 :    
737 :     =item genomeID
738 :    
739 :     ID of the genome whose taxonomy is desired. The genome does not need to exist
740 :     in the database: the version number will be lopped off and the result used as
741 :     an entry point into the taxonomy tree.
742 :    
743 : parrello 1.14 =item format (optional)
744 :    
745 :     Format of the taxonomy. C<names> will return primary names, C<numbers> will
746 :     return taxonomy numbers, and C<both> will return taxonomy number followed by
747 :     primary name. The default is C<names>.
748 :    
749 : parrello 1.4 =item RETURN
750 :    
751 :     Returns a list of taxonomy names, starting from the domain and moving
752 :     down to the node where the genome is attached.
753 :    
754 :     =back
755 :    
756 :     =cut
757 :    
758 :     sub Taxonomy {
759 :     # Get the parameters.
760 : parrello 1.14 my ($self, $genomeID, $format) = @_;
761 : parrello 1.4 # Get the genome's taxonomic group.
762 :     my ($taxon) = split /\./, $genomeID, 2;
763 :     # We'll put the return data in here.
764 :     my @retVal;
765 :     # Loop until we hit a domain.
766 :     my $domainFlag;
767 :     while (! $domainFlag) {
768 :     # Get the data we need for this taxonomic group.
769 :     my ($taxonData) = $self->GetAll('TaxonomicGrouping IsInGroup',
770 :     'TaxonomicGrouping(id) = ?', [$taxon],
771 :     'domain scientific-name IsInGroup(to-link)');
772 :     # If we didn't find what we're looking for, then we have a problem. This
773 :     # would indicate a node below the domain level that doesn't have a parent
774 :     # or (more likely) an invalid input string.
775 :     if (! $taxonData) {
776 :     # Terminate the loop and trace a warning.
777 :     $domainFlag = 1;
778 :     Trace("Could not find node or parent for \"$taxon\".") if T(1);
779 :     } else {
780 :     # Extract the data for the current group. Note we overwrite our
781 :     # taxonomy ID with the ID of our parent, priming the next iteration
782 :     # of the loop.
783 :     my $name;
784 : parrello 1.14 my $oldTaxon = $taxon;
785 : parrello 1.4 ($domainFlag, $name, $taxon) = @$taxonData;
786 : parrello 1.14 # Compute the value we want to put in the output list.
787 :     my $value;
788 :     if ($format eq 'numbers') {
789 :     $value = $oldTaxon;
790 :     } elsif ($format eq 'both') {
791 :     $value = "$oldTaxon $name";
792 :     } else {
793 :     $value = $name;
794 :     }
795 :     # Put the current group's data in the return list.
796 :     unshift @retVal, $value;
797 : parrello 1.4 }
798 :     }
799 :     # Return the result.
800 :     return @retVal;
801 :     }
802 :    
803 : parrello 1.18 =head3 IsDeletedFid
804 :    
805 :     my $flag = $sapling->IsDeletedFid($fid);
806 :    
807 :     Return TRUE if the specified feature is B<not> in the database, else
808 :     FALSE.
809 :    
810 :     =over 4
811 :    
812 :     =item fid
813 :    
814 :     FIG ID of the relevant feature.
815 :    
816 :     =item RETURN
817 :    
818 :     Returns TRUE if the specified feature is in the database, else FALSE.
819 :    
820 :     =back
821 :    
822 :     =cut
823 :    
824 :     sub IsDeletedFid {
825 :     # Get the parameters.
826 :     my ($self, $fid) = @_;
827 :     # Check for the feature. If the feature does not exist, we'll get an
828 :     # undefined value (FALSE). If it does, we'll get the feature ID itself
829 :     # (TRUE).
830 :     my ($retVal) = $self->GetFlat("Feature", "Feature(id) = ?", [$fid], 'id');
831 :     # Return FALSE if the feature was found, TRUE if it was not found.
832 :     return ($retVal ? 0 : 1);
833 :     }
834 :    
835 : parrello 1.4
836 : parrello 1.1 =head3 GenomeHash
837 :    
838 :     my $genomeHash = $sap->GenomeHash();
839 :    
840 :     Return a hash of the genomes configured to be in this database. The list
841 :     is either taken from the active SEED database or from a configuration
842 :     file in the data directory. The hash maps genome IDs to TRUE.
843 :    
844 :     =cut
845 :    
846 :     sub GenomeHash {
847 :     # Get the parameters.
848 :     my ($self) = @_;
849 :     # We'll build the hash in here.
850 :     my %genomeHash;
851 :     # Do we already have a list?
852 :     if (! defined $self->{genomeHash}) {
853 :     # No, check for a configuration file.
854 :     my $xml = $self->ReadConfigFile();
855 :     if (defined $xml && $xml->{Genomes}) {
856 :     # We found one and it has a genome list, so extract the genomes.
857 :     %genomeHash = map { $_ => 1 } grep { $_ =~ /\S/ } split /\s+/, $xml->{Genomes};
858 :     } else {
859 :     # No, so get the genome list.
860 :     my $fig = $self->GetSourceObject();
861 : parrello 1.8 my @genomes = $fig->genomes();
862 : parrello 1.1 # Verify the genome list to insure every genome has an organism
863 :     # directory.
864 :     for my $genome (@genomes) {
865 :     if (-d "$FIG_Config::organisms/$genome") {
866 :     $genomeHash{$genome} = 1;
867 :     }
868 :     }
869 :     }
870 :     # Store the genomes in this object.
871 :     $self->{genomeHash} = \%genomeHash;
872 :     }
873 :     # Return the result.
874 :     return $self->{genomeHash};
875 :     }
876 :    
877 :     =head3 SubsystemID
878 :    
879 :     my $subID = $sap->SubsystemID($subName);
880 :    
881 :     Return the ID of the subsystem with the specified name.
882 :    
883 :     =over 4
884 :    
885 :     =item subName
886 :    
887 :     Name of the relevant subsystem. A subsystem name with underscores for spaces
888 :     will return the same ID as a subsystem name with the spaces still in it.
889 :    
890 :     =item RETURN
891 :    
892 : parrello 1.4 Returns a normalized subsystem name.
893 : parrello 1.1
894 :     =back
895 :    
896 :     =cut
897 :    
898 :     sub SubsystemID {
899 :     # Get the parameters.
900 :     my ($self, $subName) = @_;
901 : parrello 1.4 # Normalize the subsystem name by converting underscores to spaces.
902 :     my $retVal = $subName;
903 :     $retVal =~ s/_/ /g;
904 : parrello 1.1 # Return the result.
905 :     return $retVal;
906 :     }
907 :    
908 : parrello 1.14 =head3 Alias
909 :    
910 :     my $translatedID = $sap->Alias($fid, $source);
911 :    
912 :     Return an alternate ID of the specified type for the specified feature.
913 :     If no alternate ID of that type exists, the incoming value will be
914 :     returned unchanged.
915 :    
916 :     =over 4
917 :    
918 :     =item fid
919 :    
920 :     FIG ID of the feature whose alias identifier is desired.
921 :    
922 :     =item source
923 :    
924 :     Database type for the alternate ID (e.g. C<LocusTag>, C<NCBI>, C<RefSeq>). If
925 :     C<SEED> is specified, the ID will be returned unchanged and no database lookup
926 :     will occur.
927 :    
928 :     =item RETURN
929 :    
930 :     Returns an equivalent ID for the specified feature that belongs to the specified
931 :     database (that is, has the specified source). If no such ID exists, returns the
932 :     incoming ID.
933 :    
934 :     =back
935 :    
936 :     =cut
937 :    
938 :     sub Alias {
939 :     # Get the parameters.
940 :     my ($self, $fid, $source) = @_;
941 :     # Default to the incoming value.
942 :     my $retVal = $fid;
943 :     # We only have work to do if the database type isn't "SEED".
944 :     if ($source ne 'SEED') {
945 :     # Look for an alias.
946 :     my ($alias) = $self->GetFlat("IsIdentifiedBy Identifier",
947 : parrello 1.15 'IsIdentifiedBy(from-link) = ? AND Identifier(source) = ?',
948 : parrello 1.14 [$fid, $source], 'Identifier(natural-form)');
949 :     # If we found one, return it.
950 :     if (defined $alias) {
951 :     $retVal = $alias;
952 :     }
953 :     }
954 :     # Return the result.
955 :     return $retVal;
956 :     }
957 :    
958 :    
959 :     =head3 ContigLength
960 :    
961 :     my $contigLen = $sap->ContigLength($contigID);
962 :    
963 :     Return the number of base pairs in the specified contig.
964 :    
965 :     =over 4
966 :    
967 :     =item contigID
968 :    
969 :     ID of the contig of interest.
970 :    
971 :     =item RETURN
972 :    
973 :     Returns the number of base pairs in the specified contig, or 0 if the contig
974 :     does not exist.
975 :    
976 :     =back
977 :    
978 :     =cut
979 :    
980 :     sub ContigLength {
981 :     # Get the parameters.
982 :     my ($self, $contigID) = @_;
983 :     # Try to find the length.
984 :     my ($retVal) = $self->GetEntityValues(Contig => $contigID, 'length');
985 :     # Convert not-found to 0.
986 :     $retVal = 0 if ! defined $retVal;
987 :     # Return the result.
988 :     return $retVal;
989 :     }
990 :    
991 :    
992 :     =head2 Configuration-Related Methods
993 :    
994 : parrello 1.1 =head3 SubsystemHash
995 :    
996 :     my $subHash = $sap->SubsystemHash();
997 :    
998 :     Return a hash of the subsystems configured to be in this database. The
999 :     list is either taken from the active SEED database or from a
1000 :     configuration file in the data directory. The hash maps subsystem names
1001 :     to TRUE.
1002 :    
1003 :     =cut
1004 :    
1005 :     sub SubsystemHash {
1006 :     # Get the parameters.
1007 :     my ($self) = @_;
1008 :     # We'll build the hash in here.
1009 :     my %subHash;
1010 :     # Do we already have a list?
1011 :     if (! defined $self->{subHash}) {
1012 :     # No, check for a configuration file.
1013 :     my $xml = $self->ReadConfigFile();
1014 :     if (defined $xml && $xml->{Subsystems}) {
1015 :     # We found one, and it has subsystems, so we extract them.
1016 :     # A little dancing is necessary to trim spaces.
1017 :     my @subs = map { $_ =~ /\s*(\S.+\S)/; $1 } split /\n/, $xml->{Subsystems};
1018 :     # Here we need to clear out any null subsystem names resulting from
1019 :     # blank lines in the file.
1020 :     %subHash = map { $_ => 1 } grep { $_ } @subs;
1021 :     } else {
1022 :     # No config file, so we ask the FIG object.
1023 :     my $fig = $self->GetSourceObject();
1024 : parrello 1.8 %subHash = map { $self->SubsystemID($_) => 1 } $fig->all_subsystems();
1025 : parrello 1.1 }
1026 :     # Store the subsystems in this object.
1027 :     $self->{subHash} = \%subHash;
1028 :     }
1029 :     # Return the result.
1030 :     return $self->{subHash};
1031 :     }
1032 :    
1033 :     =head3 TuningParameter
1034 :    
1035 :     my $parm = $erdb->TuningParameter($parmName);
1036 :    
1037 :     Return the value of the specified tuning parameter. Tuning parameters are
1038 :     read from the XML configuration file.
1039 :    
1040 :     =over 4
1041 :    
1042 :     =item parmName
1043 :    
1044 :     Name of the parameter whose value is desired.
1045 :    
1046 :     =item RETURN
1047 :    
1048 :     Returns the paramter value.
1049 :    
1050 :     =back
1051 :    
1052 :     =cut
1053 :    
1054 :     sub TuningParameter {
1055 :     # Get the parameters.
1056 :     my ($self, $parmName) = @_;
1057 :     # Insure we have the parameters in memory.
1058 :     if (! defined $self->{tuning}) {
1059 :     # Read the configuration file.
1060 :     my $configFile = $self->ReadConfigFile();
1061 :     # Get the tuning parameters (if any).
1062 :     my $tuning;
1063 :     if (! defined $configFile || ! exists $configFile->{TuningParameters}) {
1064 :     $tuning = {};
1065 :     } else {
1066 :     $tuning = $configFile->{TuningParameters};
1067 :     }
1068 :     # Merge in the default option values.
1069 :     Tracer::MergeOptions($tuning, TUNING_DEFAULTS);
1070 :     # Save the result in our object.
1071 :     $self->{tuning} = $tuning;
1072 :     }
1073 :     # Extract the tuning paramter.
1074 :     my $retVal = $self->{tuning}{$parmName};
1075 :     # Throw an error if it does not exist.
1076 :     Confess("Invalid tuning parameter \"$parmName\".") if ! defined $retVal;
1077 :     # Return the result.
1078 :     return $retVal;
1079 :     }
1080 :    
1081 :    
1082 :     =head3 ReadConfigFile
1083 :    
1084 :     my $xmlObject = $sap->ReadConfigFile();
1085 :    
1086 :     Return the hash structure created from reading the configuration file, or
1087 :     an undefined value if the file is not found.
1088 :    
1089 :     =cut
1090 :    
1091 :     sub ReadConfigFile {
1092 :     my ($self) = @_;
1093 :     # Declare the return variable.
1094 :     my $retVal;
1095 :     # Compute the configuration file name.
1096 :     my $fileName = "$self->{loadDirectory}/SaplingConfig.xml";
1097 :     # Did we find it?
1098 :     if (-f $fileName) {
1099 :     # Yes, read it in.
1100 :     $retVal = XMLin($fileName);
1101 :     }
1102 :     # Return the result.
1103 :     return $retVal;
1104 :     }
1105 :    
1106 :     =head3 GlobalSection
1107 :    
1108 :     my $flag = $sap->GlobalSection($name);
1109 :    
1110 :     Return TRUE if the specified section name is the global section, FALSE
1111 :     otherwise.
1112 :    
1113 :     =over 4
1114 :    
1115 :     =item name
1116 :    
1117 :     Section name to test.
1118 :    
1119 :     =item RETURN
1120 :    
1121 : parrello 1.4 Returns TRUE if the parameter matches the GLOBAL constant, else FALSE.
1122 : parrello 1.1
1123 :     =back
1124 :    
1125 :     =cut
1126 :    
1127 :     sub GlobalSection {
1128 :     # Get the parameters.
1129 :     my ($self, $name) = @_;
1130 :     # Return the result.
1131 :     return ($name eq GLOBAL);
1132 :     }
1133 :    
1134 : parrello 1.14 =head2 Special-Purpose Methods
1135 :    
1136 :     =head3 ComputeFeatureFilter
1137 :    
1138 :     my ($objects, $filter, @parms) = $sap->ComputeFeatureFilter($source);
1139 :    
1140 :     Compute the initial object name list, filter string, and parameter list
1141 :     for a query by feature ID. The object name list will always end with the
1142 :     B<Feature> entity, and the combination of the filter string and parameter
1143 :     list will translate the incoming ID from the specified format to a real
1144 :     FIG feature ID. If the specified format B<is> FIG feature IDs, then the
1145 :     query will start on the B<Feature> entity; otherwise, it will start with
1146 :     the B<Identifier> entity. This is a special-purpose method that performs
1147 :     the task of intelligently modifying queries to allow for external ID
1148 :     types.
1149 :    
1150 :     =over 4
1151 :    
1152 :     =item source (optional)
1153 :    
1154 :     Database source of the IDs specified-- C<SEED> for FIG IDs, C<GENE> for standard
1155 :     gene identifiers, or C<LocusTag> for locus tags. In addition, you may specify
1156 :     C<RefSeq>, C<CMR>, C<NCBI>, C<Trembl>, or C<UniProt> for IDs from those databases.
1157 :     Use C<mixed> to allow mixed ID types (though this may cause problems when the same
1158 : parrello 1.22 ID has different meanings in different databases). Use C<prefixed> to allow IDs with
1159 :     prefixing indicating the ID type (e.g. C<uni|P00934> for a UniProt ID, C<gi|135813> for
1160 :     an NCBI identifier, and so forth). The default is C<SEED>.
1161 : parrello 1.14
1162 :     =item RETURN
1163 :    
1164 :     Returns a list containing parameters to the desired query call. The first element
1165 :     is the prefix for the object name list, the second is the prefix for the filter
1166 :     string, and the subsequent elements form the prefix for the parameter value list.
1167 :    
1168 :     =back
1169 :    
1170 :     =cut
1171 :    
1172 :     sub ComputeFeatureFilter {
1173 :     # Get the parameters.
1174 :     my ($self, $source) = @_;
1175 :     # Declare the return variables.
1176 :     my ($objects, $filter, @parms);
1177 :     # Determine the source type.
1178 :     if (! defined $source || $source eq 'SEED') {
1179 :     # Here we're directly processing FIG IDs.
1180 :     $objects = 'Feature';
1181 :     $filter = 'Feature(id) = ?'
1182 :     } elsif ($source eq 'mixed') {
1183 :     # Here we're processing mixed IDs of unknown type.
1184 :     $objects = 'Identifier Identifies Feature';
1185 :     $filter = 'Identifier(natural-form) = ?';
1186 : parrello 1.22 } elsif ($source eq 'prefixed') {
1187 :     # Here we're processing mixed IDs with prefixes. This is the internal form
1188 :     # of the ID.
1189 :     $objects = 'Identifier Identifies Feature';
1190 :     $filter = 'Identifier(id) = ?';
1191 : parrello 1.14 } else {
1192 :     # Here we're processing a fixed ID type from an external database.
1193 :     # This is the case that requires an additional parameter. Note that
1194 :     # we insist that the additional parameter matches the first parameter
1195 :     # mark.
1196 :     $objects = 'Identifier Identifies Feature';
1197 :     $filter = 'Identifier(source) = ? AND Identifier(natural-form) = ?';
1198 :     push @parms, $source;
1199 :     }
1200 :     # Return the results.
1201 :     return ($objects, $filter, @parms);
1202 :     }
1203 :    
1204 :    
1205 :     =head3 FindGapLeft
1206 :    
1207 :     my @operonData = $sap->FindGapLeft($loc, $maxGap, $interval, \%redundancyHash, \$redundancyFlag);
1208 :    
1209 :     This method performs a rather arcane task: searching for a gap to the
1210 :     left of a location in the contig. The search will proceed from the
1211 :     starting point to the left, and will stop when a gap between occupied
1212 :     locations is found that is larger than the specified maximum. The caller
1213 :     has the option of specifying a hash of feature IDs that are redundant. If
1214 :     any feature in the hash is found, the search will stop early and the
1215 :     provided redundancy flag will be set. In addition, an interval size can
1216 :     be specified to tune the process of retrieving data from the database.
1217 :    
1218 :     =over 4
1219 :    
1220 :     =item loc
1221 :    
1222 :     L<BasicLocation> object for the location from which the search is to start.
1223 :     This gives us the contig ID, the strand of interest (forward or backward),
1224 :     and the starting point of the search.
1225 :    
1226 :     =item maxGap
1227 :    
1228 :     The maximum allowable gap. The search will stop at the left end of the contig
1229 :     or the first gap larger than this amount.
1230 :    
1231 :     =item interval (optional)
1232 :    
1233 :     Interval to use for retrieving data from the database. This is the size of
1234 :     the contig segments being retrieved. The default is C<10000>
1235 :    
1236 :     =item redundancyHash (optional)
1237 :    
1238 :     A hash of feature IDs. If any feature present in this hash is found during
1239 :     the search, the search will stop and no data will be returned. The default
1240 :     is an empty hash (no check).
1241 :    
1242 :     =item redundancyFlag (optional)
1243 :    
1244 :     A reference to a scalar flag. If present, the entire method will be bypassed
1245 :     if the flag is TRUE. If a redundancy hash is specified and a redundant feature
1246 :     is found, this flag will be set to TRUE by the method.
1247 :    
1248 :     =item RETURN
1249 :    
1250 :     Returns a list of 4-tuples. Each tuple will contain a feature ID, a begin
1251 :     offset, a direction (C<+> or C<->), and a length, representing an occupied
1252 :     location on the contig and the feature to which it belongs. The complete
1253 :     list of locations will be to the left of the starting location and relatively
1254 :     close together, with no gap larger than the caller-specified maximum.
1255 :    
1256 :     =back
1257 :    
1258 :     =cut
1259 :    
1260 :     sub FindGapLeft {
1261 :     # Get the parameters.
1262 :     my ($self, $loc, $maxGap, $interval, $redundancyHash, $redundancyFlag) = @_;
1263 :     # Declare the return variable.
1264 :     my @retVal;
1265 :     # Fix up defaults for the missing parameters.
1266 :     $interval ||= 10000;
1267 :     if (! defined $redundancyHash) {
1268 :     $redundancyHash = {};
1269 :     }
1270 :     my $fakeFlag = 0;
1271 :     if (! defined $redundancyFlag) {
1272 :     $redundancyFlag = \$fakeFlag;
1273 :     }
1274 :     # This flag will be set to TRUE if we run out of locations or find a gap.
1275 :     my $gapFound = 0;
1276 :     # This will be used to store tuples found. If we are successful, it will
1277 :     # be copied to the return list.
1278 :     my @operonData;
1279 :     # Now we need to set up some data for the loop. In particular, the contig
1280 :     # ID, the strand (direction), and the starting point. We add one to the
1281 :     # starting current position to insure that the starting point is included
1282 :     # in the first search.
1283 :     my $currentPosition = $loc->Left + 1;
1284 :     my $contigID = $loc->Contig;
1285 :     my $strand = $loc->Dir;
1286 :     # This variable keeps the leftmost begin location found.
1287 :     my $begin = $loc->Left;
1288 :     # Loop until we find a redundancy or a gap.
1289 :     while (! $$redundancyFlag && ! $gapFound && $currentPosition >= 0) {
1290 :     # Compute the limits of the search interval for this iteration.
1291 :     my $nextPosition = $currentPosition - $interval;
1292 :     # Get all the locations in the interval.
1293 :     my @rows = $self->GetAll("IsLocatedIn",
1294 :     'IsLocatedIn(to-link) = ? AND IsLocatedIn(dir) = ? AND IsLocatedIn(begin) >= ? AND IsLocatedIn(begin) < ?',
1295 :     [$contigID, $strand, $nextPosition, $currentPosition],
1296 :     [qw(from-link begin dir len)]);
1297 :     # If nothing was found, it's a gap.
1298 :     if (! @rows) {
1299 :     $gapFound = 1;
1300 :     } else {
1301 :     # We got something, so we can loop through looking for gaps. The search
1302 :     # requires we sort by right point.
1303 :     my @sortableTuples;
1304 :     for my $tuple (@rows) {
1305 :     my ($fid, $left, $dir, $len) = @$tuple;
1306 :     push @sortableTuples, [$left + $len, $tuple];
1307 :     }
1308 :     my @sortedTuples = map { $_->[1] } sort { -($a->[0] <=> $b->[0]) } @sortableTuples;
1309 :     # Loop through the tuples, stopping at the first redundancy or gap.
1310 :     for my $tuple (@sortedTuples) { last if $gapFound || $$redundancyFlag;
1311 :     # Get this tuple's data.
1312 :     my ($fid, $left, $dir, $len) = @$tuple;
1313 :     # Is it close enough to be counted?
1314 :     if ($begin - ($left + $len) <= $maxGap) {
1315 :     # Yes. We can include this tuple.
1316 :     push @operonData, $tuple;
1317 :     # Update the begin point.
1318 :     $begin = $left;
1319 :     # Is it redundant? It's only reasonable to ask this if it's
1320 :     # an included feature.
1321 :     if ($redundancyHash->{$fid}) {
1322 :     $$redundancyFlag = 1;
1323 :     }
1324 :     } else {
1325 :     # No, it's not close enough. We've found a gap.
1326 :     $gapFound = 1;
1327 :     }
1328 :     }
1329 :     }
1330 :     # Set up for the next interval.
1331 :     $currentPosition = $nextPosition;
1332 :     }
1333 :     # If we're nonredundant, save our results.
1334 :     if (! $$redundancyFlag) {
1335 :     @retVal = @operonData;
1336 :     }
1337 :     # Return the result.
1338 :     return @retVal;
1339 :     }
1340 :    
1341 :     =head3 FindGapRight
1342 : parrello 1.13
1343 : parrello 1.14 my @operonData = $sap->FindGapRight($loc, $maxGap, $interval, \%redundancyHash, \$redundancyFlag);
1344 : parrello 1.13
1345 : parrello 1.14 This method is the dual of L</FindGapLeft>: it searches for a gap to the
1346 :     right of a location in the contig. The search will proceed from the
1347 :     starting point to the right, and will stop when a gap between occupied
1348 :     locations is found that is larger than the specified maximum. The caller
1349 :     has the option of specifying a hash of feature IDs that are redundant. If
1350 :     any feature in the hash is found, the search will stop early and the
1351 :     provided redundancy flag will be set. In addition, an interval size can
1352 :     be specified to tune the process of retrieving data from the database.
1353 : parrello 1.13
1354 :     =over 4
1355 :    
1356 : parrello 1.14 =item loc
1357 :    
1358 :     L<BasicLocation> object for the location from which the search is to start.
1359 :     This gives us the contig ID, the strand of interest (forward or backward),
1360 :     and the starting point of the search.
1361 :    
1362 :     =item maxGap
1363 :    
1364 :     The maximum allowable gap. The search will stop at the right end of the contig
1365 :     or the first gap larger than this amount.
1366 :    
1367 :     =item interval (optional)
1368 :    
1369 :     Interval to use for retrieving data from the database. This is the size of
1370 :     the contig segments being retrieved. The default is C<10000>
1371 :    
1372 :     =item redundancyHash (optional)
1373 :    
1374 :     A hash of feature IDs. If any feature present in this hash is found during
1375 :     the search, the search will stop and no data will be returned. The default
1376 :     is an empty hash (no check).
1377 :    
1378 :     =item redundancyFlag (optional)
1379 : parrello 1.13
1380 : parrello 1.14 A reference to a scalar flag. If present, the entire method will be bypassed
1381 :     if the flag is TRUE. If a redundancy hash is specified and a redundant feature
1382 :     is found, this flag will be set to TRUE by the method.
1383 : parrello 1.13
1384 :     =item RETURN
1385 :    
1386 : parrello 1.14 Returns a list of 4-tuples. Each tuple will contain a feature ID, a begin
1387 :     offset, a direction (C<+> or C<->), and a length, representing an occupied
1388 :     location on the contig and the feature to which it belongs. The complete
1389 :     list of locations will be to the right of the starting location and relatively
1390 :     close together, with no gap larger than the caller-specified maximum.
1391 : parrello 1.13
1392 :     =back
1393 :    
1394 :     =cut
1395 :    
1396 : parrello 1.14 sub FindGapRight {
1397 : parrello 1.13 # Get the parameters.
1398 : parrello 1.14 my ($self, $loc, $maxGap, $interval, $redundancyHash, $redundancyFlag) = @_;
1399 :     # Declare the return variable.
1400 :     my @retVal;
1401 :     # Fix up defaults for the missing parameters.
1402 :     $interval ||= 10000;
1403 :     if (! defined $redundancyHash) {
1404 :     $redundancyHash = {};
1405 :     }
1406 :     my $fakeFlag = 0;
1407 :     if (! defined $redundancyFlag) {
1408 :     $redundancyFlag = \$fakeFlag;
1409 :     }
1410 :     # This flag will be set to TRUE if we run out of locations or find a gap.
1411 :     my $gapFound = 0;
1412 :     # This will be used to store tuples found. If we are successful, it will
1413 :     # be copied to the return list.
1414 :     my @operonData;
1415 :     # Now we need to set up some data for the loop. In particular, the contig
1416 :     # ID, the strand (direction), and the starting point. We subtract one from the
1417 :     # starting current position to insure that the starting point is included
1418 :     # in the first search.
1419 :     my $currentPosition = $loc->Left - 1;
1420 :     my $contigID = $loc->Contig;
1421 :     my $strand = $loc->Dir;
1422 :     # Get the length of the contig.
1423 :     my $contigLen = $self->ContigLength($contigID);
1424 : parrello 1.19 Trace("Contig length is $contigLen. Starting at $currentPosition.") if T(3);
1425 : parrello 1.14 # This variable keeps the rightmost end location found.
1426 :     my $endPoint = $loc->Left;
1427 :     # Loop until we find a redundancy or a gap.
1428 :     while (! $$redundancyFlag && ! $gapFound && $currentPosition <= $contigLen) {
1429 : parrello 1.19 Trace("Checking at $currentPosition.") if T(3);
1430 : parrello 1.14 # Compute the limits of the search interval for this iteration.
1431 :     my $nextPosition = $currentPosition + $interval;
1432 :     # Get all the locations in the interval.
1433 :     my @rows = $self->GetAll("IsLocatedIn",
1434 :     'IsLocatedIn(to-link) = ? AND IsLocatedIn(dir) = ? AND IsLocatedIn(begin) >= ? AND IsLocatedIn(begin) < ?',
1435 : parrello 1.19 [$contigID, $strand, $currentPosition, $nextPosition],
1436 : parrello 1.14 [qw(from-link begin dir len)]);
1437 :     # If nothing was found, it's a gap.
1438 :     if (! @rows) {
1439 :     $gapFound = 1;
1440 : parrello 1.19 Trace("No result. Gap found.") if T(3);
1441 : parrello 1.14 } else {
1442 :     # We got something, so we can loop through looking for gaps. The search
1443 :     # requires we sort by left point.
1444 :     my @sortedTuples = sort { $a->[1] <=> $b->[1] } @rows;
1445 :     # Loop through the tuples, stopping at the first redundancy or gap.
1446 :     for my $tuple (@sortedTuples) { last if $gapFound || $$redundancyFlag;
1447 :     # Get this tuple's data.
1448 :     my ($fid, $left, $dir, $len) = @$tuple;
1449 :     # Is it close enough to be counted?
1450 :     if ($left - $endPoint <= $maxGap) {
1451 :     # Yes. We can include this tuple.
1452 :     push @operonData, $tuple;
1453 :     # Update the end point.
1454 :     $endPoint = $left + $len;
1455 :     # Is it redundant? It's only reasonable to ask this if it's
1456 :     # an included feature.
1457 :     if ($redundancyHash->{$fid}) {
1458 :     $$redundancyFlag = 1;
1459 :     }
1460 :     } else {
1461 :     # No, it's not close enough. We've found a gap.
1462 :     $gapFound = 1;
1463 : parrello 1.19 Trace("Long distance. Gap found.") if T(3);
1464 : parrello 1.14 }
1465 :     }
1466 :     }
1467 :     # Set up for the next interval.
1468 :     $currentPosition = $nextPosition;
1469 :     }
1470 :     # If we're nonredundant, save our results.
1471 :     if (! $$redundancyFlag) {
1472 :     @retVal = @operonData;
1473 :     }
1474 : parrello 1.13 # Return the result.
1475 : parrello 1.14 return @retVal;
1476 : parrello 1.13 }
1477 :    
1478 : parrello 1.17 =head3 GenomesInPairSet
1479 :    
1480 :     my @genomes = $sap->GenomesInPairSet($pairSetID);
1481 :    
1482 :     Return a list of the IDs for all of the genomes represented in the
1483 :     specified pair set. This is useful when analyzing what data is missing
1484 :     from the coupling tables.
1485 :    
1486 :     =over 4
1487 :    
1488 :     =item pairSetID
1489 :    
1490 :     ID of the pair set to examine.
1491 :    
1492 :     =item RETURN
1493 :    
1494 :     Returns a list of the IDs for the genomes represented in the specified pair set.
1495 :    
1496 :     =back
1497 :    
1498 :     =cut
1499 :    
1500 :     sub GenomesInPairSet {
1501 :     # Get the parameters.
1502 :     my ($self, $pairSetID) = @_;
1503 :     # We'll use this hash to isolate the genome IDs.
1504 :     my %retVal;
1505 :     # Get all the pairs in this set.
1506 :     my $query = $self->Get("IsDeterminedBy", "IsDeterminedBy(from-link) = ?",
1507 :     [$pairSetID]);
1508 :     while (my $pairData = $query->Fetch()) {
1509 :     # Record the genomes for the pegs in the pair. The pegs can be found
1510 :     # separated by a colon in the pairing ID.
1511 :     for my $peg (split /:/, $pairData->PrimaryValue('to-link')) {
1512 :     $retVal{genome_of($peg)} = 1;
1513 :     }
1514 :     }
1515 :     # Return the genome IDs.
1516 :     return keys %retVal;
1517 :     }
1518 :    
1519 : parrello 1.1
1520 :     =head2 Virtual Methods
1521 :    
1522 : parrello 1.11 =head3 PreferredName
1523 :    
1524 :     my $name = $erdb->PreferredName();
1525 :    
1526 :     Return the variable name to use for this database when generating code.
1527 :    
1528 :     =cut
1529 :    
1530 :     sub PreferredName {
1531 :     return 'sap';
1532 :     }
1533 :    
1534 : parrello 1.1 =head3 GetSourceObject
1535 :    
1536 :     my $source = $erdb->GetSourceObject();
1537 :    
1538 :     Return the object to be used in creating load files for this database. This is
1539 :     only the default source object. Loaders have the option of overriding the chosen
1540 : parrello 1.11 source object when constructing the L<ERDBLoadGroup> objects.
1541 : parrello 1.1
1542 :     =cut
1543 :    
1544 :     sub GetSourceObject {
1545 :     my ($self) = @_;
1546 :     # Insure the source object exists in our internal cache.
1547 :     if (! defined $self->{source}) {
1548 :     # We require the FIG object. If the user has no intention of
1549 :     # doing a load, this method won't be used, so he won't need to
1550 :     # have the FIG object on his system.
1551 :     require FIG;
1552 :     $self->{source} = FIG->new();
1553 :     }
1554 :     # Return it to the caller.
1555 :     return $self->{source};
1556 :     }
1557 :    
1558 :     =head3 SectionList
1559 :    
1560 :     my @sections = $erdb->SectionList();
1561 :    
1562 :     Return a list of the names for the different data sections used when loading this database.
1563 :     The default is a single string, in which case there is only one section representing the
1564 :     entire database.
1565 :    
1566 :     =cut
1567 :    
1568 :     sub SectionList {
1569 :     # Get the parameters.
1570 :     my ($self) = @_;
1571 :     # Get the genome hash.
1572 :     my $genomes = $self->GenomeHash();
1573 :     # Create one section per genome.
1574 :     my @retVal = sort keys %$genomes;
1575 :     # Append the global section.
1576 :     push @retVal, GLOBAL;
1577 :     # Return the section list.
1578 :     return @retVal;
1579 :     }
1580 :    
1581 :     =head3 Loader
1582 :    
1583 :     my $groupLoader = $erdb->Loader($groupName, $source, $options);
1584 :    
1585 : parrello 1.11 Return an L<ERDBLoadGroup> object for the specified load group. This method is used
1586 :     by L<ERDBGenerator.pl> to create the load group objects. If you are not using
1587 :     L<ERDBGenerator.pl>, you don't need to override this method.
1588 : parrello 1.1
1589 :     =over 4
1590 :    
1591 :     =item groupName
1592 :    
1593 :     Name of the load group whose object is to be returned. The group name is
1594 :     guaranteed to be a single word with only the first letter capitalized.
1595 :    
1596 :     =item source
1597 :    
1598 :     The source object used to access the data from which the load file is derived. This
1599 :     is the same object returned by L</GetSourceObject>; however, we allow the caller to pass
1600 :     it in as a parameter so that we don't end up creating multiple copies of a potentially
1601 :     expensive data structure. It is permissible for this value to be undefined, in which
1602 :     case the source will be retrieved the first time the client asks for it.
1603 :    
1604 :     =item options
1605 :    
1606 :     Reference to a hash of command-line options.
1607 :    
1608 :     =item RETURN
1609 :    
1610 : parrello 1.11 Returns an L<ERDBLoadGroup> object that can be used to process the specified load group
1611 : parrello 1.1 for this database.
1612 :    
1613 :     =back
1614 :    
1615 :     =cut
1616 :    
1617 :     sub Loader {
1618 :     # Get the parameters.
1619 :     my ($self, $groupName, $options) = @_;
1620 :     # Compute the loader name.
1621 :     my $loaderClass = "${groupName}SaplingLoader";
1622 :     # Pull in its definition.
1623 :     require "$loaderClass.pm";
1624 :     # Create an object for it.
1625 :     my $retVal = eval("$loaderClass->new(\$self, \$options)");
1626 :     # Insure it worked.
1627 :     Confess("Could not create $loaderClass object: $@") if $@;
1628 :     # Return it to the caller.
1629 :     return $retVal;
1630 :     }
1631 :    
1632 :     =head3 LoadGroupList
1633 :    
1634 :     my @groups = $erdb->LoadGroupList();
1635 :    
1636 :     Returns a list of the names for this database's load groups. This method is used
1637 : parrello 1.11 by L<ERDBGenerator.pl> when the user wishes to load all table groups. The default
1638 : parrello 1.1 is a single group called 'All' that loads everything.
1639 :    
1640 :     =cut
1641 :    
1642 :     sub LoadGroupList {
1643 :     # Return the list.
1644 : parrello 1.10 return qw(Protein Genome Feature Subsystem Family Scenario Model);
1645 : parrello 1.1 }
1646 :    
1647 :     =head3 LoadDirectory
1648 :    
1649 :     my $dirName = $erdb->LoadDirectory();
1650 :    
1651 :     Return the name of the directory in which load files are kept. The default is
1652 :     the FIG temporary directory, which is a really bad choice, but it's always there.
1653 :    
1654 :     =cut
1655 :    
1656 :     sub LoadDirectory {
1657 :     # Get the parameters.
1658 :     my ($self) = @_;
1659 :     # Return the directory name.
1660 :     return $self->{loadDirectory};
1661 :     }
1662 :    
1663 : parrello 1.21 =head3 UseInternalDBD
1664 :    
1665 :     my $flag = $erdb->UseInternalDBD();
1666 :    
1667 :     Return TRUE if this database should be allowed to use an internal DBD.
1668 :     The internal DBD is stored in the C<_metadata> table, which is created
1669 :     when the database is loaded. The Sapling uses an internal DBD.
1670 :    
1671 :     =cut
1672 :    
1673 :     sub UseInternalDBD {
1674 :     return 1;
1675 :     }
1676 : parrello 1.1
1677 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3