[Bio] / Sprout / Sapling.pm Repository:
ViewVC logotype

Annotation of /Sprout/Sapling.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package Sapling;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :     use DBKernel;
25 :     use base 'ERDB';
26 :     use Stats;
27 :     use XML::Simple;
28 :    
29 :     =head1 Sapling Package
30 :    
31 :     Sapling Database Access Methods
32 :    
33 :     =head2 Introduction
34 :    
35 :     The Sapling database is a new [[ErdbPm]] database that attempts to encapsulate
36 :     our data in a portable form for distribution. It is loaded directly from the
37 :     complete genomes and trusted subsystems of the SEED. This object has minimal
38 :     capabilities: in essence, it's just enough to get the database loaded and
39 :     working. As with the earlier Sprout database, most of the work required to use
40 :     the database can be performed using the base-class methods.
41 :    
42 :     The fields in this object are as follows.
43 :    
44 :     =over 4
45 :    
46 :     =item loadDirectory
47 :    
48 :     Name of the directory containing the files used by the loaders.
49 :    
50 :     =item loaderSource
51 :    
52 :     Source object for the loaders (a [[FigPm]] in our case).
53 :    
54 :     =item genomeHash
55 :    
56 :     Reference to a hash of the genomes to include when loading.
57 :    
58 :     =item subHash
59 :    
60 :     Reference to a hash of the subsystems to include when loading.
61 :    
62 :     =item tuning
63 :    
64 :     Reference to a hash of tuning parameters.
65 :    
66 :     =back
67 :    
68 :     =head2 Configuration
69 :    
70 :     The default loading profile for the Sapling database is to include all complete
71 :     genomes and all usable subsystems. This can be overridden by specifying a list of
72 :     genomes and subsystems in an XML configuration file. The file name should be
73 :     C<SaplingConfig.xml> in the specified data directory. The document element should
74 :     be C<Sapling>, and it has two sub-elements. The C<Genomes> element should contain as
75 :     its text a space-delimited list of genome IDs. The <Subsystems> element should contain
76 :     a list of subsystem names, one per line. If a particular section is missing, the
77 :     default list will be used.
78 :    
79 :     =head3 Example
80 :    
81 :     The following configuration file specifies 10 genomes and 6 subsystems.
82 :    
83 :     <Sapling>
84 :     <Genomes>
85 :     100226.1 31033.3 31964.1 36873.1 126740.4
86 :     155864.1 349307.7 350058.5 351348.5 412694.5
87 :     </Genomes>
88 :     <Subsystems>
89 :     Sugar_utilization_in_Thermotogales
90 :     Coenzyme_F420_hydrogenase
91 :     Ribosome_activity_modulation
92 :     prophage_tails
93 :     CBSS-393130.3.peg.794
94 :     Apigenin_derivatives
95 :     </Subsystems>
96 :     </Sapling>
97 :    
98 :     The XML file also contains tuning parameters that affect the way the data
99 :     is loaded. These are specified as attributes in the TuningParameters element,
100 :     as follows.
101 :    
102 :     =over 4
103 :    
104 :     =item maxLocationLength
105 :    
106 :     The maximum number of base pairs allowed in a single location. B<IsLocatedIn>
107 :     records are split into sections based on this length, so when you are looking
108 :     for all the features in a particular neighborhood, you can look for locations
109 :     within the maximum location distance from the neighborhood, and even if you have
110 :     a huge operon that contains tens of thousands of base pairs, you'll still be
111 :     able to find it.
112 :    
113 : parrello 1.4 =item maxSequenceLength
114 :    
115 :     The maximum number of base pairs allowed in a single DNA sequence. DNA sequences
116 :     are broken into segments to prevent excessively large genomes from clogging
117 :     memory during sequence resolution.
118 :    
119 : parrello 1.1 =back
120 :    
121 :     =head2 Special Methods
122 :    
123 :     =head3 Global Section Constant
124 :    
125 :     Each section of the database used by the loader corresponds to a single genome.
126 :     The global section is loaded after all the others, and is concerned with data
127 :     not related to a particular genome.
128 :    
129 :     =cut
130 :    
131 :     # Name of the global section
132 :     use constant GLOBAL => 'Globals';
133 :    
134 :     =head3 Tuning Parameter Defaults
135 :    
136 :     Each tuning parameter must have a default value, in case it is not present in
137 :     the XML configuration file. The defaults are specified in a constant hash
138 :     reference called C<TUNING_DEFAULTS>.
139 :    
140 :     =cut
141 :    
142 :     use constant TUNING_DEFAULTS => {
143 : parrello 1.4 maxLocationLength => 4000,
144 :     maxSequenceLength => 1000000,
145 : parrello 1.1 };
146 :    
147 :     =head3 new
148 :    
149 :     my $sap = Sapling->new(%options);
150 :    
151 :     Construct a new Sapling object. The following options are supported.
152 :    
153 :     =over 4
154 :    
155 :     =item loadDirectory
156 :    
157 :     Data directory to be used by the loaders.
158 :    
159 :     =item dbd
160 :    
161 :     XML database definition file.
162 :    
163 :     =item dbName
164 :    
165 :     Name of the database to use.
166 :    
167 :     =item sock
168 :    
169 :     Socket for accessing the database.
170 :    
171 :     =item userData
172 :    
173 :     Name and password used to log on to the database, separated by a slash.
174 :    
175 :     =item dbhost
176 :    
177 :     Database host name.
178 :    
179 :     =back
180 :    
181 :     =cut
182 :    
183 :     sub new {
184 :     # Get the parameters.
185 :     my ($class, %options) = @_;
186 :     # Get the options.
187 : parrello 1.5 my $loadDirectory = $options{loadDirectory} || $FIG_Config::saplingData ||
188 :     "$FIG_Config::fig/SaplingData";
189 :     my $dbd = $options{dbd} || "$loadDirectory/SaplingDBD.xml";
190 : parrello 1.1 my $dbName = $options{dbName} || "nmpdr_sapling";
191 : parrello 1.5 my $sock = $options{sock} || "$FIG_Config::sproutSock" || "";
192 : parrello 1.3 my $userData = $options{userData} || "seed/";
193 :     my $dbhost = $options{dbhost} || $FIG_Config::saplingHost || "localhost";
194 : parrello 1.1 # Compute the user name and password.
195 :     my ($user, $pass) = split '/', $userData, 2;
196 :     $pass = "" if ! defined $pass;
197 :     # Connect to the database.
198 :     my $dbh = DBKernel->new('mysql', $dbName, $user, $pass, 3306, $dbhost, $sock);
199 :     # Create the ERDB object.
200 : parrello 1.2 my $retVal = ERDB::new($class, $dbh, $dbd, %options);
201 : parrello 1.1 # Add the load directory pointer.
202 :     $retVal->{loadDirectory} = $loadDirectory;
203 :     # Set up the spaces for the loader source object, the subsystem hash, the
204 :     # genome hash, and the tuning parameters.
205 :     $retVal->{source} = undef;
206 :     $retVal->{genomeHash} = undef;
207 :     $retVal->{subHash} = undef;
208 :     $retVal->{tuning} = undef;
209 :     # Return it.
210 :     return $retVal;
211 :     }
212 :    
213 :    
214 :     =head2 Public Methods
215 :    
216 : parrello 1.4 =head3 Taxonomy
217 :    
218 :     my @taxonomy = $sap->Taxonomy($genomeID);
219 :    
220 :     Return the full taxonomy of the specified genome, starting from the
221 :     domain downward. The returned values will be primary names, not taxonomy
222 :     IDs.
223 :    
224 :     =over 4
225 :    
226 :     =item genomeID
227 :    
228 :     ID of the genome whose taxonomy is desired. The genome does not need to exist
229 :     in the database: the version number will be lopped off and the result used as
230 :     an entry point into the taxonomy tree.
231 :    
232 :     =item RETURN
233 :    
234 :     Returns a list of taxonomy names, starting from the domain and moving
235 :     down to the node where the genome is attached.
236 :    
237 :     =back
238 :    
239 :     =cut
240 :    
241 :     sub Taxonomy {
242 :     # Get the parameters.
243 :     my ($self, $genomeID) = @_;
244 :     # Get the genome's taxonomic group.
245 :     my ($taxon) = split /\./, $genomeID, 2;
246 :     # We'll put the return data in here.
247 :     my @retVal;
248 :     # Loop until we hit a domain.
249 :     my $domainFlag;
250 :     while (! $domainFlag) {
251 :     # Get the data we need for this taxonomic group.
252 :     my ($taxonData) = $self->GetAll('TaxonomicGrouping IsInGroup',
253 :     'TaxonomicGrouping(id) = ?', [$taxon],
254 :     'domain scientific-name IsInGroup(to-link)');
255 :     # If we didn't find what we're looking for, then we have a problem. This
256 :     # would indicate a node below the domain level that doesn't have a parent
257 :     # or (more likely) an invalid input string.
258 :     if (! $taxonData) {
259 :     # Terminate the loop and trace a warning.
260 :     $domainFlag = 1;
261 :     Trace("Could not find node or parent for \"$taxon\".") if T(1);
262 :     } else {
263 :     # Extract the data for the current group. Note we overwrite our
264 :     # taxonomy ID with the ID of our parent, priming the next iteration
265 :     # of the loop.
266 :     my $name;
267 :     ($domainFlag, $name, $taxon) = @$taxonData;
268 :     # Put the current group's name in the return list.
269 :     unshift @retVal, $name;
270 :     }
271 :     }
272 :     # Return the result.
273 :     return @retVal;
274 :     }
275 :    
276 :    
277 : parrello 1.1 =head3 GenomeHash
278 :    
279 :     my $genomeHash = $sap->GenomeHash();
280 :    
281 :     Return a hash of the genomes configured to be in this database. The list
282 :     is either taken from the active SEED database or from a configuration
283 :     file in the data directory. The hash maps genome IDs to TRUE.
284 :    
285 :     =cut
286 :    
287 :     sub GenomeHash {
288 :     # Get the parameters.
289 :     my ($self) = @_;
290 :     # We'll build the hash in here.
291 :     my %genomeHash;
292 :     # Do we already have a list?
293 :     if (! defined $self->{genomeHash}) {
294 :     # No, check for a configuration file.
295 :     my $xml = $self->ReadConfigFile();
296 :     if (defined $xml && $xml->{Genomes}) {
297 :     # We found one and it has a genome list, so extract the genomes.
298 :     %genomeHash = map { $_ => 1 } grep { $_ =~ /\S/ } split /\s+/, $xml->{Genomes};
299 :     } else {
300 :     # No, so get the genome list.
301 :     my $fig = $self->GetSourceObject();
302 :     my @genomes = $fig->genomes(1);
303 :     # Verify the genome list to insure every genome has an organism
304 :     # directory.
305 :     for my $genome (@genomes) {
306 :     if (-d "$FIG_Config::organisms/$genome") {
307 :     $genomeHash{$genome} = 1;
308 :     }
309 :     }
310 :     }
311 :     # Store the genomes in this object.
312 :     $self->{genomeHash} = \%genomeHash;
313 :     }
314 :     # Return the result.
315 :     return $self->{genomeHash};
316 :     }
317 :    
318 :     =head3 SubsystemID
319 :    
320 :     my $subID = $sap->SubsystemID($subName);
321 :    
322 :     Return the ID of the subsystem with the specified name.
323 :    
324 :     =over 4
325 :    
326 :     =item subName
327 :    
328 :     Name of the relevant subsystem. A subsystem name with underscores for spaces
329 :     will return the same ID as a subsystem name with the spaces still in it.
330 :    
331 :     =item RETURN
332 :    
333 : parrello 1.4 Returns a normalized subsystem name.
334 : parrello 1.1
335 :     =back
336 :    
337 :     =cut
338 :    
339 :     sub SubsystemID {
340 :     # Get the parameters.
341 :     my ($self, $subName) = @_;
342 : parrello 1.4 # Normalize the subsystem name by converting underscores to spaces.
343 :     my $retVal = $subName;
344 :     $retVal =~ s/_/ /g;
345 : parrello 1.1 # Return the result.
346 :     return $retVal;
347 :     }
348 :    
349 :     =head3 SubsystemHash
350 :    
351 :     my $subHash = $sap->SubsystemHash();
352 :    
353 :     Return a hash of the subsystems configured to be in this database. The
354 :     list is either taken from the active SEED database or from a
355 :     configuration file in the data directory. The hash maps subsystem names
356 :     to TRUE.
357 :    
358 :     =cut
359 :    
360 :     sub SubsystemHash {
361 :     # Get the parameters.
362 :     my ($self) = @_;
363 :     # We'll build the hash in here.
364 :     my %subHash;
365 :     # Do we already have a list?
366 :     if (! defined $self->{subHash}) {
367 :     # No, check for a configuration file.
368 :     my $xml = $self->ReadConfigFile();
369 :     if (defined $xml && $xml->{Subsystems}) {
370 :     # We found one, and it has subsystems, so we extract them.
371 :     # A little dancing is necessary to trim spaces.
372 :     my @subs = map { $_ =~ /\s*(\S.+\S)/; $1 } split /\n/, $xml->{Subsystems};
373 :     # Here we need to clear out any null subsystem names resulting from
374 :     # blank lines in the file.
375 :     %subHash = map { $_ => 1 } grep { $_ } @subs;
376 :     } else {
377 :     # No config file, so we ask the FIG object.
378 :     my $fig = $self->GetSourceObject();
379 : parrello 1.4 my @subs = map { $self->SubsystemID($_) } $fig->all_subsystems();
380 : parrello 1.1 %subHash = map { $_ => 1 } grep { $fig->usable_subsystem($_) } @subs;
381 :     }
382 :     # Store the subsystems in this object.
383 :     $self->{subHash} = \%subHash;
384 :     }
385 :     # Return the result.
386 :     return $self->{subHash};
387 :     }
388 :    
389 :     =head3 TuningParameter
390 :    
391 :     my $parm = $erdb->TuningParameter($parmName);
392 :    
393 :     Return the value of the specified tuning parameter. Tuning parameters are
394 :     read from the XML configuration file.
395 :    
396 :     =over 4
397 :    
398 :     =item parmName
399 :    
400 :     Name of the parameter whose value is desired.
401 :    
402 :     =item RETURN
403 :    
404 :     Returns the paramter value.
405 :    
406 :     =back
407 :    
408 :     =cut
409 :    
410 :     sub TuningParameter {
411 :     # Get the parameters.
412 :     my ($self, $parmName) = @_;
413 :     # Insure we have the parameters in memory.
414 :     if (! defined $self->{tuning}) {
415 :     # Read the configuration file.
416 :     my $configFile = $self->ReadConfigFile();
417 :     # Get the tuning parameters (if any).
418 :     my $tuning;
419 :     if (! defined $configFile || ! exists $configFile->{TuningParameters}) {
420 :     $tuning = {};
421 :     } else {
422 :     $tuning = $configFile->{TuningParameters};
423 :     }
424 :     # Merge in the default option values.
425 :     Tracer::MergeOptions($tuning, TUNING_DEFAULTS);
426 :     # Save the result in our object.
427 :     $self->{tuning} = $tuning;
428 :     }
429 :     # Extract the tuning paramter.
430 :     my $retVal = $self->{tuning}{$parmName};
431 :     # Throw an error if it does not exist.
432 :     Confess("Invalid tuning parameter \"$parmName\".") if ! defined $retVal;
433 :     # Return the result.
434 :     return $retVal;
435 :     }
436 :    
437 :    
438 :     =head3 ReadConfigFile
439 :    
440 :     my $xmlObject = $sap->ReadConfigFile();
441 :    
442 :     Return the hash structure created from reading the configuration file, or
443 :     an undefined value if the file is not found.
444 :    
445 :     =cut
446 :    
447 :     sub ReadConfigFile {
448 :     my ($self) = @_;
449 :     # Declare the return variable.
450 :     my $retVal;
451 :     # Compute the configuration file name.
452 :     my $fileName = "$self->{loadDirectory}/SaplingConfig.xml";
453 :     # Did we find it?
454 :     if (-f $fileName) {
455 :     # Yes, read it in.
456 :     $retVal = XMLin($fileName);
457 :     }
458 :     # Return the result.
459 :     return $retVal;
460 :     }
461 :    
462 :     =head3 GlobalSection
463 :    
464 :     my $flag = $sap->GlobalSection($name);
465 :    
466 :     Return TRUE if the specified section name is the global section, FALSE
467 :     otherwise.
468 :    
469 :     =over 4
470 :    
471 :     =item name
472 :    
473 :     Section name to test.
474 :    
475 :     =item RETURN
476 :    
477 : parrello 1.4 Returns TRUE if the parameter matches the GLOBAL constant, else FALSE.
478 : parrello 1.1
479 :     =back
480 :    
481 :     =cut
482 :    
483 :     sub GlobalSection {
484 :     # Get the parameters.
485 :     my ($self, $name) = @_;
486 :     # Return the result.
487 :     return ($name eq GLOBAL);
488 :     }
489 :    
490 :    
491 :     =head2 Virtual Methods
492 :    
493 :     =head3 GetSourceObject
494 :    
495 :     my $source = $erdb->GetSourceObject();
496 :    
497 :     Return the object to be used in creating load files for this database. This is
498 :     only the default source object. Loaders have the option of overriding the chosen
499 :     source object when constructing the [[ERDBLoadGroupPm]] objects.
500 :    
501 :     =cut
502 :    
503 :     sub GetSourceObject {
504 :     my ($self) = @_;
505 :     # Insure the source object exists in our internal cache.
506 :     if (! defined $self->{source}) {
507 :     # We require the FIG object. If the user has no intention of
508 :     # doing a load, this method won't be used, so he won't need to
509 :     # have the FIG object on his system.
510 :     require FIG;
511 :     $self->{source} = FIG->new();
512 :     }
513 :     # Return it to the caller.
514 :     return $self->{source};
515 :     }
516 :    
517 :     =head3 SectionList
518 :    
519 :     my @sections = $erdb->SectionList();
520 :    
521 :     Return a list of the names for the different data sections used when loading this database.
522 :     The default is a single string, in which case there is only one section representing the
523 :     entire database.
524 :    
525 :     =cut
526 :    
527 :     sub SectionList {
528 :     # Get the parameters.
529 :     my ($self) = @_;
530 :     # Get the genome hash.
531 :     my $genomes = $self->GenomeHash();
532 :     # Create one section per genome.
533 :     my @retVal = sort keys %$genomes;
534 :     # Append the global section.
535 :     push @retVal, GLOBAL;
536 :     # Return the section list.
537 :     return @retVal;
538 :     }
539 :    
540 :     =head3 Loader
541 :    
542 :     my $groupLoader = $erdb->Loader($groupName, $source, $options);
543 :    
544 :     Return an [[ERDBLoadGroupPm]] object for the specified load group. This method is used
545 :     by [[ERDBGeneratorPl]] to create the load group objects. If you are not using
546 :     [[ERDBGeneratorPl]], you don't need to override this method.
547 :    
548 :     =over 4
549 :    
550 :     =item groupName
551 :    
552 :     Name of the load group whose object is to be returned. The group name is
553 :     guaranteed to be a single word with only the first letter capitalized.
554 :    
555 :     =item source
556 :    
557 :     The source object used to access the data from which the load file is derived. This
558 :     is the same object returned by L</GetSourceObject>; however, we allow the caller to pass
559 :     it in as a parameter so that we don't end up creating multiple copies of a potentially
560 :     expensive data structure. It is permissible for this value to be undefined, in which
561 :     case the source will be retrieved the first time the client asks for it.
562 :    
563 :     =item options
564 :    
565 :     Reference to a hash of command-line options.
566 :    
567 :     =item RETURN
568 :    
569 :     Returns an [[ERDBLoadGroupPm]] object that can be used to process the specified load group
570 :     for this database.
571 :    
572 :     =back
573 :    
574 :     =cut
575 :    
576 :     sub Loader {
577 :     # Get the parameters.
578 :     my ($self, $groupName, $options) = @_;
579 :     # Compute the loader name.
580 :     my $loaderClass = "${groupName}SaplingLoader";
581 :     # Pull in its definition.
582 :     require "$loaderClass.pm";
583 :     # Create an object for it.
584 :     my $retVal = eval("$loaderClass->new(\$self, \$options)");
585 :     # Insure it worked.
586 :     Confess("Could not create $loaderClass object: $@") if $@;
587 :     # Return it to the caller.
588 :     return $retVal;
589 :     }
590 :    
591 :     =head3 LoadGroupList
592 :    
593 :     my @groups = $erdb->LoadGroupList();
594 :    
595 :     Returns a list of the names for this database's load groups. This method is used
596 :     by [[ERDBGeneratorPl]] when the user wishes to load all table groups. The default
597 :     is a single group called 'All' that loads everything.
598 :    
599 :     =cut
600 :    
601 :     sub LoadGroupList {
602 :     # Return the list.
603 : parrello 1.5 return qw(Genome Feature Subsystem Family Scenario Model); # ##TODO Drug, Protein
604 : parrello 1.1 }
605 :    
606 :     =head3 LoadDirectory
607 :    
608 :     my $dirName = $erdb->LoadDirectory();
609 :    
610 :     Return the name of the directory in which load files are kept. The default is
611 :     the FIG temporary directory, which is a really bad choice, but it's always there.
612 :    
613 :     =cut
614 :    
615 :     sub LoadDirectory {
616 :     # Get the parameters.
617 :     my ($self) = @_;
618 :     # Return the directory name.
619 :     return $self->{loadDirectory};
620 :     }
621 :    
622 :    
623 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3