[Bio] / Sprout / Sapling.pm Repository:
ViewVC logotype

Annotation of /Sprout/Sapling.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package Sapling;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :     use DBKernel;
25 :     use base 'ERDB';
26 :     use Stats;
27 :     use XML::Simple;
28 :    
29 :     =head1 Sapling Package
30 :    
31 :     Sapling Database Access Methods
32 :    
33 :     =head2 Introduction
34 :    
35 :     The Sapling database is a new [[ErdbPm]] database that attempts to encapsulate
36 :     our data in a portable form for distribution. It is loaded directly from the
37 :     complete genomes and trusted subsystems of the SEED. This object has minimal
38 :     capabilities: in essence, it's just enough to get the database loaded and
39 :     working. As with the earlier Sprout database, most of the work required to use
40 :     the database can be performed using the base-class methods.
41 :    
42 :     The fields in this object are as follows.
43 :    
44 :     =over 4
45 :    
46 :     =item loadDirectory
47 :    
48 :     Name of the directory containing the files used by the loaders.
49 :    
50 :     =item loaderSource
51 :    
52 :     Source object for the loaders (a [[FigPm]] in our case).
53 :    
54 :     =item genomeHash
55 :    
56 :     Reference to a hash of the genomes to include when loading.
57 :    
58 :     =item subHash
59 :    
60 :     Reference to a hash of the subsystems to include when loading.
61 :    
62 :     =item tuning
63 :    
64 :     Reference to a hash of tuning parameters.
65 :    
66 :     =back
67 :    
68 :     =head2 Configuration
69 :    
70 :     The default loading profile for the Sapling database is to include all complete
71 :     genomes and all usable subsystems. This can be overridden by specifying a list of
72 :     genomes and subsystems in an XML configuration file. The file name should be
73 :     C<SaplingConfig.xml> in the specified data directory. The document element should
74 :     be C<Sapling>, and it has two sub-elements. The C<Genomes> element should contain as
75 :     its text a space-delimited list of genome IDs. The <Subsystems> element should contain
76 :     a list of subsystem names, one per line. If a particular section is missing, the
77 :     default list will be used.
78 :    
79 :     =head3 Example
80 :    
81 :     The following configuration file specifies 10 genomes and 6 subsystems.
82 :    
83 :     <Sapling>
84 :     <Genomes>
85 :     100226.1 31033.3 31964.1 36873.1 126740.4
86 :     155864.1 349307.7 350058.5 351348.5 412694.5
87 :     </Genomes>
88 :     <Subsystems>
89 :     Sugar_utilization_in_Thermotogales
90 :     Coenzyme_F420_hydrogenase
91 :     Ribosome_activity_modulation
92 :     prophage_tails
93 :     CBSS-393130.3.peg.794
94 :     Apigenin_derivatives
95 :     </Subsystems>
96 :     </Sapling>
97 :    
98 :     The XML file also contains tuning parameters that affect the way the data
99 :     is loaded. These are specified as attributes in the TuningParameters element,
100 :     as follows.
101 :    
102 :     =over 4
103 :    
104 :     =item maxLocationLength
105 :    
106 :     The maximum number of base pairs allowed in a single location. B<IsLocatedIn>
107 :     records are split into sections based on this length, so when you are looking
108 :     for all the features in a particular neighborhood, you can look for locations
109 :     within the maximum location distance from the neighborhood, and even if you have
110 :     a huge operon that contains tens of thousands of base pairs, you'll still be
111 :     able to find it.
112 :    
113 : parrello 1.4 =item maxSequenceLength
114 :    
115 :     The maximum number of base pairs allowed in a single DNA sequence. DNA sequences
116 :     are broken into segments to prevent excessively large genomes from clogging
117 :     memory during sequence resolution.
118 :    
119 : parrello 1.1 =back
120 :    
121 :     =head2 Special Methods
122 :    
123 :     =head3 Global Section Constant
124 :    
125 :     Each section of the database used by the loader corresponds to a single genome.
126 :     The global section is loaded after all the others, and is concerned with data
127 :     not related to a particular genome.
128 :    
129 :     =cut
130 :    
131 :     # Name of the global section
132 :     use constant GLOBAL => 'Globals';
133 :    
134 :     =head3 Tuning Parameter Defaults
135 :    
136 :     Each tuning parameter must have a default value, in case it is not present in
137 :     the XML configuration file. The defaults are specified in a constant hash
138 :     reference called C<TUNING_DEFAULTS>.
139 :    
140 :     =cut
141 :    
142 :     use constant TUNING_DEFAULTS => {
143 : parrello 1.4 maxLocationLength => 4000,
144 :     maxSequenceLength => 1000000,
145 : parrello 1.1 };
146 :    
147 :     =head3 new
148 :    
149 :     my $sap = Sapling->new(%options);
150 :    
151 :     Construct a new Sapling object. The following options are supported.
152 :    
153 :     =over 4
154 :    
155 :     =item loadDirectory
156 :    
157 :     Data directory to be used by the loaders.
158 :    
159 :     =item dbd
160 :    
161 :     XML database definition file.
162 :    
163 :     =item dbName
164 :    
165 :     Name of the database to use.
166 :    
167 :     =item sock
168 :    
169 :     Socket for accessing the database.
170 :    
171 :     =item userData
172 :    
173 :     Name and password used to log on to the database, separated by a slash.
174 :    
175 :     =item dbhost
176 :    
177 :     Database host name.
178 :    
179 :     =back
180 :    
181 :     =cut
182 :    
183 :     sub new {
184 :     # Get the parameters.
185 :     my ($class, %options) = @_;
186 :     # Get the options.
187 :     my $loadDirectory = $options{loadDirectory} || $FIG_Config::saplingData;
188 :     my $dbd = $options{dbd} || "$FIG_Config::fig/SaplingDBD.xml";
189 :     my $dbName = $options{dbName} || "nmpdr_sapling";
190 :     my $sock = $options{sock} || "$FIG_Config::sproutSock";
191 : parrello 1.3 my $userData = $options{userData} || "seed/";
192 :     my $dbhost = $options{dbhost} || $FIG_Config::saplingHost || "localhost";
193 : parrello 1.1 # Compute the user name and password.
194 :     my ($user, $pass) = split '/', $userData, 2;
195 :     $pass = "" if ! defined $pass;
196 :     # Connect to the database.
197 :     my $dbh = DBKernel->new('mysql', $dbName, $user, $pass, 3306, $dbhost, $sock);
198 :     # Create the ERDB object.
199 : parrello 1.2 my $retVal = ERDB::new($class, $dbh, $dbd, %options);
200 : parrello 1.1 # Add the load directory pointer.
201 :     $retVal->{loadDirectory} = $loadDirectory;
202 :     # Set up the spaces for the loader source object, the subsystem hash, the
203 :     # genome hash, and the tuning parameters.
204 :     $retVal->{source} = undef;
205 :     $retVal->{genomeHash} = undef;
206 :     $retVal->{subHash} = undef;
207 :     $retVal->{tuning} = undef;
208 :     # Return it.
209 :     return $retVal;
210 :     }
211 :    
212 :    
213 :     =head2 Public Methods
214 :    
215 : parrello 1.4 =head3 Taxonomy
216 :    
217 :     my @taxonomy = $sap->Taxonomy($genomeID);
218 :    
219 :     Return the full taxonomy of the specified genome, starting from the
220 :     domain downward. The returned values will be primary names, not taxonomy
221 :     IDs.
222 :    
223 :     =over 4
224 :    
225 :     =item genomeID
226 :    
227 :     ID of the genome whose taxonomy is desired. The genome does not need to exist
228 :     in the database: the version number will be lopped off and the result used as
229 :     an entry point into the taxonomy tree.
230 :    
231 :     =item RETURN
232 :    
233 :     Returns a list of taxonomy names, starting from the domain and moving
234 :     down to the node where the genome is attached.
235 :    
236 :     =back
237 :    
238 :     =cut
239 :    
240 :     sub Taxonomy {
241 :     # Get the parameters.
242 :     my ($self, $genomeID) = @_;
243 :     # Get the genome's taxonomic group.
244 :     my ($taxon) = split /\./, $genomeID, 2;
245 :     # We'll put the return data in here.
246 :     my @retVal;
247 :     # Loop until we hit a domain.
248 :     my $domainFlag;
249 :     while (! $domainFlag) {
250 :     # Get the data we need for this taxonomic group.
251 :     my ($taxonData) = $self->GetAll('TaxonomicGrouping IsInGroup',
252 :     'TaxonomicGrouping(id) = ?', [$taxon],
253 :     'domain scientific-name IsInGroup(to-link)');
254 :     # If we didn't find what we're looking for, then we have a problem. This
255 :     # would indicate a node below the domain level that doesn't have a parent
256 :     # or (more likely) an invalid input string.
257 :     if (! $taxonData) {
258 :     # Terminate the loop and trace a warning.
259 :     $domainFlag = 1;
260 :     Trace("Could not find node or parent for \"$taxon\".") if T(1);
261 :     } else {
262 :     # Extract the data for the current group. Note we overwrite our
263 :     # taxonomy ID with the ID of our parent, priming the next iteration
264 :     # of the loop.
265 :     my $name;
266 :     ($domainFlag, $name, $taxon) = @$taxonData;
267 :     # Put the current group's name in the return list.
268 :     unshift @retVal, $name;
269 :     }
270 :     }
271 :     # Return the result.
272 :     return @retVal;
273 :     }
274 :    
275 :    
276 : parrello 1.1 =head3 GenomeHash
277 :    
278 :     my $genomeHash = $sap->GenomeHash();
279 :    
280 :     Return a hash of the genomes configured to be in this database. The list
281 :     is either taken from the active SEED database or from a configuration
282 :     file in the data directory. The hash maps genome IDs to TRUE.
283 :    
284 :     =cut
285 :    
286 :     sub GenomeHash {
287 :     # Get the parameters.
288 :     my ($self) = @_;
289 :     # We'll build the hash in here.
290 :     my %genomeHash;
291 :     # Do we already have a list?
292 :     if (! defined $self->{genomeHash}) {
293 :     # No, check for a configuration file.
294 :     my $xml = $self->ReadConfigFile();
295 :     if (defined $xml && $xml->{Genomes}) {
296 :     # We found one and it has a genome list, so extract the genomes.
297 :     %genomeHash = map { $_ => 1 } grep { $_ =~ /\S/ } split /\s+/, $xml->{Genomes};
298 :     } else {
299 :     # No, so get the genome list.
300 :     my $fig = $self->GetSourceObject();
301 :     my @genomes = $fig->genomes(1);
302 :     # Verify the genome list to insure every genome has an organism
303 :     # directory.
304 :     for my $genome (@genomes) {
305 :     if (-d "$FIG_Config::organisms/$genome") {
306 :     $genomeHash{$genome} = 1;
307 :     }
308 :     }
309 :     }
310 :     # Store the genomes in this object.
311 :     $self->{genomeHash} = \%genomeHash;
312 :     }
313 :     # Return the result.
314 :     return $self->{genomeHash};
315 :     }
316 :    
317 :     =head3 SubsystemID
318 :    
319 :     my $subID = $sap->SubsystemID($subName);
320 :    
321 :     Return the ID of the subsystem with the specified name.
322 :    
323 :     =over 4
324 :    
325 :     =item subName
326 :    
327 :     Name of the relevant subsystem. A subsystem name with underscores for spaces
328 :     will return the same ID as a subsystem name with the spaces still in it.
329 :    
330 :     =item RETURN
331 :    
332 : parrello 1.4 Returns a normalized subsystem name.
333 : parrello 1.1
334 :     =back
335 :    
336 :     =cut
337 :    
338 :     sub SubsystemID {
339 :     # Get the parameters.
340 :     my ($self, $subName) = @_;
341 : parrello 1.4 # Normalize the subsystem name by converting underscores to spaces.
342 :     my $retVal = $subName;
343 :     $retVal =~ s/_/ /g;
344 : parrello 1.1 # Return the result.
345 :     return $retVal;
346 :     }
347 :    
348 :     =head3 SubsystemHash
349 :    
350 :     my $subHash = $sap->SubsystemHash();
351 :    
352 :     Return a hash of the subsystems configured to be in this database. The
353 :     list is either taken from the active SEED database or from a
354 :     configuration file in the data directory. The hash maps subsystem names
355 :     to TRUE.
356 :    
357 :     =cut
358 :    
359 :     sub SubsystemHash {
360 :     # Get the parameters.
361 :     my ($self) = @_;
362 :     # We'll build the hash in here.
363 :     my %subHash;
364 :     # Do we already have a list?
365 :     if (! defined $self->{subHash}) {
366 :     # No, check for a configuration file.
367 :     my $xml = $self->ReadConfigFile();
368 :     if (defined $xml && $xml->{Subsystems}) {
369 :     # We found one, and it has subsystems, so we extract them.
370 :     # A little dancing is necessary to trim spaces.
371 :     my @subs = map { $_ =~ /\s*(\S.+\S)/; $1 } split /\n/, $xml->{Subsystems};
372 :     # Here we need to clear out any null subsystem names resulting from
373 :     # blank lines in the file.
374 :     %subHash = map { $_ => 1 } grep { $_ } @subs;
375 :     } else {
376 :     # No config file, so we ask the FIG object.
377 :     my $fig = $self->GetSourceObject();
378 : parrello 1.4 my @subs = map { $self->SubsystemID($_) } $fig->all_subsystems();
379 : parrello 1.1 %subHash = map { $_ => 1 } grep { $fig->usable_subsystem($_) } @subs;
380 :     }
381 :     # Store the subsystems in this object.
382 :     $self->{subHash} = \%subHash;
383 :     }
384 :     # Return the result.
385 :     return $self->{subHash};
386 :     }
387 :    
388 :     =head3 TuningParameter
389 :    
390 :     my $parm = $erdb->TuningParameter($parmName);
391 :    
392 :     Return the value of the specified tuning parameter. Tuning parameters are
393 :     read from the XML configuration file.
394 :    
395 :     =over 4
396 :    
397 :     =item parmName
398 :    
399 :     Name of the parameter whose value is desired.
400 :    
401 :     =item RETURN
402 :    
403 :     Returns the paramter value.
404 :    
405 :     =back
406 :    
407 :     =cut
408 :    
409 :     sub TuningParameter {
410 :     # Get the parameters.
411 :     my ($self, $parmName) = @_;
412 :     # Insure we have the parameters in memory.
413 :     if (! defined $self->{tuning}) {
414 :     # Read the configuration file.
415 :     my $configFile = $self->ReadConfigFile();
416 :     # Get the tuning parameters (if any).
417 :     my $tuning;
418 :     if (! defined $configFile || ! exists $configFile->{TuningParameters}) {
419 :     $tuning = {};
420 :     } else {
421 :     $tuning = $configFile->{TuningParameters};
422 :     }
423 :     # Merge in the default option values.
424 :     Tracer::MergeOptions($tuning, TUNING_DEFAULTS);
425 :     # Save the result in our object.
426 :     $self->{tuning} = $tuning;
427 :     }
428 :     # Extract the tuning paramter.
429 :     my $retVal = $self->{tuning}{$parmName};
430 :     # Throw an error if it does not exist.
431 :     Confess("Invalid tuning parameter \"$parmName\".") if ! defined $retVal;
432 :     # Return the result.
433 :     return $retVal;
434 :     }
435 :    
436 :    
437 :     =head3 ReadConfigFile
438 :    
439 :     my $xmlObject = $sap->ReadConfigFile();
440 :    
441 :     Return the hash structure created from reading the configuration file, or
442 :     an undefined value if the file is not found.
443 :    
444 :     =cut
445 :    
446 :     sub ReadConfigFile {
447 :     my ($self) = @_;
448 :     # Declare the return variable.
449 :     my $retVal;
450 :     # Compute the configuration file name.
451 :     my $fileName = "$self->{loadDirectory}/SaplingConfig.xml";
452 :     # Did we find it?
453 :     if (-f $fileName) {
454 :     # Yes, read it in.
455 :     $retVal = XMLin($fileName);
456 :     }
457 :     # Return the result.
458 :     return $retVal;
459 :     }
460 :    
461 :     =head3 GlobalSection
462 :    
463 :     my $flag = $sap->GlobalSection($name);
464 :    
465 :     Return TRUE if the specified section name is the global section, FALSE
466 :     otherwise.
467 :    
468 :     =over 4
469 :    
470 :     =item name
471 :    
472 :     Section name to test.
473 :    
474 :     =item RETURN
475 :    
476 : parrello 1.4 Returns TRUE if the parameter matches the GLOBAL constant, else FALSE.
477 : parrello 1.1
478 :     =back
479 :    
480 :     =cut
481 :    
482 :     sub GlobalSection {
483 :     # Get the parameters.
484 :     my ($self, $name) = @_;
485 :     # Return the result.
486 :     return ($name eq GLOBAL);
487 :     }
488 :    
489 :    
490 :     =head2 Virtual Methods
491 :    
492 :     =head3 GetSourceObject
493 :    
494 :     my $source = $erdb->GetSourceObject();
495 :    
496 :     Return the object to be used in creating load files for this database. This is
497 :     only the default source object. Loaders have the option of overriding the chosen
498 :     source object when constructing the [[ERDBLoadGroupPm]] objects.
499 :    
500 :     =cut
501 :    
502 :     sub GetSourceObject {
503 :     my ($self) = @_;
504 :     # Insure the source object exists in our internal cache.
505 :     if (! defined $self->{source}) {
506 :     # We require the FIG object. If the user has no intention of
507 :     # doing a load, this method won't be used, so he won't need to
508 :     # have the FIG object on his system.
509 :     require FIG;
510 :     $self->{source} = FIG->new();
511 :     }
512 :     # Return it to the caller.
513 :     return $self->{source};
514 :     }
515 :    
516 :     =head3 SectionList
517 :    
518 :     my @sections = $erdb->SectionList();
519 :    
520 :     Return a list of the names for the different data sections used when loading this database.
521 :     The default is a single string, in which case there is only one section representing the
522 :     entire database.
523 :    
524 :     =cut
525 :    
526 :     sub SectionList {
527 :     # Get the parameters.
528 :     my ($self) = @_;
529 :     # Get the genome hash.
530 :     my $genomes = $self->GenomeHash();
531 :     # Create one section per genome.
532 :     my @retVal = sort keys %$genomes;
533 :     # Append the global section.
534 :     push @retVal, GLOBAL;
535 :     # Return the section list.
536 :     return @retVal;
537 :     }
538 :    
539 :     =head3 Loader
540 :    
541 :     my $groupLoader = $erdb->Loader($groupName, $source, $options);
542 :    
543 :     Return an [[ERDBLoadGroupPm]] object for the specified load group. This method is used
544 :     by [[ERDBGeneratorPl]] to create the load group objects. If you are not using
545 :     [[ERDBGeneratorPl]], you don't need to override this method.
546 :    
547 :     =over 4
548 :    
549 :     =item groupName
550 :    
551 :     Name of the load group whose object is to be returned. The group name is
552 :     guaranteed to be a single word with only the first letter capitalized.
553 :    
554 :     =item source
555 :    
556 :     The source object used to access the data from which the load file is derived. This
557 :     is the same object returned by L</GetSourceObject>; however, we allow the caller to pass
558 :     it in as a parameter so that we don't end up creating multiple copies of a potentially
559 :     expensive data structure. It is permissible for this value to be undefined, in which
560 :     case the source will be retrieved the first time the client asks for it.
561 :    
562 :     =item options
563 :    
564 :     Reference to a hash of command-line options.
565 :    
566 :     =item RETURN
567 :    
568 :     Returns an [[ERDBLoadGroupPm]] object that can be used to process the specified load group
569 :     for this database.
570 :    
571 :     =back
572 :    
573 :     =cut
574 :    
575 :     sub Loader {
576 :     # Get the parameters.
577 :     my ($self, $groupName, $options) = @_;
578 :     # Compute the loader name.
579 :     my $loaderClass = "${groupName}SaplingLoader";
580 :     # Pull in its definition.
581 :     require "$loaderClass.pm";
582 :     # Create an object for it.
583 :     my $retVal = eval("$loaderClass->new(\$self, \$options)");
584 :     # Insure it worked.
585 :     Confess("Could not create $loaderClass object: $@") if $@;
586 :     # Return it to the caller.
587 :     return $retVal;
588 :     }
589 :    
590 :     =head3 LoadGroupList
591 :    
592 :     my @groups = $erdb->LoadGroupList();
593 :    
594 :     Returns a list of the names for this database's load groups. This method is used
595 :     by [[ERDBGeneratorPl]] when the user wishes to load all table groups. The default
596 :     is a single group called 'All' that loads everything.
597 :    
598 :     =cut
599 :    
600 :     sub LoadGroupList {
601 :     # Return the list.
602 : parrello 1.4 return qw(Genome Feature Subsystem Family Scenario); # ##TODO Model, Drug, Protein
603 : parrello 1.1 }
604 :    
605 :     =head3 LoadDirectory
606 :    
607 :     my $dirName = $erdb->LoadDirectory();
608 :    
609 :     Return the name of the directory in which load files are kept. The default is
610 :     the FIG temporary directory, which is a really bad choice, but it's always there.
611 :    
612 :     =cut
613 :    
614 :     sub LoadDirectory {
615 :     # Get the parameters.
616 :     my ($self) = @_;
617 :     # Return the directory name.
618 :     return $self->{loadDirectory};
619 :     }
620 :    
621 :    
622 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3