[Bio] / Sprout / Sapling.pm Repository:
ViewVC logotype

Annotation of /Sprout/Sapling.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.2 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package Sapling;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :     use DBKernel;
25 :     use base 'ERDB';
26 :     use Stats;
27 :     use XML::Simple;
28 :    
29 :     =head1 Sapling Package
30 :    
31 :     Sapling Database Access Methods
32 :    
33 :     =head2 Introduction
34 :    
35 :     The Sapling database is a new [[ErdbPm]] database that attempts to encapsulate
36 :     our data in a portable form for distribution. It is loaded directly from the
37 :     complete genomes and trusted subsystems of the SEED. This object has minimal
38 :     capabilities: in essence, it's just enough to get the database loaded and
39 :     working. As with the earlier Sprout database, most of the work required to use
40 :     the database can be performed using the base-class methods.
41 :    
42 :     The fields in this object are as follows.
43 :    
44 :     =over 4
45 :    
46 :     =item loadDirectory
47 :    
48 :     Name of the directory containing the files used by the loaders.
49 :    
50 :     =item loaderSource
51 :    
52 :     Source object for the loaders (a [[FigPm]] in our case).
53 :    
54 :     =item genomeHash
55 :    
56 :     Reference to a hash of the genomes to include when loading.
57 :    
58 :     =item subHash
59 :    
60 :     Reference to a hash of the subsystems to include when loading.
61 :    
62 :     =item tuning
63 :    
64 :     Reference to a hash of tuning parameters.
65 :    
66 :     =back
67 :    
68 :     =head2 Configuration
69 :    
70 :     The default loading profile for the Sapling database is to include all complete
71 :     genomes and all usable subsystems. This can be overridden by specifying a list of
72 :     genomes and subsystems in an XML configuration file. The file name should be
73 :     C<SaplingConfig.xml> in the specified data directory. The document element should
74 :     be C<Sapling>, and it has two sub-elements. The C<Genomes> element should contain as
75 :     its text a space-delimited list of genome IDs. The <Subsystems> element should contain
76 :     a list of subsystem names, one per line. If a particular section is missing, the
77 :     default list will be used.
78 :    
79 :     =head3 Example
80 :    
81 :     The following configuration file specifies 10 genomes and 6 subsystems.
82 :    
83 :     <Sapling>
84 :     <Genomes>
85 :     100226.1 31033.3 31964.1 36873.1 126740.4
86 :     155864.1 349307.7 350058.5 351348.5 412694.5
87 :     </Genomes>
88 :     <Subsystems>
89 :     Sugar_utilization_in_Thermotogales
90 :     Coenzyme_F420_hydrogenase
91 :     Ribosome_activity_modulation
92 :     prophage_tails
93 :     CBSS-393130.3.peg.794
94 :     Apigenin_derivatives
95 :     </Subsystems>
96 :     </Sapling>
97 :    
98 :     The XML file also contains tuning parameters that affect the way the data
99 :     is loaded. These are specified as attributes in the TuningParameters element,
100 :     as follows.
101 :    
102 :     =over 4
103 :    
104 :     =item maxLocationLength
105 :    
106 :     The maximum number of base pairs allowed in a single location. B<IsLocatedIn>
107 :     records are split into sections based on this length, so when you are looking
108 :     for all the features in a particular neighborhood, you can look for locations
109 :     within the maximum location distance from the neighborhood, and even if you have
110 :     a huge operon that contains tens of thousands of base pairs, you'll still be
111 :     able to find it.
112 :    
113 :     =back
114 :    
115 :     =head2 Special Methods
116 :    
117 :     =head3 Global Section Constant
118 :    
119 :     Each section of the database used by the loader corresponds to a single genome.
120 :     The global section is loaded after all the others, and is concerned with data
121 :     not related to a particular genome.
122 :    
123 :     =cut
124 :    
125 :     # Name of the global section
126 :     use constant GLOBAL => 'Globals';
127 :    
128 :     =head3 Tuning Parameter Defaults
129 :    
130 :     Each tuning parameter must have a default value, in case it is not present in
131 :     the XML configuration file. The defaults are specified in a constant hash
132 :     reference called C<TUNING_DEFAULTS>.
133 :    
134 :     =cut
135 :    
136 :     use constant TUNING_DEFAULTS => {
137 :     maxLocationLength => 4000
138 :     };
139 :    
140 :     =head3 new
141 :    
142 :     my $sap = Sapling->new(%options);
143 :    
144 :     Construct a new Sapling object. The following options are supported.
145 :    
146 :     =over 4
147 :    
148 :     =item loadDirectory
149 :    
150 :     Data directory to be used by the loaders.
151 :    
152 :     =item dbd
153 :    
154 :     XML database definition file.
155 :    
156 :     =item dbName
157 :    
158 :     Name of the database to use.
159 :    
160 :     =item sock
161 :    
162 :     Socket for accessing the database.
163 :    
164 :     =item userData
165 :    
166 :     Name and password used to log on to the database, separated by a slash.
167 :    
168 :     =item dbhost
169 :    
170 :     Database host name.
171 :    
172 :     =back
173 :    
174 :     =cut
175 :    
176 :     sub new {
177 :     # Get the parameters.
178 :     my ($class, %options) = @_;
179 :     # Get the options.
180 :     my $loadDirectory = $options{loadDirectory} || $FIG_Config::saplingData;
181 :     my $dbd = $options{dbd} || "$FIG_Config::fig/SaplingDBD.xml";
182 :     my $dbName = $options{dbName} || "nmpdr_sapling";
183 :     my $sock = $options{sock} || "$FIG_Config::sproutSock";
184 :     my $userData = $options{userData} || "root/";
185 :     my $dbhost = $options{dbhost} || "localhost";
186 :     # Compute the user name and password.
187 :     my ($user, $pass) = split '/', $userData, 2;
188 :     $pass = "" if ! defined $pass;
189 :     # Connect to the database.
190 :     my $dbh = DBKernel->new('mysql', $dbName, $user, $pass, 3306, $dbhost, $sock);
191 :     # Create the ERDB object.
192 : parrello 1.2 my $retVal = ERDB::new($class, $dbh, $dbd, %options);
193 : parrello 1.1 # Add the load directory pointer.
194 :     $retVal->{loadDirectory} = $loadDirectory;
195 :     # Set up the spaces for the loader source object, the subsystem hash, the
196 :     # genome hash, and the tuning parameters.
197 :     $retVal->{source} = undef;
198 :     $retVal->{genomeHash} = undef;
199 :     $retVal->{subHash} = undef;
200 :     $retVal->{tuning} = undef;
201 :     # Return it.
202 :     return $retVal;
203 :     }
204 :    
205 :    
206 :     =head2 Public Methods
207 :    
208 :     =head3 GenomeHash
209 :    
210 :     my $genomeHash = $sap->GenomeHash();
211 :    
212 :     Return a hash of the genomes configured to be in this database. The list
213 :     is either taken from the active SEED database or from a configuration
214 :     file in the data directory. The hash maps genome IDs to TRUE.
215 :    
216 :     =cut
217 :    
218 :     sub GenomeHash {
219 :     # Get the parameters.
220 :     my ($self) = @_;
221 :     # We'll build the hash in here.
222 :     my %genomeHash;
223 :     # Do we already have a list?
224 :     if (! defined $self->{genomeHash}) {
225 :     # No, check for a configuration file.
226 :     my $xml = $self->ReadConfigFile();
227 :     if (defined $xml && $xml->{Genomes}) {
228 :     # We found one and it has a genome list, so extract the genomes.
229 :     %genomeHash = map { $_ => 1 } grep { $_ =~ /\S/ } split /\s+/, $xml->{Genomes};
230 :     } else {
231 :     # No, so get the genome list.
232 :     my $fig = $self->GetSourceObject();
233 :     my @genomes = $fig->genomes(1);
234 :     # Verify the genome list to insure every genome has an organism
235 :     # directory.
236 :     for my $genome (@genomes) {
237 :     if (-d "$FIG_Config::organisms/$genome") {
238 :     $genomeHash{$genome} = 1;
239 :     }
240 :     }
241 :     }
242 :     # Store the genomes in this object.
243 :     $self->{genomeHash} = \%genomeHash;
244 :     }
245 :     # Return the result.
246 :     return $self->{genomeHash};
247 :     }
248 :    
249 :     =head3 SubsystemID
250 :    
251 :     my $subID = $sap->SubsystemID($subName);
252 :    
253 :     Return the ID of the subsystem with the specified name.
254 :    
255 :     =over 4
256 :    
257 :     =item subName
258 :    
259 :     Name of the relevant subsystem. A subsystem name with underscores for spaces
260 :     will return the same ID as a subsystem name with the spaces still in it.
261 :    
262 :     =item RETURN
263 :    
264 :     Returns an MD5 hash of the normalized subsystem name.
265 :    
266 :     =back
267 :    
268 :     =cut
269 :    
270 :     sub SubsystemID {
271 :     # Get the parameters.
272 :     my ($self, $subName) = @_;
273 :     # Normalize the subsystem name. Spaces are converted to underscores,
274 :     # and all letters are lower-cased.
275 :     my $subNormalized = lc $subName;
276 :     $subNormalized =~ s/\s+/_/g;
277 :     # Compute a hash of the normalized name.
278 :     my $retVal = ERDB::DigestKey($subNormalized);
279 :     # Return the result.
280 :     return $retVal;
281 :     }
282 :    
283 :     =head3 SubsystemHash
284 :    
285 :     my $subHash = $sap->SubsystemHash();
286 :    
287 :     Return a hash of the subsystems configured to be in this database. The
288 :     list is either taken from the active SEED database or from a
289 :     configuration file in the data directory. The hash maps subsystem names
290 :     to TRUE.
291 :    
292 :     =cut
293 :    
294 :     sub SubsystemHash {
295 :     # Get the parameters.
296 :     my ($self) = @_;
297 :     # We'll build the hash in here.
298 :     my %subHash;
299 :     # Do we already have a list?
300 :     if (! defined $self->{subHash}) {
301 :     # No, check for a configuration file.
302 :     my $xml = $self->ReadConfigFile();
303 :     if (defined $xml && $xml->{Subsystems}) {
304 :     # We found one, and it has subsystems, so we extract them.
305 :     # A little dancing is necessary to trim spaces.
306 :     my @subs = map { $_ =~ /\s*(\S.+\S)/; $1 } split /\n/, $xml->{Subsystems};
307 :     # Here we need to clear out any null subsystem names resulting from
308 :     # blank lines in the file.
309 :     %subHash = map { $_ => 1 } grep { $_ } @subs;
310 :     } else {
311 :     # No config file, so we ask the FIG object.
312 :     my $fig = $self->GetSourceObject();
313 :     my @subs = $fig->all_subsystems();
314 :     %subHash = map { $_ => 1 } grep { $fig->usable_subsystem($_) } @subs;
315 :     }
316 :     # Store the subsystems in this object.
317 :     $self->{subHash} = \%subHash;
318 :     }
319 :     # Return the result.
320 :     return $self->{subHash};
321 :     }
322 :    
323 :     =head3 TuningParameter
324 :    
325 :     my $parm = $erdb->TuningParameter($parmName);
326 :    
327 :     Return the value of the specified tuning parameter. Tuning parameters are
328 :     read from the XML configuration file.
329 :    
330 :     =over 4
331 :    
332 :     =item parmName
333 :    
334 :     Name of the parameter whose value is desired.
335 :    
336 :     =item RETURN
337 :    
338 :     Returns the paramter value.
339 :    
340 :     =back
341 :    
342 :     =cut
343 :    
344 :     sub TuningParameter {
345 :     # Get the parameters.
346 :     my ($self, $parmName) = @_;
347 :     # Insure we have the parameters in memory.
348 :     if (! defined $self->{tuning}) {
349 :     # Read the configuration file.
350 :     my $configFile = $self->ReadConfigFile();
351 :     # Get the tuning parameters (if any).
352 :     my $tuning;
353 :     if (! defined $configFile || ! exists $configFile->{TuningParameters}) {
354 :     $tuning = {};
355 :     } else {
356 :     $tuning = $configFile->{TuningParameters};
357 :     }
358 :     # Merge in the default option values.
359 :     Tracer::MergeOptions($tuning, TUNING_DEFAULTS);
360 :     # Save the result in our object.
361 :     $self->{tuning} = $tuning;
362 :     }
363 :     # Extract the tuning paramter.
364 :     my $retVal = $self->{tuning}{$parmName};
365 :     # Throw an error if it does not exist.
366 :     Confess("Invalid tuning parameter \"$parmName\".") if ! defined $retVal;
367 :     # Return the result.
368 :     return $retVal;
369 :     }
370 :    
371 :    
372 :     =head3 ReadConfigFile
373 :    
374 :     my $xmlObject = $sap->ReadConfigFile();
375 :    
376 :     Return the hash structure created from reading the configuration file, or
377 :     an undefined value if the file is not found.
378 :    
379 :     =cut
380 :    
381 :     sub ReadConfigFile {
382 :     my ($self) = @_;
383 :     # Declare the return variable.
384 :     my $retVal;
385 :     # Compute the configuration file name.
386 :     my $fileName = "$self->{loadDirectory}/SaplingConfig.xml";
387 :     # Did we find it?
388 :     if (-f $fileName) {
389 :     # Yes, read it in.
390 :     $retVal = XMLin($fileName);
391 :     }
392 :     # Return the result.
393 :     return $retVal;
394 :     }
395 :    
396 :     =head3 GlobalSection
397 :    
398 :     my $flag = $sap->GlobalSection($name);
399 :    
400 :     Return TRUE if the specified section name is the global section, FALSE
401 :     otherwise.
402 :    
403 :     =over 4
404 :    
405 :     =item name
406 :    
407 :     Section name to test.
408 :    
409 :     =item RETURN
410 :    
411 :     Returns TRUE if the parameter matches the GLOBAL contant, else FALSE.
412 :    
413 :     =back
414 :    
415 :     =cut
416 :    
417 :     sub GlobalSection {
418 :     # Get the parameters.
419 :     my ($self, $name) = @_;
420 :     # Return the result.
421 :     return ($name eq GLOBAL);
422 :     }
423 :    
424 :    
425 :     =head2 Virtual Methods
426 :    
427 :     =head3 GetSourceObject
428 :    
429 :     my $source = $erdb->GetSourceObject();
430 :    
431 :     Return the object to be used in creating load files for this database. This is
432 :     only the default source object. Loaders have the option of overriding the chosen
433 :     source object when constructing the [[ERDBLoadGroupPm]] objects.
434 :    
435 :     =cut
436 :    
437 :     sub GetSourceObject {
438 :     my ($self) = @_;
439 :     # Insure the source object exists in our internal cache.
440 :     if (! defined $self->{source}) {
441 :     # We require the FIG object. If the user has no intention of
442 :     # doing a load, this method won't be used, so he won't need to
443 :     # have the FIG object on his system.
444 :     require FIG;
445 :     $self->{source} = FIG->new();
446 :     }
447 :     # Return it to the caller.
448 :     return $self->{source};
449 :     }
450 :    
451 :     =head3 SectionList
452 :    
453 :     my @sections = $erdb->SectionList();
454 :    
455 :     Return a list of the names for the different data sections used when loading this database.
456 :     The default is a single string, in which case there is only one section representing the
457 :     entire database.
458 :    
459 :     =cut
460 :    
461 :     sub SectionList {
462 :     # Get the parameters.
463 :     my ($self) = @_;
464 :     # Get the genome hash.
465 :     my $genomes = $self->GenomeHash();
466 :     # Create one section per genome.
467 :     my @retVal = sort keys %$genomes;
468 :     # Append the global section.
469 :     push @retVal, GLOBAL;
470 :     # Return the section list.
471 :     return @retVal;
472 :     }
473 :    
474 :     =head3 Loader
475 :    
476 :     my $groupLoader = $erdb->Loader($groupName, $source, $options);
477 :    
478 :     Return an [[ERDBLoadGroupPm]] object for the specified load group. This method is used
479 :     by [[ERDBGeneratorPl]] to create the load group objects. If you are not using
480 :     [[ERDBGeneratorPl]], you don't need to override this method.
481 :    
482 :     =over 4
483 :    
484 :     =item groupName
485 :    
486 :     Name of the load group whose object is to be returned. The group name is
487 :     guaranteed to be a single word with only the first letter capitalized.
488 :    
489 :     =item source
490 :    
491 :     The source object used to access the data from which the load file is derived. This
492 :     is the same object returned by L</GetSourceObject>; however, we allow the caller to pass
493 :     it in as a parameter so that we don't end up creating multiple copies of a potentially
494 :     expensive data structure. It is permissible for this value to be undefined, in which
495 :     case the source will be retrieved the first time the client asks for it.
496 :    
497 :     =item options
498 :    
499 :     Reference to a hash of command-line options.
500 :    
501 :     =item RETURN
502 :    
503 :     Returns an [[ERDBLoadGroupPm]] object that can be used to process the specified load group
504 :     for this database.
505 :    
506 :     =back
507 :    
508 :     =cut
509 :    
510 :     sub Loader {
511 :     # Get the parameters.
512 :     my ($self, $groupName, $options) = @_;
513 :     # Compute the loader name.
514 :     my $loaderClass = "${groupName}SaplingLoader";
515 :     # Pull in its definition.
516 :     require "$loaderClass.pm";
517 :     # Create an object for it.
518 :     my $retVal = eval("$loaderClass->new(\$self, \$options)");
519 :     # Insure it worked.
520 :     Confess("Could not create $loaderClass object: $@") if $@;
521 :     # Return it to the caller.
522 :     return $retVal;
523 :     }
524 :    
525 :     =head3 LoadGroupList
526 :    
527 :     my @groups = $erdb->LoadGroupList();
528 :    
529 :     Returns a list of the names for this database's load groups. This method is used
530 :     by [[ERDBGeneratorPl]] when the user wishes to load all table groups. The default
531 :     is a single group called 'All' that loads everything.
532 :    
533 :     =cut
534 :    
535 :     sub LoadGroupList {
536 :     # Return the list.
537 :     return qw(Genome Feature Subsystem); ##TODO more sections
538 :     }
539 :    
540 :     =head3 LoadDirectory
541 :    
542 :     my $dirName = $erdb->LoadDirectory();
543 :    
544 :     Return the name of the directory in which load files are kept. The default is
545 :     the FIG temporary directory, which is a really bad choice, but it's always there.
546 :    
547 :     =cut
548 :    
549 :     sub LoadDirectory {
550 :     # Get the parameters.
551 :     my ($self) = @_;
552 :     # Return the directory name.
553 :     return $self->{loadDirectory};
554 :     }
555 :    
556 :    
557 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3