[Bio] / Sprout / SproutLoad.pm Repository:
ViewVC logotype

Annotation of /Sprout/SproutLoad.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.82 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     package SproutLoad;
4 :    
5 :     use strict;
6 :     use Tracer;
7 :     use PageBuilder;
8 :     use ERDBLoad;
9 :     use FIG;
10 :     use Sprout;
11 :     use Stats;
12 :     use BasicLocation;
13 : parrello 1.18 use HTML;
14 : parrello 1.1
15 :     =head1 Sprout Load Methods
16 :    
17 :     =head2 Introduction
18 :    
19 :     This object contains the methods needed to copy data from the FIG data store to the
20 :     Sprout database. It makes heavy use of the ERDBLoad object to manage the load into
21 :     individual tables. The client can create an instance of this object and then
22 :     call methods for each group of tables to load. For example, the following code will
23 :     load the Genome- and Feature-related tables. (It is presumed the first command line
24 :     parameter contains the name of a file specifying the genomes.)
25 :    
26 :     my $fig = FIG->new();
27 :     my $sprout = SFXlate->new_sprout_only();
28 :     my $spl = SproutLoad->new($sprout, $fig, $ARGV[0]);
29 :     my $stats = $spl->LoadGenomeData();
30 :     $stats->Accumulate($spl->LoadFeatureData());
31 :     print $stats->Show();
32 :    
33 :     It is worth noting that the FIG object does not need to be a real one. Any object
34 :     that implements the FIG methods for data retrieval could be used. So, for example,
35 :     this object could be used to copy data from one Sprout database to another, or
36 :     from any FIG-compliant data story implemented in the future.
37 :    
38 :     To insure that this is possible, each time the FIG object is used, it will be via
39 :     a variable called C<$fig>. This makes it fairly straightforward to determine which
40 :     FIG methods are required to load the Sprout database.
41 :    
42 : parrello 1.5 This object creates the load files; however, the tables are not created until it
43 :     is time to actually do the load from the files into the target database.
44 :    
45 : parrello 1.1 =cut
46 :    
47 :     #: Constructor SproutLoad->new();
48 :    
49 :     =head2 Public Methods
50 :    
51 :     =head3 new
52 :    
53 : parrello 1.8 C<< my $spl = SproutLoad->new($sprout, $fig, $genomeFile, $subsysFile, $options); >>
54 : parrello 1.1
55 :     Construct a new Sprout Loader object, specifying the two participating databases and
56 :     the name of the files containing the list of genomes and subsystems to use.
57 :    
58 :     =over 4
59 :    
60 :     =item sprout
61 :    
62 :     Sprout object representing the target database. This also specifies the directory to
63 :     be used for creating the load files.
64 :    
65 :     =item fig
66 :    
67 :     FIG object representing the source data store from which the data is to be taken.
68 :    
69 :     =item genomeFile
70 :    
71 :     Either the name of the file containing the list of genomes to load or a reference to
72 :     a hash of genome IDs to access codes. If nothing is specified, all complete genomes
73 :     will be loaded and the access code will default to 1. The genome list is presumed
74 :     to be all-inclusive. In other words, all existing data in the target database will
75 :     be deleted and replaced with the data on the specified genes. If a file is specified,
76 :     it should contain one genome ID and access code per line, tab-separated.
77 :    
78 :     =item subsysFile
79 :    
80 :     Either the name of the file containing the list of trusted subsystems or a reference
81 : parrello 1.34 to a list of subsystem names. If nothing is specified, all NMPDR subsystems will be
82 :     considered trusted. (A subsystem is considered NMPDR if it has a file named C<NMPDR>
83 : parrello 1.76 in its data directory.) Only subsystem data related to the NMPDR subsystems is loaded.
84 : parrello 1.1
85 : parrello 1.8 =item options
86 :    
87 :     Reference to a hash of command-line options.
88 :    
89 : parrello 1.1 =back
90 :    
91 :     =cut
92 :    
93 :     sub new {
94 :     # Get the parameters.
95 : parrello 1.8 my ($class, $sprout, $fig, $genomeFile, $subsysFile, $options) = @_;
96 : parrello 1.35 # Create the genome hash.
97 :     my %genomes = ();
98 :     # We only need it if load-only is NOT specified.
99 :     if (! $options->{loadOnly}) {
100 :     if (! defined($genomeFile) || $genomeFile eq '') {
101 :     # Here we want all the complete genomes and an access code of 1.
102 :     my @genomeList = $fig->genomes(1);
103 :     %genomes = map { $_ => 1 } @genomeList;
104 :     } else {
105 :     my $type = ref $genomeFile;
106 :     Trace("Genome file parameter type is \"$type\".") if T(3);
107 :     if ($type eq 'HASH') {
108 :     # Here the user specified a hash of genome IDs to access codes, which is
109 :     # exactly what we want.
110 :     %genomes = %{$genomeFile};
111 :     } elsif (! $type || $type eq 'SCALAR' ) {
112 :     # The caller specified a file, so read the genomes from the file. (Note
113 :     # that some PERLs return an empty string rather than SCALAR.)
114 :     my @genomeList = Tracer::GetFile($genomeFile);
115 :     if (! @genomeList) {
116 :     # It's an error if the genome file is empty or not found.
117 :     Confess("No genomes found in file \"$genomeFile\".");
118 :     } else {
119 :     # We build the genome Hash using a loop rather than "map" so that
120 :     # an omitted access code can be defaulted to 1.
121 :     for my $genomeLine (@genomeList) {
122 :     my ($genomeID, $accessCode) = split("\t", $genomeLine);
123 : parrello 1.65 if (! defined($accessCode)) {
124 : parrello 1.35 $accessCode = 1;
125 :     }
126 :     $genomes{$genomeID} = $accessCode;
127 : parrello 1.3 }
128 : parrello 1.1 }
129 : parrello 1.35 } else {
130 :     Confess("Invalid genome parameter ($type) in SproutLoad constructor.");
131 : parrello 1.1 }
132 :     }
133 :     }
134 :     # Load the list of trusted subsystems.
135 :     my %subsystems = ();
136 : parrello 1.35 # We only need it if load-only is NOT specified.
137 :     if (! $options->{loadOnly}) {
138 :     if (! defined $subsysFile || $subsysFile eq '') {
139 : parrello 1.55 # Here we want all the usable subsystems. First we get the whole list.
140 : parrello 1.35 my @subs = $fig->all_subsystems();
141 : parrello 1.76 # Loop through, checking for the NMPDR file.
142 : parrello 1.35 for my $sub (@subs) {
143 : parrello 1.76 if ($fig->nmpdr_subsystem($sub)) {
144 : parrello 1.35 $subsystems{$sub} = 1;
145 :     }
146 : parrello 1.33 }
147 : parrello 1.35 } else {
148 :     my $type = ref $subsysFile;
149 :     if ($type eq 'ARRAY') {
150 :     # Here the user passed in a list of subsystems.
151 :     %subsystems = map { $_ => 1 } @{$subsysFile};
152 :     } elsif (! $type || $type eq 'SCALAR') {
153 :     # Here the list of subsystems is in a file.
154 :     if (! -e $subsysFile) {
155 :     # It's an error if the file does not exist.
156 :     Confess("Trusted subsystem file not found.");
157 :     } else {
158 :     # GetFile automatically chomps end-of-line characters, so this
159 :     # is an easy task.
160 :     %subsystems = map { $_ => 1 } Tracer::GetFile($subsysFile);
161 :     }
162 : parrello 1.4 } else {
163 : parrello 1.35 Confess("Invalid subsystem parameter in SproutLoad constructor.");
164 : parrello 1.4 }
165 : parrello 1.1 }
166 : parrello 1.72 # Go through the subsys hash again, creating the keyword list for each subsystem.
167 :     for my $subsystem (keys %subsystems) {
168 :     my $name = $subsystem;
169 :     $name =~ s/_/ /g;
170 :     my $classes = $fig->subsystem_classification($subsystem);
171 : parrello 1.76 $name .= " " . join(" ", @{$classes});
172 : parrello 1.72 $subsystems{$subsystem} = $name;
173 :     }
174 : parrello 1.1 }
175 :     # Get the data directory from the Sprout object.
176 :     my ($directory) = $sprout->LoadInfo();
177 :     # Create the Sprout load object.
178 :     my $retVal = {
179 :     fig => $fig,
180 :     genomes => \%genomes,
181 :     subsystems => \%subsystems,
182 :     sprout => $sprout,
183 :     loadDirectory => $directory,
184 : parrello 1.39 erdb => $sprout,
185 : parrello 1.8 loaders => [],
186 :     options => $options
187 : parrello 1.1 };
188 :     # Bless and return it.
189 :     bless $retVal, $class;
190 :     return $retVal;
191 :     }
192 :    
193 : parrello 1.23 =head3 LoadOnly
194 :    
195 :     C<< my $flag = $spl->LoadOnly; >>
196 :    
197 :     Return TRUE if we are in load-only mode, else FALSE.
198 :    
199 :     =cut
200 :    
201 :     sub LoadOnly {
202 :     my ($self) = @_;
203 :     return $self->{options}->{loadOnly};
204 :     }
205 :    
206 : parrello 1.25 =head3 PrimaryOnly
207 :    
208 :     C<< my $flag = $spl->PrimaryOnly; >>
209 :    
210 :     Return TRUE if only the main entity is to be loaded, else FALSE.
211 :    
212 :     =cut
213 :    
214 :     sub PrimaryOnly {
215 :     my ($self) = @_;
216 :     return $self->{options}->{primaryOnly};
217 :     }
218 :    
219 : parrello 1.1 =head3 LoadGenomeData
220 :    
221 :     C<< my $stats = $spl->LoadGenomeData(); >>
222 :    
223 :     Load the Genome, Contig, and Sequence data from FIG into Sprout.
224 :    
225 :     The Sequence table is the largest single relation in the Sprout database, so this
226 :     method is expected to be slow and clumsy. At some point we will need to make it
227 :     restartable, since an error 10 gigabytes through a 20-gigabyte load is bound to be
228 :     very annoying otherwise.
229 :    
230 :     The following relations are loaded by this method.
231 :    
232 :     Genome
233 :     HasContig
234 :     Contig
235 :     IsMadeUpOf
236 :     Sequence
237 :    
238 :     =over 4
239 :    
240 :     =item RETURNS
241 :    
242 :     Returns a statistics object for the loads.
243 :    
244 :     =back
245 :    
246 :     =cut
247 :     #: Return Type $%;
248 :     sub LoadGenomeData {
249 :     # Get this object instance.
250 :     my ($self) = @_;
251 :     # Get the FIG object.
252 :     my $fig = $self->{fig};
253 :     # Get the genome count.
254 :     my $genomeHash = $self->{genomes};
255 :     my $genomeCount = (keys %{$genomeHash});
256 :     # Create load objects for each of the tables we're loading.
257 : parrello 1.23 my $loadGenome = $self->_TableLoader('Genome');
258 : parrello 1.25 my $loadHasContig = $self->_TableLoader('HasContig', $self->PrimaryOnly);
259 :     my $loadContig = $self->_TableLoader('Contig', $self->PrimaryOnly);
260 :     my $loadIsMadeUpOf = $self->_TableLoader('IsMadeUpOf', $self->PrimaryOnly);
261 :     my $loadSequence = $self->_TableLoader('Sequence', $self->PrimaryOnly);
262 : parrello 1.23 if ($self->{options}->{loadOnly}) {
263 :     Trace("Loading from existing files.") if T(2);
264 :     } else {
265 :     Trace("Generating genome data.") if T(2);
266 :     # Now we loop through the genomes, generating the data for each one.
267 :     for my $genomeID (sort keys %{$genomeHash}) {
268 :     Trace("Generating data for genome $genomeID.") if T(3);
269 :     $loadGenome->Add("genomeIn");
270 :     # The access code comes in via the genome hash.
271 :     my $accessCode = $genomeHash->{$genomeID};
272 : parrello 1.28 # Get the genus, species, and strain from the scientific name.
273 : parrello 1.23 my ($genus, $species, @extraData) = split / /, $self->{fig}->genus_species($genomeID);
274 : parrello 1.28 my $extra = join " ", @extraData;
275 : parrello 1.23 # Get the full taxonomy.
276 :     my $taxonomy = $fig->taxonomy_of($genomeID);
277 : parrello 1.82 # Get the version. If no version is specified, we default to the genome ID by itself.
278 :     my $version = $fig->genome_version($genomeID);
279 :     if (! defined($version)) {
280 :     $version = $genomeID;
281 :     }
282 :     # Get the DNA size.
283 :     my $dnaSize = $fig->genome_szdna($genomeID);
284 : parrello 1.68 # Open the NMPDR group file for this genome.
285 :     my $group;
286 :     if (open(TMP, "<$FIG_Config::organisms/$genomeID/NMPDR") &&
287 :     defined($group = <TMP>)) {
288 :     # Clean the line ending.
289 :     chomp $group;
290 :     } else {
291 :     # No group, so use the default.
292 :     $group = $FIG_Config::otherGroup;
293 :     }
294 :     close TMP;
295 : parrello 1.23 # Output the genome record.
296 : parrello 1.82 $loadGenome->Put($genomeID, $accessCode, $fig->is_complete($genomeID),
297 :     $dnaSize, $genus, $group, $species, $extra, $version, $taxonomy);
298 : parrello 1.23 # Now we loop through each of the genome's contigs.
299 :     my @contigs = $fig->all_contigs($genomeID);
300 :     for my $contigID (@contigs) {
301 :     Trace("Processing contig $contigID for $genomeID.") if T(4);
302 :     $loadContig->Add("contigIn");
303 :     $loadSequence->Add("contigIn");
304 :     # Create the contig ID.
305 :     my $sproutContigID = "$genomeID:$contigID";
306 :     # Create the contig record and relate it to the genome.
307 :     $loadContig->Put($sproutContigID);
308 :     $loadHasContig->Put($genomeID, $sproutContigID);
309 :     # Now we need to split the contig into sequences. The maximum sequence size is
310 :     # a property of the Sprout object.
311 :     my $chunkSize = $self->{sprout}->MaxSequence();
312 :     # Now we get the sequence a chunk at a time.
313 :     my $contigLen = $fig->contig_ln($genomeID, $contigID);
314 :     for (my $i = 1; $i <= $contigLen; $i += $chunkSize) {
315 :     $loadSequence->Add("chunkIn");
316 :     # Compute the endpoint of this chunk.
317 :     my $end = FIG::min($i + $chunkSize - 1, $contigLen);
318 :     # Get the actual DNA.
319 :     my $dna = $fig->get_dna($genomeID, $contigID, $i, $end);
320 :     # Compute the sequenceID.
321 :     my $seqID = "$sproutContigID.$i";
322 :     # Write out the data. For now, the quality vector is always "unknown".
323 :     $loadIsMadeUpOf->Put($sproutContigID, $seqID, $end + 1 - $i, $i);
324 :     $loadSequence->Put($seqID, "unknown", $dna);
325 :     }
326 : parrello 1.1 }
327 :     }
328 :     }
329 :     # Finish the loads.
330 :     my $retVal = $self->_FinishAll();
331 :     # Return the result.
332 :     return $retVal;
333 :     }
334 :    
335 :     =head3 LoadCouplingData
336 :    
337 :     C<< my $stats = $spl->LoadCouplingData(); >>
338 :    
339 :     Load the coupling and evidence data from FIG into Sprout.
340 :    
341 :     The coupling data specifies which genome features are functionally coupled. The
342 :     evidence data explains why the coupling is functional.
343 :    
344 :     The following relations are loaded by this method.
345 :    
346 :     Coupling
347 :     IsEvidencedBy
348 :     PCH
349 :     ParticipatesInCoupling
350 :     UsesAsEvidence
351 :    
352 :     =over 4
353 :    
354 :     =item RETURNS
355 :    
356 :     Returns a statistics object for the loads.
357 :    
358 :     =back
359 :    
360 :     =cut
361 :     #: Return Type $%;
362 :     sub LoadCouplingData {
363 :     # Get this object instance.
364 :     my ($self) = @_;
365 :     # Get the FIG object.
366 :     my $fig = $self->{fig};
367 :     # Get the genome hash.
368 :     my $genomeFilter = $self->{genomes};
369 : parrello 1.50 # Set up an ID counter for the PCHs.
370 :     my $pchID = 0;
371 : parrello 1.1 # Start the loads.
372 : parrello 1.23 my $loadCoupling = $self->_TableLoader('Coupling');
373 : parrello 1.25 my $loadIsEvidencedBy = $self->_TableLoader('IsEvidencedBy', $self->PrimaryOnly);
374 :     my $loadPCH = $self->_TableLoader('PCH', $self->PrimaryOnly);
375 :     my $loadParticipatesInCoupling = $self->_TableLoader('ParticipatesInCoupling', $self->PrimaryOnly);
376 :     my $loadUsesAsEvidence = $self->_TableLoader('UsesAsEvidence', $self->PrimaryOnly);
377 : parrello 1.23 if ($self->{options}->{loadOnly}) {
378 :     Trace("Loading from existing files.") if T(2);
379 :     } else {
380 :     Trace("Generating coupling data.") if T(2);
381 :     # Loop through the genomes found.
382 :     for my $genome (sort keys %{$genomeFilter}) {
383 :     Trace("Generating coupling data for $genome.") if T(3);
384 :     $loadCoupling->Add("genomeIn");
385 :     # Create a hash table for holding coupled pairs. We use this to prevent
386 :     # duplicates. For example, if A is coupled to B, we don't want to also
387 :     # assert that B is coupled to A, because we already know it. Fortunately,
388 :     # all couplings occur within a genome, so we can keep the hash table
389 :     # size reasonably small.
390 :     my %dupHash = ();
391 :     # Get all of the genome's PEGs.
392 :     my @pegs = $fig->pegs_of($genome);
393 :     # Loop through the PEGs.
394 :     for my $peg1 (@pegs) {
395 :     $loadCoupling->Add("pegIn");
396 :     Trace("Processing PEG $peg1 for $genome.") if T(4);
397 :     # Get a list of the coupled PEGs.
398 :     my @couplings = $fig->coupled_to($peg1);
399 :     # For each coupled PEG, we need to verify that a coupling already
400 :     # exists. If not, we have to create one.
401 :     for my $coupleData (@couplings) {
402 :     my ($peg2, $score) = @{$coupleData};
403 :     # Compute the coupling ID.
404 : parrello 1.47 my $coupleID = $self->{erdb}->CouplingID($peg1, $peg2);
405 : parrello 1.23 if (! exists $dupHash{$coupleID}) {
406 :     $loadCoupling->Add("couplingIn");
407 :     # Here we have a new coupling to store in the load files.
408 :     Trace("Storing coupling ($coupleID) with score $score.") if T(4);
409 :     # Ensure we don't do this again.
410 :     $dupHash{$coupleID} = $score;
411 :     # Write the coupling record.
412 :     $loadCoupling->Put($coupleID, $score);
413 :     # Connect it to the coupled PEGs.
414 :     $loadParticipatesInCoupling->Put($peg1, $coupleID, 1);
415 :     $loadParticipatesInCoupling->Put($peg2, $coupleID, 2);
416 :     # Get the evidence for this coupling.
417 :     my @evidence = $fig->coupling_evidence($peg1, $peg2);
418 :     # Organize the evidence into a hash table.
419 :     my %evidenceMap = ();
420 :     # Process each evidence item.
421 :     for my $evidenceData (@evidence) {
422 :     $loadPCH->Add("evidenceIn");
423 :     my ($peg3, $peg4, $usage) = @{$evidenceData};
424 :     # Only proceed if the evidence is from a Sprout
425 :     # genome.
426 :     if ($genomeFilter->{$fig->genome_of($peg3)}) {
427 :     $loadUsesAsEvidence->Add("evidenceChosen");
428 :     my $evidenceKey = "$coupleID $peg3 $peg4";
429 :     # We store this evidence in the hash if the usage
430 :     # is nonzero or no prior evidence has been found. This
431 :     # insures that if there is duplicate evidence, we
432 :     # at least keep the meaningful ones. Only evidence in
433 :     # the hash makes it to the output.
434 :     if ($usage || ! exists $evidenceMap{$evidenceKey}) {
435 :     $evidenceMap{$evidenceKey} = $evidenceData;
436 :     }
437 : parrello 1.1 }
438 :     }
439 : parrello 1.23 for my $evidenceID (keys %evidenceMap) {
440 : parrello 1.50 # Get the ID for this evidence.
441 :     $pchID++;
442 : parrello 1.23 # Create the evidence record.
443 :     my ($peg3, $peg4, $usage) = @{$evidenceMap{$evidenceID}};
444 : parrello 1.50 $loadPCH->Put($pchID, $usage);
445 : parrello 1.23 # Connect it to the coupling.
446 : parrello 1.50 $loadIsEvidencedBy->Put($coupleID, $pchID);
447 : parrello 1.23 # Connect it to the features.
448 : parrello 1.50 $loadUsesAsEvidence->Put($pchID, $peg3, 1);
449 :     $loadUsesAsEvidence->Put($pchID, $peg4, 2);
450 : parrello 1.23 }
451 : parrello 1.1 }
452 :     }
453 :     }
454 :     }
455 :     }
456 :     # All done. Finish the load.
457 :     my $retVal = $self->_FinishAll();
458 :     return $retVal;
459 :     }
460 :    
461 :     =head3 LoadFeatureData
462 :    
463 :     C<< my $stats = $spl->LoadFeatureData(); >>
464 :    
465 :     Load the feature data from FIG into Sprout.
466 :    
467 :     Features represent annotated genes, and are therefore the heart of the data store.
468 :    
469 :     The following relations are loaded by this method.
470 :    
471 :     Feature
472 :     FeatureAlias
473 :     FeatureLink
474 :     FeatureTranslation
475 :     FeatureUpstream
476 :     IsLocatedIn
477 : parrello 1.30 HasFeature
478 : parrello 1.69 HasRoleInSubsystem
479 : parrello 1.76 FeatureEssential
480 :     FeatureVirulent
481 :     FeatureIEDB
482 : parrello 1.1
483 :     =over 4
484 :    
485 :     =item RETURNS
486 :    
487 :     Returns a statistics object for the loads.
488 :    
489 :     =back
490 :    
491 :     =cut
492 :     #: Return Type $%;
493 :     sub LoadFeatureData {
494 :     # Get this object instance.
495 :     my ($self) = @_;
496 : parrello 1.72 # Get the FIG and Sprout objects.
497 : parrello 1.1 my $fig = $self->{fig};
498 : parrello 1.72 my $sprout = $self->{sprout};
499 : parrello 1.1 # Get the table of genome IDs.
500 :     my $genomeHash = $self->{genomes};
501 :     # Create load objects for each of the tables we're loading.
502 : parrello 1.23 my $loadFeature = $self->_TableLoader('Feature');
503 : parrello 1.25 my $loadIsLocatedIn = $self->_TableLoader('IsLocatedIn', $self->PrimaryOnly);
504 : parrello 1.23 my $loadFeatureAlias = $self->_TableLoader('FeatureAlias');
505 :     my $loadFeatureLink = $self->_TableLoader('FeatureLink');
506 :     my $loadFeatureTranslation = $self->_TableLoader('FeatureTranslation');
507 :     my $loadFeatureUpstream = $self->_TableLoader('FeatureUpstream');
508 : parrello 1.69 my $loadHasFeature = $self->_TableLoader('HasFeature', $self->PrimaryOnly);
509 :     my $loadHasRoleInSubsystem = $self->_TableLoader('HasRoleInSubsystem', $self->PrimaryOnly);
510 : parrello 1.76 my $loadFeatureEssential = $self->_TableLoader('FeatureEssential');
511 :     my $loadFeatureVirulent = $self->_TableLoader('FeatureVirulent');
512 :     my $loadFeatureIEDB = $self->_TableLoader('FeatureIEDB');
513 : parrello 1.72 # Get the subsystem hash.
514 :     my $subHash = $self->{subsystems};
515 : parrello 1.1 # Get the maximum sequence size. We need this later for splitting up the
516 :     # locations.
517 :     my $chunkSize = $self->{sprout}->MaxSegment();
518 : parrello 1.23 if ($self->{options}->{loadOnly}) {
519 :     Trace("Loading from existing files.") if T(2);
520 :     } else {
521 :     Trace("Generating feature data.") if T(2);
522 :     # Now we loop through the genomes, generating the data for each one.
523 :     for my $genomeID (sort keys %{$genomeHash}) {
524 :     Trace("Loading features for genome $genomeID.") if T(3);
525 :     $loadFeature->Add("genomeIn");
526 :     # Get the feature list for this genome.
527 : parrello 1.82 my $features = $fig->all_features_detailed_fast($genomeID);
528 : parrello 1.56 # Sort and count the list.
529 : parrello 1.57 my @featureTuples = sort { $a->[0] cmp $b->[0] } @{$features};
530 :     my $count = scalar @featureTuples;
531 : parrello 1.80 my @fids = map { $_->[0] } @featureTuples;
532 : parrello 1.54 Trace("$count features found for genome $genomeID.") if T(3);
533 : parrello 1.80 # Get the attributes for this genome and put them in a hash by feature ID.
534 :     my $attributes = GetGenomeAttributes($fig, $genomeID, \@fids);
535 : parrello 1.56 # Set up for our duplicate-feature check.
536 :     my $oldFeatureID = "";
537 : parrello 1.23 # Loop through the features.
538 : parrello 1.57 for my $featureTuple (@featureTuples) {
539 : parrello 1.23 # Split the tuple.
540 : parrello 1.82 my ($featureID, $locations, undef, $type, $minloc, $maxloc, $assignment, $user, $quality) = @{$featureTuple};
541 : parrello 1.56 # Check for duplicates.
542 :     if ($featureID eq $oldFeatureID) {
543 :     Trace("Duplicate feature $featureID found.") if T(1);
544 :     } else {
545 :     $oldFeatureID = $featureID;
546 :     # Count this feature.
547 :     $loadFeature->Add("featureIn");
548 : parrello 1.82 # Fix the quality. It is almost always a space, but some odd stuff might sneak through, and the
549 :     # Sprout database requires a single character.
550 :     if (! defined($quality) || $quality eq "") {
551 :     $quality = " ";
552 :     }
553 : parrello 1.76 # Begin building the keywords. We start with the genome ID, the
554 : parrello 1.79 # feature ID, the taxonomy, and the organism name.
555 :     my @keywords = ($genomeID, $featureID, $fig->genus_species($genomeID),
556 :     $fig->taxonomy_of($genomeID));
557 : parrello 1.81 # Create the aliases.
558 :     for my $alias ($fig->feature_aliases($featureID)) {
559 :     $loadFeatureAlias->Put($featureID, $alias);
560 :     push @keywords, $alias;
561 : parrello 1.75 }
562 :     Trace("Assignment for $featureID is: $assignment") if T(4);
563 :     # Break the assignment into words and shove it onto the
564 :     # keyword list.
565 :     push @keywords, split(/\s+/, $assignment);
566 : parrello 1.72 # Link this feature to the parent genome.
567 : parrello 1.56 $loadHasFeature->Put($genomeID, $featureID, $type);
568 :     # Get the links.
569 :     my @links = $fig->fid_links($featureID);
570 :     for my $link (@links) {
571 :     $loadFeatureLink->Put($featureID, $link);
572 : parrello 1.8 }
573 : parrello 1.56 # If this is a peg, generate the translation and the upstream.
574 :     if ($type eq 'peg') {
575 :     $loadFeatureTranslation->Add("pegIn");
576 :     my $translation = $fig->get_translation($featureID);
577 :     if ($translation) {
578 :     $loadFeatureTranslation->Put($featureID, $translation);
579 :     }
580 :     # We use the default upstream values of u=200 and c=100.
581 :     my $upstream = $fig->upstream_of($featureID, 200, 100);
582 :     if ($upstream) {
583 :     $loadFeatureUpstream->Put($featureID, $upstream);
584 :     }
585 : parrello 1.23 }
586 : parrello 1.69 # Now we need to find the subsystems this feature participates in.
587 : parrello 1.72 # We also add the subsystems to the keyword list. Before we do that,
588 :     # we must convert underscores to spaces and tack on the classifications.
589 : parrello 1.69 my @subsystems = $fig->peg_to_subsystems($featureID);
590 :     for my $subsystem (@subsystems) {
591 : parrello 1.72 # Only proceed if we like this subsystem.
592 :     if (exists $subHash->{$subsystem}) {
593 :     # Store the has-role link.
594 :     $loadHasRoleInSubsystem->Put($featureID, $subsystem, $genomeID, $type);
595 :     # Save the subsystem's keyword data.
596 :     my $subKeywords = $subHash->{$subsystem};
597 : parrello 1.75 push @keywords, split /\s+/, $subKeywords;
598 :     # Now we need to get this feature's role in the subsystem.
599 :     my $subObject = $fig->get_subsystem($subsystem);
600 :     my @roleColumns = $subObject->get_peg_roles($featureID);
601 :     my @allRoles = $subObject->get_roles();
602 :     for my $col (@roleColumns) {
603 :     my $role = $allRoles[$col];
604 :     push @keywords, split /\s+/, $role;
605 :     push @keywords, $subObject->get_role_abbr($col);
606 :     }
607 : parrello 1.72 }
608 :     }
609 : parrello 1.76 # There are three special attributes computed from property
610 :     # data that we build next. If the special attribute is non-empty,
611 :     # its name will be added to the keyword list. First, we get all
612 :     # the attributes for this feature. They will come back as
613 :     # 4-tuples: [peg, name, value, URL]. We use a 3-tuple instead:
614 :     # [name, value, value with URL]. (We don't need the PEG, since
615 :     # we already know it.)
616 :     my @attributes = map { [$_->[1], $_->[2], Tracer::CombineURL($_->[2], $_->[3])] }
617 : parrello 1.80 @{$attributes->{$featureID}};
618 : parrello 1.76 # Now we process each of the special attributes.
619 :     if (SpecialAttribute($featureID, \@attributes,
620 : parrello 1.77 1, [0,2], '^(essential|potential_essential)$',
621 : parrello 1.76 $loadFeatureEssential)) {
622 :     push @keywords, 'essential';
623 :     $loadFeature->Add('essential');
624 : parrello 1.72 }
625 : parrello 1.76 if (SpecialAttribute($featureID, \@attributes,
626 : parrello 1.77 0, [2], '^virulen',
627 : parrello 1.76 $loadFeatureVirulent)) {
628 :     push @keywords, 'virulent';
629 :     $loadFeature->Add('virulent');
630 :     }
631 :     if (SpecialAttribute($featureID, \@attributes,
632 : parrello 1.77 0, [0,2], '^iedb_',
633 : parrello 1.76 $loadFeatureIEDB)) {
634 :     push @keywords, 'iedb';
635 :     $loadFeature->Add('iedb');
636 : parrello 1.75 }
637 :     # Now we need to bust up hyphenated words in the keyword
638 : parrello 1.79 # list. We keep them separate and put them at the end so
639 :     # the original word order is available.
640 : parrello 1.75 my $keywordString = "";
641 : parrello 1.79 my $bustedString = "";
642 : parrello 1.75 for my $keyword (@keywords) {
643 : parrello 1.80 if (length $keyword >= 3) {
644 : parrello 1.75 $keywordString .= " $keyword";
645 :     if ($keyword =~ /-/) {
646 : parrello 1.80 my @words = split /-/, $keyword;
647 : parrello 1.79 $bustedString .= join(" ", "", @words);
648 : parrello 1.75 }
649 :     }
650 : parrello 1.69 }
651 : parrello 1.79 $keywordString .= $bustedString;
652 :     # Get rid of annoying punctuation.
653 :     $keywordString =~ s/[();]//g;
654 : parrello 1.72 # Clean the keyword list.
655 : parrello 1.75 my $cleanWords = $sprout->CleanKeywords($keywordString);
656 :     Trace("Keyword string for $featureID: $cleanWords") if T(4);
657 : parrello 1.72 # Create the feature record.
658 : parrello 1.82 $loadFeature->Put($featureID, 1, $user, $quality, $type, $assignment, $cleanWords);
659 : parrello 1.56 # This part is the roughest. We need to relate the features to contig
660 :     # locations, and the locations must be split so that none of them exceed
661 :     # the maximum segment size. This simplifies the genes_in_region processing
662 :     # for Sprout.
663 :     my @locationList = split /\s*,\s*/, $locations;
664 :     # Create the location position indicator.
665 :     my $i = 1;
666 :     # Loop through the locations.
667 :     for my $location (@locationList) {
668 :     # Parse the location.
669 :     my $locObject = BasicLocation->new("$genomeID:$location");
670 :     # Split it into a list of chunks.
671 :     my @locOList = ();
672 :     while (my $peeling = $locObject->Peel($chunkSize)) {
673 :     $loadIsLocatedIn->Add("peeling");
674 :     push @locOList, $peeling;
675 :     }
676 :     push @locOList, $locObject;
677 :     # Loop through the chunks, creating IsLocatedIn records. The variable
678 :     # "$i" will be used to keep the location index.
679 :     for my $locChunk (@locOList) {
680 :     $loadIsLocatedIn->Put($featureID, $locChunk->Contig, $locChunk->Left,
681 :     $locChunk->Dir, $locChunk->Length, $i);
682 :     $i++;
683 :     }
684 : parrello 1.23 }
685 : parrello 1.1 }
686 :     }
687 :     }
688 :     }
689 :     # Finish the loads.
690 :     my $retVal = $self->_FinishAll();
691 :     return $retVal;
692 :     }
693 :    
694 :     =head3 LoadSubsystemData
695 :    
696 :     C<< my $stats = $spl->LoadSubsystemData(); >>
697 :    
698 :     Load the subsystem data from FIG into Sprout.
699 :    
700 :     Subsystems are groupings of genetic roles that work together to effect a specific
701 :     chemical reaction. Similar organisms require similar subsystems. To curate a subsystem,
702 :     a spreadsheet is created with genomes on one axis and subsystem roles on the other
703 :     axis. Similar features are then mapped into the cells, allowing the annotation of one
704 :     genome's roles to be used to assist in the annotation of others.
705 :    
706 :     The following relations are loaded by this method.
707 :    
708 :     Subsystem
709 : parrello 1.46 SubsystemClass
710 : parrello 1.1 Role
711 : parrello 1.19 RoleEC
712 : parrello 1.1 SSCell
713 :     ContainsFeature
714 :     IsGenomeOf
715 :     IsRoleOf
716 :     OccursInSubsystem
717 :     ParticipatesIn
718 :     HasSSCell
719 : parrello 1.18 ConsistsOfRoles
720 :     RoleSubset
721 :     HasRoleSubset
722 :     ConsistsOfGenomes
723 :     GenomeSubset
724 :     HasGenomeSubset
725 : parrello 1.20 Catalyzes
726 : parrello 1.21 Diagram
727 :     RoleOccursIn
728 : parrello 1.1
729 :     =over 4
730 :    
731 :     =item RETURNS
732 :    
733 :     Returns a statistics object for the loads.
734 :    
735 :     =back
736 :    
737 :     =cut
738 :     #: Return Type $%;
739 :     sub LoadSubsystemData {
740 :     # Get this object instance.
741 :     my ($self) = @_;
742 :     # Get the FIG object.
743 :     my $fig = $self->{fig};
744 :     # Get the genome hash. We'll use it to filter the genomes in each
745 :     # spreadsheet.
746 :     my $genomeHash = $self->{genomes};
747 :     # Get the subsystem hash. This lists the subsystems we'll process.
748 :     my $subsysHash = $self->{subsystems};
749 :     my @subsysIDs = sort keys %{$subsysHash};
750 : parrello 1.21 # Get the map list.
751 :     my @maps = $fig->all_maps;
752 : parrello 1.1 # Create load objects for each of the tables we're loading.
753 : parrello 1.25 my $loadDiagram = $self->_TableLoader('Diagram', $self->PrimaryOnly);
754 :     my $loadRoleOccursIn = $self->_TableLoader('RoleOccursIn', $self->PrimaryOnly);
755 : parrello 1.23 my $loadSubsystem = $self->_TableLoader('Subsystem');
756 : parrello 1.25 my $loadRole = $self->_TableLoader('Role', $self->PrimaryOnly);
757 :     my $loadRoleEC = $self->_TableLoader('RoleEC', $self->PrimaryOnly);
758 :     my $loadCatalyzes = $self->_TableLoader('Catalyzes', $self->PrimaryOnly);
759 :     my $loadSSCell = $self->_TableLoader('SSCell', $self->PrimaryOnly);
760 :     my $loadContainsFeature = $self->_TableLoader('ContainsFeature', $self->PrimaryOnly);
761 :     my $loadIsGenomeOf = $self->_TableLoader('IsGenomeOf', $self->PrimaryOnly);
762 :     my $loadIsRoleOf = $self->_TableLoader('IsRoleOf', $self->PrimaryOnly);
763 :     my $loadOccursInSubsystem = $self->_TableLoader('OccursInSubsystem', $self->PrimaryOnly);
764 :     my $loadParticipatesIn = $self->_TableLoader('ParticipatesIn', $self->PrimaryOnly);
765 :     my $loadHasSSCell = $self->_TableLoader('HasSSCell', $self->PrimaryOnly);
766 :     my $loadRoleSubset = $self->_TableLoader('RoleSubset', $self->PrimaryOnly);
767 :     my $loadGenomeSubset = $self->_TableLoader('GenomeSubset', $self->PrimaryOnly);
768 :     my $loadConsistsOfRoles = $self->_TableLoader('ConsistsOfRoles', $self->PrimaryOnly);
769 :     my $loadConsistsOfGenomes = $self->_TableLoader('ConsistsOfGenomes', $self->PrimaryOnly);
770 :     my $loadHasRoleSubset = $self->_TableLoader('HasRoleSubset', $self->PrimaryOnly);
771 :     my $loadHasGenomeSubset = $self->_TableLoader('HasGenomeSubset', $self->PrimaryOnly);
772 : parrello 1.46 my $loadSubsystemClass = $self->_TableLoader('SubsystemClass', $self->PrimaryOnly);
773 : parrello 1.23 if ($self->{options}->{loadOnly}) {
774 :     Trace("Loading from existing files.") if T(2);
775 :     } else {
776 :     Trace("Generating subsystem data.") if T(2);
777 :     # This hash will contain the role for each EC. When we're done, this
778 :     # information will be used to generate the Catalyzes table.
779 :     my %ecToRoles = ();
780 :     # Loop through the subsystems. Our first task will be to create the
781 :     # roles. We do this by looping through the subsystems and creating a
782 :     # role hash. The hash tracks each role ID so that we don't create
783 :     # duplicates. As we move along, we'll connect the roles and subsystems
784 :     # and memorize up the reactions.
785 :     my ($genomeID, $roleID);
786 :     my %roleData = ();
787 :     for my $subsysID (@subsysIDs) {
788 :     # Get the subsystem object.
789 :     my $sub = $fig->get_subsystem($subsysID);
790 : parrello 1.32 # Only proceed if the subsystem has a spreadsheet.
791 :     if (! $sub->{empty_ss}) {
792 : parrello 1.31 Trace("Creating subsystem $subsysID.") if T(3);
793 :     $loadSubsystem->Add("subsystemIn");
794 :     # Create the subsystem record.
795 :     my $curator = $sub->get_curator();
796 :     my $notes = $sub->get_notes();
797 :     $loadSubsystem->Put($subsysID, $curator, $notes);
798 : parrello 1.72 # Now for the classification string. This comes back as a list
799 :     # reference and we convert it to a space-delimited string.
800 : parrello 1.64 my $classList = $fig->subsystem_classification($subsysID);
801 : parrello 1.78 my $classString = join($FIG_Config::splitter, grep { $_ } @$classList);
802 : parrello 1.72 $loadSubsystemClass->Put($subsysID, $classString);
803 : parrello 1.31 # Connect it to its roles. Each role is a column in the subsystem spreadsheet.
804 :     for (my $col = 0; defined($roleID = $sub->get_role($col)); $col++) {
805 :     # Connect to this role.
806 :     $loadOccursInSubsystem->Add("roleIn");
807 :     $loadOccursInSubsystem->Put($roleID, $subsysID, $col);
808 :     # If it's a new role, add it to the role table.
809 :     if (! exists $roleData{$roleID}) {
810 :     # Get the role's abbreviation.
811 :     my $abbr = $sub->get_role_abbr($col);
812 :     # Add the role.
813 :     $loadRole->Put($roleID, $abbr);
814 :     $roleData{$roleID} = 1;
815 :     # Check for an EC number.
816 :     if ($roleID =~ /\(EC ([^.]+\.[^.]+\.[^.]+\.[^)]+)\)\s*$/) {
817 :     my $ec = $1;
818 :     $loadRoleEC->Put($roleID, $ec);
819 :     $ecToRoles{$ec} = $roleID;
820 :     }
821 : parrello 1.23 }
822 : parrello 1.18 }
823 : parrello 1.31 # Now we create the spreadsheet for the subsystem by matching roles to
824 :     # genomes. Each genome is a row and each role is a column. We may need
825 :     # to actually create the roles as we find them.
826 :     Trace("Creating subsystem $subsysID spreadsheet.") if T(3);
827 :     for (my $row = 0; defined($genomeID = $sub->get_genome($row)); $row++) {
828 :     # Only proceed if this is one of our genomes.
829 :     if (exists $genomeHash->{$genomeID}) {
830 :     # Count the PEGs and cells found for verification purposes.
831 :     my $pegCount = 0;
832 :     my $cellCount = 0;
833 :     # Create a list for the PEGs we find. This list will be used
834 :     # to generate cluster numbers.
835 :     my @pegsFound = ();
836 :     # Create a hash that maps spreadsheet IDs to PEGs. We will
837 :     # use this to generate the ContainsFeature data after we have
838 :     # the cluster numbers.
839 :     my %cellPegs = ();
840 :     # Get the genome's variant code for this subsystem.
841 :     my $variantCode = $sub->get_variant_code($row);
842 :     # Loop through the subsystem's roles. We use an index because it is
843 :     # part of the spreadsheet cell ID.
844 :     for (my $col = 0; defined($roleID = $sub->get_role($col)); $col++) {
845 :     # Get the features in the spreadsheet cell for this genome and role.
846 : parrello 1.37 my @pegs = grep { !$fig->is_deleted_fid($_) } $sub->get_pegs_from_cell($row, $col);
847 : parrello 1.31 # Only proceed if features exist.
848 :     if (@pegs > 0) {
849 :     # Create the spreadsheet cell.
850 :     $cellCount++;
851 :     my $cellID = "$subsysID:$genomeID:$col";
852 :     $loadSSCell->Put($cellID);
853 :     $loadIsGenomeOf->Put($genomeID, $cellID);
854 :     $loadIsRoleOf->Put($roleID, $cellID);
855 :     $loadHasSSCell->Put($subsysID, $cellID);
856 :     # Remember its features.
857 :     push @pegsFound, @pegs;
858 :     $cellPegs{$cellID} = \@pegs;
859 :     $pegCount += @pegs;
860 :     }
861 : parrello 1.23 }
862 : parrello 1.31 # If we found some cells for this genome, we need to compute clusters and
863 :     # denote it participates in the subsystem.
864 :     if ($pegCount > 0) {
865 :     Trace("$pegCount PEGs in $cellCount cells for $genomeID.") if T(3);
866 :     $loadParticipatesIn->Put($genomeID, $subsysID, $variantCode);
867 :     # Create a hash mapping PEG IDs to cluster numbers.
868 :     # We default to -1 for all of them.
869 :     my %clusterOf = map { $_ => -1 } @pegsFound;
870 : parrello 1.41 # Partition the PEGs found into clusters.
871 :     my @clusters = $fig->compute_clusters([keys %clusterOf], $sub);
872 : parrello 1.31 for (my $i = 0; $i <= $#clusters; $i++) {
873 :     my $subList = $clusters[$i];
874 :     for my $peg (@{$subList}) {
875 :     $clusterOf{$peg} = $i;
876 :     }
877 : parrello 1.23 }
878 : parrello 1.31 # Create the ContainsFeature data.
879 :     for my $cellID (keys %cellPegs) {
880 :     my $cellList = $cellPegs{$cellID};
881 :     for my $cellPeg (@$cellList) {
882 :     $loadContainsFeature->Put($cellID, $cellPeg, $clusterOf{$cellPeg});
883 :     }
884 : parrello 1.23 }
885 : parrello 1.18 }
886 :     }
887 : parrello 1.15 }
888 : parrello 1.31 # Now we need to generate the subsets. The subset names must be concatenated to
889 :     # the subsystem name to make them unique keys. There are two types of subsets:
890 :     # genome subsets and role subsets. We do the role subsets first.
891 :     my @subsetNames = $sub->get_subset_names();
892 :     for my $subsetID (@subsetNames) {
893 :     # Create the subset record.
894 :     my $actualID = "$subsysID:$subsetID";
895 :     $loadRoleSubset->Put($actualID);
896 :     # Connect the subset to the subsystem.
897 :     $loadHasRoleSubset->Put($subsysID, $actualID);
898 :     # Connect the subset to its roles.
899 :     my @roles = $sub->get_subsetC_roles($subsetID);
900 :     for my $roleID (@roles) {
901 :     $loadConsistsOfRoles->Put($actualID, $roleID);
902 :     }
903 :     }
904 :     # Next the genome subsets.
905 :     @subsetNames = $sub->get_subset_namesR();
906 :     for my $subsetID (@subsetNames) {
907 :     # Create the subset record.
908 :     my $actualID = "$subsysID:$subsetID";
909 :     $loadGenomeSubset->Put($actualID);
910 :     # Connect the subset to the subsystem.
911 :     $loadHasGenomeSubset->Put($subsysID, $actualID);
912 :     # Connect the subset to its genomes.
913 :     my @genomes = $sub->get_subsetR($subsetID);
914 :     for my $genomeID (@genomes) {
915 :     $loadConsistsOfGenomes->Put($actualID, $genomeID);
916 :     }
917 : parrello 1.23 }
918 : parrello 1.18 }
919 : parrello 1.57 }
920 :     # Now we loop through the diagrams. We need to create the diagram records
921 :     # and link each diagram to its roles. Note that only roles which occur
922 :     # in subsystems (and therefore appear in the %ecToRoles hash) are
923 :     # included.
924 :     for my $map (@maps) {
925 :     Trace("Loading diagram $map.") if T(3);
926 :     # Get the diagram's descriptive name.
927 :     my $name = $fig->map_name($map);
928 :     $loadDiagram->Put($map, $name);
929 :     # Now we need to link all the map's roles to it.
930 :     # A hash is used to prevent duplicates.
931 :     my %roleHash = ();
932 :     for my $role ($fig->map_to_ecs($map)) {
933 :     if (exists $ecToRoles{$role} && ! $roleHash{$role}) {
934 :     $loadRoleOccursIn->Put($ecToRoles{$role}, $map);
935 :     $roleHash{$role} = 1;
936 : parrello 1.23 }
937 : parrello 1.21 }
938 : parrello 1.57 }
939 :     # Before we leave, we must create the Catalyzes table. We start with the reactions,
940 :     # then use the "ecToRoles" table to convert EC numbers to role IDs.
941 :     my @reactions = $fig->all_reactions();
942 :     for my $reactionID (@reactions) {
943 :     # Get this reaction's list of roles. The results will be EC numbers.
944 :     my @roles = $fig->catalyzed_by($reactionID);
945 :     # Loop through the roles, creating catalyzation records.
946 :     for my $thisRole (@roles) {
947 :     if (exists $ecToRoles{$thisRole}) {
948 :     $loadCatalyzes->Put($ecToRoles{$thisRole}, $reactionID);
949 : parrello 1.23 }
950 : parrello 1.18 }
951 :     }
952 : parrello 1.1 }
953 :     # Finish the load.
954 :     my $retVal = $self->_FinishAll();
955 :     return $retVal;
956 :     }
957 :    
958 :     =head3 LoadPropertyData
959 :    
960 :     C<< my $stats = $spl->LoadPropertyData(); >>
961 :    
962 :     Load the attribute data from FIG into Sprout.
963 :    
964 :     Attribute data in FIG corresponds to the Sprout concept of Property. As currently
965 :     implemented, each key-value attribute combination in the SEED corresponds to a
966 :     record in the B<Property> table. The B<HasProperty> relationship links the
967 :     features to the properties.
968 :    
969 :     The SEED also allows attributes to be assigned to genomes, but this is not yet
970 :     supported by Sprout.
971 :    
972 :     The following relations are loaded by this method.
973 :    
974 :     HasProperty
975 :     Property
976 :    
977 :     =over 4
978 :    
979 :     =item RETURNS
980 :    
981 :     Returns a statistics object for the loads.
982 :    
983 :     =back
984 :    
985 :     =cut
986 :     #: Return Type $%;
987 :     sub LoadPropertyData {
988 :     # Get this object instance.
989 :     my ($self) = @_;
990 :     # Get the FIG object.
991 :     my $fig = $self->{fig};
992 :     # Get the genome hash.
993 :     my $genomeHash = $self->{genomes};
994 :     # Create load objects for each of the tables we're loading.
995 : parrello 1.23 my $loadProperty = $self->_TableLoader('Property');
996 : parrello 1.25 my $loadHasProperty = $self->_TableLoader('HasProperty', $self->PrimaryOnly);
997 : parrello 1.23 if ($self->{options}->{loadOnly}) {
998 :     Trace("Loading from existing files.") if T(2);
999 :     } else {
1000 :     Trace("Generating property data.") if T(2);
1001 :     # Create a hash for storing property IDs.
1002 :     my %propertyKeys = ();
1003 :     my $nextID = 1;
1004 :     # Loop through the genomes.
1005 : parrello 1.66 for my $genomeID (sort keys %{$genomeHash}) {
1006 : parrello 1.23 $loadProperty->Add("genomeIn");
1007 : parrello 1.24 Trace("Generating properties for $genomeID.") if T(3);
1008 : parrello 1.23 # Get the genome's features. The feature ID is the first field in the
1009 :     # tuples returned by "all_features_detailed". We use "all_features_detailed"
1010 :     # rather than "all_features" because we want all features regardless of type.
1011 :     my @features = map { $_->[0] } @{$fig->all_features_detailed($genomeID)};
1012 : parrello 1.24 my $featureCount = 0;
1013 :     my $propertyCount = 0;
1014 : parrello 1.80 # Get the properties for this genome's features.
1015 :     my $attributes = GetGenomeAttributes($fig, $genomeID, \@features);
1016 :     Trace("Property hash built for $genomeID.") if T(3);
1017 : parrello 1.23 # Loop through the features, creating HasProperty records.
1018 :     for my $fid (@features) {
1019 :     # Get all attributes for this feature. We do this one feature at a time
1020 :     # to insure we do not get any genome attributes.
1021 : parrello 1.80 my @attributeList = @{$attributes->{$fid}};
1022 : parrello 1.24 if (scalar @attributeList) {
1023 :     $featureCount++;
1024 :     }
1025 : parrello 1.23 # Loop through the attributes.
1026 :     for my $tuple (@attributeList) {
1027 : parrello 1.24 $propertyCount++;
1028 : parrello 1.23 # Get this attribute value's data. Note that we throw away the FID,
1029 :     # since it will always be the same as the value if "$fid".
1030 :     my (undef, $key, $value, $url) = @{$tuple};
1031 :     # Concatenate the key and value and check the "propertyKeys" hash to
1032 :     # see if we already have an ID for it. We use a tab for the separator
1033 :     # character.
1034 :     my $propertyKey = "$key\t$value";
1035 :     # Use the concatenated value to check for an ID. If no ID exists, we
1036 :     # create one.
1037 :     my $propertyID = $propertyKeys{$propertyKey};
1038 :     if (! $propertyID) {
1039 :     # Here we need to create a new property ID for this key/value pair.
1040 :     $propertyKeys{$propertyKey} = $nextID;
1041 :     $propertyID = $nextID;
1042 :     $nextID++;
1043 :     $loadProperty->Put($propertyID, $key, $value);
1044 :     }
1045 :     # Create the HasProperty entry for this feature/property association.
1046 :     $loadHasProperty->Put($fid, $propertyID, $url);
1047 : parrello 1.1 }
1048 :     }
1049 : parrello 1.24 # Update the statistics.
1050 :     Trace("$propertyCount attributes processed for $featureCount features.") if T(3);
1051 :     $loadHasProperty->Add("featuresIn", $featureCount);
1052 :     $loadHasProperty->Add("propertiesIn", $propertyCount);
1053 : parrello 1.1 }
1054 :     }
1055 :     # Finish the load.
1056 :     my $retVal = $self->_FinishAll();
1057 :     return $retVal;
1058 :     }
1059 :    
1060 :     =head3 LoadAnnotationData
1061 :    
1062 :     C<< my $stats = $spl->LoadAnnotationData(); >>
1063 :    
1064 :     Load the annotation data from FIG into Sprout.
1065 :    
1066 :     Sprout annotations encompass both the assignments and the annotations in SEED.
1067 :     These describe the function performed by a PEG as well as any other useful
1068 :     information that may aid in identifying its purpose.
1069 :    
1070 :     The following relations are loaded by this method.
1071 :    
1072 :     Annotation
1073 :     IsTargetOfAnnotation
1074 :     SproutUser
1075 :     MadeAnnotation
1076 :    
1077 :     =over 4
1078 :    
1079 :     =item RETURNS
1080 :    
1081 :     Returns a statistics object for the loads.
1082 :    
1083 :     =back
1084 :    
1085 :     =cut
1086 :     #: Return Type $%;
1087 :     sub LoadAnnotationData {
1088 :     # Get this object instance.
1089 :     my ($self) = @_;
1090 :     # Get the FIG object.
1091 :     my $fig = $self->{fig};
1092 :     # Get the genome hash.
1093 :     my $genomeHash = $self->{genomes};
1094 :     # Create load objects for each of the tables we're loading.
1095 : parrello 1.23 my $loadAnnotation = $self->_TableLoader('Annotation');
1096 : parrello 1.25 my $loadIsTargetOfAnnotation = $self->_TableLoader('IsTargetOfAnnotation', $self->PrimaryOnly);
1097 :     my $loadSproutUser = $self->_TableLoader('SproutUser', $self->PrimaryOnly);
1098 :     my $loadUserAccess = $self->_TableLoader('UserAccess', $self->PrimaryOnly);
1099 :     my $loadMadeAnnotation = $self->_TableLoader('MadeAnnotation', $self->PrimaryOnly);
1100 : parrello 1.23 if ($self->{options}->{loadOnly}) {
1101 :     Trace("Loading from existing files.") if T(2);
1102 :     } else {
1103 :     Trace("Generating annotation data.") if T(2);
1104 :     # Create a hash of user names. We'll use this to prevent us from generating duplicate
1105 :     # user records.
1106 :     my %users = ( FIG => 1, master => 1 );
1107 :     # Put in FIG and "master".
1108 :     $loadSproutUser->Put("FIG", "Fellowship for Interpretation of Genomes");
1109 :     $loadUserAccess->Put("FIG", 1);
1110 :     $loadSproutUser->Put("master", "Master User");
1111 :     $loadUserAccess->Put("master", 1);
1112 :     # Get the current time.
1113 :     my $time = time();
1114 :     # Loop through the genomes.
1115 :     for my $genomeID (sort keys %{$genomeHash}) {
1116 :     Trace("Processing $genomeID.") if T(3);
1117 : parrello 1.38 # Create a hash of timestamps. We use this to prevent duplicate time stamps
1118 :     # from showing up for a single PEG's annotations.
1119 :     my %seenTimestamps = ();
1120 : parrello 1.36 # Get the genome's annotations.
1121 :     my @annotations = $fig->read_all_annotations($genomeID);
1122 :     Trace("Processing annotations.") if T(2);
1123 :     for my $tuple (@annotations) {
1124 :     # Get the annotation tuple.
1125 :     my ($peg, $timestamp, $user, $text) = @{$tuple};
1126 :     # Here we fix up the annotation text. "\r" is removed,
1127 : parrello 1.42 # and "\t" and "\n" are escaped. Note we use the "gs"
1128 : parrello 1.36 # modifier so that new-lines inside the text do not
1129 :     # stop the substitution search.
1130 :     $text =~ s/\r//gs;
1131 :     $text =~ s/\t/\\t/gs;
1132 :     $text =~ s/\n/\\n/gs;
1133 :     # Change assignments by the master user to FIG assignments.
1134 :     $text =~ s/Set master function/Set FIG function/s;
1135 :     # Insure the time stamp is valid.
1136 :     if ($timestamp =~ /^\d+$/) {
1137 :     # Here it's a number. We need to insure the one we use to form
1138 :     # the key is unique.
1139 :     my $keyStamp = $timestamp;
1140 :     while ($seenTimestamps{"$peg:$keyStamp"}) {
1141 :     $keyStamp++;
1142 : parrello 1.1 }
1143 : parrello 1.36 my $annotationID = "$peg:$keyStamp";
1144 :     $seenTimestamps{$annotationID} = 1;
1145 :     # Insure the user exists.
1146 :     if (! $users{$user}) {
1147 :     $loadSproutUser->Put($user, "SEED user");
1148 :     $loadUserAccess->Put($user, 1);
1149 :     $users{$user} = 1;
1150 :     }
1151 :     # Generate the annotation.
1152 :     $loadAnnotation->Put($annotationID, $timestamp, $text);
1153 :     $loadIsTargetOfAnnotation->Put($peg, $annotationID);
1154 :     $loadMadeAnnotation->Put($user, $annotationID);
1155 :     } else {
1156 :     # Here we have an invalid time stamp.
1157 :     Trace("Invalid time stamp \"$timestamp\" in annotations for $peg.") if T(1);
1158 : parrello 1.1 }
1159 :     }
1160 :     }
1161 :     }
1162 :     # Finish the load.
1163 :     my $retVal = $self->_FinishAll();
1164 :     return $retVal;
1165 :     }
1166 :    
1167 : parrello 1.5 =head3 LoadSourceData
1168 :    
1169 :     C<< my $stats = $spl->LoadSourceData(); >>
1170 :    
1171 :     Load the source data from FIG into Sprout.
1172 :    
1173 :     Source data links genomes to information about the organizations that
1174 :     mapped it.
1175 :    
1176 :     The following relations are loaded by this method.
1177 :    
1178 :     ComesFrom
1179 :     Source
1180 :     SourceURL
1181 :    
1182 :     There is no direct support for source attribution in FIG, so we access the SEED
1183 :     files directly.
1184 :    
1185 :     =over 4
1186 :    
1187 :     =item RETURNS
1188 :    
1189 :     Returns a statistics object for the loads.
1190 :    
1191 :     =back
1192 :    
1193 :     =cut
1194 :     #: Return Type $%;
1195 :     sub LoadSourceData {
1196 :     # Get this object instance.
1197 :     my ($self) = @_;
1198 :     # Get the FIG object.
1199 :     my $fig = $self->{fig};
1200 :     # Get the genome hash.
1201 :     my $genomeHash = $self->{genomes};
1202 :     # Create load objects for each of the tables we're loading.
1203 : parrello 1.25 my $loadComesFrom = $self->_TableLoader('ComesFrom', $self->PrimaryOnly);
1204 : parrello 1.23 my $loadSource = $self->_TableLoader('Source');
1205 :     my $loadSourceURL = $self->_TableLoader('SourceURL');
1206 :     if ($self->{options}->{loadOnly}) {
1207 :     Trace("Loading from existing files.") if T(2);
1208 :     } else {
1209 :     Trace("Generating annotation data.") if T(2);
1210 :     # Create hashes to collect the Source information.
1211 :     my %sourceURL = ();
1212 :     my %sourceDesc = ();
1213 :     # Loop through the genomes.
1214 :     my $line;
1215 :     for my $genomeID (sort keys %{$genomeHash}) {
1216 :     Trace("Processing $genomeID.") if T(3);
1217 :     # Open the project file.
1218 :     if ((open(TMP, "<$FIG_Config::organisms/$genomeID/PROJECT")) &&
1219 :     defined($line = <TMP>)) {
1220 :     chomp $line;
1221 :     my($sourceID, $desc, $url) = split(/\t/,$line);
1222 :     $loadComesFrom->Put($genomeID, $sourceID);
1223 :     if ($url && ! exists $sourceURL{$sourceID}) {
1224 :     $loadSourceURL->Put($sourceID, $url);
1225 :     $sourceURL{$sourceID} = 1;
1226 :     }
1227 :     if ($desc) {
1228 :     $sourceDesc{$sourceID} = $desc;
1229 :     } elsif (! exists $sourceDesc{$sourceID}) {
1230 :     $sourceDesc{$sourceID} = $sourceID;
1231 :     }
1232 : parrello 1.5 }
1233 : parrello 1.23 close TMP;
1234 :     }
1235 :     # Write the source descriptions.
1236 :     for my $sourceID (keys %sourceDesc) {
1237 :     $loadSource->Put($sourceID, $sourceDesc{$sourceID});
1238 : parrello 1.5 }
1239 : parrello 1.16 }
1240 : parrello 1.5 # Finish the load.
1241 :     my $retVal = $self->_FinishAll();
1242 :     return $retVal;
1243 :     }
1244 :    
1245 : parrello 1.6 =head3 LoadExternalData
1246 :    
1247 :     C<< my $stats = $spl->LoadExternalData(); >>
1248 :    
1249 :     Load the external data from FIG into Sprout.
1250 :    
1251 :     External data contains information about external feature IDs.
1252 :    
1253 :     The following relations are loaded by this method.
1254 :    
1255 :     ExternalAliasFunc
1256 :     ExternalAliasOrg
1257 :    
1258 :     The support for external IDs in FIG is hidden beneath layers of other data, so
1259 :     we access the SEED files directly to create these tables. This is also one of
1260 :     the few load methods that does not proceed genome by genome.
1261 :    
1262 :     =over 4
1263 :    
1264 :     =item RETURNS
1265 :    
1266 :     Returns a statistics object for the loads.
1267 :    
1268 :     =back
1269 :    
1270 :     =cut
1271 :     #: Return Type $%;
1272 :     sub LoadExternalData {
1273 :     # Get this object instance.
1274 :     my ($self) = @_;
1275 :     # Get the FIG object.
1276 :     my $fig = $self->{fig};
1277 :     # Get the genome hash.
1278 :     my $genomeHash = $self->{genomes};
1279 :     # Convert the genome hash. We'll get the genus and species for each genome and make
1280 :     # it the key.
1281 :     my %speciesHash = map { $fig->genus_species($_) => $_ } (keys %{$genomeHash});
1282 :     # Create load objects for each of the tables we're loading.
1283 : parrello 1.23 my $loadExternalAliasFunc = $self->_TableLoader('ExternalAliasFunc');
1284 :     my $loadExternalAliasOrg = $self->_TableLoader('ExternalAliasOrg');
1285 :     if ($self->{options}->{loadOnly}) {
1286 :     Trace("Loading from existing files.") if T(2);
1287 :     } else {
1288 :     Trace("Generating external data.") if T(2);
1289 :     # We loop through the files one at a time. First, the organism file.
1290 : parrello 1.58 Open(\*ORGS, "sort +0 -1 -u -t\"\t\" $FIG_Config::global/ext_org.table |");
1291 : parrello 1.23 my $orgLine;
1292 :     while (defined($orgLine = <ORGS>)) {
1293 :     # Clean the input line.
1294 :     chomp $orgLine;
1295 :     # Parse the organism name.
1296 :     my ($protID, $name) = split /\s*\t\s*/, $orgLine;
1297 :     $loadExternalAliasOrg->Put($protID, $name);
1298 :     }
1299 :     close ORGS;
1300 :     # Now the function file.
1301 :     my $funcLine;
1302 : parrello 1.58 Open(\*FUNCS, "sort +0 -1 -u -t\"\t\" $FIG_Config::global/ext_func.table |");
1303 : parrello 1.23 while (defined($funcLine = <FUNCS>)) {
1304 :     # Clean the line ending.
1305 :     chomp $funcLine;
1306 :     # Only proceed if the line is non-blank.
1307 :     if ($funcLine) {
1308 :     # Split it into fields.
1309 :     my @funcFields = split /\s*\t\s*/, $funcLine;
1310 :     # If there's an EC number, append it to the description.
1311 :     if ($#funcFields >= 2 && $funcFields[2] =~ /^(EC .*\S)/) {
1312 :     $funcFields[1] .= " $1";
1313 :     }
1314 :     # Output the function line.
1315 :     $loadExternalAliasFunc->Put(@funcFields[0,1]);
1316 : parrello 1.6 }
1317 :     }
1318 :     }
1319 :     # Finish the load.
1320 :     my $retVal = $self->_FinishAll();
1321 :     return $retVal;
1322 :     }
1323 : parrello 1.5
1324 : parrello 1.18
1325 :     =head3 LoadReactionData
1326 :    
1327 :     C<< my $stats = $spl->LoadReactionData(); >>
1328 :    
1329 :     Load the reaction data from FIG into Sprout.
1330 :    
1331 :     Reaction data connects reactions to the compounds that participate in them.
1332 :    
1333 :     The following relations are loaded by this method.
1334 :    
1335 : parrello 1.20 Reaction
1336 : parrello 1.18 ReactionURL
1337 :     Compound
1338 :     CompoundName
1339 :     CompoundCAS
1340 :     IsAComponentOf
1341 :    
1342 :     This method proceeds reaction by reaction rather than genome by genome.
1343 :    
1344 :     =over 4
1345 :    
1346 :     =item RETURNS
1347 :    
1348 :     Returns a statistics object for the loads.
1349 :    
1350 :     =back
1351 :    
1352 :     =cut
1353 :     #: Return Type $%;
1354 :     sub LoadReactionData {
1355 :     # Get this object instance.
1356 :     my ($self) = @_;
1357 :     # Get the FIG object.
1358 :     my $fig = $self->{fig};
1359 :     # Create load objects for each of the tables we're loading.
1360 : parrello 1.23 my $loadReaction = $self->_TableLoader('Reaction');
1361 : parrello 1.25 my $loadReactionURL = $self->_TableLoader('ReactionURL', $self->PrimaryOnly);
1362 :     my $loadCompound = $self->_TableLoader('Compound', $self->PrimaryOnly);
1363 :     my $loadCompoundName = $self->_TableLoader('CompoundName', $self->PrimaryOnly);
1364 :     my $loadCompoundCAS = $self->_TableLoader('CompoundCAS', $self->PrimaryOnly);
1365 :     my $loadIsAComponentOf = $self->_TableLoader('IsAComponentOf', $self->PrimaryOnly);
1366 : parrello 1.23 if ($self->{options}->{loadOnly}) {
1367 :     Trace("Loading from existing files.") if T(2);
1368 :     } else {
1369 :     Trace("Generating annotation data.") if T(2);
1370 :     # First we create the compounds.
1371 :     my @compounds = $fig->all_compounds();
1372 :     for my $cid (@compounds) {
1373 :     # Check for names.
1374 :     my @names = $fig->names_of_compound($cid);
1375 :     # Each name will be given a priority number, starting with 1.
1376 :     my $prio = 1;
1377 :     for my $name (@names) {
1378 :     $loadCompoundName->Put($cid, $name, $prio++);
1379 :     }
1380 :     # Create the main compound record. Note that the first name
1381 :     # becomes the label.
1382 :     my $label = (@names > 0 ? $names[0] : $cid);
1383 :     $loadCompound->Put($cid, $label);
1384 :     # Check for a CAS ID.
1385 :     my $cas = $fig->cas($cid);
1386 :     if ($cas) {
1387 :     $loadCompoundCAS->Put($cid, $cas);
1388 :     }
1389 : parrello 1.20 }
1390 : parrello 1.23 # All the compounds are set up, so we need to loop through the reactions next. First,
1391 :     # we initialize the discriminator index. This is a single integer used to insure
1392 :     # duplicate elements in a reaction are not accidentally collapsed.
1393 :     my $discrim = 0;
1394 :     my @reactions = $fig->all_reactions();
1395 :     for my $reactionID (@reactions) {
1396 :     # Create the reaction record.
1397 :     $loadReaction->Put($reactionID, $fig->reversible($reactionID));
1398 :     # Compute the reaction's URL.
1399 :     my $url = HTML::reaction_link($reactionID);
1400 :     # Put it in the ReactionURL table.
1401 :     $loadReactionURL->Put($reactionID, $url);
1402 :     # Now we need all of the reaction's compounds. We get these in two phases,
1403 :     # substrates first and then products.
1404 :     for my $product (0, 1) {
1405 :     # Get the compounds of the current type for the current reaction. FIG will
1406 :     # give us 3-tuples: [ID, stoichiometry, main-flag]. At this time we do not
1407 :     # have location data in SEED, so it defaults to the empty string.
1408 :     my @compounds = $fig->reaction2comp($reactionID, $product);
1409 :     for my $compData (@compounds) {
1410 :     # Extract the compound data from the current tuple.
1411 :     my ($cid, $stoich, $main) = @{$compData};
1412 :     # Link the compound to the reaction.
1413 :     $loadIsAComponentOf->Put($cid, $reactionID, $discrim++, "", $main,
1414 :     $product, $stoich);
1415 :     }
1416 : parrello 1.18 }
1417 :     }
1418 :     }
1419 :     # Finish the load.
1420 :     my $retVal = $self->_FinishAll();
1421 :     return $retVal;
1422 :     }
1423 :    
1424 : parrello 1.5 =head3 LoadGroupData
1425 :    
1426 :     C<< my $stats = $spl->LoadGroupData(); >>
1427 :    
1428 :     Load the genome Groups into Sprout.
1429 :    
1430 :     The following relations are loaded by this method.
1431 :    
1432 :     GenomeGroups
1433 :    
1434 : parrello 1.68 Currently, we do not use groups. We used to use them for NMPDR groups,
1435 :     butThere is no direct support for genome groups in FIG, so we access the SEED
1436 : parrello 1.5 files directly.
1437 :    
1438 :     =over 4
1439 :    
1440 :     =item RETURNS
1441 :    
1442 :     Returns a statistics object for the loads.
1443 :    
1444 :     =back
1445 :    
1446 :     =cut
1447 :     #: Return Type $%;
1448 :     sub LoadGroupData {
1449 :     # Get this object instance.
1450 :     my ($self) = @_;
1451 :     # Get the FIG object.
1452 :     my $fig = $self->{fig};
1453 :     # Get the genome hash.
1454 :     my $genomeHash = $self->{genomes};
1455 :     # Create a load object for the table we're loading.
1456 : parrello 1.23 my $loadGenomeGroups = $self->_TableLoader('GenomeGroups');
1457 :     if ($self->{options}->{loadOnly}) {
1458 :     Trace("Loading from existing files.") if T(2);
1459 :     } else {
1460 :     Trace("Generating group data.") if T(2);
1461 : parrello 1.68 # Currently there are no groups.
1462 : parrello 1.5 }
1463 :     # Finish the load.
1464 :     my $retVal = $self->_FinishAll();
1465 :     return $retVal;
1466 :     }
1467 :    
1468 : parrello 1.43 =head3 LoadSynonymData
1469 :    
1470 :     C<< my $stats = $spl->LoadSynonymData(); >>
1471 :    
1472 :     Load the synonym groups into Sprout.
1473 :    
1474 :     The following relations are loaded by this method.
1475 :    
1476 :     SynonymGroup
1477 :     IsSynonymGroupFor
1478 :    
1479 :     The source information for these relations is taken from the C<maps_to_id> method
1480 : parrello 1.56 of the B<FIG> object. Unfortunately, to make this work, we need to use direct
1481 :     SQL against the FIG database.
1482 : parrello 1.43
1483 :     =over 4
1484 :    
1485 :     =item RETURNS
1486 :    
1487 :     Returns a statistics object for the loads.
1488 :    
1489 :     =back
1490 :    
1491 :     =cut
1492 :     #: Return Type $%;
1493 :     sub LoadSynonymData {
1494 :     # Get this object instance.
1495 :     my ($self) = @_;
1496 :     # Get the FIG object.
1497 :     my $fig = $self->{fig};
1498 :     # Get the genome hash.
1499 :     my $genomeHash = $self->{genomes};
1500 :     # Create a load object for the table we're loading.
1501 :     my $loadSynonymGroup = $self->_TableLoader('SynonymGroup');
1502 :     my $loadIsSynonymGroupFor = $self->_TableLoader('IsSynonymGroupFor');
1503 :     if ($self->{options}->{loadOnly}) {
1504 :     Trace("Loading from existing files.") if T(2);
1505 :     } else {
1506 :     Trace("Generating synonym group data.") if T(2);
1507 : parrello 1.56 # Get the database handle.
1508 :     my $dbh = $fig->db_handle();
1509 :     # Ask for the synonyms.
1510 : parrello 1.59 my $sth = $dbh->prepare_command("SELECT maps_to, syn_id FROM peg_synonyms ORDER BY maps_to");
1511 : parrello 1.56 my $result = $sth->execute();
1512 :     if (! defined($result)) {
1513 :     Confess("Database error in Synonym load: " . $sth->errstr());
1514 :     } else {
1515 :     # Remember the current synonym.
1516 :     my $current_syn = "";
1517 :     # Count the features.
1518 :     my $featureCount = 0;
1519 :     # Loop through the synonym/peg pairs.
1520 :     while (my @row = $sth->fetchrow()) {
1521 :     # Get the synonym ID and feature ID.
1522 :     my ($syn_id, $peg) = @row;
1523 :     # Insure it's for one of our genomes.
1524 :     my $genomeID = FIG::genome_of($peg);
1525 :     if (exists $genomeHash->{$genomeID}) {
1526 :     # Verify the synonym.
1527 :     if ($syn_id ne $current_syn) {
1528 :     # It's new, so put it in the group table.
1529 :     $loadSynonymGroup->Put($syn_id);
1530 :     $current_syn = $syn_id;
1531 :     }
1532 :     # Connect the synonym to the peg.
1533 :     $loadIsSynonymGroupFor->Put($syn_id, $peg);
1534 :     # Count this feature.
1535 :     $featureCount++;
1536 :     if ($featureCount % 1000 == 0) {
1537 :     Trace("$featureCount features processed.") if T(3);
1538 :     }
1539 : parrello 1.43 }
1540 :     }
1541 :     }
1542 :     }
1543 :     # Finish the load.
1544 :     my $retVal = $self->_FinishAll();
1545 :     return $retVal;
1546 :     }
1547 :    
1548 : parrello 1.60 =head3 LoadFamilyData
1549 :    
1550 :     C<< my $stats = $spl->LoadFamilyData(); >>
1551 :    
1552 :     Load the protein families into Sprout.
1553 :    
1554 :     The following relations are loaded by this method.
1555 :    
1556 :     Family
1557 : parrello 1.63 IsFamilyForFeature
1558 : parrello 1.60
1559 :     The source information for these relations is taken from the C<families_for_protein>,
1560 :     C<family_function>, and C<sz_family> methods of the B<FIG> object.
1561 :    
1562 :     =over 4
1563 :    
1564 :     =item RETURNS
1565 :    
1566 :     Returns a statistics object for the loads.
1567 :    
1568 :     =back
1569 :    
1570 :     =cut
1571 :     #: Return Type $%;
1572 :     sub LoadFamilyData {
1573 :     # Get this object instance.
1574 :     my ($self) = @_;
1575 :     # Get the FIG object.
1576 :     my $fig = $self->{fig};
1577 :     # Get the genome hash.
1578 :     my $genomeHash = $self->{genomes};
1579 :     # Create load objects for the tables we're loading.
1580 :     my $loadFamily = $self->_TableLoader('Family');
1581 : parrello 1.63 my $loadIsFamilyForFeature = $self->_TableLoader('IsFamilyForFeature');
1582 : parrello 1.60 if ($self->{options}->{loadOnly}) {
1583 :     Trace("Loading from existing files.") if T(2);
1584 :     } else {
1585 :     Trace("Generating family data.") if T(2);
1586 :     # Create a hash for the family IDs.
1587 :     my %familyHash = ();
1588 :     # Loop through the genomes.
1589 :     for my $genomeID (sort keys %{$genomeHash}) {
1590 :     Trace("Processing features for $genomeID.") if T(2);
1591 :     # Loop through this genome's PEGs.
1592 :     for my $fid ($fig->all_features($genomeID, "peg")) {
1593 : parrello 1.63 $loadIsFamilyForFeature->Add("features", 1);
1594 : parrello 1.60 # Get this feature's families.
1595 :     my @families = $fig->families_for_protein($fid);
1596 :     # Loop through the families, connecting them to the feature.
1597 :     for my $family (@families) {
1598 : parrello 1.63 $loadIsFamilyForFeature->Put($family, $fid);
1599 : parrello 1.60 # If this is a new family, create a record for it.
1600 :     if (! exists $familyHash{$family}) {
1601 : parrello 1.62 $familyHash{$family} = 1;
1602 : parrello 1.60 $loadFamily->Add("families", 1);
1603 :     my $size = $fig->sz_family($family);
1604 :     my $func = $fig->family_function($family);
1605 : parrello 1.61 $loadFamily->Put($family, $size, $func);
1606 : parrello 1.60 }
1607 :     }
1608 :     }
1609 :     }
1610 :     }
1611 :     # Finish the load.
1612 :     my $retVal = $self->_FinishAll();
1613 :     return $retVal;
1614 :     }
1615 : parrello 1.43
1616 : parrello 1.76 =head3 LoadDrugData
1617 :    
1618 :     C<< my $stats = $spl->LoadDrugData(); >>
1619 :    
1620 :     Load the drug target data into Sprout.
1621 :    
1622 :     The following relations are loaded by this method.
1623 :    
1624 :     DrugProject
1625 :     ContainsTopic
1626 :     DrugTopic
1627 :     ContainsAnalysisOf
1628 :     PDB
1629 :     IncludesBound
1630 :     IsBoundIn
1631 :     BindsWith
1632 :     Ligand
1633 :     DescribesProteinForFeature
1634 :     FeatureConservation
1635 :    
1636 :     The source information for these relations is taken from flat files in the
1637 :     C<$FIG_Config::drug_directory>. The file C<master_tables.list> contains
1638 :     a list of drug project names paired with file names. The named file (in the
1639 :     same directory) contains all the data for the project.
1640 :    
1641 :     =over 4
1642 :    
1643 :     =item RETURNS
1644 :    
1645 :     Returns a statistics object for the loads.
1646 :    
1647 :     =back
1648 :    
1649 :     =cut
1650 :     #: Return Type $%;
1651 :     sub LoadDrugData {
1652 :     # Get this object instance.
1653 :     my ($self) = @_;
1654 :     # Get the FIG object.
1655 :     my $fig = $self->{fig};
1656 :     # Get the genome hash.
1657 :     my $genomeHash = $self->{genomes};
1658 :     # Create load objects for the tables we're loading.
1659 :     my $loadDrugProject = $self->_TableLoader('DrugProject');
1660 :     my $loadContainsTopic = $self->_TableLoader('ContainsTopic');
1661 :     my $loadDrugTopic = $self->_TableLoader('DrugTopic');
1662 :     my $loadContainsAnalysisOf = $self->_TableLoader('ContainsAnalysisOf');
1663 :     my $loadPDB = $self->_TableLoader('PDB');
1664 :     my $loadIncludesBound = $self->_TableLoader('IncludesBound');
1665 :     my $loadIsBoundIn = $self->_TableLoader('IsBoundIn');
1666 :     my $loadBindsWith = $self->_TableLoader('BindsWith');
1667 :     my $loadLigand = $self->_TableLoader('Ligand');
1668 :     my $loadDescribesProteinForFeature = $self->_TableLoader('DescribesProteinForFeature');
1669 :     my $loadFeatureConservation = $self->_TableLoader('FeatureConservation');
1670 :     if ($self->{options}->{loadOnly}) {
1671 :     Trace("Loading from existing files.") if T(2);
1672 :     } else {
1673 :     Trace("Generating drug target data.") if T(2);
1674 :     # Load the project list. The file comes in as a list of chomped lines,
1675 :     # and we split them on the TAB character to make the project name the
1676 :     # key and the file name the value of the resulting hash.
1677 :     my %projects = map { split /\t/, $_ } Tracer::GetFile("$FIG_Config::drug_directory/master_tables.list");
1678 :     # Create hashes for the derived objects: PDBs, Features, and Ligands. These objects
1679 :     # may occur multiple times in a single project file or even in multiple project
1680 :     # files.
1681 :     my %ligands = ();
1682 :     my %pdbs = ();
1683 :     my %features = ();
1684 :     my %bindings = ();
1685 :     # Set up a counter for drug topics. This will be used as the key.
1686 :     my $topicCounter = 0;
1687 :     # Loop through the projects. We sort the keys not because we need them sorted, but
1688 :     # because it makes it easier to infer our progress from trace messages.
1689 :     for my $project (sort keys %projects) {
1690 :     Trace("Processing project $project.") if T(3);
1691 :     # Only proceed if the download file exists.
1692 :     my $projectFile = "$FIG_Config::drug_directory/$projects{$project}";
1693 :     if (! -f $projectFile) {
1694 :     Trace("Project file $projectFile not found.") if T(0);
1695 :     } else {
1696 :     # Create the project record.
1697 :     $loadDrugProject->Put($project);
1698 :     # Create a hash for the topics. Each project has one or more topics. The
1699 :     # topic is identified by a URL, a category, and an identifier.
1700 :     my %topics = ();
1701 :     # Now we can open the project file.
1702 :     Trace("Reading project file $projectFile.") if T(3);
1703 :     Open(\*PROJECT, "<$projectFile");
1704 :     # Get the first record, which is a list of column headers. We don't use this
1705 :     # for anything, but it may be useful for debugging.
1706 :     my $headerLine = <PROJECT>;
1707 :     # Loop through the rest of the records.
1708 :     while (! eof PROJECT) {
1709 :     # Get the current line of data. Note that not all lines will have all
1710 :     # the fields. In particular, the CLIBE data is fairly rare.
1711 :     my ($authorOrganism, $category, $tag, $refURL, $peg, $conservation,
1712 :     $pdbBound, $pdbBoundEval, $pdbFree, $pdbFreeEval, $pdbFreeTitle,
1713 :     $protDistInfo, $passAspInfo, $passAspFile, $passWeightInfo,
1714 :     $passWeightFile, $clibeInfo, $clibeURL, $clibeTotalEnergy,
1715 :     $clibeVanderwaals, $clibeHBonds, $clibeEI, $clibeSolvationE)
1716 :     = Tracer::GetLine(\*PROJECT);
1717 :     # The tag contains an identifier for the current line of data followed
1718 :     # by a text statement that generally matches a property name in the
1719 :     # main database. We split it up, since the identifier goes with
1720 :     # the PDB data and the text statement is part of the topic.
1721 :     my ($lineID, $topicTag) = split /\s*,\s*/, $tag;
1722 :     $loadDrugProject->Add("data line");
1723 :     # Check for a new topic.
1724 :     my $topicData = "$category\t$topicTag\t$refURL";
1725 :     if (! exists $topics{$topicData}) {
1726 :     # Here we have a new topic. Compute its ID.
1727 :     $topicCounter++;
1728 :     $topics{$topicData} = $topicCounter;
1729 :     # Create its database record.
1730 :     $loadDrugTopic->Put($topicCounter, $refURL, $category, $authorOrganism,
1731 :     $topicTag);
1732 :     # Connect it to the project.
1733 :     $loadContainsTopic->Put($project, $topicCounter);
1734 :     $loadDrugTopic->Add("topic");
1735 :     }
1736 :     # Now we know the topic ID exists in the hash and the topic will
1737 :     # appear in the database, so we get this topic's ID.
1738 :     my $topicID = $topics{$topicData};
1739 :     # If the feature in this line is new, we need to save its conservation
1740 :     # number.
1741 :     if (! exists $features{$peg}) {
1742 :     $loadFeatureConservation->Put($peg, $conservation);
1743 :     $features{$peg} = 1;
1744 :     }
1745 :     # Now we have two PDBs to deal with-- a bound PDB and a free PDB.
1746 :     # The free PDB will have data about docking points; the bound PDB
1747 :     # will have data about docking. We store both types as PDBs, and
1748 :     # the special data comes from relationships. First we process the
1749 :     # bound PDB.
1750 :     if ($pdbBound) {
1751 :     $loadPDB->Add("bound line");
1752 :     # Insure this PDB is in the database.
1753 :     $self->CreatePDB($pdbBound, lc "$pdbFreeTitle (bound)", "bound", \%pdbs, $loadPDB);
1754 :     # Connect it to this topic.
1755 :     $loadIncludesBound->Put($topicID, $pdbBound);
1756 :     # Check for CLIBE data.
1757 :     if ($clibeInfo) {
1758 :     $loadLigand->Add("clibes");
1759 :     # We have CLIBE data, so we create a ligand and relate it to the PDB.
1760 :     if (! exists $ligands{$clibeInfo}) {
1761 :     # This is a new ligand, so create its record.
1762 :     $loadLigand->Put($clibeInfo);
1763 :     $loadLigand->Add("ligand");
1764 :     # Make sure we know this ligand already exists.
1765 :     $ligands{$clibeInfo} = 1;
1766 :     }
1767 :     # Now connect the PDB to the ligand using the CLIBE data.
1768 :     $loadBindsWith->Put($pdbBound, $clibeInfo, $clibeURL, $clibeHBonds, $clibeEI,
1769 :     $clibeSolvationE, $clibeVanderwaals);
1770 :     }
1771 :     # Connect this PDB to the feature.
1772 :     $loadDescribesProteinForFeature->Put($pdbBound, $peg, $protDistInfo, $pdbBoundEval);
1773 :     }
1774 :     # Next is the free PDB.
1775 :     if ($pdbFree) {
1776 :     $loadPDB->Add("free line");
1777 :     # Insure this PDB is in the database.
1778 :     $self->CreatePDB($pdbFree, lc $pdbFreeTitle, "free", \%pdbs, $loadPDB);
1779 :     # Connect it to this topic.
1780 :     $loadContainsAnalysisOf->Put($topicID, $pdbFree, $passAspInfo,
1781 :     $passWeightFile, $passWeightInfo, $passAspFile);
1782 :     # Connect this PDB to the feature.
1783 :     $loadDescribesProteinForFeature->Put($pdbFree, $peg, $protDistInfo, $pdbFreeEval);
1784 :     }
1785 :     # If we have both PDBs, we may need to link them.
1786 :     if ($pdbFree && $pdbBound) {
1787 :     $loadIsBoundIn->Add("connection");
1788 :     # Insure we only link them once.
1789 :     my $bindingKey = "$pdbFree\t$pdbBound";
1790 :     if (! exists $bindings{$bindingKey}) {
1791 :     $loadIsBoundIn->Add("newConnection");
1792 :     $loadIsBoundIn->Put($pdbFree, $pdbBound);
1793 :     $bindings{$bindingKey} = 1;
1794 :     }
1795 :     }
1796 :     }
1797 :     # Close off this project.
1798 :     close PROJECT;
1799 :     }
1800 :     }
1801 :     }
1802 :     # Finish the load.
1803 :     my $retVal = $self->_FinishAll();
1804 :     return $retVal;
1805 :     }
1806 : parrello 1.69
1807 :    
1808 : parrello 1.1 =head2 Internal Utility Methods
1809 :    
1810 : parrello 1.76 =head3 SpecialAttribute
1811 :    
1812 : parrello 1.77 C<< my $count = SproutLoad::SpecialAttribute($id, \@attributes, $idxMatch, \@idxValues, $pattern, $loader); >>
1813 : parrello 1.76
1814 :     Look for special attributes of a given type. A special attribute is found by comparing one of
1815 :     the columns of the incoming attribute list to a search pattern. If a match is found, then
1816 : parrello 1.77 a set of columns is put into an output table connected to the specified ID.
1817 : parrello 1.76
1818 :     For example, when processing features, the attribute list we look at has three columns: attribute
1819 :     name, attribute value, and attribute value HTML. The IEDB attribute exists if the attribute name
1820 :     begins with C<iedb_>. The call signature is therefore
1821 :    
1822 : parrello 1.77 my $found = SpecialAttribute($fid, \@attributeList, 0, [0,2], '^iedb_', $loadFeatureIEDB);
1823 : parrello 1.76
1824 :     The pattern is matched against column 0, and if we have a match, then column 2's value is put
1825 :     to the output along with the specified feature ID.
1826 :    
1827 :     =over 4
1828 :    
1829 :     =item id
1830 :    
1831 :     ID of the object whose special attributes are being loaded. This forms the first column of the
1832 :     output.
1833 :    
1834 :     =item attributes
1835 :    
1836 :     Reference to a list of tuples.
1837 :    
1838 :     =item idxMatch
1839 :    
1840 :     Index in each tuple of the column to be matched against the pattern. If the match is
1841 :     successful, an output record will be generated.
1842 :    
1843 : parrello 1.77 =item idxValues
1844 : parrello 1.76
1845 : parrello 1.77 Reference to a list containing the indexes in each tuple of the columns to be put as
1846 :     the second column of the output.
1847 : parrello 1.76
1848 :     =item pattern
1849 :    
1850 :     Pattern to be matched against the specified column. The match will be case-insensitive.
1851 :    
1852 :     =item loader
1853 :    
1854 :     An object to which each output record will be put. Usually this is an B<ERDBLoad> object,
1855 :     but technically it could be anything with a C<Put> method.
1856 :    
1857 :     =item RETURN
1858 :    
1859 :     Returns a count of the matches found.
1860 :    
1861 :     =item
1862 :    
1863 :     =back
1864 :    
1865 :     =cut
1866 :    
1867 :     sub SpecialAttribute {
1868 :     # Get the parameters.
1869 : parrello 1.77 my ($id, $attributes, $idxMatch, $idxValues, $pattern, $loader) = @_;
1870 : parrello 1.76 # Declare the return variable.
1871 :     my $retVal = 0;
1872 :     # Loop through the attribute rows.
1873 :     for my $row (@{$attributes}) {
1874 :     # Check for a match.
1875 :     if ($row->[$idxMatch] =~ m/$pattern/i) {
1876 : parrello 1.77 # We have a match, so output a row. This is a bit tricky, since we may
1877 :     # be putting out multiple columns of data from the input.
1878 :     my $value = join(" ", map { $row->[$_] } @{$idxValues});
1879 :     $loader->Put($id, $value);
1880 : parrello 1.76 $retVal++;
1881 :     }
1882 :     }
1883 :     Trace("$retVal special attributes found for $id and loader " . $loader->RelName() . ".") if T(4) && $retVal;
1884 :     # Return the number of matches.
1885 :     return $retVal;
1886 :     }
1887 :    
1888 :     =head3 CreatePDB
1889 :    
1890 :     C<< $loader->CreatePDB($pdbID, $title, $type, \%pdbHash); >>
1891 :    
1892 :     Insure that a PDB record exists for the identified PDB. If one does not exist, it will be
1893 :     created.
1894 :    
1895 :     =over 4
1896 :    
1897 :     =item pdbID
1898 :    
1899 :     ID string (usually an unqualified file name) for the desired PDB.
1900 :    
1901 :     =item title
1902 :    
1903 :     Title to use if the PDB must be created.
1904 :    
1905 :     =item type
1906 :    
1907 :     Type of PDB: C<free> or C<bound>
1908 :    
1909 :     =item pdbHash
1910 :    
1911 :     Hash containing the IDs of PDBs that have already been created.
1912 :    
1913 :     =item pdbLoader
1914 :    
1915 :     Load object for the PDB table.
1916 :    
1917 :     =back
1918 :    
1919 :     =cut
1920 :    
1921 :     sub CreatePDB {
1922 :     # Get the parameters.
1923 :     my ($self, $pdbID, $title, $type, $pdbHash, $pdbLoader) = @_;
1924 :     $pdbLoader->Add("PDB check");
1925 :     # Check to see if this is a new PDB.
1926 :     if (! exists $pdbHash->{$pdbID}) {
1927 :     # It is, so we create it.
1928 :     $pdbLoader->Put($pdbID, $title, $type);
1929 :     $pdbHash->{$pdbID} = 1;
1930 :     # Count it.
1931 :     $pdbLoader->Add("PDB-$type");
1932 :     }
1933 :     }
1934 :    
1935 : parrello 1.1 =head3 TableLoader
1936 :    
1937 :     Create an ERDBLoad object for the specified table. The object is also added to
1938 :     the internal list in the C<loaders> property of this object. That enables the
1939 :     L</FinishAll> method to terminate all the active loads.
1940 :    
1941 :     This is an instance method.
1942 :    
1943 :     =over 4
1944 :    
1945 :     =item tableName
1946 :    
1947 :     Name of the table (relation) being loaded.
1948 :    
1949 : parrello 1.25 =item ignore
1950 :    
1951 :     TRUE if the table should be ignored entirely, else FALSE.
1952 :    
1953 : parrello 1.1 =item RETURN
1954 :    
1955 :     Returns an ERDBLoad object for loading the specified table.
1956 :    
1957 :     =back
1958 :    
1959 :     =cut
1960 :    
1961 :     sub _TableLoader {
1962 :     # Get the parameters.
1963 : parrello 1.25 my ($self, $tableName, $ignore) = @_;
1964 : parrello 1.1 # Create the load object.
1965 : parrello 1.25 my $retVal = ERDBLoad->new($self->{erdb}, $tableName, $self->{loadDirectory}, $self->LoadOnly,
1966 :     $ignore);
1967 : parrello 1.1 # Cache it in the loader list.
1968 :     push @{$self->{loaders}}, $retVal;
1969 :     # Return it to the caller.
1970 :     return $retVal;
1971 :     }
1972 :    
1973 :     =head3 FinishAll
1974 :    
1975 :     Finish all the active loads on this object.
1976 :    
1977 :     When a load is started by L</TableLoader>, the controlling B<ERDBLoad> object is cached in
1978 :     the list pointed to be the C<loaders> property of this object. This method pops the loaders
1979 :     off the list and finishes them to flush out any accumulated residue.
1980 :    
1981 :     This is an instance method.
1982 :    
1983 :     =over 4
1984 :    
1985 :     =item RETURN
1986 :    
1987 :     Returns a statistics object containing the accumulated statistics for the load.
1988 :    
1989 :     =back
1990 :    
1991 :     =cut
1992 :    
1993 :     sub _FinishAll {
1994 :     # Get this object instance.
1995 :     my ($self) = @_;
1996 :     # Create the statistics object.
1997 :     my $retVal = Stats->new();
1998 :     # Get the loader list.
1999 :     my $loadList = $self->{loaders};
2000 : parrello 1.48 # Create a hash to hold the statistics objects, keyed on relation name.
2001 :     my %loaderHash = ();
2002 : parrello 1.1 # Loop through the list, finishing the loads. Note that if the finish fails, we die
2003 : parrello 1.48 # ignominiously. At some future point, we want to make the loads more restartable.
2004 : parrello 1.1 while (my $loader = pop @{$loadList}) {
2005 : parrello 1.26 # Get the relation name.
2006 : parrello 1.19 my $relName = $loader->RelName;
2007 : parrello 1.26 # Check the ignore flag.
2008 :     if ($loader->Ignore) {
2009 :     Trace("Relation $relName not loaded.") if T(2);
2010 :     } else {
2011 :     # Here we really need to finish.
2012 :     Trace("Finishing $relName.") if T(2);
2013 :     my $stats = $loader->Finish();
2014 : parrello 1.48 $loaderHash{$relName} = $stats;
2015 :     }
2016 :     }
2017 :     # Now we loop through again, actually loading the tables. We want to finish before
2018 :     # loading so that if something goes wrong at this point, all the load files are usable
2019 :     # and we don't have to redo all that work.
2020 :     for my $relName (sort keys %loaderHash) {
2021 :     # Get the statistics for this relation.
2022 :     my $stats = $loaderHash{$relName};
2023 :     # Check for a database load.
2024 :     if ($self->{options}->{dbLoad}) {
2025 :     # Here we want to use the load file just created to load the database.
2026 :     Trace("Loading relation $relName.") if T(2);
2027 :     my $newStats = $self->{sprout}->LoadUpdate(1, [$relName]);
2028 :     # Accumulate the statistics from the DB load.
2029 :     $stats->Accumulate($newStats);
2030 :     }
2031 :     $retVal->Accumulate($stats);
2032 :     Trace("Statistics for $relName:\n" . $stats->Show()) if T(2);
2033 : parrello 1.1 }
2034 :     # Return the load statistics.
2035 :     return $retVal;
2036 :     }
2037 : parrello 1.80 =head3 GetGenomeAttributes
2038 :    
2039 :     C<< my $aHashRef = GetGenomeAttributes($fig, $genomeID, \@fids); >>
2040 :    
2041 :     Return a hash of attributes keyed on feature ID. This method gets all the attributes
2042 :     for all the features of a genome in a single call, then organizes them into a hash.
2043 :    
2044 :     =over 4
2045 :    
2046 :     =item fig
2047 :    
2048 :     FIG-like object for accessing attributes.
2049 :    
2050 :     =item genomeID
2051 :    
2052 :     ID of the genome who's attributes are desired.
2053 :    
2054 :     =item fids
2055 :    
2056 :     Reference to a list of the feature IDs whose attributes are to be kept.
2057 :    
2058 :     =item RETURN
2059 :    
2060 :     Returns a reference to a hash. The key of the hash is the feature ID. The value is the
2061 :     reference to a list of the feature's attribute tuples. Each tuple contains the feature ID,
2062 :     the attribute key, and one or more attribute values.
2063 :    
2064 :     =back
2065 :    
2066 :     =cut
2067 :    
2068 :     sub GetGenomeAttributes {
2069 :     # Get the parameters.
2070 :     my ($fig, $genomeID, $fids) = @_;
2071 :     # Declare the return variable.
2072 :     my $retVal = {};
2073 :     # Get the attributes.
2074 :     my @aList = $fig->get_attributes("fig|$genomeID%");
2075 :     # Initialize the hash. This not only enables us to easily determine which FIDs to
2076 :     # keep, it insures that the caller sees a list reference for every known fid,
2077 :     # simplifying the logic.
2078 :     for my $fid (@{$fids}) {
2079 :     $retVal->{$fid} = [];
2080 :     }
2081 :     # Populate the hash.
2082 :     for my $aListEntry (@aList) {
2083 :     my $fid = $aListEntry->[0];
2084 :     if (exists $retVal->{$fid}) {
2085 :     push @{$retVal->{$fid}}, $aListEntry;
2086 :     }
2087 :     }
2088 :     # Return the result.
2089 :     return $retVal;
2090 :     }
2091 : parrello 1.1
2092 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3