Parent Directory
|
Revision Log
Revision 1.34 - (view) (download) (as text)
1 : | parrello | 1.1 | #!/usr/bin/perl -w |
2 : | |||
3 : | =head1 Load Sprout Tables | ||
4 : | |||
5 : | parrello | 1.12 | =head2 Introduction |
6 : | |||
7 : | parrello | 1.14 | The Sprout database reflects a snapshot of the SEED taken at a particular point in |
8 : | time. At some point in the future, it will be possible to add annotations to the | ||
9 : | Sprout data. All records added to Sprout after the snapshot is taken are | ||
10 : | specially-marked so that the changes can be copied to the SEED. The SEED remains | ||
11 : | the live version of the data. | ||
12 : | |||
13 : | The snapshot is produced by reading the SEED data and writing it to sequential | ||
14 : | files. There is one file per Sprout table, and each such file's name consists of | ||
15 : | the table name with the suffix C<dtx>. Thus, the file for the C<Genome> table | ||
16 : | would be named C<Genome.dtx>. These files are used to load the actual Sprout | ||
17 : | database and to generate Glimpse indices. | ||
18 : | |||
19 : | To load all the Sprout tables and then validate the result, you need to issue three | ||
20 : | commands. | ||
21 : | |||
22 : | LoadSproutTables -dbLoad -dbCreate "*" | ||
23 : | parrello | 1.27 | TestSproutLoad [genomeID] ... |
24 : | index_sprout_lucene | ||
25 : | |||
26 : | where I<[genomeID]> is one or more genome IDs. These genomes will be tested more | ||
27 : | thoroughly than the others. | ||
28 : | parrello | 1.14 | |
29 : | All three commands send output to the console. In addition, C<LoadSproutTables> and | ||
30 : | parrello | 1.27 | C<TestSproutLoad> write tracing information to a trace log in the FIG temporary |
31 : | parrello | 1.14 | directory (B<$FIG_Config::Tmp>). At the bottom of the log file will be a complete |
32 : | list of errors. If errors occur in C<LoadSproutTables>, then the data must be corrected | ||
33 : | and the offending table group reloaded. So, for example, if there are errors in the | ||
34 : | load of the B<MadeAnnotation> and B<Compound> tables, you would need to run | ||
35 : | |||
36 : | LoadSproutTables -dbLoad Annotation Reaction | ||
37 : | |||
38 : | because B<MadeAnnotation> is in the C<Annotation> group, and B<Compound> is in the | ||
39 : | C<Reaction> group. A list of the groups is given below. | ||
40 : | |||
41 : | You can omit the C<dbLoad> option to create the load files without | ||
42 : | loading the database, and you can add a C<trace> option to change the trace level. | ||
43 : | The command below creates the Genome-related load files with a trace level of 3 and | ||
44 : | does not load them into the Sprout database. | ||
45 : | |||
46 : | LoadSproutTables -trace=3 Genome | ||
47 : | |||
48 : | C<LoadSproutTables> takes a long time to run, so setting the trace level to 3 helps | ||
49 : | to give you an idea of the progress. | ||
50 : | |||
51 : | Once the Sprout database is loaded, B<TestSproutLoad> can be used to verify the load | ||
52 : | parrello | 1.27 | against the FIG data. The end of the trace log file will contain statistics on |
53 : | the errors found. Like C<LoadSproutTables>, C<TestSproutLoad> is a time-consuming | ||
54 : | parrello | 1.14 | script, so you may want to set the trace level to 3 to see visible progress. |
55 : | |||
56 : | parrello | 1.27 | TestSproutLoad -trace=3 [genomeID] ... |
57 : | |||
58 : | The I<[genomeID]> specifies zero or more IDs of genomes to receive more thorough | ||
59 : | testing. So, for example, | ||
60 : | |||
61 : | TestSproutLoad -trace=3 100226.1 83333.1 | ||
62 : | |||
63 : | would do thorough testing of I<Streptomyces coelicolor A3-2> (100226.1) and | ||
64 : | I<Escherichia coli K12> (83333.1). | ||
65 : | parrello | 1.14 | |
66 : | Unlike C<LoadSproutTables>, in C<TestSproutLoad>, the individual errors found are | ||
67 : | mixed in with the trace messages. They are all, however, marked with a trace type | ||
68 : | of B<Problem>, as shown in the fragment below. | ||
69 : | |||
70 : | 11/02/2005 19:15:16 <main>: Processing feature fig|100226.1.peg.7742. | ||
71 : | 11/02/2005 19:15:17 <main>: Processing feature fig|100226.1.peg.7741. | ||
72 : | 11/02/2005 19:15:17 <Problem>: assignment "Short-chain dehydrodenase ... | ||
73 : | 11/02/2005 19:15:17 <Problem>: assignment "putative oxidoreductase." ... | ||
74 : | 11/02/2005 19:15:17 <Problem>: Incorrect assignment for fig|100226.1.peg.7741... | ||
75 : | 11/02/2005 19:15:17 <Problem>: Incorrect number of annotations found in ... | ||
76 : | 11/02/2005 19:15:17 <main>: Processing feature fig|100226.1.peg.7740. | ||
77 : | 11/02/2005 19:15:18 <main>: Processing feature fig|100226.1.peg.7739. | ||
78 : | |||
79 : | The test may reveal that some tables need to be reloaded, or that a software | ||
80 : | problem has crept into the Sprout. | ||
81 : | |||
82 : | parrello | 1.27 | Once all the tables have the correct data, C<index_sprout_lucene> can be run to create the |
83 : | Lucene search indexes. Lucene is a web site search engine produced by the Apache project. | ||
84 : | It is written in Java, and in order to run it you must have the B<LuceneSearch> and | ||
85 : | B<NmpdrConfigs> projects checked out from CVS and made. | ||
86 : | parrello | 1.14 | |
87 : | parrello | 1.28 | =head2 The NMPDR Web Site |
88 : | |||
89 : | Sprout is the database engine for the NMPDR web site. The NMPDR web site consists of two | ||
90 : | pieces that run on two different machines. The B<WEB> machine contains HTML pages | ||
91 : | generated by a Content Management Tool. | ||
92 : | |||
93 : | parrello | 1.14 | =head2 Procedure For Loading Sprout |
94 : | |||
95 : | parrello | 1.27 | In order to load the Sprout, you need to have the B<Sprout>, B<NmpdrConfigs>, and |
96 : | B<LuceneSearch> projects checked out from CVS in addition to the standard FIG | ||
97 : | projects. You must also set up the following B<FIG_Config.pm> variables in addition | ||
98 : | to the normal ones. | ||
99 : | |||
100 : | =over 4 | ||
101 : | |||
102 : | =item sproutData | ||
103 : | |||
104 : | Name of the data directory for the Sprout load files. | ||
105 : | |||
106 : | =item var | ||
107 : | |||
108 : | Name of the directory to contain cached NMPDR pages. The most important file in | ||
109 : | this directory is C<nmpdr_page_template.html>, which contains a skeleton page | ||
110 : | from the main NMPDR web site. This skeleton page is used to generate output | ||
111 : | pages that look like the other NMPDR pages. | ||
112 : | |||
113 : | =item java | ||
114 : | |||
115 : | Path to the Java runtime environment. | ||
116 : | |||
117 : | =item sproutDB | ||
118 : | |||
119 : | Name of the Sprout database | ||
120 : | |||
121 : | =item dbuser | ||
122 : | |||
123 : | User name for logging into the Sprout database. | ||
124 : | |||
125 : | =item dbpass | ||
126 : | |||
127 : | Password for logging into the Sprout database. | ||
128 : | |||
129 : | =item nmpdr_site_url | ||
130 : | |||
131 : | URL for the NMPDR cover pages. The NMPDR cover pages are informational and text | ||
132 : | pages that serve as the entry point to the NMPDR web site. They are generated by | ||
133 : | a Content Management tool, and some Sprout scripts need to know where to find | ||
134 : | them. | ||
135 : | |||
136 : | =item nmpdr_site_template_id | ||
137 : | |||
138 : | Page number for the template page used to generate results that look like they're | ||
139 : | part of the NMPDR web site. | ||
140 : | |||
141 : | =back | ||
142 : | |||
143 : | parrello | 1.14 | =over 4 |
144 : | |||
145 : | parrello | 1.27 | The procedure for loading Sprout is as follows. |
146 : | |||
147 : | parrello | 1.14 | =item 1 |
148 : | |||
149 : | parrello | 1.25 | Type |
150 : | |||
151 : | nohup LoadSproutTables -dbLoad -dbCreate -user=you -background "*" >null & | ||
152 : | |||
153 : | parrello | 1.26 | where C<you> is your user ID, and press ENTER. This will create the C<dtx> files |
154 : | parrello | 1.25 | and load them. You may be asked for a password. If this is the case, simply |
155 : | press ENTER. If that does not work, use the C<dbpass> value specified in | ||
156 : | your C<FIG_Config.pm> file. | ||
157 : | |||
158 : | The above command line runs the load in the background. The standard output, | ||
159 : | standard error, and trace output will be directed to files in the FIG temporary | ||
160 : | directory. If your user name is C<Bruce> then the files will be named | ||
161 : | C<outBruce.log>, C<errBruce.log>, and C<traceBruce.log> respectively. | ||
162 : | |||
163 : | If the load fails at some point and you are able to correct the problem, use the | ||
164 : | C<resume> option to restart it. For example, if the load failed while doing the | ||
165 : | Feature load group, you would resume it using | ||
166 : | |||
167 : | nohup LoadSproutTables -dbLoad -dbCreate -user=you -resume -background Feature >null & | ||
168 : | parrello | 1.14 | |
169 : | =item 2 | ||
170 : | |||
171 : | parrello | 1.27 | Type |
172 : | |||
173 : | nohup TestSproutLoad -user=you -background >null &100226.1 83333.1> | ||
174 : | |||
175 : | and press ENTER. This will validate the Sprout database against the SEED data. | ||
176 : | parrello | 1.14 | |
177 : | =item 3 | ||
178 : | |||
179 : | If any errors are detected in step (2), it is most likely due to a change in | ||
180 : | SEED that did not make it to Sprout. Contact Bruce Parrello or Robert Olson | ||
181 : | to get the code updated properly. | ||
182 : | |||
183 : | =item 4 | ||
184 : | |||
185 : | parrello | 1.27 | Type |
186 : | |||
187 : | index_sprout_lucene | ||
188 : | |||
189 : | and press ENTER. This will create the Lucene indexes for the Sprout data. | ||
190 : | |||
191 : | =item 5 | ||
192 : | |||
193 : | Change to the B<SproutData/Indexes> directory under B<FIGdisk> and look for the | ||
194 : | directory created by C<index_sprout_lucene>. The directory name will be | ||
195 : | something like C<Lucene.20060412-154112>. The numbers indicate the date and time | ||
196 : | the index was created. In this case it was 04/12/2006 03:41:12pm. Type | ||
197 : | |||
198 : | ln -sf directory Lucene | ||
199 : | |||
200 : | where C<directory> is the new directory name, to point the C<Lucene> directory to the | ||
201 : | new search index. | ||
202 : | parrello | 1.14 | |
203 : | =back | ||
204 : | |||
205 : | =head2 LoadSproutTables Command | ||
206 : | |||
207 : | C<LoadSproutTables> creates the load files for Sprout tables and optionally loads them. | ||
208 : | parrello | 1.12 | The parameters are the names of the table groups whose data is to be created. |
209 : | The legal table group names are given below. | ||
210 : | parrello | 1.1 | |
211 : | =over 4 | ||
212 : | |||
213 : | =item Genome | ||
214 : | |||
215 : | Loads B<Genome>, B<HasContig>, B<Contig>, B<IsMadeUpOf>, and B<Sequence>. | ||
216 : | |||
217 : | parrello | 1.30 | =item Feature |
218 : | |||
219 : | Loads B<Feature>, B<FeatureAlias>, B<FeatureTranslation>, B<FeatureUpstream>, | ||
220 : | B<IsLocatedIn>, B<FeatureLink>. | ||
221 : | |||
222 : | parrello | 1.1 | =item Coupling |
223 : | |||
224 : | Loads B<Coupling>, B<IsEvidencedBy>, B<PCH>, B<ParticipatesInCoupling>, | ||
225 : | B<UsesAsEvidence>. | ||
226 : | |||
227 : | =item Subsystem | ||
228 : | |||
229 : | parrello | 1.2 | Loads B<Subsystem>, B<Role>, B<SSCell>, B<ContainsFeature>, B<IsGenomeOf>, |
230 : | parrello | 1.8 | B<IsRoleOf>, B<OccursInSubsystem>, B<ParticipatesIn>, B<HasSSCell>, |
231 : | parrello | 1.11 | B<Catalyzes>, B<ConsistsOfRoles>, B<RoleSubset>, B<HasRoleSubset>, |
232 : | parrello | 1.13 | B<ConsistsOfGenomes>, B<GenomeSubset>, B<HasGenomeSubset>, B<Diagram>, |
233 : | B<RoleOccursIn>. | ||
234 : | parrello | 1.1 | |
235 : | parrello | 1.2 | =item Annotation |
236 : | |||
237 : | Loads B<SproutUser>, B<UserAccess>, B<Annotation>, B<IsTargetOfAnnotation>, | ||
238 : | B<MadeAnnotation>. | ||
239 : | |||
240 : | =item Property | ||
241 : | |||
242 : | Loads B<Property>, B<HasProperty>. | ||
243 : | |||
244 : | =item BBH | ||
245 : | |||
246 : | Loads B<IsBidirectionalBestHitOf>. | ||
247 : | |||
248 : | parrello | 1.3 | =item Group |
249 : | |||
250 : | Loads B<GenomeGroups>. | ||
251 : | |||
252 : | =item Source | ||
253 : | |||
254 : | Loads B<Source>, B<ComesFrom>, B<SourceURL>. | ||
255 : | |||
256 : | parrello | 1.4 | =item External |
257 : | |||
258 : | Loads B<ExternalAliasOrg>, B<ExternalAliasFunc>. | ||
259 : | |||
260 : | parrello | 1.8 | =item Reaction |
261 : | |||
262 : | Loads B<ReactionURL>, B<Compound>, B<CompoundName>, | ||
263 : | parrello | 1.11 | B<CompoundCAS>, B<IsAComponentOf>, B<Reaction>. |
264 : | parrello | 1.8 | |
265 : | parrello | 1.31 | =item Synonym |
266 : | |||
267 : | Loads B<SynonymGroup> and B<IsSynonymGroupFor>. | ||
268 : | |||
269 : | parrello | 1.3 | =item * |
270 : | |||
271 : | Loads all of the above tables. | ||
272 : | |||
273 : | parrello | 1.1 | =back |
274 : | |||
275 : | parrello | 1.7 | The command-line options are given below. |
276 : | parrello | 1.1 | |
277 : | =over 4 | ||
278 : | |||
279 : | =item geneFile | ||
280 : | |||
281 : | The name of the file containing the genomes and their associated access codes. The | ||
282 : | file should have one line per genome, each line consisting of the genome ID followed | ||
283 : | by the access code, separated by a tab. If no file is specified, all complete genomes | ||
284 : | will be processed and the access code will be 1. | ||
285 : | |||
286 : | =item subsysFile | ||
287 : | |||
288 : | The name of the file containing the trusted subsystems. The file should have one line | ||
289 : | per trusted subsystem. If no file is specified, all subsystems will be trusted. | ||
290 : | |||
291 : | =item trace | ||
292 : | |||
293 : | Desired tracing level. The default is 3. | ||
294 : | |||
295 : | parrello | 1.25 | =item user |
296 : | |||
297 : | Suffix to use for trace, output, and error files created in | ||
298 : | |||
299 : | parrello | 1.10 | =item dbLoad |
300 : | |||
301 : | If TRUE, the database tables will be loaded automatically from the load files created. | ||
302 : | |||
303 : | parrello | 1.14 | =item dbCreate |
304 : | parrello | 1.1 | |
305 : | parrello | 1.14 | If TRUE, the database will be created. If the database exists already, it will be |
306 : | dropped. Use the function with caution. | ||
307 : | parrello | 1.12 | |
308 : | parrello | 1.17 | =item loadOnly |
309 : | |||
310 : | If TRUE, the database tables will be loaded from existing load files. Load files | ||
311 : | will not be created. This option is useful if you are setting up a copy of Sprout | ||
312 : | and have load files already set up from the original version. | ||
313 : | |||
314 : | parrello | 1.19 | =item primaryOnly |
315 : | |||
316 : | If TRUE, only the group's primary entity will be loaded. | ||
317 : | |||
318 : | parrello | 1.25 | =item background |
319 : | |||
320 : | Redirect the standard and error output to files in the FIG temporary directory. | ||
321 : | |||
322 : | =item resume | ||
323 : | |||
324 : | Resume an interrupted load, starting with the load group specified in the first | ||
325 : | positional parameter. | ||
326 : | |||
327 : | =item sql | ||
328 : | |||
329 : | Trace SQL statements. | ||
330 : | |||
331 : | parrello | 1.32 | =item phone |
332 : | |||
333 : | Phone number to message when the load finishes. | ||
334 : | |||
335 : | parrello | 1.14 | =back |
336 : | parrello | 1.12 | |
337 : | parrello | 1.1 | =cut |
338 : | |||
339 : | use strict; | ||
340 : | use Tracer; | ||
341 : | use DocUtils; | ||
342 : | use Cwd; | ||
343 : | use FIG; | ||
344 : | use SFXlate; | ||
345 : | use File::Copy; | ||
346 : | use File::Path; | ||
347 : | use SproutLoad; | ||
348 : | use Stats; | ||
349 : | parrello | 1.9 | use SFXlate; |
350 : | parrello | 1.1 | |
351 : | # Get the command-line parameters and options. | ||
352 : | parrello | 1.17 | my ($options, @parameters) = StandardSetup(['SproutLoad', 'ERDBLoad', 'Stats', |
353 : | parrello | 1.26 | 'ERDB', 'Load', 'Sprout', 'Subsystem'], |
354 : | parrello | 1.18 | { geneFile => ["", "name of the genome list file"], |
355 : | subsysFile => ["", "name of the trusted subsystem file"], | ||
356 : | dbLoad => [0, "load the database from generated files"], | ||
357 : | dbCreate => [0, "drop and re-create the database"], | ||
358 : | parrello | 1.19 | loadOnly => [0, "load the database from previously generated files"], |
359 : | parrello | 1.23 | primaryOnly => [0, "only process the group's main entity"], |
360 : | resume => [0, "resume a complete load starting with the first group specified in the parameter list"], | ||
361 : | parrello | 1.32 | phone => ["", "phone number (international format) to call when load finishes"], |
362 : | parrello | 1.18 | }, |
363 : | "<group1> <group2> ...", | ||
364 : | parrello | 1.17 | @ARGV); |
365 : | # If we're doing a load-only, turn on loading. | ||
366 : | if ($options->{loadOnly}) { | ||
367 : | $options->{dbLoad} = 1 | ||
368 : | } | ||
369 : | parrello | 1.14 | if ($options->{dbCreate}) { |
370 : | # Here we want to drop and re-create the database. | ||
371 : | my $db = $FIG_Config::sproutDB; | ||
372 : | parrello | 1.20 | DBKernel::CreateDB($db); |
373 : | parrello | 1.14 | } |
374 : | parrello | 1.9 | # Create the sprout loader object. Note that the Sprout object does not |
375 : | parrello | 1.10 | # open the database unless the "dbLoad" option is turned on. |
376 : | parrello | 1.1 | my $fig = FIG->new(); |
377 : | parrello | 1.10 | my $sprout = SFXlate->new_sprout_only(undef, undef, undef, ! $options->{dbLoad}); |
378 : | parrello | 1.7 | my $spl = SproutLoad->new($sprout, $fig, $options->{geneFile}, $options->{subsysFile}, $options); |
379 : | parrello | 1.15 | # Insure we have an output directory. |
380 : | FIG::verify_dir($FIG_Config::sproutData); | ||
381 : | parrello | 1.23 | # If we're resuming, we only want to have 1 parameter. |
382 : | my $resume = $options->{resume}; | ||
383 : | if ($resume && @parameters > 1) { | ||
384 : | Confess("If resume=1, only one load group can be specified."); | ||
385 : | } elsif (! @parameters) { | ||
386 : | parrello | 1.34 | Trace("No load groups were specified.") if T(0); |
387 : | parrello | 1.23 | } |
388 : | parrello | 1.32 | # Set a variable to contain return type information. |
389 : | my $rtype; | ||
390 : | # Insure we catch errors. | ||
391 : | eval { | ||
392 : | # Process the parameters. | ||
393 : | for my $group (@parameters) { | ||
394 : | Trace("Processing load group $group.") if T(2); | ||
395 : | my $stats; | ||
396 : | if ($group eq 'Genome' || $group eq '*') { | ||
397 : | $spl->LoadGenomeData(); | ||
398 : | $group = ResumeCheck($resume, $group); | ||
399 : | } | ||
400 : | if ($group eq 'Feature' || $group eq '*') { | ||
401 : | $spl->LoadFeatureData(); | ||
402 : | $group = ResumeCheck($resume, $group); | ||
403 : | } | ||
404 : | if ($group eq 'Coupling' || $group eq '*') { | ||
405 : | $spl->LoadCouplingData(); | ||
406 : | $group = ResumeCheck($resume, $group); | ||
407 : | } | ||
408 : | if ($group eq 'Subsystem' || $group eq '*') { | ||
409 : | $spl->LoadSubsystemData(); | ||
410 : | $group = ResumeCheck($resume, $group); | ||
411 : | } | ||
412 : | if ($group eq 'Property' || $group eq '*') { | ||
413 : | $spl->LoadPropertyData(); | ||
414 : | $group = ResumeCheck($resume, $group); | ||
415 : | } | ||
416 : | if ($group eq 'Annotation' || $group eq '*') { | ||
417 : | $spl->LoadAnnotationData(); | ||
418 : | $group = ResumeCheck($resume, $group); | ||
419 : | } | ||
420 : | if ($group eq 'BBH' || $group eq '*') { | ||
421 : | $spl->LoadBBHData(); | ||
422 : | $group = ResumeCheck($resume, $group); | ||
423 : | } | ||
424 : | if ($group eq 'Group' || $group eq '*') { | ||
425 : | $spl->LoadGroupData(); | ||
426 : | $group = ResumeCheck($resume, $group); | ||
427 : | } | ||
428 : | if ($group eq 'Source' || $group eq '*') { | ||
429 : | $spl->LoadSourceData(); | ||
430 : | $group = ResumeCheck($resume, $group); | ||
431 : | } | ||
432 : | if ($group eq 'External' || $group eq '*') { | ||
433 : | $spl->LoadExternalData(); | ||
434 : | $group = ResumeCheck($resume, $group); | ||
435 : | } | ||
436 : | if ($group eq 'Reaction' || $group eq '*') { | ||
437 : | $spl->LoadReactionData(); | ||
438 : | $group = ResumeCheck($resume, $group); | ||
439 : | } | ||
440 : | if ($group eq 'Synonym' || $group eq '*') { | ||
441 : | $spl->LoadSynonymData(); | ||
442 : | $group = ResumeCheck($resume, $group); | ||
443 : | } | ||
444 : | } | ||
445 : | }; | ||
446 : | if ($@) { | ||
447 : | Trace("Load failed with error: $@") if T(0); | ||
448 : | $rtype = "error"; | ||
449 : | } else { | ||
450 : | Trace("Load complete.") if T(2); | ||
451 : | $rtype = "no error"; | ||
452 : | } | ||
453 : | parrello | 1.33 | if ($options->{phone}) { |
454 : | parrello | 1.32 | my $msgID = Tracer::SendSMS($options->{phone}, "Sprout load terminated with $rtype."); |
455 : | if ($msgID) { | ||
456 : | Trace("Phone message sent with ID $msgID.") if T(2); | ||
457 : | } else { | ||
458 : | Trace("Phone message not sent.") if T(2); | ||
459 : | parrello | 1.31 | } |
460 : | parrello | 1.1 | } |
461 : | parrello | 1.23 | # If the resume flag is set, return "*", else return "". |
462 : | sub ResumeCheck { | ||
463 : | parrello | 1.29 | my ($resume, $group) = @_; |
464 : | return ($resume ? "*" : $group); | ||
465 : | parrello | 1.23 | } |
466 : | |||
467 : | parrello | 1.1 | 1; |
MCS Webmaster | ViewVC Help |
Powered by ViewVC 1.0.3 |