[Bio] / Sprout / LoadSproutTables.pl Repository:
ViewVC logotype

Annotation of /Sprout/LoadSproutTables.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.31 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     =head1 Load Sprout Tables
4 :    
5 : parrello 1.12 =head2 Introduction
6 :    
7 : parrello 1.14 The Sprout database reflects a snapshot of the SEED taken at a particular point in
8 :     time. At some point in the future, it will be possible to add annotations to the
9 :     Sprout data. All records added to Sprout after the snapshot is taken are
10 :     specially-marked so that the changes can be copied to the SEED. The SEED remains
11 :     the live version of the data.
12 :    
13 :     The snapshot is produced by reading the SEED data and writing it to sequential
14 :     files. There is one file per Sprout table, and each such file's name consists of
15 :     the table name with the suffix C<dtx>. Thus, the file for the C<Genome> table
16 :     would be named C<Genome.dtx>. These files are used to load the actual Sprout
17 :     database and to generate Glimpse indices.
18 :    
19 :     To load all the Sprout tables and then validate the result, you need to issue three
20 :     commands.
21 :    
22 :     LoadSproutTables -dbLoad -dbCreate "*"
23 : parrello 1.27 TestSproutLoad [genomeID] ...
24 :     index_sprout_lucene
25 :    
26 :     where I<[genomeID]> is one or more genome IDs. These genomes will be tested more
27 :     thoroughly than the others.
28 : parrello 1.14
29 :     All three commands send output to the console. In addition, C<LoadSproutTables> and
30 : parrello 1.27 C<TestSproutLoad> write tracing information to a trace log in the FIG temporary
31 : parrello 1.14 directory (B<$FIG_Config::Tmp>). At the bottom of the log file will be a complete
32 :     list of errors. If errors occur in C<LoadSproutTables>, then the data must be corrected
33 :     and the offending table group reloaded. So, for example, if there are errors in the
34 :     load of the B<MadeAnnotation> and B<Compound> tables, you would need to run
35 :    
36 :     LoadSproutTables -dbLoad Annotation Reaction
37 :    
38 :     because B<MadeAnnotation> is in the C<Annotation> group, and B<Compound> is in the
39 :     C<Reaction> group. A list of the groups is given below.
40 :    
41 :     You can omit the C<dbLoad> option to create the load files without
42 :     loading the database, and you can add a C<trace> option to change the trace level.
43 :     The command below creates the Genome-related load files with a trace level of 3 and
44 :     does not load them into the Sprout database.
45 :    
46 :     LoadSproutTables -trace=3 Genome
47 :    
48 :     C<LoadSproutTables> takes a long time to run, so setting the trace level to 3 helps
49 :     to give you an idea of the progress.
50 :    
51 :     Once the Sprout database is loaded, B<TestSproutLoad> can be used to verify the load
52 : parrello 1.27 against the FIG data. The end of the trace log file will contain statistics on
53 :     the errors found. Like C<LoadSproutTables>, C<TestSproutLoad> is a time-consuming
54 : parrello 1.14 script, so you may want to set the trace level to 3 to see visible progress.
55 :    
56 : parrello 1.27 TestSproutLoad -trace=3 [genomeID] ...
57 :    
58 :     The I<[genomeID]> specifies zero or more IDs of genomes to receive more thorough
59 :     testing. So, for example,
60 :    
61 :     TestSproutLoad -trace=3 100226.1 83333.1
62 :    
63 :     would do thorough testing of I<Streptomyces coelicolor A3-2> (100226.1) and
64 :     I<Escherichia coli K12> (83333.1).
65 : parrello 1.14
66 :     Unlike C<LoadSproutTables>, in C<TestSproutLoad>, the individual errors found are
67 :     mixed in with the trace messages. They are all, however, marked with a trace type
68 :     of B<Problem>, as shown in the fragment below.
69 :    
70 :     11/02/2005 19:15:16 <main>: Processing feature fig|100226.1.peg.7742.
71 :     11/02/2005 19:15:17 <main>: Processing feature fig|100226.1.peg.7741.
72 :     11/02/2005 19:15:17 <Problem>: assignment "Short-chain dehydrodenase ...
73 :     11/02/2005 19:15:17 <Problem>: assignment "putative oxidoreductase." ...
74 :     11/02/2005 19:15:17 <Problem>: Incorrect assignment for fig|100226.1.peg.7741...
75 :     11/02/2005 19:15:17 <Problem>: Incorrect number of annotations found in ...
76 :     11/02/2005 19:15:17 <main>: Processing feature fig|100226.1.peg.7740.
77 :     11/02/2005 19:15:18 <main>: Processing feature fig|100226.1.peg.7739.
78 :    
79 :     The test may reveal that some tables need to be reloaded, or that a software
80 :     problem has crept into the Sprout.
81 :    
82 : parrello 1.27 Once all the tables have the correct data, C<index_sprout_lucene> can be run to create the
83 :     Lucene search indexes. Lucene is a web site search engine produced by the Apache project.
84 :     It is written in Java, and in order to run it you must have the B<LuceneSearch> and
85 :     B<NmpdrConfigs> projects checked out from CVS and made.
86 : parrello 1.14
87 : parrello 1.28 =head2 The NMPDR Web Site
88 :    
89 :     Sprout is the database engine for the NMPDR web site. The NMPDR web site consists of two
90 :     pieces that run on two different machines. The B<WEB> machine contains HTML pages
91 :     generated by a Content Management Tool.
92 :    
93 : parrello 1.14 =head2 Procedure For Loading Sprout
94 :    
95 : parrello 1.27 In order to load the Sprout, you need to have the B<Sprout>, B<NmpdrConfigs>, and
96 :     B<LuceneSearch> projects checked out from CVS in addition to the standard FIG
97 :     projects. You must also set up the following B<FIG_Config.pm> variables in addition
98 :     to the normal ones.
99 :    
100 :     =over 4
101 :    
102 :     =item sproutData
103 :    
104 :     Name of the data directory for the Sprout load files.
105 :    
106 :     =item var
107 :    
108 :     Name of the directory to contain cached NMPDR pages. The most important file in
109 :     this directory is C<nmpdr_page_template.html>, which contains a skeleton page
110 :     from the main NMPDR web site. This skeleton page is used to generate output
111 :     pages that look like the other NMPDR pages.
112 :    
113 :     =item java
114 :    
115 :     Path to the Java runtime environment.
116 :    
117 :     =item sproutDB
118 :    
119 :     Name of the Sprout database
120 :    
121 :     =item dbuser
122 :    
123 :     User name for logging into the Sprout database.
124 :    
125 :     =item dbpass
126 :    
127 :     Password for logging into the Sprout database.
128 :    
129 :     =item nmpdr_site_url
130 :    
131 :     URL for the NMPDR cover pages. The NMPDR cover pages are informational and text
132 :     pages that serve as the entry point to the NMPDR web site. They are generated by
133 :     a Content Management tool, and some Sprout scripts need to know where to find
134 :     them.
135 :    
136 :     =item nmpdr_site_template_id
137 :    
138 :     Page number for the template page used to generate results that look like they're
139 :     part of the NMPDR web site.
140 :    
141 :     =back
142 :    
143 : parrello 1.14 =over 4
144 :    
145 : parrello 1.27 The procedure for loading Sprout is as follows.
146 :    
147 : parrello 1.14 =item 1
148 :    
149 : parrello 1.25 Type
150 :    
151 :     nohup LoadSproutTables -dbLoad -dbCreate -user=you -background "*" >null &
152 :    
153 : parrello 1.26 where C<you> is your user ID, and press ENTER. This will create the C<dtx> files
154 : parrello 1.25 and load them. You may be asked for a password. If this is the case, simply
155 :     press ENTER. If that does not work, use the C<dbpass> value specified in
156 :     your C<FIG_Config.pm> file.
157 :    
158 :     The above command line runs the load in the background. The standard output,
159 :     standard error, and trace output will be directed to files in the FIG temporary
160 :     directory. If your user name is C<Bruce> then the files will be named
161 :     C<outBruce.log>, C<errBruce.log>, and C<traceBruce.log> respectively.
162 :    
163 :     If the load fails at some point and you are able to correct the problem, use the
164 :     C<resume> option to restart it. For example, if the load failed while doing the
165 :     Feature load group, you would resume it using
166 :    
167 :     nohup LoadSproutTables -dbLoad -dbCreate -user=you -resume -background Feature >null &
168 : parrello 1.14
169 :     =item 2
170 :    
171 : parrello 1.27 Type
172 :    
173 :     nohup TestSproutLoad -user=you -background >null &100226.1 83333.1>
174 :    
175 :     and press ENTER. This will validate the Sprout database against the SEED data.
176 : parrello 1.14
177 :     =item 3
178 :    
179 :     If any errors are detected in step (2), it is most likely due to a change in
180 :     SEED that did not make it to Sprout. Contact Bruce Parrello or Robert Olson
181 :     to get the code updated properly.
182 :    
183 :     =item 4
184 :    
185 : parrello 1.27 Type
186 :    
187 :     index_sprout_lucene
188 :    
189 :     and press ENTER. This will create the Lucene indexes for the Sprout data.
190 :    
191 :     =item 5
192 :    
193 :     Change to the B<SproutData/Indexes> directory under B<FIGdisk> and look for the
194 :     directory created by C<index_sprout_lucene>. The directory name will be
195 :     something like C<Lucene.20060412-154112>. The numbers indicate the date and time
196 :     the index was created. In this case it was 04/12/2006 03:41:12pm. Type
197 :    
198 :     ln -sf directory Lucene
199 :    
200 :     where C<directory> is the new directory name, to point the C<Lucene> directory to the
201 :     new search index.
202 : parrello 1.14
203 :     =back
204 :    
205 :     =head2 LoadSproutTables Command
206 :    
207 :     C<LoadSproutTables> creates the load files for Sprout tables and optionally loads them.
208 : parrello 1.12 The parameters are the names of the table groups whose data is to be created.
209 :     The legal table group names are given below.
210 : parrello 1.1
211 :     =over 4
212 :    
213 :     =item Genome
214 :    
215 :     Loads B<Genome>, B<HasContig>, B<Contig>, B<IsMadeUpOf>, and B<Sequence>.
216 :    
217 : parrello 1.30 =item Feature
218 :    
219 :     Loads B<Feature>, B<FeatureAlias>, B<FeatureTranslation>, B<FeatureUpstream>,
220 :     B<IsLocatedIn>, B<FeatureLink>.
221 :    
222 : parrello 1.1 =item Coupling
223 :    
224 :     Loads B<Coupling>, B<IsEvidencedBy>, B<PCH>, B<ParticipatesInCoupling>,
225 :     B<UsesAsEvidence>.
226 :    
227 :     =item Subsystem
228 :    
229 : parrello 1.2 Loads B<Subsystem>, B<Role>, B<SSCell>, B<ContainsFeature>, B<IsGenomeOf>,
230 : parrello 1.8 B<IsRoleOf>, B<OccursInSubsystem>, B<ParticipatesIn>, B<HasSSCell>,
231 : parrello 1.11 B<Catalyzes>, B<ConsistsOfRoles>, B<RoleSubset>, B<HasRoleSubset>,
232 : parrello 1.13 B<ConsistsOfGenomes>, B<GenomeSubset>, B<HasGenomeSubset>, B<Diagram>,
233 :     B<RoleOccursIn>.
234 : parrello 1.1
235 : parrello 1.2 =item Annotation
236 :    
237 :     Loads B<SproutUser>, B<UserAccess>, B<Annotation>, B<IsTargetOfAnnotation>,
238 :     B<MadeAnnotation>.
239 :    
240 :     =item Property
241 :    
242 :     Loads B<Property>, B<HasProperty>.
243 :    
244 :     =item BBH
245 :    
246 :     Loads B<IsBidirectionalBestHitOf>.
247 :    
248 : parrello 1.3 =item Group
249 :    
250 :     Loads B<GenomeGroups>.
251 :    
252 :     =item Source
253 :    
254 :     Loads B<Source>, B<ComesFrom>, B<SourceURL>.
255 :    
256 : parrello 1.4 =item External
257 :    
258 :     Loads B<ExternalAliasOrg>, B<ExternalAliasFunc>.
259 :    
260 : parrello 1.8 =item Reaction
261 :    
262 :     Loads B<ReactionURL>, B<Compound>, B<CompoundName>,
263 : parrello 1.11 B<CompoundCAS>, B<IsAComponentOf>, B<Reaction>.
264 : parrello 1.8
265 : parrello 1.31 =item Synonym
266 :    
267 :     Loads B<SynonymGroup> and B<IsSynonymGroupFor>.
268 :    
269 : parrello 1.3 =item *
270 :    
271 :     Loads all of the above tables.
272 :    
273 : parrello 1.1 =back
274 :    
275 : parrello 1.7 The command-line options are given below.
276 : parrello 1.1
277 :     =over 4
278 :    
279 :     =item geneFile
280 :    
281 :     The name of the file containing the genomes and their associated access codes. The
282 :     file should have one line per genome, each line consisting of the genome ID followed
283 :     by the access code, separated by a tab. If no file is specified, all complete genomes
284 :     will be processed and the access code will be 1.
285 :    
286 :     =item subsysFile
287 :    
288 :     The name of the file containing the trusted subsystems. The file should have one line
289 :     per trusted subsystem. If no file is specified, all subsystems will be trusted.
290 :    
291 :     =item trace
292 :    
293 :     Desired tracing level. The default is 3.
294 :    
295 : parrello 1.25 =item user
296 :    
297 :     Suffix to use for trace, output, and error files created in
298 :    
299 : parrello 1.10 =item dbLoad
300 :    
301 :     If TRUE, the database tables will be loaded automatically from the load files created.
302 :    
303 : parrello 1.14 =item dbCreate
304 : parrello 1.1
305 : parrello 1.14 If TRUE, the database will be created. If the database exists already, it will be
306 :     dropped. Use the function with caution.
307 : parrello 1.12
308 : parrello 1.17 =item loadOnly
309 :    
310 :     If TRUE, the database tables will be loaded from existing load files. Load files
311 :     will not be created. This option is useful if you are setting up a copy of Sprout
312 :     and have load files already set up from the original version.
313 :    
314 : parrello 1.19 =item primaryOnly
315 :    
316 :     If TRUE, only the group's primary entity will be loaded.
317 :    
318 : parrello 1.25 =item background
319 :    
320 :     Redirect the standard and error output to files in the FIG temporary directory.
321 :    
322 :     =item resume
323 :    
324 :     Resume an interrupted load, starting with the load group specified in the first
325 :     positional parameter.
326 :    
327 :     =item sql
328 :    
329 :     Trace SQL statements.
330 :    
331 : parrello 1.14 =back
332 : parrello 1.12
333 : parrello 1.1 =cut
334 :    
335 :     use strict;
336 :     use Tracer;
337 :     use DocUtils;
338 :     use Cwd;
339 :     use FIG;
340 :     use SFXlate;
341 :     use File::Copy;
342 :     use File::Path;
343 :     use SproutLoad;
344 :     use Stats;
345 : parrello 1.9 use SFXlate;
346 : parrello 1.1
347 :     # Get the command-line parameters and options.
348 : parrello 1.17 my ($options, @parameters) = StandardSetup(['SproutLoad', 'ERDBLoad', 'Stats',
349 : parrello 1.26 'ERDB', 'Load', 'Sprout', 'Subsystem'],
350 : parrello 1.18 { geneFile => ["", "name of the genome list file"],
351 :     subsysFile => ["", "name of the trusted subsystem file"],
352 :     dbLoad => [0, "load the database from generated files"],
353 :     dbCreate => [0, "drop and re-create the database"],
354 : parrello 1.19 loadOnly => [0, "load the database from previously generated files"],
355 : parrello 1.23 primaryOnly => [0, "only process the group's main entity"],
356 :     resume => [0, "resume a complete load starting with the first group specified in the parameter list"],
357 : parrello 1.18 },
358 :     "<group1> <group2> ...",
359 : parrello 1.17 @ARGV);
360 :     # If we're doing a load-only, turn on loading.
361 :     if ($options->{loadOnly}) {
362 :     $options->{dbLoad} = 1
363 :     }
364 : parrello 1.14 if ($options->{dbCreate}) {
365 :     # Here we want to drop and re-create the database.
366 :     my $db = $FIG_Config::sproutDB;
367 : parrello 1.20 DBKernel::CreateDB($db);
368 : parrello 1.14 }
369 : parrello 1.9 # Create the sprout loader object. Note that the Sprout object does not
370 : parrello 1.10 # open the database unless the "dbLoad" option is turned on.
371 : parrello 1.1 my $fig = FIG->new();
372 : parrello 1.10 my $sprout = SFXlate->new_sprout_only(undef, undef, undef, ! $options->{dbLoad});
373 : parrello 1.7 my $spl = SproutLoad->new($sprout, $fig, $options->{geneFile}, $options->{subsysFile}, $options);
374 : parrello 1.15 # Insure we have an output directory.
375 :     FIG::verify_dir($FIG_Config::sproutData);
376 : parrello 1.23 # If we're resuming, we only want to have 1 parameter.
377 :     my $resume = $options->{resume};
378 :     if ($resume && @parameters > 1) {
379 :     Confess("If resume=1, only one load group can be specified.");
380 :     } elsif (! @parameters) {
381 :     Confess("No load groups were specified.");
382 :     }
383 : parrello 1.1 # Process the parameters.
384 :     for my $group (@parameters) {
385 :     Trace("Processing load group $group.") if T(2);
386 :     my $stats;
387 : parrello 1.3 if ($group eq 'Genome' || $group eq '*') {
388 : parrello 1.1 $spl->LoadGenomeData();
389 : parrello 1.29 $group = ResumeCheck($resume, $group);
390 : parrello 1.3 }
391 :     if ($group eq 'Feature' || $group eq '*') {
392 : parrello 1.1 $spl->LoadFeatureData();
393 : parrello 1.29 $group = ResumeCheck($resume, $group);
394 : parrello 1.3 }
395 :     if ($group eq 'Coupling' || $group eq '*') {
396 : parrello 1.1 $spl->LoadCouplingData();
397 : parrello 1.29 $group = ResumeCheck($resume, $group);
398 : parrello 1.3 }
399 :     if ($group eq 'Subsystem' || $group eq '*') {
400 : parrello 1.1 $spl->LoadSubsystemData();
401 : parrello 1.29 $group = ResumeCheck($resume, $group);
402 : parrello 1.3 }
403 :     if ($group eq 'Property' || $group eq '*') {
404 : parrello 1.1 $spl->LoadPropertyData();
405 : parrello 1.29 $group = ResumeCheck($resume, $group);
406 : parrello 1.3 }
407 :     if ($group eq 'Annotation' || $group eq '*') {
408 : parrello 1.2 $spl->LoadAnnotationData();
409 : parrello 1.29 $group = ResumeCheck($resume, $group);
410 : parrello 1.3 }
411 :     if ($group eq 'BBH' || $group eq '*') {
412 : parrello 1.2 $spl->LoadBBHData();
413 : parrello 1.29 $group = ResumeCheck($resume, $group);
414 : parrello 1.1 }
415 : parrello 1.4 if ($group eq 'Group' || $group eq '*') {
416 : parrello 1.3 $spl->LoadGroupData();
417 : parrello 1.29 $group = ResumeCheck($resume, $group);
418 : parrello 1.3 }
419 :     if ($group eq 'Source' || $group eq '*') {
420 :     $spl->LoadSourceData();
421 : parrello 1.29 $group = ResumeCheck($resume, $group);
422 : parrello 1.3 }
423 : parrello 1.4 if ($group eq 'External' || $group eq '*') {
424 :     $spl->LoadExternalData();
425 : parrello 1.29 $group = ResumeCheck($resume, $group);
426 : parrello 1.4 }
427 : parrello 1.8 if ($group eq 'Reaction' || $group eq '*') {
428 :     $spl->LoadReactionData();
429 : parrello 1.29 $group = ResumeCheck($resume, $group);
430 : parrello 1.8 }
431 : parrello 1.31 if ($group eq 'Synonym' || $group eq '*') {
432 :     $spl->LoadSynonymData();
433 :     $group = ResumeCheck($resume, $group);
434 :     }
435 : parrello 1.1 }
436 :     Trace("Load complete.") if T(2);
437 :    
438 : parrello 1.23 # If the resume flag is set, return "*", else return "".
439 :     sub ResumeCheck {
440 : parrello 1.29 my ($resume, $group) = @_;
441 :     return ($resume ? "*" : $group);
442 : parrello 1.23 }
443 :    
444 : parrello 1.1 1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3