[Bio] / Sprout / LoadSproutTables.pl Repository:
ViewVC logotype

Annotation of /Sprout/LoadSproutTables.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.43 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     =head1 Load Sprout Tables
4 :    
5 : parrello 1.12 =head2 Introduction
6 :    
7 : parrello 1.14 The Sprout database reflects a snapshot of the SEED taken at a particular point in
8 :     time. At some point in the future, it will be possible to add annotations to the
9 :     Sprout data. All records added to Sprout after the snapshot is taken are
10 :     specially-marked so that the changes can be copied to the SEED. The SEED remains
11 :     the live version of the data.
12 :    
13 :     The snapshot is produced by reading the SEED data and writing it to sequential
14 :     files. There is one file per Sprout table, and each such file's name consists of
15 :     the table name with the suffix C<dtx>. Thus, the file for the C<Genome> table
16 :     would be named C<Genome.dtx>. These files are used to load the actual Sprout
17 :     database and to generate Glimpse indices.
18 :    
19 :     To load all the Sprout tables and then validate the result, you need to issue three
20 :     commands.
21 :    
22 :     LoadSproutTables -dbLoad -dbCreate "*"
23 : parrello 1.27 TestSproutLoad [genomeID] ...
24 :     index_sprout_lucene
25 :    
26 :     where I<[genomeID]> is one or more genome IDs. These genomes will be tested more
27 :     thoroughly than the others.
28 : parrello 1.14
29 :     All three commands send output to the console. In addition, C<LoadSproutTables> and
30 : parrello 1.27 C<TestSproutLoad> write tracing information to a trace log in the FIG temporary
31 : parrello 1.14 directory (B<$FIG_Config::Tmp>). At the bottom of the log file will be a complete
32 :     list of errors. If errors occur in C<LoadSproutTables>, then the data must be corrected
33 :     and the offending table group reloaded. So, for example, if there are errors in the
34 :     load of the B<MadeAnnotation> and B<Compound> tables, you would need to run
35 :    
36 :     LoadSproutTables -dbLoad Annotation Reaction
37 :    
38 :     because B<MadeAnnotation> is in the C<Annotation> group, and B<Compound> is in the
39 :     C<Reaction> group. A list of the groups is given below.
40 :    
41 :     You can omit the C<dbLoad> option to create the load files without
42 :     loading the database, and you can add a C<trace> option to change the trace level.
43 :     The command below creates the Genome-related load files with a trace level of 3 and
44 :     does not load them into the Sprout database.
45 :    
46 :     LoadSproutTables -trace=3 Genome
47 :    
48 :     C<LoadSproutTables> takes a long time to run, so setting the trace level to 3 helps
49 :     to give you an idea of the progress.
50 :    
51 : parrello 1.28 =head2 The NMPDR Web Site
52 :    
53 :     Sprout is the database engine for the NMPDR web site. The NMPDR web site consists of two
54 :     pieces that run on two different machines. The B<WEB> machine contains HTML pages
55 :     generated by a Content Management Tool.
56 :    
57 : parrello 1.14 =head2 Procedure For Loading Sprout
58 :    
59 : parrello 1.27 In order to load the Sprout, you need to have the B<Sprout>, B<NmpdrConfigs>, and
60 :     B<LuceneSearch> projects checked out from CVS in addition to the standard FIG
61 :     projects. You must also set up the following B<FIG_Config.pm> variables in addition
62 :     to the normal ones.
63 :    
64 :     =over 4
65 :    
66 :     =item sproutData
67 :    
68 :     Name of the data directory for the Sprout load files.
69 :    
70 :     =item var
71 :    
72 :     Name of the directory to contain cached NMPDR pages. The most important file in
73 :     this directory is C<nmpdr_page_template.html>, which contains a skeleton page
74 :     from the main NMPDR web site. This skeleton page is used to generate output
75 :     pages that look like the other NMPDR pages.
76 :    
77 :     =item java
78 :    
79 :     Path to the Java runtime environment.
80 :    
81 :     =item sproutDB
82 :    
83 :     Name of the Sprout database
84 :    
85 :     =item dbuser
86 :    
87 :     User name for logging into the Sprout database.
88 :    
89 :     =item dbpass
90 :    
91 :     Password for logging into the Sprout database.
92 :    
93 :     =item nmpdr_site_url
94 :    
95 :     URL for the NMPDR cover pages. The NMPDR cover pages are informational and text
96 :     pages that serve as the entry point to the NMPDR web site. They are generated by
97 :     a Content Management tool, and some Sprout scripts need to know where to find
98 :     them.
99 :    
100 :     =item nmpdr_site_template_id
101 :    
102 :     Page number for the template page used to generate results that look like they're
103 :     part of the NMPDR web site.
104 :    
105 :     =back
106 :    
107 : parrello 1.14 =over 4
108 :    
109 : parrello 1.39 Most of the above preparation is performed by the B<NMPDRSetup> utility.
110 :     NMPDRSetup prints the instructions for completing the process, including
111 :     loading the Sprout database. The specific procedure for loading
112 :     the Sprout data, however, is as follows.
113 : parrello 1.27
114 : parrello 1.14 =item 1
115 :    
116 : parrello 1.25 Type
117 :    
118 : parrello 1.39 nohup LoadSproutTables -dbLoad -user=you -background "*" >null &
119 : parrello 1.25
120 : parrello 1.39 where C<you> is your user ID, and press ENTER.
121 : parrello 1.25
122 :     The above command line runs the load in the background. The standard output,
123 :     standard error, and trace output will be directed to files in the FIG temporary
124 :     directory. If your user name is C<Bruce> then the files will be named
125 :     C<outBruce.log>, C<errBruce.log>, and C<traceBruce.log> respectively.
126 :    
127 :     If the load fails at some point and you are able to correct the problem, use the
128 :     C<resume> option to restart it. For example, if the load failed while doing the
129 :     Feature load group, you would resume it using
130 :    
131 :     nohup LoadSproutTables -dbLoad -dbCreate -user=you -resume -background Feature >null &
132 : parrello 1.14
133 :     =item 2
134 :    
135 : parrello 1.27 Type
136 :    
137 :     index_sprout_lucene
138 :    
139 :     and press ENTER. This will create the Lucene indexes for the Sprout data.
140 :    
141 : parrello 1.14 =back
142 :    
143 :     =head2 LoadSproutTables Command
144 :    
145 :     C<LoadSproutTables> creates the load files for Sprout tables and optionally loads them.
146 : parrello 1.12 The parameters are the names of the table groups whose data is to be created.
147 :     The legal table group names are given below.
148 : parrello 1.1
149 :     =over 4
150 :    
151 :     =item Genome
152 :    
153 :     Loads B<Genome>, B<HasContig>, B<Contig>, B<IsMadeUpOf>, and B<Sequence>.
154 :    
155 : parrello 1.30 =item Feature
156 :    
157 :     Loads B<Feature>, B<FeatureAlias>, B<FeatureTranslation>, B<FeatureUpstream>,
158 :     B<IsLocatedIn>, B<FeatureLink>.
159 :    
160 : parrello 1.1 =item Coupling
161 :    
162 :     Loads B<Coupling>, B<IsEvidencedBy>, B<PCH>, B<ParticipatesInCoupling>,
163 :     B<UsesAsEvidence>.
164 :    
165 :     =item Subsystem
166 :    
167 : parrello 1.2 Loads B<Subsystem>, B<Role>, B<SSCell>, B<ContainsFeature>, B<IsGenomeOf>,
168 : parrello 1.8 B<IsRoleOf>, B<OccursInSubsystem>, B<ParticipatesIn>, B<HasSSCell>,
169 : parrello 1.11 B<Catalyzes>, B<ConsistsOfRoles>, B<RoleSubset>, B<HasRoleSubset>,
170 : parrello 1.13 B<ConsistsOfGenomes>, B<GenomeSubset>, B<HasGenomeSubset>, B<Diagram>,
171 :     B<RoleOccursIn>.
172 : parrello 1.1
173 : parrello 1.2 =item Annotation
174 :    
175 :     Loads B<SproutUser>, B<UserAccess>, B<Annotation>, B<IsTargetOfAnnotation>,
176 :     B<MadeAnnotation>.
177 :    
178 :     =item Property
179 :    
180 :     Loads B<Property>, B<HasProperty>.
181 :    
182 : parrello 1.3 =item Group
183 :    
184 :     Loads B<GenomeGroups>.
185 :    
186 :     =item Source
187 :    
188 :     Loads B<Source>, B<ComesFrom>, B<SourceURL>.
189 :    
190 : parrello 1.4 =item External
191 :    
192 :     Loads B<ExternalAliasOrg>, B<ExternalAliasFunc>.
193 :    
194 : parrello 1.8 =item Reaction
195 :    
196 :     Loads B<ReactionURL>, B<Compound>, B<CompoundName>,
197 : parrello 1.11 B<CompoundCAS>, B<IsAComponentOf>, B<Reaction>.
198 : parrello 1.8
199 : parrello 1.31 =item Synonym
200 :    
201 :     Loads B<SynonymGroup> and B<IsSynonymGroupFor>.
202 :    
203 : parrello 1.36 =item Family
204 :    
205 : parrello 1.38 Loads B<Family> and B<IsFamilyForFeature>.
206 : parrello 1.36
207 : parrello 1.41 =item Drug
208 :    
209 :     Loads B<DrugProject>, B<ContainsTopic>, B<DrugTopic>, B<ContainsAnalysisOf>,
210 :     B<PDB>, B<IncludesBound>, B<PDB>, B<IsBoundIn>, B<BindsWith>, B<Ligand>,
211 :     B<DescribesProteinForFeature>, and B<FeatureConservation>.
212 :    
213 : parrello 1.3 =item *
214 :    
215 :     Loads all of the above tables.
216 :    
217 : parrello 1.1 =back
218 :    
219 : parrello 1.7 The command-line options are given below.
220 : parrello 1.1
221 :     =over 4
222 :    
223 :     =item geneFile
224 :    
225 :     The name of the file containing the genomes and their associated access codes. The
226 :     file should have one line per genome, each line consisting of the genome ID followed
227 :     by the access code, separated by a tab. If no file is specified, all complete genomes
228 : parrello 1.39 will be processed and the access code will be 1. Specify C<default> to use the
229 :     default gene file-- C<genes.tbl> in the C<SproutData> directory.
230 : parrello 1.1
231 :     =item subsysFile
232 :    
233 :     The name of the file containing the trusted subsystems. The file should have one line
234 :     per trusted subsystem. If no file is specified, all subsystems will be trusted.
235 :    
236 :     =item trace
237 :    
238 :     Desired tracing level. The default is 3.
239 :    
240 : parrello 1.25 =item user
241 :    
242 : parrello 1.35 Suffix to use for trace, output, and error files created.
243 : parrello 1.25
244 : parrello 1.10 =item dbLoad
245 :    
246 :     If TRUE, the database tables will be loaded automatically from the load files created.
247 :    
248 : parrello 1.14 =item dbCreate
249 : parrello 1.1
250 : parrello 1.14 If TRUE, the database will be created. If the database exists already, it will be
251 :     dropped. Use the function with caution.
252 : parrello 1.12
253 : parrello 1.17 =item loadOnly
254 :    
255 :     If TRUE, the database tables will be loaded from existing load files. Load files
256 :     will not be created. This option is useful if you are setting up a copy of Sprout
257 :     and have load files already set up from the original version.
258 :    
259 : parrello 1.19 =item primaryOnly
260 :    
261 :     If TRUE, only the group's primary entity will be loaded.
262 :    
263 : parrello 1.25 =item background
264 :    
265 :     Redirect the standard and error output to files in the FIG temporary directory.
266 :    
267 :     =item resume
268 :    
269 :     Resume an interrupted load, starting with the load group specified in the first
270 :     positional parameter.
271 :    
272 :     =item sql
273 :    
274 :     Trace SQL statements.
275 :    
276 : parrello 1.32 =item phone
277 :    
278 :     Phone number to message when the load finishes.
279 :    
280 : parrello 1.14 =back
281 : parrello 1.12
282 : parrello 1.1 =cut
283 :    
284 :     use strict;
285 :     use Tracer;
286 :     use DocUtils;
287 :     use Cwd;
288 :     use FIG;
289 :     use SFXlate;
290 :     use File::Copy;
291 :     use File::Path;
292 :     use SproutLoad;
293 :     use Stats;
294 : parrello 1.9 use SFXlate;
295 : parrello 1.1
296 :     # Get the command-line parameters and options.
297 : parrello 1.17 my ($options, @parameters) = StandardSetup(['SproutLoad', 'ERDBLoad', 'Stats',
298 : parrello 1.26 'ERDB', 'Load', 'Sprout', 'Subsystem'],
299 : parrello 1.18 { geneFile => ["", "name of the genome list file"],
300 :     subsysFile => ["", "name of the trusted subsystem file"],
301 :     dbLoad => [0, "load the database from generated files"],
302 :     dbCreate => [0, "drop and re-create the database"],
303 : parrello 1.19 loadOnly => [0, "load the database from previously generated files"],
304 : parrello 1.23 primaryOnly => [0, "only process the group's main entity"],
305 :     resume => [0, "resume a complete load starting with the first group specified in the parameter list"],
306 : parrello 1.32 phone => ["", "phone number (international format) to call when load finishes"],
307 : parrello 1.18 },
308 :     "<group1> <group2> ...",
309 : parrello 1.17 @ARGV);
310 :     # If we're doing a load-only, turn on loading.
311 :     if ($options->{loadOnly}) {
312 :     $options->{dbLoad} = 1
313 :     }
314 : parrello 1.14 if ($options->{dbCreate}) {
315 :     # Here we want to drop and re-create the database.
316 :     my $db = $FIG_Config::sproutDB;
317 : parrello 1.20 DBKernel::CreateDB($db);
318 : parrello 1.14 }
319 : parrello 1.39 # Compute the gene file name.
320 :     my $geneFile = $options->{geneFile};
321 :     if ($geneFile eq 'default') {
322 :     $geneFile = "$FIG_Config::sproutData/genes.tbl";
323 :     }
324 : parrello 1.9 # Create the sprout loader object. Note that the Sprout object does not
325 : parrello 1.10 # open the database unless the "dbLoad" option is turned on.
326 : parrello 1.1 my $fig = FIG->new();
327 : parrello 1.10 my $sprout = SFXlate->new_sprout_only(undef, undef, undef, ! $options->{dbLoad});
328 : parrello 1.39 my $spl = SproutLoad->new($sprout, $fig, $geneFile, $options->{subsysFile}, $options);
329 : parrello 1.15 # Insure we have an output directory.
330 :     FIG::verify_dir($FIG_Config::sproutData);
331 : parrello 1.23 # If we're resuming, we only want to have 1 parameter.
332 :     my $resume = $options->{resume};
333 :     if ($resume && @parameters > 1) {
334 :     Confess("If resume=1, only one load group can be specified.");
335 :     } elsif (! @parameters) {
336 : parrello 1.34 Trace("No load groups were specified.") if T(0);
337 : parrello 1.23 }
338 : parrello 1.32 # Set a variable to contain return type information.
339 :     my $rtype;
340 :     # Insure we catch errors.
341 :     eval {
342 :     # Process the parameters.
343 :     for my $group (@parameters) {
344 :     Trace("Processing load group $group.") if T(2);
345 :     my $stats;
346 :     if ($group eq 'Genome' || $group eq '*') {
347 :     $spl->LoadGenomeData();
348 :     $group = ResumeCheck($resume, $group);
349 :     }
350 :     if ($group eq 'Feature' || $group eq '*') {
351 :     $spl->LoadFeatureData();
352 :     $group = ResumeCheck($resume, $group);
353 :     }
354 :     if ($group eq 'Coupling' || $group eq '*') {
355 :     $spl->LoadCouplingData();
356 :     $group = ResumeCheck($resume, $group);
357 :     }
358 :     if ($group eq 'Subsystem' || $group eq '*') {
359 :     $spl->LoadSubsystemData();
360 :     $group = ResumeCheck($resume, $group);
361 :     }
362 :     if ($group eq 'Property' || $group eq '*') {
363 :     $spl->LoadPropertyData();
364 :     $group = ResumeCheck($resume, $group);
365 :     }
366 :     if ($group eq 'Annotation' || $group eq '*') {
367 :     $spl->LoadAnnotationData();
368 :     $group = ResumeCheck($resume, $group);
369 :     }
370 :     if ($group eq 'Group' || $group eq '*') {
371 :     $spl->LoadGroupData();
372 :     $group = ResumeCheck($resume, $group);
373 :     }
374 :     if ($group eq 'Source' || $group eq '*') {
375 :     $spl->LoadSourceData();
376 :     $group = ResumeCheck($resume, $group);
377 :     }
378 :     if ($group eq 'External' || $group eq '*') {
379 :     $spl->LoadExternalData();
380 :     $group = ResumeCheck($resume, $group);
381 :     }
382 :     if ($group eq 'Reaction' || $group eq '*') {
383 :     $spl->LoadReactionData();
384 :     $group = ResumeCheck($resume, $group);
385 :     }
386 :     if ($group eq 'Synonym' || $group eq '*') {
387 :     $spl->LoadSynonymData();
388 :     $group = ResumeCheck($resume, $group);
389 :     }
390 : parrello 1.37 if ($group eq 'Family' || $group eq '*') {
391 :     $spl->LoadFamilyData();
392 :     $group = ResumeCheck($resume, $group);
393 :     }
394 : parrello 1.43 if ($group eq 'Drug' || $group eq '*') {
395 :     $spl->LoadDrugData();
396 :     $group = ResumeCheck($resume, $group);
397 :     }
398 : parrello 1.32 }
399 :     };
400 :     if ($@) {
401 :     Trace("Load failed with error: $@") if T(0);
402 :     $rtype = "error";
403 :     } else {
404 :     Trace("Load complete.") if T(2);
405 :     $rtype = "no error";
406 :     }
407 : parrello 1.33 if ($options->{phone}) {
408 : parrello 1.32 my $msgID = Tracer::SendSMS($options->{phone}, "Sprout load terminated with $rtype.");
409 :     if ($msgID) {
410 :     Trace("Phone message sent with ID $msgID.") if T(2);
411 :     } else {
412 :     Trace("Phone message not sent.") if T(2);
413 : parrello 1.31 }
414 : parrello 1.1 }
415 : parrello 1.35
416 : parrello 1.23 # If the resume flag is set, return "*", else return "".
417 :     sub ResumeCheck {
418 : parrello 1.29 my ($resume, $group) = @_;
419 :     return ($resume ? "*" : $group);
420 : parrello 1.23 }
421 :    
422 : parrello 1.1 1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3