[Bio] / Sprout / LoadSproutTables.pl Repository:
ViewVC logotype

Annotation of /Sprout/LoadSproutTables.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.24 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     =head1 Load Sprout Tables
4 :    
5 : parrello 1.12 =head2 Introduction
6 :    
7 : parrello 1.14 The Sprout database reflects a snapshot of the SEED taken at a particular point in
8 :     time. At some point in the future, it will be possible to add annotations to the
9 :     Sprout data. All records added to Sprout after the snapshot is taken are
10 :     specially-marked so that the changes can be copied to the SEED. The SEED remains
11 :     the live version of the data.
12 :    
13 :     The snapshot is produced by reading the SEED data and writing it to sequential
14 :     files. There is one file per Sprout table, and each such file's name consists of
15 :     the table name with the suffix C<dtx>. Thus, the file for the C<Genome> table
16 :     would be named C<Genome.dtx>. These files are used to load the actual Sprout
17 :     database and to generate Glimpse indices.
18 :    
19 :     To load all the Sprout tables and then validate the result, you need to issue three
20 :     commands.
21 :    
22 :     LoadSproutTables -dbLoad -dbCreate "*"
23 :     TestSproutLoad
24 :     index_sprout
25 :    
26 :     All three commands send output to the console. In addition, C<LoadSproutTables> and
27 :     C<TestSproutLoad> write tracing information to C<trace.log> in the FIG temporary
28 :     directory (B<$FIG_Config::Tmp>). At the bottom of the log file will be a complete
29 :     list of errors. If errors occur in C<LoadSproutTables>, then the data must be corrected
30 :     and the offending table group reloaded. So, for example, if there are errors in the
31 :     load of the B<MadeAnnotation> and B<Compound> tables, you would need to run
32 :    
33 :     LoadSproutTables -dbLoad Annotation Reaction
34 :    
35 :     because B<MadeAnnotation> is in the C<Annotation> group, and B<Compound> is in the
36 :     C<Reaction> group. A list of the groups is given below.
37 :    
38 :     You can omit the C<dbLoad> option to create the load files without
39 :     loading the database, and you can add a C<trace> option to change the trace level.
40 :     The command below creates the Genome-related load files with a trace level of 3 and
41 :     does not load them into the Sprout database.
42 :    
43 :     LoadSproutTables -trace=3 Genome
44 :    
45 :     C<LoadSproutTables> takes a long time to run, so setting the trace level to 3 helps
46 :     to give you an idea of the progress.
47 :    
48 :     Once the Sprout database is loaded, B<TestSproutLoad> can be used to verify the load
49 :     against the FIG data. Again, the end of the C<trace.log> file will contain a summary
50 :     of the errors found. Like C<LoadSproutTables>, C<TestSproutLoad> is a time-consuming
51 :     script, so you may want to set the trace level to 3 to see visible progress.
52 :    
53 :     TestSproutLoad -trace=3
54 :    
55 :     Unlike C<LoadSproutTables>, in C<TestSproutLoad>, the individual errors found are
56 :     mixed in with the trace messages. They are all, however, marked with a trace type
57 :     of B<Problem>, as shown in the fragment below.
58 :    
59 :     11/02/2005 19:15:16 <main>: Processing feature fig|100226.1.peg.7742.
60 :     11/02/2005 19:15:17 <main>: Processing feature fig|100226.1.peg.7741.
61 :     11/02/2005 19:15:17 <Problem>: assignment "Short-chain dehydrodenase ...
62 :     11/02/2005 19:15:17 <Problem>: assignment "putative oxidoreductase." ...
63 :     11/02/2005 19:15:17 <Problem>: Incorrect assignment for fig|100226.1.peg.7741...
64 :     11/02/2005 19:15:17 <Problem>: Incorrect number of annotations found in ...
65 :     11/02/2005 19:15:17 <main>: Processing feature fig|100226.1.peg.7740.
66 :     11/02/2005 19:15:18 <main>: Processing feature fig|100226.1.peg.7739.
67 :    
68 :     The test may reveal that some tables need to be reloaded, or that a software
69 :     problem has crept into the Sprout.
70 :    
71 :     Once all the tables have the correct data, C<index_sprout> can be run to create the
72 :     Glimpse indexes.
73 :    
74 :     =head2 Procedure For Loading Sprout
75 :    
76 :     =over 4
77 :    
78 :     =item 1
79 :    
80 :     Type C<LoadSproutTables -dbLoad -dbCreate "*"> and press ENTER. This will create
81 : parrello 1.22 the C<dtx> files and load them. You may be asked for a password. If this is the
82 :     case, simply press ENTER. If that does not work, use the C<dbpass> value specified
83 :     in your C<FIG_Config.pm> file.
84 : parrello 1.14
85 :     =item 2
86 :    
87 : parrello 1.17 Type C<TestSproutLoad 100226.1 83333.1> and press ENTER. This will validate
88 :     the Sprout database against the SEED data.
89 : parrello 1.14
90 :     =item 3
91 :    
92 :     If any errors are detected in step (2), it is most likely due to a change in
93 :     SEED that did not make it to Sprout. Contact Bruce Parrello or Robert Olson
94 :     to get the code updated properly.
95 :    
96 :     =item 4
97 :    
98 :     Type C<index_sprout> and press ENTER. This will create the Glimpse indexes
99 :     for the Sprout data.
100 :    
101 :     =back
102 :    
103 :     =head2 LoadSproutTables Command
104 :    
105 :     C<LoadSproutTables> creates the load files for Sprout tables and optionally loads them.
106 : parrello 1.12 The parameters are the names of the table groups whose data is to be created.
107 :     The legal table group names are given below.
108 : parrello 1.1
109 :     =over 4
110 :    
111 :     =item Genome
112 :    
113 :     Loads B<Genome>, B<HasContig>, B<Contig>, B<IsMadeUpOf>, and B<Sequence>.
114 :    
115 :     =item Coupling
116 :    
117 :     Loads B<Coupling>, B<IsEvidencedBy>, B<PCH>, B<ParticipatesInCoupling>,
118 :     B<UsesAsEvidence>.
119 :    
120 :     =item Feature
121 :    
122 :     Loads B<Feature>, B<FeatureAlias>, B<FeatureTranslation>, B<FeatureUpstream>,
123 : parrello 1.2 B<IsLocatedIn>, B<FeatureLink>.
124 : parrello 1.1
125 :     =item Subsystem
126 :    
127 : parrello 1.2 Loads B<Subsystem>, B<Role>, B<SSCell>, B<ContainsFeature>, B<IsGenomeOf>,
128 : parrello 1.8 B<IsRoleOf>, B<OccursInSubsystem>, B<ParticipatesIn>, B<HasSSCell>,
129 : parrello 1.11 B<Catalyzes>, B<ConsistsOfRoles>, B<RoleSubset>, B<HasRoleSubset>,
130 : parrello 1.13 B<ConsistsOfGenomes>, B<GenomeSubset>, B<HasGenomeSubset>, B<Diagram>,
131 :     B<RoleOccursIn>.
132 : parrello 1.1
133 : parrello 1.2 =item Annotation
134 :    
135 :     Loads B<SproutUser>, B<UserAccess>, B<Annotation>, B<IsTargetOfAnnotation>,
136 :     B<MadeAnnotation>.
137 :    
138 :     =item Property
139 :    
140 :     Loads B<Property>, B<HasProperty>.
141 :    
142 :     =item BBH
143 :    
144 :     Loads B<IsBidirectionalBestHitOf>.
145 :    
146 : parrello 1.3 =item Group
147 :    
148 :     Loads B<GenomeGroups>.
149 :    
150 :     =item Source
151 :    
152 :     Loads B<Source>, B<ComesFrom>, B<SourceURL>.
153 :    
154 : parrello 1.4 =item External
155 :    
156 :     Loads B<ExternalAliasOrg>, B<ExternalAliasFunc>.
157 :    
158 : parrello 1.8 =item Reaction
159 :    
160 :     Loads B<ReactionURL>, B<Compound>, B<CompoundName>,
161 : parrello 1.11 B<CompoundCAS>, B<IsAComponentOf>, B<Reaction>.
162 : parrello 1.8
163 : parrello 1.3 =item *
164 :    
165 :     Loads all of the above tables.
166 :    
167 : parrello 1.1 =back
168 :    
169 : parrello 1.7 The command-line options are given below.
170 : parrello 1.1
171 :     =over 4
172 :    
173 :     =item geneFile
174 :    
175 :     The name of the file containing the genomes and their associated access codes. The
176 :     file should have one line per genome, each line consisting of the genome ID followed
177 :     by the access code, separated by a tab. If no file is specified, all complete genomes
178 :     will be processed and the access code will be 1.
179 :    
180 :     =item subsysFile
181 :    
182 :     The name of the file containing the trusted subsystems. The file should have one line
183 :     per trusted subsystem. If no file is specified, all subsystems will be trusted.
184 :    
185 :     =item trace
186 :    
187 :     Desired tracing level. The default is 3.
188 :    
189 : parrello 1.10 =item dbLoad
190 :    
191 :     If TRUE, the database tables will be loaded automatically from the load files created.
192 :    
193 : parrello 1.14 =item dbCreate
194 : parrello 1.1
195 : parrello 1.14 If TRUE, the database will be created. If the database exists already, it will be
196 :     dropped. Use the function with caution.
197 : parrello 1.12
198 : parrello 1.17 =item loadOnly
199 :    
200 :     If TRUE, the database tables will be loaded from existing load files. Load files
201 :     will not be created. This option is useful if you are setting up a copy of Sprout
202 :     and have load files already set up from the original version.
203 :    
204 : parrello 1.19 =item primaryOnly
205 :    
206 :     If TRUE, only the group's primary entity will be loaded.
207 :    
208 : parrello 1.14 =back
209 : parrello 1.12
210 : parrello 1.1 =cut
211 :    
212 :     use strict;
213 :     use Tracer;
214 :     use DocUtils;
215 :     use Cwd;
216 :     use FIG;
217 :     use SFXlate;
218 :     use File::Copy;
219 :     use File::Path;
220 :     use SproutLoad;
221 :     use Stats;
222 : parrello 1.9 use SFXlate;
223 : parrello 1.1
224 :     # Get the command-line parameters and options.
225 : parrello 1.17 my ($options, @parameters) = StandardSetup(['SproutLoad', 'ERDBLoad', 'Stats',
226 : parrello 1.21 'ERDB', 'Load', 'Sprout'],
227 : parrello 1.18 { geneFile => ["", "name of the genome list file"],
228 :     subsysFile => ["", "name of the trusted subsystem file"],
229 :     dbLoad => [0, "load the database from generated files"],
230 :     dbCreate => [0, "drop and re-create the database"],
231 : parrello 1.19 loadOnly => [0, "load the database from previously generated files"],
232 : parrello 1.23 primaryOnly => [0, "only process the group's main entity"],
233 :     resume => [0, "resume a complete load starting with the first group specified in the parameter list"],
234 : parrello 1.18 },
235 :     "<group1> <group2> ...",
236 : parrello 1.17 @ARGV);
237 :     # If we're doing a load-only, turn on loading.
238 :     if ($options->{loadOnly}) {
239 :     $options->{dbLoad} = 1
240 :     }
241 : parrello 1.14 if ($options->{dbCreate}) {
242 :     # Here we want to drop and re-create the database.
243 :     my $db = $FIG_Config::sproutDB;
244 : parrello 1.20 DBKernel::CreateDB($db);
245 : parrello 1.14 }
246 : parrello 1.9 # Create the sprout loader object. Note that the Sprout object does not
247 : parrello 1.10 # open the database unless the "dbLoad" option is turned on.
248 : parrello 1.1 my $fig = FIG->new();
249 : parrello 1.10 my $sprout = SFXlate->new_sprout_only(undef, undef, undef, ! $options->{dbLoad});
250 : parrello 1.7 my $spl = SproutLoad->new($sprout, $fig, $options->{geneFile}, $options->{subsysFile}, $options);
251 : parrello 1.15 # Insure we have an output directory.
252 :     FIG::verify_dir($FIG_Config::sproutData);
253 : parrello 1.23 # If we're resuming, we only want to have 1 parameter.
254 :     my $resume = $options->{resume};
255 :     if ($resume && @parameters > 1) {
256 :     Confess("If resume=1, only one load group can be specified.");
257 :     } elsif (! @parameters) {
258 :     Confess("No load groups were specified.");
259 :     }
260 : parrello 1.1 # Process the parameters.
261 :     for my $group (@parameters) {
262 :     Trace("Processing load group $group.") if T(2);
263 :     my $stats;
264 : parrello 1.3 if ($group eq 'Genome' || $group eq '*') {
265 : parrello 1.1 $spl->LoadGenomeData();
266 : parrello 1.24 $group = ResumeCheck($resume);
267 : parrello 1.3 }
268 :     if ($group eq 'Feature' || $group eq '*') {
269 : parrello 1.1 $spl->LoadFeatureData();
270 : parrello 1.24 $group = ResumeCheck($resume);
271 : parrello 1.3 }
272 :     if ($group eq 'Coupling' || $group eq '*') {
273 : parrello 1.1 $spl->LoadCouplingData();
274 : parrello 1.24 $group = ResumeCheck($resume);
275 : parrello 1.3 }
276 :     if ($group eq 'Subsystem' || $group eq '*') {
277 : parrello 1.1 $spl->LoadSubsystemData();
278 : parrello 1.24 $group = ResumeCheck($resume);
279 : parrello 1.3 }
280 :     if ($group eq 'Property' || $group eq '*') {
281 : parrello 1.1 $spl->LoadPropertyData();
282 : parrello 1.24 $group = ResumeCheck($resume);
283 : parrello 1.3 }
284 :     if ($group eq 'Annotation' || $group eq '*') {
285 : parrello 1.2 $spl->LoadAnnotationData();
286 : parrello 1.24 $group = ResumeCheck($resume);
287 : parrello 1.3 }
288 :     if ($group eq 'BBH' || $group eq '*') {
289 : parrello 1.2 $spl->LoadBBHData();
290 : parrello 1.24 $group = ResumeCheck($resume);
291 : parrello 1.1 }
292 : parrello 1.4 if ($group eq 'Group' || $group eq '*') {
293 : parrello 1.3 $spl->LoadGroupData();
294 : parrello 1.24 $group = ResumeCheck($resume);
295 : parrello 1.3 }
296 :     if ($group eq 'Source' || $group eq '*') {
297 :     $spl->LoadSourceData();
298 : parrello 1.24 $group = ResumeCheck($resume);
299 : parrello 1.3 }
300 : parrello 1.4 if ($group eq 'External' || $group eq '*') {
301 :     $spl->LoadExternalData();
302 : parrello 1.24 $group = ResumeCheck($resume);
303 : parrello 1.4 }
304 : parrello 1.8 if ($group eq 'Reaction' || $group eq '*') {
305 :     $spl->LoadReactionData();
306 : parrello 1.24 $group = ResumeCheck($resume);
307 : parrello 1.8 }
308 : parrello 1.3
309 : parrello 1.1 }
310 :     Trace("Load complete.") if T(2);
311 :    
312 : parrello 1.23 # If the resume flag is set, return "*", else return "".
313 :     sub ResumeCheck {
314 :     my ($resume) = @_;
315 :     return ($resume ? "*" : "");
316 :     }
317 :    
318 : parrello 1.1 1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3