[Bio] / Sprout / SimBlocks.pm Repository:
ViewVC logotype

Annotation of /Sprout/SimBlocks.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.2 - (view) (download) (as text)

1 : parrello 1.1 package SimBlocks;
2 :    
3 :     require Exporter;
4 :     use ERDB;
5 :     @ISA = qw(Exporter ERDB);
6 : parrello 1.2 @EXPORT = qw(DefaultDistances);
7 :     @EXPORT_OK = qw(BlocksInSet GetBlock CompareGenomes DBLoad DistanceMatrix GetAlignment GetRegions ParsePattern RemoveBlocks SequenceDistance SetNumber SnipScan);
8 : parrello 1.1
9 :     use strict;
10 :     use Tracer;
11 :     use PageBuilder;
12 :     use Genome;
13 :    
14 :     use FIG_Config;
15 :    
16 :     =head1 Similarity Block Database
17 :    
18 :     =head2 Introduction
19 :    
20 :     A I<similarity Block Database> describes the similarities amongst a group
21 :     of contigs. The load files for the database are produced by the
22 :     C<SimBlast.pl> script from genome FASTA files. The database is described
23 :     by the C<SimBlocksDBD.xml> file.
24 :    
25 :     The primary entities of the database are B<Genome>, which represents a specified
26 : parrello 1.2 organism, B<GroupBlock>, which represents a set of similar regions, and B<Contig>,
27 : parrello 1.1 which represents a contiguous segment of DNA. The key relationship is
28 : parrello 1.2 B<ContainsRegionIn>, which maps blocks to contigs. The mapping information
29 : parrello 1.1 includes the DNA nucleotides as well as the location in the contig.
30 :    
31 : parrello 1.2 =head2 Formal Definitions
32 :    
33 :     Let {B<G(1)>, B<G(2)>, ..., B<G(>I<n>B<)>} be a set of genomes
34 :     and let {B<g(1)>, B<g(2)>, ..., B<g(>I<m>B<)>} be the set of
35 :     genes called on these genomes.
36 :    
37 :     The primitive notion from which similarity blocks are derived is the
38 :     I<correspondence mapping>. Roughly, this mapping connects genes on
39 :     one genome to corresponding genes on other genomes. We will treat
40 :     the mapping as an equivalence class. In practice, the entire mapping
41 :     will not be available when we are building to load files for the
42 :     similarity block database. Instead, the user will provide a subset
43 :     of the mapping from which the entire partitioning can be deduced
44 :     via commutativity, reflexivity, and transitivity. This subset is
45 :     called the I<correspondence relation>.
46 :    
47 :     In actual practice, the correspondence relation will contain some
48 :     false positives. The process of developing the correspondence
49 :     relation usually involves analyzing similarities above a certain
50 :     threshhold. (This type of similarity is not necessarily transitive;
51 :     however, the true, underlying correspondence mapping we want I<is>
52 :     transitive.) Often, a sequence of correspondence pairs that maps
53 :     a gene on one particular genome to another gene on the same genome
54 :     is an indication of a false positive.
55 :    
56 :     In addition to false positives, we may have invalid genes, missing
57 :     genes, frame shifts, and incorrect start locations. Detecting and
58 :     cleaning these problems is an important part of getting a good
59 :     correspondence relation.
60 :    
61 :     Given a full-blown correspondence mapping, we define the following
62 :     concepts.
63 :    
64 :     =over 4
65 :    
66 :     =item corresponding gene
67 :    
68 :     The I<corresponding genes> of a given gene B<g(>I<i>B<)> are those
69 :     genes mapped to it by the correspondence mapping.
70 :    
71 :     =item gene block
72 :    
73 :     A I<gene block> is a maximal set of corresponding genes.
74 :    
75 :     =item intergenic region
76 :    
77 :     An I<intergenic region> is a section of the contig between
78 :     two adjacent genes. The adjacent genes are called the
79 :     I<left gene> and the I<right gene>, depending on whether they
80 :     precede or follow the region. For example, if C<G2_100+50>
81 :     and C<G2_175+100> are adjacent, C<G2_150+25> is their
82 :     intergenic region, the left gene is C<G2_100+50>, and
83 :     the right gene is C<G2_175+100>.
84 :    
85 :     =item intergenic block
86 :    
87 :     An I<intergenic block> is a maximal set of intergenic regions such
88 :     that (1) the left genes all correspond and have the same orientation,
89 :     (2) the right genes all correspond and have the same orientation, and
90 :     (3) the region lengths differ by no more than a specified threshhold
91 :     (usually 10%).
92 :    
93 :     =item similarity block
94 :    
95 :     A I<similarity block> is either a gene block or an intergenic block.
96 :     A similarity block is therefore a set of regions (either gene regions
97 :     or intergenic regions) that are tied together by the correspondence
98 :     mapping.
99 :    
100 :     =item common block
101 :    
102 :     A I<common block> with respect to a subset of the genomes
103 :     {B<G(>I<i1>B<)>, B<G(>I<i2>B<)>, ... B<G(>I<ik>B<)>} is a
104 :     similarity block that has exactly one region in each
105 :     of the genomes of the subset. There is a lesser notion of
106 :     a block that occurs in most genomes in the subset. The actual
107 :     program allows the notion of I<most> to be defined externally.
108 :     The default is all of the genomes in the subset.
109 :    
110 :     =item unique sequence
111 :    
112 :     A I<unique sequence> is a region that does not appear in any
113 :     similarity block with another region. In other words, a unique
114 :     sequence is a region that does not correspond to any other region.
115 :    
116 :     =back
117 :    
118 :     =head2 Usage
119 : parrello 1.1
120 :     The similarity block database must be able to answer two questions, taking
121 : parrello 1.2 as input two genome sets. (These would be subsets of the total set of genomes
122 :     represented in the correspondence mapping.)
123 : parrello 1.1
124 :     =over 4
125 :    
126 : parrello 1.2 =item (1) Which blocks are common in one set and do not occur in the other.
127 :    
128 :     These are considered major variations. They result from prophage insertions,
129 :     transposition events, and recombinations. A major variation results in a
130 :     different set of proteins in an operating cell.
131 : parrello 1.1
132 : parrello 1.2 =item (2) What individual variations in the shared common blocks distinguish the two sets?
133 :    
134 :     These are considered minor variations, and are the most common result of
135 :     single-nucleotide mutations. Rather than changing the set of proteins,
136 :     minor variations change the proteins themselves.
137 : parrello 1.1
138 :     =back
139 :    
140 :     This class is a subclass of B<ERDB>, and inherits all of its public methods.
141 :    
142 : parrello 1.2 =head2 Setup
143 :    
144 :     To begin working with similarity blocks, you must first create a database to
145 :     contain the block data. Next, you must specify the various similarity block
146 :     parameters in C<FIG_Config>. These are as follows. (Of course, the values
147 :     you specify will be different.)
148 :    
149 :     $simBlocksDB = "conser5_blocks"; # SimBlocks database schema name
150 :     $simBlocksData = "$data/BlockData"; # directory containing SimBlocks data files
151 :     # (used to load the Similarity database)
152 :     $simBlastData = "$data/BlastData"; # directory containing SimBlast data files
153 :     # (input to SimBlast.pl)
154 :    
155 :     Run the C<SimBlast> script to load the database.
156 :    
157 :     The C<SimBlockForm.cgi> service enables you to run a similarity analysis between genome
158 :     sets in the database. The C<Html/ERDBTest.html> page enables you to run general queries
159 :     against the database and execute simple scripts (however, this latter capability is
160 :     disabled unless you have
161 :    
162 :     $debug_mode = 1; # TRUE to enable the debugging scripts
163 :    
164 :     The Similarity Block methods and scripts work in the Windows version of the SEED (see
165 :     C<WinBuild>).
166 :    
167 : parrello 1.1 =cut
168 :    
169 :     #: Constructor SimBlocks->new();
170 :    
171 :     =head2 Public Methods
172 :    
173 :     =head3 new
174 :    
175 :     C<< my $simdb = SimBlocks->new($dbname, $dbType, $port); >>
176 :    
177 :     Construct a new SimBlocks object connected to the specified database. If no
178 :     database is specified, the default database indicated by C<FIG_Config::simBlocksDB>
179 :     will be used. Similarly, if the database type is omitted, the type will be
180 :     taken from C<FIG_Config::dbms>, and if the port is omitted, C<FIG_Config::dbport>
181 :     will be used. The user name and password for connecting are taken from
182 :     C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.
183 :    
184 : parrello 1.2 In almost every case, you will be calling
185 :    
186 :     C<< my $simdb = SimBlocks->new(); >>
187 :    
188 : parrello 1.1 =over 4
189 :    
190 :     =item dbname (optional)
191 :    
192 :     Database schema name. This is how we choose the desired database on the server.
193 :     If omitted, the value of C<$FIG_Config::simBlocksDB> is used.
194 :    
195 :     =item dbType
196 :    
197 :     Database type (e.g. C<mysql> for MySQL or C<pg> for Postgres). If omitted, the
198 :     value of C<$FIG_Config::dbms> will be used.
199 :    
200 :     =item port
201 :    
202 :     Port number for connecting to the database server. If omitted, the value of
203 :     C<$FIG_Config::dbport> will be used.
204 :    
205 :     =back
206 :    
207 :     =cut
208 :    
209 :     sub new {
210 :     # Get the parameters.
211 :     my ($class, $dbname, $dbType, $port) = @_;
212 :     # Plug in the default values.
213 :     if (! $dbname) {
214 :     $dbname = $FIG_Config::simBlocksDB;
215 :     }
216 :     if (! $dbType) {
217 :     $dbType = $FIG_Config::dbms;
218 :     }
219 :     if (! $port) {
220 :     $port = $FIG_Config::dbport;
221 :     }
222 :     # Connect to the database.
223 :     my $dbh = DBKernel->new($dbType, $dbname, $FIG_Config::dbuser, $FIG_Config::dbpass, $port);
224 :     # Get the data directory name.
225 :     my $directory = $FIG_Config::simBlocksData;
226 :     # Create and bless the ERDB object.
227 :     my $retVal = ERDB::new($class, $dbh, "$directory/SimBlocksDBD.xml");
228 :     # Return it.
229 :     return $retVal;
230 :     }
231 :    
232 : parrello 1.2 =head3 DefaultDistances
233 :    
234 :     C<< my $distances = DefaultDistances(); >>
235 :    
236 :     Return the default distance matrix for computing the alignment distances (see
237 :     also L</DistanceMatrix>. This matrix returns a distance of 0 between an insertion
238 :     character (C<->) and any other nucleotide, a distance of 0.5 between nucleotides
239 :     of the same type, and a distance of 1 between nucleotides of different types.
240 :    
241 :     =cut
242 :    
243 :     my $DefaultDistanceMatrix = { AA => 0, AC => 1, AG => 0.5, AT => 1,
244 :     AN => 0.625, AR => 0.25, AY => 1, 'A-' => 0,
245 :     CA => 1, CC => 0, CG => 1, CT => 0.5,
246 :     CN => 0.625, CR => 1, CY => 0.25, 'C-' => 0,
247 :     GA => 0.5, GC => 1, GG => 0, GT => 1,
248 :     GN => 0, GR => 0.25, GY => 1, 'G-' => 0,
249 :     TA => 1, TC => 0.5, TG => 1, TT => 0,
250 :     TN => 0, TR => 1, TY => 0.25, 'T-' => 0,
251 :     RA => 0.25, RC => 1, RG => 0.25, RT => 1,
252 :     RN => 0.625, RR => 0.25, RY => 1, 'R-' => 0,
253 :     YA => 1, YC => 0.25, YG => 1, YT => 0.25,
254 :     YN => 0.625, YR => 1, YY => 0.25, 'Y-' => 0,
255 :     NA => 0.625, NC => 0.625, NG => 0.625, NT => 0.625,
256 :     NN => 0.625, NR => 0.625, NY => 0.625,'N-' => 0,
257 :     '-A' => 0, '-C' => 0, '-G' => 0, '-T' => 0,
258 :     '-N' => 0, '-R' => 0, '-Y' => 0, '--' => 0 };
259 :    
260 :     sub DefaultDistances {
261 :     return $DefaultDistanceMatrix;
262 :     }
263 :    
264 : parrello 1.1 =head3 DBLoad
265 :    
266 :     C<< my $stats = $simBlocks->DBLoad($rebuild); >>
267 :    
268 :     Load the database from the default directory. This is essentially identical to
269 :     a B<LoadTables> call with the default directory used instead of a caller-specified
270 :     directory.
271 :    
272 :     =over 4
273 :    
274 :     =item rebuild
275 :    
276 :     TRUE if the tables should be re-created; FALSE if the tables should be cleared and
277 : parrello 1.2 reloaded. The default is FALSE; however, TRUE is considerably faster. You would use
278 :     FALSE if you had added non-standard indexes to the database and didn't want to lose
279 :     them.
280 : parrello 1.1
281 :     =item RETURN
282 :    
283 :     Returns a statistics object describing the duration of the load, the number of
284 :     records loaded, and a list of error messages.
285 :    
286 :     =back
287 :    
288 :     =cut
289 :     #: Return Type $%@;
290 :     sub DBLoad {
291 :     # Get the parameters.
292 :     my ($self, $rebuild) = @_;
293 :     # Load the tables.
294 :     my $retVal = $self->LoadTables($FIG_Config::simBlocksData, $rebuild);
295 :     # Return the result.
296 :     return $retVal;
297 :     }
298 :    
299 :     =head3 CompareGenomes
300 :    
301 : parrello 1.2 C<< my (\%set0Blocks, \%set1Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set0, \@set1); >>
302 : parrello 1.1
303 :     Analyze two sets of genomes for commonalities. The group blocks returned will be divided
304 : parrello 1.2 into three hashes: one for those common to set 0 and not occurring at all in set 1, one
305 :     for those common to set 1 and not occurring at all in set 0, and one for those common
306 :     to both sets. Each hash is keyed by group ID and will contain B<DBObject>s for
307 :     B<GroupBlock> records with B<HasInstanceOf> data attached, though the genome ID in
308 :     the B<HasInstanceOf> section is not generally predictable.
309 : parrello 1.1
310 :     =over 4
311 :    
312 : parrello 1.2 =item set0
313 : parrello 1.1
314 : parrello 1.2 Reference to a list of genome IDs for the genomes in the first set (set 0).
315 : parrello 1.1
316 : parrello 1.2 =item set1
317 : parrello 1.1
318 : parrello 1.2 Reference to a list of genome IDs for the genomes in the second set (set 1).
319 : parrello 1.1
320 :     =item RETURN
321 :    
322 : parrello 1.2 Returns a triple of hashes. Each hash is keyed by group ID, and will contain
323 :     B<DBObject>s for records in the B<GroupBlock> table. Groups found in all of the
324 :     genomes in set 0 but in none of the genomes of set 1 will be in the first hash,
325 :     groups found in all of the genomes in set 1 but in none of the genomes of set 0
326 :     will be in the second hash, and groups found in all of the genomes of both sets
327 :     will be in the third hash.
328 : parrello 1.1
329 :     =back
330 :    
331 :     =cut
332 :     #: Return Type @%;
333 :     sub CompareGenomes {
334 :     # Get the parameters.
335 : parrello 1.2 my ($self, $set0, $set1) = @_;
336 : parrello 1.1 # Declare the three hashes.
337 : parrello 1.2 my (%set0Blocks, %set1Blocks, %bothBlocks);
338 :     # Our first task is to find the common groups in each genome set.
339 :     my %groups0 = $self->BlocksInSet($set0);
340 : parrello 1.1 my %groups1 = $self->BlocksInSet($set1);
341 :     # Create a trailer key to help tighten the loops.
342 :     my $trailer = "\xFF";
343 :     # The groups are hashed by group ID. We will move through them in key order
344 : parrello 1.2 # to get the "set0", "set1", and "both" groups.
345 :     my @blockIDs0 = (sort keys %groups0, $trailer);
346 : parrello 1.1 my @blockIDs1 = (sort keys %groups1, $trailer);
347 :     # Prime the loop by getting the first ID from each list. Thanks to the trailers
348 :     # we know this process can't fail.
349 : parrello 1.2 my $blockID0 = shift @blockIDs0;
350 : parrello 1.1 my $blockID1 = shift @blockIDs1;
351 : parrello 1.2 while ($blockID0 lt $trailer || $blockID1 lt $trailer) {
352 : parrello 1.1 # Compare the block IDs.
353 : parrello 1.2 if ($blockID0 lt $blockID1) {
354 :     # Block 0 is only in the first set.
355 :     Trace("Block IDs $blockID0 vs. $blockID1. Set 0 selected.") if T(SimCompare => 3);
356 :     $set0Blocks{$blockID0} = $groups0{$blockID0};
357 :     $blockID0 = shift @blockIDs0;
358 :     } elsif ($blockID0 gt $blockID1) {
359 :     # Block 1 is only in the second set.
360 :     Trace("Block IDs $blockID0 vs. $blockID1. Set 1 selected.") if T(SimCompare => 3);
361 : parrello 1.1 $set1Blocks{$blockID1} = $groups1{$blockID1};
362 :     $blockID1 = shift @blockIDs1;
363 :     } else {
364 :     # We have a block in both sets.
365 : parrello 1.2 Trace("Block IDs $blockID0 vs. $blockID1. Both selected.") if T(SimCompare => 3);
366 : parrello 1.1 $bothBlocks{$blockID1} = $groups1{$blockID1};
367 : parrello 1.2 $blockID0 = shift @blockIDs0;
368 : parrello 1.1 $blockID1 = shift @blockIDs1;
369 :     }
370 :     }
371 : parrello 1.2 # The set1 and set2 lists are too big. They contain groups that are common in their
372 :     # respective sets but not common in both sets. We want groups that are common in
373 :     # their respective sets but completely absent in the other set. We do this by getting
374 :     # a complete list of the blocks in each set and deleting anything we find.
375 :     $self->RemoveBlocks(\%set0Blocks, $set1);
376 :     $self->RemoveBlocks(\%set1Blocks, $set0);
377 : parrello 1.1 # Return the result.
378 : parrello 1.2 return (\%set0Blocks, \%set1Blocks, \%bothBlocks);
379 :     }
380 :    
381 :     =head3 RemoveBlocks
382 :    
383 :     C<< $simBlocks->RemoveBlocks(\%blockMap, \@set); >>
384 :    
385 :     Remove from the specified block map any blocks that occur in the specified set of genomes.
386 :     The block map can contain any data, but it must be keyed by block ID.
387 :    
388 :     =over 4
389 :    
390 :     =item blockMap
391 :    
392 :     Reference to a hash of block IDs to block data. The character of the block data is
393 :     irrelevant to this method.
394 :    
395 :     =item set
396 :    
397 :     Reference to a list of genome IDs. Any block that occurs in any one of the specified
398 :     genomes will be removed from the block map.
399 :    
400 :     =back
401 :    
402 :     =cut
403 :     #: Return Type ;
404 :     sub RemoveBlocks {
405 :     # Get the parameters.
406 :     my ($self, $blockMap, $set) = @_;
407 :     # Get a list of the blocks in the genome set.
408 :     my %setBlocks = $self->BlocksInSet($set, 1);
409 :     # Loop through the incoming block map, deleting any blocks found
410 :     # in the new block set. Note we make the assumption that the
411 :     # incoming block map is smaller.
412 :     my @blockList = keys %{$blockMap};
413 :     for my $blockID (@blockList) {
414 :     if (exists $setBlocks{$blockID}) {
415 :     delete $blockMap->{$blockID};
416 :     }
417 :     }
418 : parrello 1.1 }
419 :    
420 :     =head3 BlocksInSet
421 :    
422 : parrello 1.2 C<< my %blockList = $simBlocks->BlocksInSet($set, $count); >>
423 : parrello 1.1
424 : parrello 1.2 Return a list of the group blocks found in a given number of the genomes in a given
425 :     set. The list returned will be a hash of B<DBObject>s, each corresponding to a single
426 :     B<GroupBlock> record, with a B<HasInstanceOf> record attached, though the content of
427 :     the B<HasInstanceOf> record is not predictable. The hash will be keyed by block ID.
428 : parrello 1.1
429 :     =over 4
430 :    
431 :     =item set
432 :    
433 : parrello 1.2 Reference to a list of genome IDs.
434 :    
435 :     =item count (optional)
436 :    
437 :     Minimum number of genomes from the set in which the block must appear. If C<1> is
438 :     specified, the list will include any block that occurs in at least one genome.
439 :     If C<5> is specified, the list will only include blocks that occur in at least
440 :     5 genomes from the set. If the set consists of only 4 genomes in this latter
441 :     case, no blocks will be returned. The default is to return blocks that occur in
442 :     all genomes of the set.
443 : parrello 1.1
444 :     =item RETURN
445 :    
446 :     Returns a hash of B<DBObject>s corresponding to the group blocks found in the
447 :     genomes of the set.
448 :    
449 :     =back
450 :    
451 :     =cut
452 :     #: Return Type %%;
453 :     sub BlocksInSet {
454 :     # Get the parameters.
455 : parrello 1.2 my ($self, $set, $count) = @_;
456 :     # If the count was not specified, set it t
457 :     if (!defined $count) {
458 :     $count = @{$set};
459 :     }
460 :     Trace("BlocksInSet called for $count genomes of set (" . join(", ", @{$set}) . ").") if T(2);
461 : parrello 1.1 # Create a hash to hold the groups found. The hash will be keyed by group ID
462 :     # and contain the relevant DB object.
463 :     my %retVal = ();
464 : parrello 1.2 # Create a hash to contain the number of genomes in which each block was found.
465 :     my %counters = ();
466 : parrello 1.1 # Loop through the genome IDs in the specified set.
467 :     for my $genomeID (@{$set}) {
468 :     # Get a list of group blocks for this genome.
469 :     my @blocks = $self->GetList(['HasInstanceOf', 'GroupBlock'],
470 :     "HasInstanceOf(from-link) = ?", $genomeID);
471 :     # Loop through the blocks, storing any new ones in the hash.
472 :     for my $block (@blocks) {
473 :     # Get the ID of this block.
474 :     my ($blockID) = $block->Value('GroupBlock(id)');
475 : parrello 1.2 # Determine whether this block is new or has been found in a previous
476 :     # genome of the set.
477 :     if (exists $counters{$blockID}) {
478 :     # Here it's old. Increment its counter.
479 :     $counters{$blockID}++;
480 :     } else {
481 :     # Here it's new. Create a counter for it.
482 :     $counters{$blockID} = 1;
483 :     }
484 :     # If it meets the threshhold, put it in the return hash. If it's already
485 :     # there, we'll simply overwrite the old copy, which makes no difference
486 :     # to the caller.
487 :     Trace("Block $blockID has $counters{$blockID} occurrences as of genome $genomeID.") if T(SimSet => 4);
488 :     if ($counters{$blockID} >= $count) {
489 :     $retVal{$blockID} = $block;
490 :     Trace("Block $blockID added to set.") if T(SimSet => 3);
491 :     }
492 : parrello 1.1 }
493 :     }
494 :     # Return the result.
495 :     return %retVal;
496 :     }
497 :    
498 :     =head3 GetRegions
499 :    
500 :     C<< my %regions = $simBlocks->GetRegions($blockID, \@genomes); >>
501 :    
502 :     Return the regions of the specified block that occur in the contigs of
503 :     the specified genomes. The return value is a hash of DNA strings keyed
504 :     on the region's location descriptor.
505 :    
506 :     =over 4
507 :    
508 :     =item blockID
509 :    
510 :     ID of the group block whose regions are to be returned.
511 :    
512 :     =item genomes
513 :    
514 :     Reference to a list of genome IDs. Only regions in one of the listed
515 :     genomes will be returned.
516 :    
517 :     =item RETURN
518 :    
519 :     Returns a hash keyed by region location. The objects in the hash are
520 :     the DNA strings for the specified region.
521 :    
522 :     =back
523 :    
524 :     =cut
525 :     #: Return Type %$;
526 :     sub GetRegions {
527 :     # Get the parameters.
528 :     my ($self, $blockID, $genomes) = @_;
529 :     # Create the return hash.
530 :     my %retVal = ();
531 : parrello 1.2 # Count the regions found.
532 :     my $regionCount = 0;
533 : parrello 1.1 # Query all the regions for the specified block.
534 : parrello 1.2 my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
535 :     $blockID);
536 : parrello 1.1 # Loop through the query.
537 :     while (my $region = $query->Fetch) {
538 :     # Get this region's data.
539 : parrello 1.2 my ($location, $dna) =
540 :     $region->Values(['Region(id)', 'Region(content)']);
541 : parrello 1.1 # Insure it belongs to one of our genomes.
542 : parrello 1.2 my $found = SetNumber($location, $genomes);
543 : parrello 1.1 # If it does, put it in the hash.
544 : parrello 1.2 if ($found >= 0) {
545 : parrello 1.1 $retVal{$location} = $dna;
546 : parrello 1.2 $regionCount++;
547 : parrello 1.1 }
548 :     }
549 : parrello 1.2 Trace("$regionCount regions found by GetRegions.") if T(2);
550 : parrello 1.1 # Return the result.
551 :     return %retVal;
552 :     }
553 :    
554 : parrello 1.2 =head3 SetNumber
555 :    
556 :     C<< my $setNumber = SimBlocks::SetNumber($contigRegion, \@set0, \@set1, ..., \@setN); >>
557 :    
558 :     Examine a region string, contig ID, or genome ID, and return the number of the genome
559 :     set to which it belongs.
560 :    
561 :     =over 4
562 :    
563 :     =item contigRegion
564 :    
565 :     A region string, contig ID, or genome ID. Because the contig ID is prefixed by the genome ID
566 :     delimited by a colon symbol (C<:>), we split at the first colon to get the desired ID.
567 :    
568 :     =item set0, set1, ..., setN
569 :    
570 :     A list of references to genome ID lists. Essentially, we will return the index of the
571 :     first incoming list that has the specified genome ID in it.
572 :    
573 :     =item RETURN
574 :    
575 :     The index in the list of sets of the set containing the relevant genome, or -1 if
576 :     the genome was not found.
577 :    
578 :     =back
579 :    
580 :     =cut
581 :     #: Return Type $
582 :     sub SetNumber {
583 :     # Get the parameters.
584 :     my ($contigRegion, @sets) = @_;
585 :     # Extract the genome ID.
586 :     my ($genomeID) = split /:/, $contigRegion;
587 :     # Clear the return variable. A negative number means nothing found.
588 :     my $retVal = -1;
589 :     # Loop until we find the genome.
590 :     for (my $i = 0; $retVal < 0 && $i <= $#sets; $i++) {
591 :     my $set = $sets[$i];
592 :     if (grep /^$genomeID$/, @{$set}) {
593 :     $retVal = $i;
594 :     }
595 :     }
596 :     # Return the result.
597 :     return $retVal;
598 :     }
599 :    
600 : parrello 1.1 =head3 TagDNA
601 :    
602 :     C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>
603 :    
604 :     Convert a DNA string from the B<ContainsRegionIn> relationship to the actual
605 :     DNA, optionally with markings surrounding them. The DNA string will contain
606 :     only the values corresponding to the question marks in the pattern,
607 :     which should be taken from the DNA string's parent B<GroupBlock>. The
608 :     resulting DNA sequence is built by copying
609 :    
610 :     =over 4
611 :    
612 :     =item pattern
613 :    
614 :     DNA pattern from which the positions of variance are to be taken.
615 :    
616 :     =item dnaString
617 :    
618 :     String containing DNA string to be marked.
619 :    
620 :     =item prefix
621 :    
622 :     String to be inserted before each position of variance.
623 :    
624 :     =item suffix
625 :    
626 :     String to be inserted after each position of variance.
627 :    
628 :     =item RETURN
629 :    
630 :     Returns a copy of the pattern string with the positions of variance filled
631 :     in from the DNA string. If a prefix and suffix are specified, they are
632 :     placed before and after each character copied into the pattern string.
633 :    
634 :     =back
635 :    
636 :     =cut
637 :     #: Return Type $;
638 :     sub TagDNA {
639 :     # Get the parameters.
640 :     my ($pattern, $dnaString, $prefix, $suffix) = @_;
641 :     # Insure the prefix and suffix behave properly.
642 :     if (!defined $prefix) {
643 :     $prefix = "";
644 :     }
645 :     if (!defined $suffix) {
646 :     $suffix = "";
647 :     }
648 :     # Initialize the return variable.
649 :     my $retVal = "";
650 :     # Convert the DNA string to a list with the prefixes and suffixes
651 :     # inserted.
652 :     my @dnaList = map { $prefix . $_ . $suffix } split //, $dnaString;
653 :     # Loop through the DNA string.
654 :     my ($start, $position) = (0, 0);
655 :     while (($position = index $pattern, "?", $start) >= 0) {
656 :     # If we have invariant text, copy it over.
657 :     if ($position > $start) {
658 :     $retVal .= substr $pattern, $start, $position - $start;
659 :     }
660 :     # Pop off the next position from the DNA string.
661 :     $retVal .= shift @dnaList;
662 :     # Start past the question mark.
663 :     $start = $position + 1;
664 :     }
665 : parrello 1.2 # Tack on the residual.
666 :     $retVal .= substr $pattern, $start;
667 : parrello 1.1 # Return the result.
668 :     return $retVal;
669 :     }
670 : parrello 1.2
671 :     =head3 SnipScan
672 :    
673 :     C<< my %positions = $simBlocks->SnipScan($blockObject, \@set0, \@set1); >>
674 :    
675 :     Examine the specified block and return a list of the positions at which the
676 :     nucleotide values for regions in the first genome set differ from the values
677 :     for regions in the second genome set. This conditions is only satisfied if
678 :     the nucleotides occurring in the first set's regions never occur in the
679 :     second set's regions. This is a very strict condition. If, for example, C<A>
680 :     occurs in one of the regions for the first set, it cannot occur in any of the
681 :     regions for the second set.
682 :    
683 :     =over 4
684 :    
685 :     =item blockObject
686 :    
687 :     A C<DBObject> representing the B<GroupBlock> record for the desired block or
688 :     the actual ID of the block whose regions are to be examined. It is expected that the
689 :     block will have regions in all of the genomes for both sets, but this is not
690 :     required by the algorithm.
691 :    
692 :     =item set0
693 :    
694 :     Reference to a list of the IDs for the genomes in the first set (0).
695 :    
696 :     =item set1
697 :    
698 :     Reference to a list of the IDs for the genomes in the second set (1).
699 :    
700 :     =item RETURN
701 :    
702 :     A hash that maps positions to contents. Each position will be relative to the
703 :     start of the block. The mapped value will list the nucleotides found in the
704 :     first set and the nucleotides found in the second set. So, for example,
705 :     C<['AC', 'G']> means that both C<A> and C<C> are found in the genomes of the
706 :     first set (0), but only C<G> is found in genomes of the second set (1).
707 :    
708 :     =back
709 :    
710 :     =cut
711 :     #: Return Type %;
712 :     sub SnipScan {
713 :     # Get the parameters.
714 :     my ($self, $blockObject, $set0, $set1) = @_;
715 :     # Convert an incoming block ID to a block object.
716 :     if (ref $blockObject ne "DBObject") {
717 :     $blockObject = $self->GetEntity('GroupBlock', $blockObject);
718 :     }
719 :     # Get the ID and length of this block.
720 :     my ($blockID) = $blockObject->Value('GroupBlock(id)');
721 :     my ($len) = $blockObject->Value('GroupBlock(snip-count)');
722 :     # Create a hash that can be used to identify the genomes in each set.
723 :     my %setMap = ();
724 :     for my $genomeID (@{$set0}) {
725 :     $setMap{$genomeID} = 0;
726 :     }
727 :     for my $genomeID (@{$set1}) {
728 :     $setMap{$genomeID} = 1;
729 :     }
730 :     # Create the snip hash. We'll create empty lists for each snip position.
731 :     # As soon as a position becomes ineligible we delete it from the hash.
732 :     my %snipHash = ();
733 :     for (my $i = 0; $i < $len; $i++) {
734 :     $snipHash{$i} = ['', ''];
735 :     }
736 :     # Ask for the regions in the block.
737 :     my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
738 :     $blockID);
739 :     # Loop through the regions.
740 :     while (my $region = $query->Fetch) {
741 :     # Determine this region's genome set. We only continue if the region is in
742 :     # one of the sets.
743 :     my ($location) = $region->Value('Region(from-link)');
744 :     my $genomeID = Genome::FromContig($location);
745 :     if (exists $setMap{$genomeID}) {
746 :     my $setIndex = $setMap{$genomeID};
747 :     # Get the region's snip values.
748 :     my ($regionValue) = $region->Value('Region(content)');
749 :     # Loop through the positions that are still active.
750 :     my @positions = keys %snipHash;
751 :     for my $position (@positions) {
752 :     # Get the nucleotide at the current position in this region.
753 :     my $letter = substr $regionValue, $position, 1;
754 :     # Find it in the appropriate list.
755 :     my $letterList = $snipHash{$position};
756 :     if ((index $letterList->[1-$setIndex], $letter) >= 0) {
757 :     # It's in the other set, so this position is no
758 :     # longer valid.
759 :     delete $snipHash{$position};
760 :     } elsif ((index $letterList->[$setIndex], $letter) < 0) {
761 :     # It's in neither set, so we add it to the list we're
762 :     # accumulating for this set at this position.
763 :     $letterList->[$setIndex] .= $letter;
764 :     }
765 :     }
766 :     }
767 :     }
768 :     # Now %snipHash contains all of the information we need, but we must relocate the
769 :     # positions so they are relative to the block address instead of the snip
770 :     # positions. We need the block pattern to do this.
771 :     my ($blockPattern) = $blockObject->Value('GroupBlock(pattern)');
772 :     my @positions = ParsePattern($blockPattern);
773 :     # Create the return hash.
774 :     my %retVal = ();
775 :     # Relocate the positions.
776 :     for my $inPosition (keys %snipHash) {
777 :     $retVal{$positions[$inPosition]} = $snipHash{$inPosition};
778 :     }
779 :     # Return the result.
780 :     return %retVal;
781 :     }
782 :    
783 :     =head3 ParsePattern
784 :    
785 :     C<< my @positions = SimBlocks::ParsePattern($pattern); >>
786 :    
787 :     Get a list of the positions of variance inside a block pattern. The
788 :     positions of variance are marked by question marks, so all we need to
789 :     do is return the location of every question mark in the incoming
790 :     patter string.
791 :    
792 :     =over 4
793 :    
794 :     =item pattern
795 :    
796 :     DNA pattern for a group block.
797 :    
798 :     =item RETURN
799 :    
800 :     Returns a list of position numbers. These are coded as offsets, with 0 indicating
801 :     the first character of the pattern.
802 :    
803 :     =back
804 :    
805 :     =cut
806 :     #: Return Type @;
807 :     sub ParsePattern {
808 :     # Get the parameters.
809 :     my ($pattern) = @_;
810 :     # Create the return list.
811 :     my @retVal = ();
812 :     # Loop through the pattern, looking for question marks.
813 :     my $position = 0;
814 :     while (($position = index($pattern, "?", $position)) >= 0) {
815 :     push @retVal, $position;
816 :     $position++;
817 :     }
818 :     # Return the result.
819 :     return @retVal;
820 :     }
821 :    
822 :    
823 :    
824 :     =head3 MergeDNA
825 :    
826 :     C<< my ($groupSequence, $variance) = SimBlocks::MergeDNA($groupSequence, $newSequence); >>
827 :    
828 :     Merge the DNA for a region into the group representation of similar DNA, returning the
829 :     result and the positions of variance. Positions of variance in the group representation
830 :     are indicated by question marks. This method essentially compares the new DNA to the
831 :     group DNA and adds question marks wherever the new DNA does not match.
832 :    
833 :     =over 4
834 :    
835 :     =item groupSequence
836 :    
837 :     Group representation of the DNA into which the new DNA is to be merged.
838 :    
839 :     =item newSequence
840 :    
841 :     New DNA to be merged into the group representation.
842 :    
843 :     =item RETURN
844 :    
845 :     Returns the merged group representation followed by an updated count of the number
846 :     of question marks in the group.
847 :    
848 :     =back
849 :    
850 :     =cut
851 :     sub MergeDNA {
852 :     # Get the parameters.
853 :     my ($groupSequence, $newSequence) = @_;
854 :     # Declare the return variables.
855 :     my ($retVal, $varianceCount) = ("", 0);
856 :     # If there is no group representation, or if the group representation is the
857 :     # same as the new DNA, we return the new DNA; otherwise, we have to merge.
858 :     # Note that the new DNA cannot contain any question marks, so if either of
859 :     # the degenerate cases is true, we return a variance of 0.
860 :     if ((! $groupSequence) || ($groupSequence eq $newSequence)) {
861 :     $retVal = $newSequence;
862 :     } else {
863 :     # Here the group representation and the new DNA are different, so we
864 :     # have to find all the points of difference and replace them with
865 :     # question marks. We do this using a trick: by XOR-ing the two
866 :     # strings together, we create a new string with nulls everywhere
867 :     # except where there's a mismatch. We then find the non-null characters
868 :     # and replace the corresponding positions in the return string with
869 :     # question marks. First, we get a copy of the group representation to
870 :     # play with.
871 :     $retVal = $groupSequence;
872 :     # Next we compute the XOR string.
873 :     my $hexor = $retVal ^ $newSequence;
874 :     # This loop finds all non-null string positions.
875 :     while ($hexor =~ m/[^\x00]/g) {
876 :     # Here we get the position past the mismatch point.
877 :     my $pos = pos $hexor;
878 :     # Replace the real mismatch point with a question mark.
879 :     substr $retVal, $pos - 1, 1, "?";
880 :     $varianceCount++;
881 :     }
882 :     }
883 :     # Return the result.
884 :     return ($retVal, $varianceCount);
885 :     }
886 :    
887 :     =head3 GetAlignment
888 :    
889 :     C<< my %sequences = $$simBlocks->GetAlignment(\@blockIDs, \@genomeIDs, $indels); >>
890 :    
891 :     Return an alignment of the specified genomes relative to the specified block
892 :     IDs. Only blocks in which all of the genomes occur will produce output for
893 :     the alignment.
894 :    
895 :     =over 4
896 :    
897 :     =item blockIDs
898 :    
899 :     Reference to a list of the IDs of the blocks from which the alignment
900 :     should be constructed.
901 :    
902 :     =item genomeIDs
903 :    
904 :     Reference to a list of the genome IDs participating in the alignment.
905 :    
906 :     =item indels (optional)
907 :    
908 :     C<1> if columns containing insertion characters (C<->) should be included
909 :     in the alignment, else C<0>. The default is C<0>.
910 :    
911 :     =item RETURN
912 :    
913 :     Returns a hash that maps each genome ID to its sequence in the alignment.
914 :    
915 :     =back
916 :    
917 :     =cut
918 :     #: Return Type %;
919 :     sub GetAlignment {
920 :     # Get the parameters.
921 :     my ($self, $blockIDs, $genomeIDs, $indels) = @_;
922 :     # Create the return hash. We will start with a null string for each
923 :     # genome, and add the alignment data one block at a time.
924 :     my %retVal = ();
925 :     for my $genomeID (@{$genomeIDs}) {
926 :     $retVal{$genomeID} = "";
927 :     }
928 :     # Get the number of genomes. This will help us in determining which
929 :     # blocks to keep.
930 :     my $totalGenomes = @{$genomeIDs};
931 :     # Loop through the blocks.
932 :     for my $blockID (@{$blockIDs}) {
933 :     # Get the block's data.
934 :     my $blockData = $self->GetEntity('GroupBlock', $blockID);
935 :     my ($blockName) = $blockData->Value('GroupBlock(description)');
936 :     Trace("Processing common block $blockID ($blockName).") if T(Alignment => 4);
937 :     # Only proceed if the block has real variance in it.
938 :     my ($snipCount) = $blockData->Value('GroupBlock(snip-count)');
939 :     if ($snipCount > 0) {
940 :     # Get this block's regions in the specified genomes.
941 :     my %regionHash = $self->GetRegions($blockID, $genomeIDs);
942 :     # Verify that each genome is represented.
943 :     my %blockHash = ();
944 :     my $genomeCount = 0;
945 :     for my $region (keys %regionHash) {
946 :     # Get the genome ID from the region string.
947 :     $region =~ m/^([^:]+):/;
948 :     # If it's new, count it.
949 :     if (! exists $blockHash{$1}) {
950 :     $blockHash{$1} = $regionHash{$region};
951 :     $genomeCount++;
952 :     }
953 :     }
954 :     # At this point, if the number of genomes found matches the
955 :     # size of the list, this block is good.
956 :     if ($genomeCount == $totalGenomes) {
957 :     if (! $indels) {
958 :     # If indels are off, we need to remove some of the columns. We do
959 :     # this by finding C<-> characters in each genome's DNA sequence,
960 :     # and deleting all instances of the relevant columns.
961 :     for my $genomeID (@{$genomeIDs}) {
962 :     Trace("Hyphen check for genome $genomeID.") if T(Alignment => 3);
963 :     while ((my $pos = index($blockHash{$genomeID},"-")) >= 0) {
964 :     # Here we found an insertion character. We delete its column
965 :     # from all the genomes in the alignment.
966 :     Trace("Hyphen found at position $pos in genome $genomeID.") if T(Alignment => 4);
967 :     for my $genomeID2 (@{$genomeIDs}) {
968 :     substr $blockHash{$genomeID2}, $pos, 1, "";
969 :     }
970 :     }
971 :     }
972 :     }
973 :     Trace("Common block $blockID ($blockName) added to alignment.") if T(Alignment => 3);
974 :     for my $genomeID (@{$genomeIDs}) {
975 :     $retVal{$genomeID} .= $blockHash{$genomeID};
976 :     }
977 :     }
978 :     }
979 :     }
980 :     # Return the result.
981 :     return %retVal;
982 :     }
983 :    
984 :     =head3 DistanceMatrix
985 :    
986 :     C<< my %distances = SimBlocks::DistanceMatrix(\%sequences, \%distances); >>
987 :    
988 :     Compute the distances between the sequences in an alignment. the L</SequenceDistance>
989 :     method is used to compute the individual distances.
990 :    
991 :     =over 4
992 :    
993 :     =item sequences
994 :    
995 :     Reference to a hash of genome IDs to alignment sequences. The alignment sequences may
996 :     only contain the characters C<AGCT->, and must all be the same length.
997 :    
998 :     =item distances
999 :    
1000 :     Reference to a hash that maps pairs of letters to distance values. Each letter pair
1001 :     must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1002 :     map to the same distance. If no distance matrix is specified, the default distance
1003 :     matrix is used.
1004 :    
1005 :     =item RETURN
1006 :    
1007 :     Returns a hash that maps genome ID pairs to distance values. The ID pairs are
1008 :     formed using a slash. For example, the distance from C<158878.1> to C<282459.1>
1009 :     would be found attached to the key C<158878.1/282459.1>.
1010 :    
1011 :     =back
1012 :    
1013 :     =cut
1014 :     #: Return Type %
1015 :     sub DistanceMatrix {
1016 :     # Get the parameters.
1017 :     my ($sequences, $distances) = @_;
1018 :     # Declare the return variable.
1019 :     my %retVal = ();
1020 :     # Create a sorted list of the genome IDs.
1021 :     my @genomes = sort keys %{$sequences};
1022 :     # Loop through the genome IDs.
1023 :     for (my $i = 0; $i <= $#genomes; $i++) {
1024 :     my $genomei = $genomes[$i];
1025 :     # Create the 0-distance pair.
1026 :     $retVal{"$genomei/$genomei"} = 0;
1027 :     # Loop through the genomes lexically after this one. We'll
1028 :     # generate one distance and use it for both orderings.
1029 :     for (my $j = $i + 1; $j <= $#genomes; $j++) {
1030 :     my $genomej = $genomes[$j];
1031 :     # Compute the distance.
1032 :     my $distance = SequenceDistance($sequences->{$genomei}, $sequences->{$genomej},
1033 :     $distances);
1034 :     Trace("Distance from $genomei to $genomej is $distance.") if T(Distance => 3);
1035 :     # Store it in the return hash.
1036 :     $retVal{"$genomei/$genomej"} = $distance;
1037 :     $retVal{"$genomej/$genomei"} = $distance;
1038 :     }
1039 :     }
1040 :     # Return the result matrix.
1041 :     return %retVal;
1042 :     }
1043 :    
1044 :     =head3 SequenceDistance
1045 :    
1046 :     C<< my $dist = SimBlocks::SequenceDistance($seq1, $seq2, $distances); >>
1047 :    
1048 :     Return the distance between two sequences. The distance presumes that
1049 :     each alignment is a vector of sorts, with purines (C<A>/C<T>) and pyrmidines (C<G>/C<C>)
1050 :     being a half-unit from themselves and a full unit from each other. The distance is normalized
1051 :     by the number of positions in the alignment. Thus, C<AATGCTT> is 0.6267 from C<ATTTGCA>.
1052 :     The fourth and sixth positions are a full-unit jump and the second, fifth, and seventh
1053 :     positions are half-unit jumps. This works out to
1054 :    
1055 :     sqrt(0 + 0.5^2 + 0 + 1^2 + 0.5^2 + 1^2 + 0.5^2)/sqrt(7)
1056 :    
1057 :     which is approximately 0.6267. Insertion characters (C<->) are considered a distance of
1058 :     0 from everything. This distance model can be overridden by supplying a custom distance
1059 :     matrix as the third parameter.
1060 :    
1061 :     =over 4
1062 :    
1063 :     =item seq1
1064 :    
1065 :     Origin sequence for computing the distance.
1066 :    
1067 :     =item seq2
1068 :    
1069 :     End sequence for computing the distance. Both sequences must be the same length.
1070 :    
1071 :     =item distances
1072 :    
1073 :     Reference to a hash that maps pairs of letters to distance values. Each letter pair
1074 :     must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1075 :     map to the same distance. If no distance matrix is specified, the default distance
1076 :     matrix is used.
1077 :    
1078 :     =back
1079 :    
1080 :     =cut
1081 :     #: Return Type $;
1082 :     sub SequenceDistance {
1083 :     # Get the parameters.
1084 :     my ($seq1, $seq2, $distances) = @_;
1085 :     if (!defined $distances) {
1086 :     $distances = DefaultDistances();
1087 :     }
1088 :     # Declare the return value. Initially, this will be a sum of squared. We'll
1089 :     # take the square root and divide out the length later.
1090 :     my $retVal = 0;
1091 :     # Loop through the sequences, character by character, comparing.
1092 :     my $n = length $seq1;
1093 :     for (my $i = 0; $i < $n; $i++) {
1094 :     my $s1 = substr $seq1, $i, 1;
1095 :     my $s2 = substr $seq2, $i, 1;
1096 :     my $step = $distances->{"$s1$s2"};
1097 :     Trace("No step distance found for \"$s1$s2\".") if !defined $step && T(Distance => 1);
1098 :     $retVal += $step * $step;
1099 :     }
1100 :     # Divide by the length and take the square root.
1101 :     $retVal = sqrt($retVal / $n);
1102 :     # Return the result.
1103 :     return $retVal;
1104 :     }
1105 :    
1106 :     =head3 GetBlock
1107 :    
1108 :     C<< my $blockData = $simBlocks->GetBlock($blockID); >>
1109 :    
1110 :     Return a B<DBObject> for a specified group block.
1111 :    
1112 :     =over 4
1113 :    
1114 :     =item blockID
1115 :    
1116 :     ID of the desired block.
1117 :    
1118 :     =item RETURN
1119 :    
1120 :     Returns a B<DBObject> for the group block in question. The object allows access to
1121 :     all of the fields in the C<GroupBlock> relation of the database.
1122 :    
1123 :     =back
1124 :    
1125 :     =cut
1126 :     #: Return Type $%;
1127 :     sub GetBlock {
1128 :     # Get the parameters.
1129 :     my ($self, $blockID) = @_;
1130 :     # Get the group block.
1131 :     my $retVal = $self->GetEntity('GroupBlock', $blockID);
1132 :     # Return the result.
1133 :     return $retVal;
1134 :     }
1135 :    
1136 : parrello 1.1 1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3