[Bio] / Sprout / SimBlocks.pm Repository:
ViewVC logotype

Annotation of /Sprout/SimBlocks.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (view) (download) (as text)

1 : parrello 1.1 package SimBlocks;
2 :    
3 :     require Exporter;
4 :     use ERDB;
5 :     @ISA = qw(Exporter ERDB);
6 : parrello 1.2 @EXPORT = qw(DefaultDistances);
7 : parrello 1.3 @EXPORT_OK = qw(BlocksInSet GetBlock GetBlockPieces CompareGenomes DBLoad DistanceMatrix GetAlignment GetRegions ParsePattern RemoveBlocks SequenceDistance SetNumber SnipScan);
8 : parrello 1.1
9 :     use strict;
10 :     use Tracer;
11 :     use PageBuilder;
12 :     use Genome;
13 : parrello 1.3 use BasicLocation;
14 : parrello 1.4 use Overlap;
15 : parrello 1.1 use FIG_Config;
16 : parrello 1.4 use FIG;
17 : parrello 1.1
18 :     =head1 Similarity Block Database
19 :    
20 :     =head2 Introduction
21 :    
22 :     A I<similarity Block Database> describes the similarities amongst a group
23 :     of contigs. The load files for the database are produced by the
24 :     C<SimBlast.pl> script from genome FASTA files. The database is described
25 :     by the C<SimBlocksDBD.xml> file.
26 :    
27 :     The primary entities of the database are B<Genome>, which represents a specified
28 : parrello 1.2 organism, B<GroupBlock>, which represents a set of similar regions, and B<Contig>,
29 : parrello 1.1 which represents a contiguous segment of DNA. The key relationship is
30 : parrello 1.2 B<ContainsRegionIn>, which maps blocks to contigs. The mapping information
31 : parrello 1.1 includes the DNA nucleotides as well as the location in the contig.
32 :    
33 : parrello 1.2 =head2 Formal Definitions
34 :    
35 :     Let {B<G(1)>, B<G(2)>, ..., B<G(>I<n>B<)>} be a set of genomes
36 :     and let {B<g(1)>, B<g(2)>, ..., B<g(>I<m>B<)>} be the set of
37 :     genes called on these genomes.
38 :    
39 :     The primitive notion from which similarity blocks are derived is the
40 :     I<correspondence mapping>. Roughly, this mapping connects genes on
41 :     one genome to corresponding genes on other genomes. We will treat
42 :     the mapping as an equivalence class. In practice, the entire mapping
43 :     will not be available when we are building to load files for the
44 :     similarity block database. Instead, the user will provide a subset
45 :     of the mapping from which the entire partitioning can be deduced
46 :     via commutativity, reflexivity, and transitivity. This subset is
47 :     called the I<correspondence relation>.
48 :    
49 :     In actual practice, the correspondence relation will contain some
50 :     false positives. The process of developing the correspondence
51 :     relation usually involves analyzing similarities above a certain
52 :     threshhold. (This type of similarity is not necessarily transitive;
53 :     however, the true, underlying correspondence mapping we want I<is>
54 :     transitive.) Often, a sequence of correspondence pairs that maps
55 :     a gene on one particular genome to another gene on the same genome
56 :     is an indication of a false positive.
57 :    
58 :     In addition to false positives, we may have invalid genes, missing
59 :     genes, frame shifts, and incorrect start locations. Detecting and
60 :     cleaning these problems is an important part of getting a good
61 :     correspondence relation.
62 :    
63 :     Given a full-blown correspondence mapping, we define the following
64 :     concepts.
65 :    
66 :     =over 4
67 :    
68 :     =item corresponding gene
69 :    
70 :     The I<corresponding genes> of a given gene B<g(>I<i>B<)> are those
71 :     genes mapped to it by the correspondence mapping.
72 :    
73 :     =item gene block
74 :    
75 :     A I<gene block> is a maximal set of corresponding genes.
76 :    
77 :     =item intergenic region
78 :    
79 :     An I<intergenic region> is a section of the contig between
80 :     two adjacent genes. The adjacent genes are called the
81 :     I<left gene> and the I<right gene>, depending on whether they
82 :     precede or follow the region. For example, if C<G2_100+50>
83 :     and C<G2_175+100> are adjacent, C<G2_150+25> is their
84 :     intergenic region, the left gene is C<G2_100+50>, and
85 :     the right gene is C<G2_175+100>.
86 :    
87 :     =item intergenic block
88 :    
89 :     An I<intergenic block> is a maximal set of intergenic regions such
90 :     that (1) the left genes all correspond and have the same orientation,
91 :     (2) the right genes all correspond and have the same orientation, and
92 :     (3) the region lengths differ by no more than a specified threshhold
93 :     (usually 10%).
94 :    
95 :     =item similarity block
96 :    
97 :     A I<similarity block> is either a gene block or an intergenic block.
98 :     A similarity block is therefore a set of regions (either gene regions
99 :     or intergenic regions) that are tied together by the correspondence
100 :     mapping.
101 :    
102 :     =item common block
103 :    
104 :     A I<common block> with respect to a subset of the genomes
105 :     {B<G(>I<i1>B<)>, B<G(>I<i2>B<)>, ... B<G(>I<ik>B<)>} is a
106 :     similarity block that has exactly one region in each
107 :     of the genomes of the subset. There is a lesser notion of
108 :     a block that occurs in most genomes in the subset. The actual
109 :     program allows the notion of I<most> to be defined externally.
110 :     The default is all of the genomes in the subset.
111 :    
112 :     =item unique sequence
113 :    
114 :     A I<unique sequence> is a region that does not appear in any
115 :     similarity block with another region. In other words, a unique
116 :     sequence is a region that does not correspond to any other region.
117 :    
118 :     =back
119 :    
120 :     =head2 Usage
121 : parrello 1.1
122 :     The similarity block database must be able to answer two questions, taking
123 : parrello 1.2 as input two genome sets. (These would be subsets of the total set of genomes
124 :     represented in the correspondence mapping.)
125 : parrello 1.1
126 :     =over 4
127 :    
128 : parrello 1.2 =item (1) Which blocks are common in one set and do not occur in the other.
129 :    
130 :     These are considered major variations. They result from prophage insertions,
131 :     transposition events, and recombinations. A major variation results in a
132 :     different set of proteins in an operating cell.
133 : parrello 1.1
134 : parrello 1.2 =item (2) What individual variations in the shared common blocks distinguish the two sets?
135 :    
136 :     These are considered minor variations, and are the most common result of
137 :     single-nucleotide mutations. Rather than changing the set of proteins,
138 :     minor variations change the proteins themselves.
139 : parrello 1.1
140 :     =back
141 :    
142 :     This class is a subclass of B<ERDB>, and inherits all of its public methods.
143 :    
144 : parrello 1.2 =head2 Setup
145 :    
146 :     To begin working with similarity blocks, you must first create a database to
147 :     contain the block data. Next, you must specify the various similarity block
148 :     parameters in C<FIG_Config>. These are as follows. (Of course, the values
149 :     you specify will be different.)
150 :    
151 :     $simBlocksDB = "conser5_blocks"; # SimBlocks database schema name
152 :     $simBlocksData = "$data/BlockData"; # directory containing SimBlocks data files
153 :     # (used to load the Similarity database)
154 :     $simBlastData = "$data/BlastData"; # directory containing SimBlast data files
155 : parrello 1.3 # (input to SimLoad.pl)
156 : parrello 1.2
157 : parrello 1.3 Run the C<SimLoad> script to load the database.
158 : parrello 1.2
159 :     The C<SimBlockForm.cgi> service enables you to run a similarity analysis between genome
160 :     sets in the database. The C<Html/ERDBTest.html> page enables you to run general queries
161 :     against the database and execute simple scripts (however, this latter capability is
162 :     disabled unless you have
163 :    
164 :     $debug_mode = 1; # TRUE to enable the debugging scripts
165 :    
166 :     The Similarity Block methods and scripts work in the Windows version of the SEED (see
167 :     C<WinBuild>).
168 :    
169 : parrello 1.1 =cut
170 :    
171 :     #: Constructor SimBlocks->new();
172 :    
173 :     =head2 Public Methods
174 :    
175 :     =head3 new
176 :    
177 :     C<< my $simdb = SimBlocks->new($dbname, $dbType, $port); >>
178 :    
179 :     Construct a new SimBlocks object connected to the specified database. If no
180 :     database is specified, the default database indicated by C<FIG_Config::simBlocksDB>
181 :     will be used. Similarly, if the database type is omitted, the type will be
182 :     taken from C<FIG_Config::dbms>, and if the port is omitted, C<FIG_Config::dbport>
183 :     will be used. The user name and password for connecting are taken from
184 :     C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.
185 :    
186 : parrello 1.2 In almost every case, you will be calling
187 :    
188 :     C<< my $simdb = SimBlocks->new(); >>
189 :    
190 : parrello 1.1 =over 4
191 :    
192 :     =item dbname (optional)
193 :    
194 :     Database schema name. This is how we choose the desired database on the server.
195 :     If omitted, the value of C<$FIG_Config::simBlocksDB> is used.
196 :    
197 :     =item dbType
198 :    
199 :     Database type (e.g. C<mysql> for MySQL or C<pg> for Postgres). If omitted, the
200 :     value of C<$FIG_Config::dbms> will be used.
201 :    
202 :     =item port
203 :    
204 :     Port number for connecting to the database server. If omitted, the value of
205 :     C<$FIG_Config::dbport> will be used.
206 :    
207 :     =back
208 :    
209 :     =cut
210 :    
211 :     sub new {
212 :     # Get the parameters.
213 :     my ($class, $dbname, $dbType, $port) = @_;
214 :     # Plug in the default values.
215 :     if (! $dbname) {
216 :     $dbname = $FIG_Config::simBlocksDB;
217 :     }
218 :     if (! $dbType) {
219 :     $dbType = $FIG_Config::dbms;
220 :     }
221 :     if (! $port) {
222 :     $port = $FIG_Config::dbport;
223 :     }
224 :     # Connect to the database.
225 :     my $dbh = DBKernel->new($dbType, $dbname, $FIG_Config::dbuser, $FIG_Config::dbpass, $port);
226 :     # Get the data directory name.
227 :     my $directory = $FIG_Config::simBlocksData;
228 :     # Create and bless the ERDB object.
229 :     my $retVal = ERDB::new($class, $dbh, "$directory/SimBlocksDBD.xml");
230 :     # Return it.
231 :     return $retVal;
232 :     }
233 :    
234 : parrello 1.2 =head3 DefaultDistances
235 :    
236 :     C<< my $distances = DefaultDistances(); >>
237 :    
238 :     Return the default distance matrix for computing the alignment distances (see
239 :     also L</DistanceMatrix>. This matrix returns a distance of 0 between an insertion
240 :     character (C<->) and any other nucleotide, a distance of 0.5 between nucleotides
241 :     of the same type, and a distance of 1 between nucleotides of different types.
242 :    
243 :     =cut
244 :    
245 :     my $DefaultDistanceMatrix = { AA => 0, AC => 1, AG => 0.5, AT => 1,
246 :     AN => 0.625, AR => 0.25, AY => 1, 'A-' => 0,
247 :     CA => 1, CC => 0, CG => 1, CT => 0.5,
248 :     CN => 0.625, CR => 1, CY => 0.25, 'C-' => 0,
249 :     GA => 0.5, GC => 1, GG => 0, GT => 1,
250 :     GN => 0, GR => 0.25, GY => 1, 'G-' => 0,
251 :     TA => 1, TC => 0.5, TG => 1, TT => 0,
252 :     TN => 0, TR => 1, TY => 0.25, 'T-' => 0,
253 :     RA => 0.25, RC => 1, RG => 0.25, RT => 1,
254 :     RN => 0.625, RR => 0.25, RY => 1, 'R-' => 0,
255 :     YA => 1, YC => 0.25, YG => 1, YT => 0.25,
256 :     YN => 0.625, YR => 1, YY => 0.25, 'Y-' => 0,
257 :     NA => 0.625, NC => 0.625, NG => 0.625, NT => 0.625,
258 :     NN => 0.625, NR => 0.625, NY => 0.625,'N-' => 0,
259 :     '-A' => 0, '-C' => 0, '-G' => 0, '-T' => 0,
260 :     '-N' => 0, '-R' => 0, '-Y' => 0, '--' => 0 };
261 :    
262 :     sub DefaultDistances {
263 :     return $DefaultDistanceMatrix;
264 :     }
265 :    
266 : parrello 1.1 =head3 DBLoad
267 :    
268 :     C<< my $stats = $simBlocks->DBLoad($rebuild); >>
269 :    
270 :     Load the database from the default directory. This is essentially identical to
271 :     a B<LoadTables> call with the default directory used instead of a caller-specified
272 :     directory.
273 :    
274 :     =over 4
275 :    
276 :     =item rebuild
277 :    
278 :     TRUE if the tables should be re-created; FALSE if the tables should be cleared and
279 : parrello 1.2 reloaded. The default is FALSE; however, TRUE is considerably faster. You would use
280 :     FALSE if you had added non-standard indexes to the database and didn't want to lose
281 :     them.
282 : parrello 1.1
283 :     =item RETURN
284 :    
285 :     Returns a statistics object describing the duration of the load, the number of
286 :     records loaded, and a list of error messages.
287 :    
288 :     =back
289 :    
290 :     =cut
291 :     #: Return Type $%@;
292 :     sub DBLoad {
293 :     # Get the parameters.
294 :     my ($self, $rebuild) = @_;
295 :     # Load the tables.
296 :     my $retVal = $self->LoadTables($FIG_Config::simBlocksData, $rebuild);
297 :     # Return the result.
298 :     return $retVal;
299 :     }
300 :    
301 :     =head3 CompareGenomes
302 :    
303 : parrello 1.2 C<< my (\%set0Blocks, \%set1Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set0, \@set1); >>
304 : parrello 1.1
305 :     Analyze two sets of genomes for commonalities. The group blocks returned will be divided
306 : parrello 1.2 into three hashes: one for those common to set 0 and not occurring at all in set 1, one
307 :     for those common to set 1 and not occurring at all in set 0, and one for those common
308 :     to both sets. Each hash is keyed by group ID and will contain B<DBObject>s for
309 :     B<GroupBlock> records with B<HasInstanceOf> data attached, though the genome ID in
310 :     the B<HasInstanceOf> section is not generally predictable.
311 : parrello 1.1
312 :     =over 4
313 :    
314 : parrello 1.2 =item set0
315 : parrello 1.1
316 : parrello 1.2 Reference to a list of genome IDs for the genomes in the first set (set 0).
317 : parrello 1.1
318 : parrello 1.2 =item set1
319 : parrello 1.1
320 : parrello 1.2 Reference to a list of genome IDs for the genomes in the second set (set 1).
321 : parrello 1.1
322 :     =item RETURN
323 :    
324 : parrello 1.2 Returns a triple of hashes. Each hash is keyed by group ID, and will contain
325 :     B<DBObject>s for records in the B<GroupBlock> table. Groups found in all of the
326 :     genomes in set 0 but in none of the genomes of set 1 will be in the first hash,
327 :     groups found in all of the genomes in set 1 but in none of the genomes of set 0
328 :     will be in the second hash, and groups found in all of the genomes of both sets
329 :     will be in the third hash.
330 : parrello 1.1
331 :     =back
332 :    
333 :     =cut
334 :     #: Return Type @%;
335 :     sub CompareGenomes {
336 :     # Get the parameters.
337 : parrello 1.2 my ($self, $set0, $set1) = @_;
338 : parrello 1.1 # Declare the three hashes.
339 : parrello 1.2 my (%set0Blocks, %set1Blocks, %bothBlocks);
340 :     # Our first task is to find the common groups in each genome set.
341 :     my %groups0 = $self->BlocksInSet($set0);
342 : parrello 1.1 my %groups1 = $self->BlocksInSet($set1);
343 :     # Create a trailer key to help tighten the loops.
344 :     my $trailer = "\xFF";
345 :     # The groups are hashed by group ID. We will move through them in key order
346 : parrello 1.2 # to get the "set0", "set1", and "both" groups.
347 :     my @blockIDs0 = (sort keys %groups0, $trailer);
348 : parrello 1.1 my @blockIDs1 = (sort keys %groups1, $trailer);
349 :     # Prime the loop by getting the first ID from each list. Thanks to the trailers
350 :     # we know this process can't fail.
351 : parrello 1.2 my $blockID0 = shift @blockIDs0;
352 : parrello 1.1 my $blockID1 = shift @blockIDs1;
353 : parrello 1.2 while ($blockID0 lt $trailer || $blockID1 lt $trailer) {
354 : parrello 1.1 # Compare the block IDs.
355 : parrello 1.2 if ($blockID0 lt $blockID1) {
356 :     # Block 0 is only in the first set.
357 :     Trace("Block IDs $blockID0 vs. $blockID1. Set 0 selected.") if T(SimCompare => 3);
358 :     $set0Blocks{$blockID0} = $groups0{$blockID0};
359 :     $blockID0 = shift @blockIDs0;
360 :     } elsif ($blockID0 gt $blockID1) {
361 :     # Block 1 is only in the second set.
362 :     Trace("Block IDs $blockID0 vs. $blockID1. Set 1 selected.") if T(SimCompare => 3);
363 : parrello 1.1 $set1Blocks{$blockID1} = $groups1{$blockID1};
364 :     $blockID1 = shift @blockIDs1;
365 :     } else {
366 :     # We have a block in both sets.
367 : parrello 1.2 Trace("Block IDs $blockID0 vs. $blockID1. Both selected.") if T(SimCompare => 3);
368 : parrello 1.1 $bothBlocks{$blockID1} = $groups1{$blockID1};
369 : parrello 1.2 $blockID0 = shift @blockIDs0;
370 : parrello 1.1 $blockID1 = shift @blockIDs1;
371 :     }
372 :     }
373 : parrello 1.2 # The set1 and set2 lists are too big. They contain groups that are common in their
374 :     # respective sets but not common in both sets. We want groups that are common in
375 :     # their respective sets but completely absent in the other set. We do this by getting
376 :     # a complete list of the blocks in each set and deleting anything we find.
377 :     $self->RemoveBlocks(\%set0Blocks, $set1);
378 :     $self->RemoveBlocks(\%set1Blocks, $set0);
379 : parrello 1.1 # Return the result.
380 : parrello 1.2 return (\%set0Blocks, \%set1Blocks, \%bothBlocks);
381 :     }
382 :    
383 :     =head3 RemoveBlocks
384 :    
385 :     C<< $simBlocks->RemoveBlocks(\%blockMap, \@set); >>
386 :    
387 :     Remove from the specified block map any blocks that occur in the specified set of genomes.
388 :     The block map can contain any data, but it must be keyed by block ID.
389 :    
390 :     =over 4
391 :    
392 :     =item blockMap
393 :    
394 :     Reference to a hash of block IDs to block data. The character of the block data is
395 :     irrelevant to this method.
396 :    
397 :     =item set
398 :    
399 :     Reference to a list of genome IDs. Any block that occurs in any one of the specified
400 :     genomes will be removed from the block map.
401 :    
402 :     =back
403 :    
404 :     =cut
405 :     #: Return Type ;
406 :     sub RemoveBlocks {
407 :     # Get the parameters.
408 :     my ($self, $blockMap, $set) = @_;
409 :     # Get a list of the blocks in the genome set.
410 :     my %setBlocks = $self->BlocksInSet($set, 1);
411 :     # Loop through the incoming block map, deleting any blocks found
412 :     # in the new block set. Note we make the assumption that the
413 :     # incoming block map is smaller.
414 :     my @blockList = keys %{$blockMap};
415 :     for my $blockID (@blockList) {
416 :     if (exists $setBlocks{$blockID}) {
417 :     delete $blockMap->{$blockID};
418 :     }
419 :     }
420 : parrello 1.1 }
421 :    
422 :     =head3 BlocksInSet
423 :    
424 : parrello 1.2 C<< my %blockList = $simBlocks->BlocksInSet($set, $count); >>
425 : parrello 1.1
426 : parrello 1.2 Return a list of the group blocks found in a given number of the genomes in a given
427 :     set. The list returned will be a hash of B<DBObject>s, each corresponding to a single
428 :     B<GroupBlock> record, with a B<HasInstanceOf> record attached, though the content of
429 :     the B<HasInstanceOf> record is not predictable. The hash will be keyed by block ID.
430 : parrello 1.1
431 :     =over 4
432 :    
433 :     =item set
434 :    
435 : parrello 1.2 Reference to a list of genome IDs.
436 :    
437 :     =item count (optional)
438 :    
439 :     Minimum number of genomes from the set in which the block must appear. If C<1> is
440 :     specified, the list will include any block that occurs in at least one genome.
441 :     If C<5> is specified, the list will only include blocks that occur in at least
442 :     5 genomes from the set. If the set consists of only 4 genomes in this latter
443 :     case, no blocks will be returned. The default is to return blocks that occur in
444 :     all genomes of the set.
445 : parrello 1.1
446 :     =item RETURN
447 :    
448 :     Returns a hash of B<DBObject>s corresponding to the group blocks found in the
449 :     genomes of the set.
450 :    
451 :     =back
452 :    
453 :     =cut
454 :     #: Return Type %%;
455 :     sub BlocksInSet {
456 :     # Get the parameters.
457 : parrello 1.2 my ($self, $set, $count) = @_;
458 :     # If the count was not specified, set it t
459 :     if (!defined $count) {
460 :     $count = @{$set};
461 :     }
462 :     Trace("BlocksInSet called for $count genomes of set (" . join(", ", @{$set}) . ").") if T(2);
463 : parrello 1.1 # Create a hash to hold the groups found. The hash will be keyed by group ID
464 :     # and contain the relevant DB object.
465 :     my %retVal = ();
466 : parrello 1.2 # Create a hash to contain the number of genomes in which each block was found.
467 :     my %counters = ();
468 : parrello 1.1 # Loop through the genome IDs in the specified set.
469 :     for my $genomeID (@{$set}) {
470 :     # Get a list of group blocks for this genome.
471 :     my @blocks = $self->GetList(['HasInstanceOf', 'GroupBlock'],
472 :     "HasInstanceOf(from-link) = ?", $genomeID);
473 :     # Loop through the blocks, storing any new ones in the hash.
474 :     for my $block (@blocks) {
475 :     # Get the ID of this block.
476 :     my ($blockID) = $block->Value('GroupBlock(id)');
477 : parrello 1.2 # Determine whether this block is new or has been found in a previous
478 :     # genome of the set.
479 :     if (exists $counters{$blockID}) {
480 :     # Here it's old. Increment its counter.
481 :     $counters{$blockID}++;
482 :     } else {
483 :     # Here it's new. Create a counter for it.
484 :     $counters{$blockID} = 1;
485 :     }
486 :     # If it meets the threshhold, put it in the return hash. If it's already
487 :     # there, we'll simply overwrite the old copy, which makes no difference
488 :     # to the caller.
489 :     Trace("Block $blockID has $counters{$blockID} occurrences as of genome $genomeID.") if T(SimSet => 4);
490 :     if ($counters{$blockID} >= $count) {
491 :     $retVal{$blockID} = $block;
492 :     Trace("Block $blockID added to set.") if T(SimSet => 3);
493 :     }
494 : parrello 1.1 }
495 :     }
496 :     # Return the result.
497 :     return %retVal;
498 :     }
499 :    
500 :     =head3 GetRegions
501 :    
502 :     C<< my %regions = $simBlocks->GetRegions($blockID, \@genomes); >>
503 :    
504 :     Return the regions of the specified block that occur in the contigs of
505 :     the specified genomes. The return value is a hash of DNA strings keyed
506 :     on the region's location descriptor.
507 :    
508 :     =over 4
509 :    
510 :     =item blockID
511 :    
512 :     ID of the group block whose regions are to be returned.
513 :    
514 :     =item genomes
515 :    
516 :     Reference to a list of genome IDs. Only regions in one of the listed
517 :     genomes will be returned.
518 :    
519 :     =item RETURN
520 :    
521 :     Returns a hash keyed by region location. The objects in the hash are
522 :     the DNA strings for the specified region.
523 :    
524 :     =back
525 :    
526 :     =cut
527 :     #: Return Type %$;
528 :     sub GetRegions {
529 :     # Get the parameters.
530 :     my ($self, $blockID, $genomes) = @_;
531 :     # Create the return hash.
532 :     my %retVal = ();
533 : parrello 1.2 # Count the regions found.
534 :     my $regionCount = 0;
535 : parrello 1.1 # Query all the regions for the specified block.
536 : parrello 1.2 my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
537 :     $blockID);
538 : parrello 1.1 # Loop through the query.
539 :     while (my $region = $query->Fetch) {
540 :     # Get this region's data.
541 : parrello 1.2 my ($location, $dna) =
542 :     $region->Values(['Region(id)', 'Region(content)']);
543 : parrello 1.1 # Insure it belongs to one of our genomes.
544 : parrello 1.2 my $found = SetNumber($location, $genomes);
545 : parrello 1.1 # If it does, put it in the hash.
546 : parrello 1.2 if ($found >= 0) {
547 : parrello 1.1 $retVal{$location} = $dna;
548 : parrello 1.2 $regionCount++;
549 : parrello 1.1 }
550 :     }
551 : parrello 1.2 Trace("$regionCount regions found by GetRegions.") if T(2);
552 : parrello 1.1 # Return the result.
553 :     return %retVal;
554 :     }
555 :    
556 : parrello 1.2 =head3 SetNumber
557 :    
558 :     C<< my $setNumber = SimBlocks::SetNumber($contigRegion, \@set0, \@set1, ..., \@setN); >>
559 :    
560 :     Examine a region string, contig ID, or genome ID, and return the number of the genome
561 :     set to which it belongs.
562 :    
563 :     =over 4
564 :    
565 :     =item contigRegion
566 :    
567 :     A region string, contig ID, or genome ID. Because the contig ID is prefixed by the genome ID
568 :     delimited by a colon symbol (C<:>), we split at the first colon to get the desired ID.
569 :    
570 :     =item set0, set1, ..., setN
571 :    
572 :     A list of references to genome ID lists. Essentially, we will return the index of the
573 :     first incoming list that has the specified genome ID in it.
574 :    
575 :     =item RETURN
576 :    
577 :     The index in the list of sets of the set containing the relevant genome, or -1 if
578 :     the genome was not found.
579 :    
580 :     =back
581 :    
582 :     =cut
583 :     #: Return Type $
584 :     sub SetNumber {
585 :     # Get the parameters.
586 :     my ($contigRegion, @sets) = @_;
587 :     # Extract the genome ID.
588 :     my ($genomeID) = split /:/, $contigRegion;
589 :     # Clear the return variable. A negative number means nothing found.
590 :     my $retVal = -1;
591 :     # Loop until we find the genome.
592 :     for (my $i = 0; $retVal < 0 && $i <= $#sets; $i++) {
593 :     my $set = $sets[$i];
594 :     if (grep /^$genomeID$/, @{$set}) {
595 :     $retVal = $i;
596 :     }
597 :     }
598 :     # Return the result.
599 :     return $retVal;
600 :     }
601 :    
602 : parrello 1.1 =head3 TagDNA
603 :    
604 :     C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>
605 :    
606 : parrello 1.3 Convert a DNA string from the B<Region> relation to the actual DNA.
607 :     The incoming DNA string will contain only the values corresponding to the
608 :     question marks in the pattern. The pattern should be taken from the DNA
609 :     string's parent B<GroupBlock>. The resulting DNA sequence is built by
610 :     copying the DNA string data into the pattern positions that have question
611 :     marks. If the string is intended for display in an HTML page, the
612 :     I<$prefix> and I<$suffix> parameters can be used to surround the
613 :     question mark positions with HTML markers that specify a different style,
614 :     weight, or font.
615 : parrello 1.1
616 :     =over 4
617 :    
618 :     =item pattern
619 :    
620 :     DNA pattern from which the positions of variance are to be taken.
621 :    
622 :     =item dnaString
623 :    
624 :     String containing DNA string to be marked.
625 :    
626 : parrello 1.3 =item prefix (optional)
627 : parrello 1.1
628 :     String to be inserted before each position of variance.
629 :    
630 : parrello 1.3 =item suffix (optional)
631 : parrello 1.1
632 :     String to be inserted after each position of variance.
633 :    
634 :     =item RETURN
635 :    
636 :     Returns a copy of the pattern string with the positions of variance filled
637 :     in from the DNA string. If a prefix and suffix are specified, they are
638 :     placed before and after each character copied into the pattern string.
639 :    
640 :     =back
641 :    
642 :     =cut
643 :     #: Return Type $;
644 :     sub TagDNA {
645 :     # Get the parameters.
646 :     my ($pattern, $dnaString, $prefix, $suffix) = @_;
647 :     # Insure the prefix and suffix behave properly.
648 :     if (!defined $prefix) {
649 :     $prefix = "";
650 :     }
651 :     if (!defined $suffix) {
652 :     $suffix = "";
653 :     }
654 :     # Initialize the return variable.
655 :     my $retVal = "";
656 :     # Convert the DNA string to a list with the prefixes and suffixes
657 :     # inserted.
658 :     my @dnaList = map { $prefix . $_ . $suffix } split //, $dnaString;
659 :     # Loop through the DNA string.
660 :     my ($start, $position) = (0, 0);
661 :     while (($position = index $pattern, "?", $start) >= 0) {
662 :     # If we have invariant text, copy it over.
663 :     if ($position > $start) {
664 :     $retVal .= substr $pattern, $start, $position - $start;
665 :     }
666 :     # Pop off the next position from the DNA string.
667 :     $retVal .= shift @dnaList;
668 :     # Start past the question mark.
669 :     $start = $position + 1;
670 :     }
671 : parrello 1.2 # Tack on the residual.
672 :     $retVal .= substr $pattern, $start;
673 : parrello 1.1 # Return the result.
674 :     return $retVal;
675 :     }
676 : parrello 1.2
677 :     =head3 SnipScan
678 :    
679 :     C<< my %positions = $simBlocks->SnipScan($blockObject, \@set0, \@set1); >>
680 :    
681 :     Examine the specified block and return a list of the positions at which the
682 :     nucleotide values for regions in the first genome set differ from the values
683 :     for regions in the second genome set. This conditions is only satisfied if
684 :     the nucleotides occurring in the first set's regions never occur in the
685 :     second set's regions. This is a very strict condition. If, for example, C<A>
686 :     occurs in one of the regions for the first set, it cannot occur in any of the
687 :     regions for the second set.
688 :    
689 :     =over 4
690 :    
691 :     =item blockObject
692 :    
693 :     A C<DBObject> representing the B<GroupBlock> record for the desired block or
694 :     the actual ID of the block whose regions are to be examined. It is expected that the
695 :     block will have regions in all of the genomes for both sets, but this is not
696 :     required by the algorithm.
697 :    
698 :     =item set0
699 :    
700 :     Reference to a list of the IDs for the genomes in the first set (0).
701 :    
702 :     =item set1
703 :    
704 :     Reference to a list of the IDs for the genomes in the second set (1).
705 :    
706 :     =item RETURN
707 :    
708 :     A hash that maps positions to contents. Each position will be relative to the
709 :     start of the block. The mapped value will list the nucleotides found in the
710 :     first set and the nucleotides found in the second set. So, for example,
711 :     C<['AC', 'G']> means that both C<A> and C<C> are found in the genomes of the
712 :     first set (0), but only C<G> is found in genomes of the second set (1).
713 :    
714 :     =back
715 :    
716 :     =cut
717 :     #: Return Type %;
718 :     sub SnipScan {
719 :     # Get the parameters.
720 :     my ($self, $blockObject, $set0, $set1) = @_;
721 :     # Convert an incoming block ID to a block object.
722 :     if (ref $blockObject ne "DBObject") {
723 :     $blockObject = $self->GetEntity('GroupBlock', $blockObject);
724 :     }
725 :     # Get the ID and length of this block.
726 :     my ($blockID) = $blockObject->Value('GroupBlock(id)');
727 :     my ($len) = $blockObject->Value('GroupBlock(snip-count)');
728 :     # Create a hash that can be used to identify the genomes in each set.
729 :     my %setMap = ();
730 :     for my $genomeID (@{$set0}) {
731 :     $setMap{$genomeID} = 0;
732 :     }
733 :     for my $genomeID (@{$set1}) {
734 :     $setMap{$genomeID} = 1;
735 :     }
736 :     # Create the snip hash. We'll create empty lists for each snip position.
737 :     # As soon as a position becomes ineligible we delete it from the hash.
738 :     my %snipHash = ();
739 :     for (my $i = 0; $i < $len; $i++) {
740 :     $snipHash{$i} = ['', ''];
741 :     }
742 :     # Ask for the regions in the block.
743 :     my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
744 :     $blockID);
745 :     # Loop through the regions.
746 :     while (my $region = $query->Fetch) {
747 :     # Determine this region's genome set. We only continue if the region is in
748 :     # one of the sets.
749 : parrello 1.4 my ($location) = $region->Value('Region(id)');
750 : parrello 1.2 my $genomeID = Genome::FromContig($location);
751 :     if (exists $setMap{$genomeID}) {
752 :     my $setIndex = $setMap{$genomeID};
753 :     # Get the region's snip values.
754 :     my ($regionValue) = $region->Value('Region(content)');
755 :     # Loop through the positions that are still active.
756 :     my @positions = keys %snipHash;
757 :     for my $position (@positions) {
758 :     # Get the nucleotide at the current position in this region.
759 :     my $letter = substr $regionValue, $position, 1;
760 :     # Find it in the appropriate list.
761 :     my $letterList = $snipHash{$position};
762 :     if ((index $letterList->[1-$setIndex], $letter) >= 0) {
763 :     # It's in the other set, so this position is no
764 :     # longer valid.
765 :     delete $snipHash{$position};
766 :     } elsif ((index $letterList->[$setIndex], $letter) < 0) {
767 :     # It's in neither set, so we add it to the list we're
768 :     # accumulating for this set at this position.
769 :     $letterList->[$setIndex] .= $letter;
770 :     }
771 :     }
772 :     }
773 :     }
774 :     # Now %snipHash contains all of the information we need, but we must relocate the
775 :     # positions so they are relative to the block address instead of the snip
776 :     # positions. We need the block pattern to do this.
777 :     my ($blockPattern) = $blockObject->Value('GroupBlock(pattern)');
778 :     my @positions = ParsePattern($blockPattern);
779 :     # Create the return hash.
780 :     my %retVal = ();
781 :     # Relocate the positions.
782 :     for my $inPosition (keys %snipHash) {
783 :     $retVal{$positions[$inPosition]} = $snipHash{$inPosition};
784 :     }
785 :     # Return the result.
786 :     return %retVal;
787 :     }
788 :    
789 :     =head3 ParsePattern
790 :    
791 :     C<< my @positions = SimBlocks::ParsePattern($pattern); >>
792 :    
793 :     Get a list of the positions of variance inside a block pattern. The
794 :     positions of variance are marked by question marks, so all we need to
795 :     do is return the location of every question mark in the incoming
796 :     patter string.
797 :    
798 :     =over 4
799 :    
800 :     =item pattern
801 :    
802 :     DNA pattern for a group block.
803 :    
804 :     =item RETURN
805 :    
806 :     Returns a list of position numbers. These are coded as offsets, with 0 indicating
807 :     the first character of the pattern.
808 :    
809 :     =back
810 :    
811 :     =cut
812 :     #: Return Type @;
813 :     sub ParsePattern {
814 :     # Get the parameters.
815 :     my ($pattern) = @_;
816 :     # Create the return list.
817 :     my @retVal = ();
818 :     # Loop through the pattern, looking for question marks.
819 :     my $position = 0;
820 :     while (($position = index($pattern, "?", $position)) >= 0) {
821 :     push @retVal, $position;
822 :     $position++;
823 :     }
824 :     # Return the result.
825 :     return @retVal;
826 :     }
827 :    
828 :     =head3 MergeDNA
829 :    
830 :     C<< my ($groupSequence, $variance) = SimBlocks::MergeDNA($groupSequence, $newSequence); >>
831 :    
832 :     Merge the DNA for a region into the group representation of similar DNA, returning the
833 :     result and the positions of variance. Positions of variance in the group representation
834 :     are indicated by question marks. This method essentially compares the new DNA to the
835 :     group DNA and adds question marks wherever the new DNA does not match.
836 :    
837 :     =over 4
838 :    
839 :     =item groupSequence
840 :    
841 :     Group representation of the DNA into which the new DNA is to be merged.
842 :    
843 :     =item newSequence
844 :    
845 :     New DNA to be merged into the group representation.
846 :    
847 :     =item RETURN
848 :    
849 :     Returns the merged group representation followed by an updated count of the number
850 :     of question marks in the group.
851 :    
852 :     =back
853 :    
854 :     =cut
855 :     sub MergeDNA {
856 :     # Get the parameters.
857 :     my ($groupSequence, $newSequence) = @_;
858 :     # Declare the return variables.
859 :     my ($retVal, $varianceCount) = ("", 0);
860 :     # If there is no group representation, or if the group representation is the
861 :     # same as the new DNA, we return the new DNA; otherwise, we have to merge.
862 :     # Note that the new DNA cannot contain any question marks, so if either of
863 :     # the degenerate cases is true, we return a variance of 0.
864 :     if ((! $groupSequence) || ($groupSequence eq $newSequence)) {
865 :     $retVal = $newSequence;
866 :     } else {
867 :     # Here the group representation and the new DNA are different, so we
868 :     # have to find all the points of difference and replace them with
869 :     # question marks. We do this using a trick: by XOR-ing the two
870 :     # strings together, we create a new string with nulls everywhere
871 :     # except where there's a mismatch. We then find the non-null characters
872 :     # and replace the corresponding positions in the return string with
873 :     # question marks. First, we get a copy of the group representation to
874 :     # play with.
875 :     $retVal = $groupSequence;
876 :     # Next we compute the XOR string.
877 :     my $hexor = $retVal ^ $newSequence;
878 :     # This loop finds all non-null string positions.
879 :     while ($hexor =~ m/[^\x00]/g) {
880 :     # Here we get the position past the mismatch point.
881 :     my $pos = pos $hexor;
882 :     # Replace the real mismatch point with a question mark.
883 :     substr $retVal, $pos - 1, 1, "?";
884 :     $varianceCount++;
885 :     }
886 :     }
887 :     # Return the result.
888 :     return ($retVal, $varianceCount);
889 :     }
890 :    
891 :     =head3 GetAlignment
892 :    
893 :     C<< my %sequences = $$simBlocks->GetAlignment(\@blockIDs, \@genomeIDs, $indels); >>
894 :    
895 :     Return an alignment of the specified genomes relative to the specified block
896 :     IDs. Only blocks in which all of the genomes occur will produce output for
897 :     the alignment.
898 :    
899 :     =over 4
900 :    
901 :     =item blockIDs
902 :    
903 :     Reference to a list of the IDs of the blocks from which the alignment
904 :     should be constructed.
905 :    
906 :     =item genomeIDs
907 :    
908 :     Reference to a list of the genome IDs participating in the alignment.
909 :    
910 :     =item indels (optional)
911 :    
912 :     C<1> if columns containing insertion characters (C<->) should be included
913 :     in the alignment, else C<0>. The default is C<0>.
914 :    
915 :     =item RETURN
916 :    
917 :     Returns a hash that maps each genome ID to its sequence in the alignment.
918 :    
919 :     =back
920 :    
921 :     =cut
922 :     #: Return Type %;
923 :     sub GetAlignment {
924 :     # Get the parameters.
925 :     my ($self, $blockIDs, $genomeIDs, $indels) = @_;
926 :     # Create the return hash. We will start with a null string for each
927 :     # genome, and add the alignment data one block at a time.
928 :     my %retVal = ();
929 :     for my $genomeID (@{$genomeIDs}) {
930 :     $retVal{$genomeID} = "";
931 :     }
932 :     # Get the number of genomes. This will help us in determining which
933 :     # blocks to keep.
934 :     my $totalGenomes = @{$genomeIDs};
935 :     # Loop through the blocks.
936 :     for my $blockID (@{$blockIDs}) {
937 :     # Get the block's data.
938 :     my $blockData = $self->GetEntity('GroupBlock', $blockID);
939 :     my ($blockName) = $blockData->Value('GroupBlock(description)');
940 :     Trace("Processing common block $blockID ($blockName).") if T(Alignment => 4);
941 :     # Only proceed if the block has real variance in it.
942 :     my ($snipCount) = $blockData->Value('GroupBlock(snip-count)');
943 :     if ($snipCount > 0) {
944 :     # Get this block's regions in the specified genomes.
945 :     my %regionHash = $self->GetRegions($blockID, $genomeIDs);
946 :     # Verify that each genome is represented.
947 :     my %blockHash = ();
948 :     my $genomeCount = 0;
949 :     for my $region (keys %regionHash) {
950 :     # Get the genome ID from the region string.
951 :     $region =~ m/^([^:]+):/;
952 :     # If it's new, count it.
953 :     if (! exists $blockHash{$1}) {
954 :     $blockHash{$1} = $regionHash{$region};
955 :     $genomeCount++;
956 :     }
957 :     }
958 :     # At this point, if the number of genomes found matches the
959 :     # size of the list, this block is good.
960 :     if ($genomeCount == $totalGenomes) {
961 :     if (! $indels) {
962 :     # If indels are off, we need to remove some of the columns. We do
963 :     # this by finding C<-> characters in each genome's DNA sequence,
964 :     # and deleting all instances of the relevant columns.
965 :     for my $genomeID (@{$genomeIDs}) {
966 :     Trace("Hyphen check for genome $genomeID.") if T(Alignment => 3);
967 :     while ((my $pos = index($blockHash{$genomeID},"-")) >= 0) {
968 :     # Here we found an insertion character. We delete its column
969 :     # from all the genomes in the alignment.
970 :     Trace("Hyphen found at position $pos in genome $genomeID.") if T(Alignment => 4);
971 :     for my $genomeID2 (@{$genomeIDs}) {
972 :     substr $blockHash{$genomeID2}, $pos, 1, "";
973 :     }
974 :     }
975 :     }
976 :     }
977 :     Trace("Common block $blockID ($blockName) added to alignment.") if T(Alignment => 3);
978 :     for my $genomeID (@{$genomeIDs}) {
979 :     $retVal{$genomeID} .= $blockHash{$genomeID};
980 :     }
981 :     }
982 :     }
983 :     }
984 :     # Return the result.
985 :     return %retVal;
986 :     }
987 :    
988 :     =head3 DistanceMatrix
989 :    
990 :     C<< my %distances = SimBlocks::DistanceMatrix(\%sequences, \%distances); >>
991 :    
992 :     Compute the distances between the sequences in an alignment. the L</SequenceDistance>
993 :     method is used to compute the individual distances.
994 :    
995 :     =over 4
996 :    
997 :     =item sequences
998 :    
999 :     Reference to a hash of genome IDs to alignment sequences. The alignment sequences may
1000 :     only contain the characters C<AGCT->, and must all be the same length.
1001 :    
1002 :     =item distances
1003 :    
1004 :     Reference to a hash that maps pairs of letters to distance values. Each letter pair
1005 :     must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1006 :     map to the same distance. If no distance matrix is specified, the default distance
1007 :     matrix is used.
1008 :    
1009 :     =item RETURN
1010 :    
1011 :     Returns a hash that maps genome ID pairs to distance values. The ID pairs are
1012 :     formed using a slash. For example, the distance from C<158878.1> to C<282459.1>
1013 :     would be found attached to the key C<158878.1/282459.1>.
1014 :    
1015 :     =back
1016 :    
1017 :     =cut
1018 :     #: Return Type %
1019 :     sub DistanceMatrix {
1020 :     # Get the parameters.
1021 :     my ($sequences, $distances) = @_;
1022 :     # Declare the return variable.
1023 :     my %retVal = ();
1024 :     # Create a sorted list of the genome IDs.
1025 :     my @genomes = sort keys %{$sequences};
1026 :     # Loop through the genome IDs.
1027 :     for (my $i = 0; $i <= $#genomes; $i++) {
1028 :     my $genomei = $genomes[$i];
1029 :     # Create the 0-distance pair.
1030 :     $retVal{"$genomei/$genomei"} = 0;
1031 :     # Loop through the genomes lexically after this one. We'll
1032 :     # generate one distance and use it for both orderings.
1033 :     for (my $j = $i + 1; $j <= $#genomes; $j++) {
1034 :     my $genomej = $genomes[$j];
1035 :     # Compute the distance.
1036 :     my $distance = SequenceDistance($sequences->{$genomei}, $sequences->{$genomej},
1037 :     $distances);
1038 :     Trace("Distance from $genomei to $genomej is $distance.") if T(Distance => 3);
1039 :     # Store it in the return hash.
1040 :     $retVal{"$genomei/$genomej"} = $distance;
1041 :     $retVal{"$genomej/$genomei"} = $distance;
1042 :     }
1043 :     }
1044 :     # Return the result matrix.
1045 :     return %retVal;
1046 :     }
1047 :    
1048 :     =head3 SequenceDistance
1049 :    
1050 :     C<< my $dist = SimBlocks::SequenceDistance($seq1, $seq2, $distances); >>
1051 :    
1052 :     Return the distance between two sequences. The distance presumes that
1053 :     each alignment is a vector of sorts, with purines (C<A>/C<T>) and pyrmidines (C<G>/C<C>)
1054 :     being a half-unit from themselves and a full unit from each other. The distance is normalized
1055 :     by the number of positions in the alignment. Thus, C<AATGCTT> is 0.6267 from C<ATTTGCA>.
1056 :     The fourth and sixth positions are a full-unit jump and the second, fifth, and seventh
1057 :     positions are half-unit jumps. This works out to
1058 :    
1059 :     sqrt(0 + 0.5^2 + 0 + 1^2 + 0.5^2 + 1^2 + 0.5^2)/sqrt(7)
1060 :    
1061 :     which is approximately 0.6267. Insertion characters (C<->) are considered a distance of
1062 :     0 from everything. This distance model can be overridden by supplying a custom distance
1063 :     matrix as the third parameter.
1064 :    
1065 :     =over 4
1066 :    
1067 :     =item seq1
1068 :    
1069 :     Origin sequence for computing the distance.
1070 :    
1071 :     =item seq2
1072 :    
1073 :     End sequence for computing the distance. Both sequences must be the same length.
1074 :    
1075 :     =item distances
1076 :    
1077 :     Reference to a hash that maps pairs of letters to distance values. Each letter pair
1078 :     must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1079 :     map to the same distance. If no distance matrix is specified, the default distance
1080 :     matrix is used.
1081 :    
1082 :     =back
1083 :    
1084 :     =cut
1085 :     #: Return Type $;
1086 :     sub SequenceDistance {
1087 :     # Get the parameters.
1088 :     my ($seq1, $seq2, $distances) = @_;
1089 :     if (!defined $distances) {
1090 :     $distances = DefaultDistances();
1091 :     }
1092 : parrello 1.3 # Declare the return value. Initially, this will be a sum of squares. We'll
1093 : parrello 1.2 # take the square root and divide out the length later.
1094 :     my $retVal = 0;
1095 :     # Loop through the sequences, character by character, comparing.
1096 :     my $n = length $seq1;
1097 :     for (my $i = 0; $i < $n; $i++) {
1098 :     my $s1 = substr $seq1, $i, 1;
1099 :     my $s2 = substr $seq2, $i, 1;
1100 :     my $step = $distances->{"$s1$s2"};
1101 :     Trace("No step distance found for \"$s1$s2\".") if !defined $step && T(Distance => 1);
1102 :     $retVal += $step * $step;
1103 :     }
1104 :     # Divide by the length and take the square root.
1105 :     $retVal = sqrt($retVal / $n);
1106 :     # Return the result.
1107 :     return $retVal;
1108 :     }
1109 :    
1110 :     =head3 GetBlock
1111 :    
1112 :     C<< my $blockData = $simBlocks->GetBlock($blockID); >>
1113 :    
1114 :     Return a B<DBObject> for a specified group block.
1115 :    
1116 :     =over 4
1117 :    
1118 :     =item blockID
1119 :    
1120 :     ID of the desired block.
1121 :    
1122 :     =item RETURN
1123 :    
1124 :     Returns a B<DBObject> for the group block in question. The object allows access to
1125 :     all of the fields in the C<GroupBlock> relation of the database.
1126 :    
1127 :     =back
1128 :    
1129 :     =cut
1130 :     #: Return Type $%;
1131 :     sub GetBlock {
1132 :     # Get the parameters.
1133 :     my ($self, $blockID) = @_;
1134 :     # Get the group block.
1135 :     my $retVal = $self->GetEntity('GroupBlock', $blockID);
1136 :     # Return the result.
1137 :     return $retVal;
1138 :     }
1139 :    
1140 : parrello 1.3 =head3 GetBlockPieces
1141 :    
1142 :     C<< my @blocks = $blockData->GetBlockPieces($location); >>
1143 :    
1144 :     Return a map of the block pieces inside the specified location. The return
1145 :     value will be a list of block locations. A block location is essentially a
1146 :     location string with a block ID in place of the contig ID.
1147 :    
1148 :     This routine depends upon the fact that there is no overlap between blocks.
1149 :     A given nucleotide on a Contig is in exactly one block. Therefore, to find
1150 :     all the blocks that overlap a specific section of the Contig, we find the
1151 :     first block that ends after the start of the section and proceed until we
1152 :     find the last block that begins before the end of the section. The only
1153 :     complexity that occurs is if the Contig folds back in upon itself and
1154 :     appears twice in the same block. In this case, we might get two segments from
1155 :     the same block, and they may or may not overlap.
1156 :    
1157 :     =over 4
1158 :    
1159 :     =item location
1160 :    
1161 :     A basic location object or location string defining the range whose block pieces
1162 :     are desired.
1163 :    
1164 :     =item RETURN
1165 :    
1166 :     Returns a list of block location strings. Each string consists of a block ID
1167 :     followed by a start location and an end location. The three pieces of the
1168 :     string are separated by underscores (C<_>).
1169 :    
1170 :     =back
1171 :    
1172 :     =cut
1173 :     #: Return Type %;
1174 :     sub GetBlockPieces {
1175 :     # Get the parameters.
1176 :     my ($self, $location) = @_;
1177 :     # Declare the return variable.
1178 :     my @retVal = ();
1179 :     # Parse the incoming location.
1180 :     my $loc = BasicLocation->new($location);
1181 :     # Determine the parameters needed to get the desired list of regions.
1182 :     my ($left, $right, $contigID) = ($loc->Left, $loc->Right, $loc->Contig);
1183 : parrello 1.5 Trace("Searching for regions near " . $loc->String) if T(4);
1184 : parrello 1.3 # Ask for the regions in the section we want.
1185 :     my @regions = $self->GetAll(['Region', 'IncludesRegion', 'GroupBlock'],
1186 :     'Region(end) >= ? AND Region(position) <= ? AND Region(contigID) = ?',
1187 :     [$left, $right, $contigID],
1188 :     ['GroupBlock(id)', 'GroupBlock(pattern)',
1189 :     'GroupBlock(len)',
1190 :     'Region(position)', 'Region(end)',
1191 :     'Region(content)']);
1192 :     # Loop through the regions found. For each region we will output a location string.
1193 :     for my $regionData (@regions) {
1194 :     # Declare the variable to contain the formatted location.
1195 :     my $blockLoc;
1196 :     # Separate the fields from the region data.
1197 :     my ($blockID, $pattern, $len, $position, $end, $content) = @{$regionData};
1198 :     # Determine whether we're using the full region or only part of it.
1199 :     if ($position >= $left && $end <= $right) {
1200 :     # Here we're using the full region, so the location is simple.
1201 :     $blockLoc = "${blockID}_1_${len}";
1202 :     } else {
1203 :     # This is the hard part. We need to convert the contig positions to
1204 :     # block positions. Since the block contains insertion characters and
1205 :     # the regions don't, this requires fancy dancing. First, we get the
1206 :     # region's DNA string.
1207 :     my $dnaSequence = TagDNA($pattern, $content);
1208 :     # Let's call the "area of interest" the part of the Contig that
1209 :     # is in the incoming location AND in the current region. This could
1210 :     # be at either end of the region or somewhere in the middle. We
1211 :     # need to walk the block DNA to find what positions in the block
1212 :     # correspond to the desired positions in the contig. First, we
1213 :     # put ourselves at the beginning of the block and the region,
1214 :     # and walk to the start point. Note that our position arithmetic
1215 :     # is 1-based, in keeping with the conventions of biologists.
1216 :     my $blockPos = 1;
1217 :     my $contigPos = $position;
1218 :     # Check to see if we can use the current position as are starting
1219 :     # point or we need to find it.
1220 :     if ($left > $position) {
1221 :     # Here we need to find the left edge of the area of interest.
1222 :     $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $left);
1223 :     # Update the contig position to match.
1224 :     $contigPos = $left;
1225 :     }
1226 :     # Save the left edge.
1227 :     my $blockLeft = $blockPos;
1228 :     # Now find the right edge.
1229 :     if ($right >= $end) {
1230 :     # Here the right edge is the end of the block.
1231 :     $blockPos = $len;
1232 :     } else {
1233 :     # Here we have to find the right edge of the area of interest.
1234 :     $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $right);
1235 :     }
1236 :     my $blockRight = $blockPos;
1237 :     # Format the location.
1238 :     $blockLoc = "${blockID}_${blockLeft}_${blockRight}";
1239 :     }
1240 :     # Push the block location onto the return list.
1241 :     push @retVal, $blockLoc;
1242 :     }
1243 :     # Return the result.
1244 :     return @retVal;
1245 :     }
1246 :    
1247 : parrello 1.4 =head3 GetFeatureBlockPieces
1248 :    
1249 :     C<< my @pieces = $simBlocks->GetFeatureBlockPieces($fig, \@featureIDs, $distance); >>
1250 :    
1251 :     Return a list of the block pieces within the specified distance of the
1252 :     specified features. This method essentially computes locations from
1253 :     the distance and the feature locations, then passes them to L/GetBlockPieces>.
1254 :     That method will return the sections of various similarity blocks that
1255 :     occur inside the the locations computed.
1256 :    
1257 :     =over 4
1258 :    
1259 :     =item fig
1260 :    
1261 :     A FIG-like object that can be used to convert features into locations.
1262 :    
1263 :     =item featureIDs
1264 :    
1265 :     Reference to a list of feature IDs. Blocks in the vicinity of any of the
1266 :     features are returned.
1267 :    
1268 :     =item distance
1269 :    
1270 :     Distance to search from the feature's actual location.
1271 :    
1272 :     =item RETURN
1273 :    
1274 :     Returns a list of block piece location objects. A block piece location object
1275 :     is a standard B<BasicLocation> object with a block ID instead of a contig ID.
1276 :    
1277 :     =back
1278 :    
1279 :     =cut
1280 :     #: Return Type @;
1281 :     sub GetFeatureBlockPieces {
1282 :     # Get the parameters.
1283 :     my ($self, $fig, $featureIDs, $distance) = @_;
1284 :     # Declare a hash for keeping the return data. This will weed out
1285 :     # duplicates, of which we expect plenty. We'll need to handle
1286 :     # overlaps later, however.
1287 :     my %retHash = ();
1288 :     # Loop through the features.
1289 :     for my $featureID (@{$featureIDs}) {
1290 :     # Get the feature's genome ID.
1291 :     my $genomeID = FIG::genome_of($featureID);
1292 :     # Get the feature's locations.
1293 :     my @locations = $fig->feature_location($featureID);
1294 :     # Loop through the locations, sticking the resulting pieces in the return
1295 :     # hash.
1296 :     for my $loc (@locations) {
1297 :     my $locObject = BasicLocation->new($loc);
1298 :     # Get the length of the contig in question.
1299 :     my $len = $fig->contig_ln($genomeID, $locObject->Contig);
1300 :     # Expand the location.
1301 : parrello 1.5 $locObject->Widen($distance, $len);
1302 :     # Insure the location is Sprout-style;
1303 :     $locObject->FixContig($genomeID);
1304 : parrello 1.4 # Get the desired block pieces.
1305 :     my @pieces = $self->GetBlockPieces($locObject);
1306 : parrello 1.5 Trace(scalar(@pieces) . " pieces found for location $loc.") if T(4);
1307 : parrello 1.4 # Put them in the hash.
1308 :     for my $piece (@pieces) {
1309 :     $retHash{$piece} = 1;
1310 :     }
1311 :     }
1312 :     }
1313 :     # Now we have all the block pieces that occur in any one of our regions. The
1314 :     # next step is to merge adjacent and overlapping regions. First we convert
1315 :     # the pieces to location objects so we can sort them.
1316 :     my @retVal = ();
1317 :     for my $piece (keys %retHash) {
1318 :     my $loc = BasicLocation->new($piece);
1319 : parrello 1.5 push @retVal, $loc;
1320 : parrello 1.4 }
1321 : parrello 1.5 Trace("Beginning sort.") if T(3);
1322 : parrello 1.4 @retVal = sort { BasicLocation::Cmp($a,$b) } @retVal;
1323 : parrello 1.5 Trace(scalar(@retVal) . " pieces found before overlap check.") if T(3);
1324 : parrello 1.4 # Now the locations are sorted by block ID, start position, and descending
1325 :     # length. This means that if there's an overlap, the two overlapping
1326 :     # pieces will be next to each other.
1327 :     my $i = 0;
1328 :     while ($i < $#retVal) {
1329 :     # Check for overlap between this location and the next.
1330 :     my ($type, $len) = Overlap::CheckOverlap($retVal[$i], $retVal[$i+1]);
1331 :     if ($len == 0) {
1332 :     # Here there is no overlap, so we push ahead.
1333 :     $i++;
1334 :     } elsif ($type eq 'embedded') {
1335 :     # Here the second piece is inside the first, so we delete the
1336 :     # second.
1337 :     delete $retVal[$i+1];
1338 :     } else {
1339 :     # Here we have a normal overlap. Because all of our pieces
1340 :     # are forward locations, this means the second piece starts
1341 :     # before the end of the of the first. We set the end point
1342 :     # if the first to the end point of the second and delete the
1343 :     # second.
1344 :     $retVal[$i]->SetEnd($retVal[$i+1]->EndPoint);
1345 :     delete $retVal[$i+1];
1346 :     }
1347 :     }
1348 :     # Return the result.
1349 :     return @retVal;
1350 :     }
1351 :    
1352 : parrello 1.3 =head3 WalkDNA
1353 :    
1354 :     C<< my $blockPos = SimBlocks::WalkDNA($blockPos, $contigPos, $dna, $loc); >>
1355 :    
1356 :     Location the desired position within a block of a specified location in a contig.
1357 :    
1358 :     The idea here is that a block contains insertion characters and a contig doesn't.
1359 :     The incoming DNA sequence is for the block. On entry, we have a current position
1360 :     in the block and a current position in the contig. We advance the two pointers
1361 :     in parallel until the contig pointer equals the desired location.
1362 :    
1363 :     =over 4
1364 :    
1365 :     =item blockPos
1366 :    
1367 :     Current position in the block (1-based).
1368 :    
1369 :     =item contigPos
1370 :    
1371 :     Current position in the contig.
1372 :    
1373 :     =item dna
1374 :    
1375 :     DNA string for the block with the points of variance filled in with data from
1376 :     the contig. In other word's the block's version of the current contig region.
1377 :    
1378 :     =item loc
1379 :    
1380 :     Desired location in the contig.
1381 :    
1382 :     =item RETURN
1383 :    
1384 :     Returns the position in the block corresponding to the desired position in the
1385 :     contig.
1386 :    
1387 :     =back
1388 :    
1389 :     =cut
1390 :     #: Return Type $;
1391 :     sub WalkDNA {
1392 :     # Get the parameters.
1393 :     my ($blockPos, $contigPos, $dna, $loc) = @_;
1394 :     # Copy the incoming positions into variables with which we can play. While
1395 :     # we're at it, we'll convert the block position from 1-based to 0-based.
1396 :     my ($retVal, $contigPtr) = ($blockPos - 1, $contigPos);
1397 :     # Loop until we find our destination point.
1398 :     while ($contigPtr < $loc) {
1399 :     # Find the next insertion character in the DNA string.
1400 :     my $nextHyphen = index $dna, '-', $retVal;
1401 :     # At this point, "nextHyphen" is either -1 (indicating no hyphen
1402 :     # found) or it is positioned on a hyphen. If it's -1, move it after
1403 :     # the end of the dna string.
1404 :     if ($nextHyphen < 0) {
1405 :     $nextHyphen = length $dna;
1406 :     }
1407 :     # Now, if we subtract $retVal from $nextHyphen, we have the largest
1408 :     # value by which we can advance the contig pointer without getting
1409 :     # out of sync. The first step is to see if this would move us past
1410 :     # our desired location.
1411 :     my $offset = $nextHyphen - $retVal;
1412 :     my $contigNext = $contigPtr + $offset;
1413 :     if ($contigNext >= $loc) {
1414 :     # We now know that everything between the current position and
1415 :     # the desired position is real nucleotides. We simply push both
1416 :     # pointers forward by the same amount so that the contig pointer
1417 :     # becomes $loc.
1418 :     $retVal += $loc - $contigPtr;
1419 :     $contigPtr = $loc
1420 :     } else {
1421 :     # Here we need to go the whole distance to the next hyphen and
1422 :     # keep going.
1423 :     $contigPtr += $offset;
1424 :     $retVal += $offset;
1425 :     $retVal++;
1426 :     }
1427 :     }
1428 :     # Return the result. Note we convert back to 1-based.
1429 :     return $retVal + 1;
1430 :     }
1431 :    
1432 : parrello 1.1 1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3