[Bio] / Sprout / SimBlocks.pm Repository:
ViewVC logotype

Annotation of /Sprout/SimBlocks.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.9 - (view) (download) (as text)

1 : parrello 1.1 package SimBlocks;
2 :    
3 :     require Exporter;
4 :     use ERDB;
5 :     @ISA = qw(Exporter ERDB);
6 : parrello 1.2 @EXPORT = qw(DefaultDistances);
7 : parrello 1.3 @EXPORT_OK = qw(BlocksInSet GetBlock GetBlockPieces CompareGenomes DBLoad DistanceMatrix GetAlignment GetRegions ParsePattern RemoveBlocks SequenceDistance SetNumber SnipScan);
8 : parrello 1.1
9 :     use strict;
10 :     use Tracer;
11 :     use PageBuilder;
12 :     use Genome;
13 : parrello 1.3 use BasicLocation;
14 : parrello 1.4 use Overlap;
15 : parrello 1.1 use FIG_Config;
16 : parrello 1.4 use FIG;
17 : parrello 1.1
18 :     =head1 Similarity Block Database
19 :    
20 :     =head2 Introduction
21 :    
22 :     A I<similarity Block Database> describes the similarities amongst a group
23 :     of contigs. The load files for the database are produced by the
24 :     C<SimBlast.pl> script from genome FASTA files. The database is described
25 :     by the C<SimBlocksDBD.xml> file.
26 :    
27 :     The primary entities of the database are B<Genome>, which represents a specified
28 : parrello 1.2 organism, B<GroupBlock>, which represents a set of similar regions, and B<Contig>,
29 : parrello 1.1 which represents a contiguous segment of DNA. The key relationship is
30 : parrello 1.2 B<ContainsRegionIn>, which maps blocks to contigs. The mapping information
31 : parrello 1.1 includes the DNA nucleotides as well as the location in the contig.
32 :    
33 : parrello 1.2 =head2 Formal Definitions
34 :    
35 :     Let {B<G(1)>, B<G(2)>, ..., B<G(>I<n>B<)>} be a set of genomes
36 :     and let {B<g(1)>, B<g(2)>, ..., B<g(>I<m>B<)>} be the set of
37 :     genes called on these genomes.
38 :    
39 :     The primitive notion from which similarity blocks are derived is the
40 :     I<correspondence mapping>. Roughly, this mapping connects genes on
41 :     one genome to corresponding genes on other genomes. We will treat
42 :     the mapping as an equivalence class. In practice, the entire mapping
43 :     will not be available when we are building to load files for the
44 :     similarity block database. Instead, the user will provide a subset
45 :     of the mapping from which the entire partitioning can be deduced
46 :     via commutativity, reflexivity, and transitivity. This subset is
47 :     called the I<correspondence relation>.
48 :    
49 :     In actual practice, the correspondence relation will contain some
50 :     false positives. The process of developing the correspondence
51 :     relation usually involves analyzing similarities above a certain
52 :     threshhold. (This type of similarity is not necessarily transitive;
53 :     however, the true, underlying correspondence mapping we want I<is>
54 :     transitive.) Often, a sequence of correspondence pairs that maps
55 :     a gene on one particular genome to another gene on the same genome
56 :     is an indication of a false positive.
57 :    
58 :     In addition to false positives, we may have invalid genes, missing
59 :     genes, frame shifts, and incorrect start locations. Detecting and
60 :     cleaning these problems is an important part of getting a good
61 :     correspondence relation.
62 :    
63 :     Given a full-blown correspondence mapping, we define the following
64 :     concepts.
65 :    
66 :     =over 4
67 :    
68 :     =item corresponding gene
69 :    
70 :     The I<corresponding genes> of a given gene B<g(>I<i>B<)> are those
71 :     genes mapped to it by the correspondence mapping.
72 :    
73 :     =item gene block
74 :    
75 :     A I<gene block> is a maximal set of corresponding genes.
76 :    
77 :     =item intergenic region
78 :    
79 :     An I<intergenic region> is a section of the contig between
80 :     two adjacent genes. The adjacent genes are called the
81 :     I<left gene> and the I<right gene>, depending on whether they
82 :     precede or follow the region. For example, if C<G2_100+50>
83 :     and C<G2_175+100> are adjacent, C<G2_150+25> is their
84 :     intergenic region, the left gene is C<G2_100+50>, and
85 :     the right gene is C<G2_175+100>.
86 :    
87 :     =item intergenic block
88 :    
89 :     An I<intergenic block> is a maximal set of intergenic regions such
90 :     that (1) the left genes all correspond and have the same orientation,
91 :     (2) the right genes all correspond and have the same orientation, and
92 :     (3) the region lengths differ by no more than a specified threshhold
93 :     (usually 10%).
94 :    
95 :     =item similarity block
96 :    
97 :     A I<similarity block> is either a gene block or an intergenic block.
98 :     A similarity block is therefore a set of regions (either gene regions
99 :     or intergenic regions) that are tied together by the correspondence
100 :     mapping.
101 :    
102 :     =item common block
103 :    
104 :     A I<common block> with respect to a subset of the genomes
105 :     {B<G(>I<i1>B<)>, B<G(>I<i2>B<)>, ... B<G(>I<ik>B<)>} is a
106 :     similarity block that has exactly one region in each
107 :     of the genomes of the subset. There is a lesser notion of
108 :     a block that occurs in most genomes in the subset. The actual
109 :     program allows the notion of I<most> to be defined externally.
110 :     The default is all of the genomes in the subset.
111 :    
112 :     =item unique sequence
113 :    
114 :     A I<unique sequence> is a region that does not appear in any
115 :     similarity block with another region. In other words, a unique
116 :     sequence is a region that does not correspond to any other region.
117 :    
118 :     =back
119 :    
120 :     =head2 Usage
121 : parrello 1.1
122 :     The similarity block database must be able to answer two questions, taking
123 : parrello 1.2 as input two genome sets. (These would be subsets of the total set of genomes
124 :     represented in the correspondence mapping.)
125 : parrello 1.1
126 :     =over 4
127 :    
128 : parrello 1.2 =item (1) Which blocks are common in one set and do not occur in the other.
129 :    
130 :     These are considered major variations. They result from prophage insertions,
131 :     transposition events, and recombinations. A major variation results in a
132 :     different set of proteins in an operating cell.
133 : parrello 1.1
134 : parrello 1.2 =item (2) What individual variations in the shared common blocks distinguish the two sets?
135 :    
136 :     These are considered minor variations, and are the most common result of
137 :     single-nucleotide mutations. Rather than changing the set of proteins,
138 :     minor variations change the proteins themselves.
139 : parrello 1.1
140 :     =back
141 :    
142 :     This class is a subclass of B<ERDB>, and inherits all of its public methods.
143 :    
144 : parrello 1.2 =head2 Setup
145 :    
146 :     To begin working with similarity blocks, you must first create a database to
147 :     contain the block data. Next, you must specify the various similarity block
148 :     parameters in C<FIG_Config>. These are as follows. (Of course, the values
149 :     you specify will be different.)
150 :    
151 :     $simBlocksDB = "conser5_blocks"; # SimBlocks database schema name
152 :     $simBlocksData = "$data/BlockData"; # directory containing SimBlocks data files
153 :     # (used to load the Similarity database)
154 :     $simBlastData = "$data/BlastData"; # directory containing SimBlast data files
155 : parrello 1.3 # (input to SimLoad.pl)
156 : parrello 1.2
157 : parrello 1.3 Run the C<SimLoad> script to load the database.
158 : parrello 1.2
159 :     The C<SimBlockForm.cgi> service enables you to run a similarity analysis between genome
160 :     sets in the database. The C<Html/ERDBTest.html> page enables you to run general queries
161 :     against the database and execute simple scripts (however, this latter capability is
162 :     disabled unless you have
163 :    
164 :     $debug_mode = 1; # TRUE to enable the debugging scripts
165 :    
166 :     The Similarity Block methods and scripts work in the Windows version of the SEED (see
167 :     C<WinBuild>).
168 :    
169 : parrello 1.1 =cut
170 :    
171 :     #: Constructor SimBlocks->new();
172 :    
173 :     =head2 Public Methods
174 :    
175 :     =head3 new
176 :    
177 :     C<< my $simdb = SimBlocks->new($dbname, $dbType, $port); >>
178 :    
179 :     Construct a new SimBlocks object connected to the specified database. If no
180 :     database is specified, the default database indicated by C<FIG_Config::simBlocksDB>
181 :     will be used. Similarly, if the database type is omitted, the type will be
182 :     taken from C<FIG_Config::dbms>, and if the port is omitted, C<FIG_Config::dbport>
183 :     will be used. The user name and password for connecting are taken from
184 :     C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.
185 :    
186 : parrello 1.2 In almost every case, you will be calling
187 :    
188 :     C<< my $simdb = SimBlocks->new(); >>
189 :    
190 : parrello 1.1 =over 4
191 :    
192 :     =item dbname (optional)
193 :    
194 :     Database schema name. This is how we choose the desired database on the server.
195 :     If omitted, the value of C<$FIG_Config::simBlocksDB> is used.
196 :    
197 :     =item dbType
198 :    
199 :     Database type (e.g. C<mysql> for MySQL or C<pg> for Postgres). If omitted, the
200 :     value of C<$FIG_Config::dbms> will be used.
201 :    
202 :     =item port
203 :    
204 :     Port number for connecting to the database server. If omitted, the value of
205 :     C<$FIG_Config::dbport> will be used.
206 :    
207 :     =back
208 :    
209 :     =cut
210 :    
211 :     sub new {
212 :     # Get the parameters.
213 :     my ($class, $dbname, $dbType, $port) = @_;
214 :     # Plug in the default values.
215 :     if (! $dbname) {
216 : parrello 1.7 $dbname = DBName();
217 : parrello 1.1 }
218 :     if (! $dbType) {
219 :     $dbType = $FIG_Config::dbms;
220 :     }
221 :     if (! $port) {
222 :     $port = $FIG_Config::dbport;
223 :     }
224 :     # Connect to the database.
225 :     my $dbh = DBKernel->new($dbType, $dbname, $FIG_Config::dbuser, $FIG_Config::dbpass, $port);
226 :     # Get the data directory name.
227 :     my $directory = $FIG_Config::simBlocksData;
228 :     # Create and bless the ERDB object.
229 : parrello 1.8 my $retVal = ERDB::new($class, $dbh, "$FIG_Config::fig/SimBlocksDBD.xml");
230 : parrello 1.1 # Return it.
231 :     return $retVal;
232 :     }
233 :    
234 : parrello 1.7 =head3 DBName
235 :    
236 :     C<< my $name = SimBlocks::DBName; >>
237 :    
238 :     Return the name of the database. This is set from a config variable, but if the
239 :     variable is undefined a default value is used.
240 :    
241 :     =cut
242 :    
243 :     sub DBName {
244 :     my $retVal;
245 :     if (defined $FIG_Config::simBlocksDB) {
246 :     $retVal = $FIG_Config::simBlocksDB;
247 :     } else {
248 :     $retVal = "Blocks";
249 :     }
250 :     return $retVal;
251 :     }
252 :    
253 : parrello 1.2 =head3 DefaultDistances
254 :    
255 :     C<< my $distances = DefaultDistances(); >>
256 :    
257 :     Return the default distance matrix for computing the alignment distances (see
258 :     also L</DistanceMatrix>. This matrix returns a distance of 0 between an insertion
259 :     character (C<->) and any other nucleotide, a distance of 0.5 between nucleotides
260 :     of the same type, and a distance of 1 between nucleotides of different types.
261 :    
262 :     =cut
263 :    
264 :     my $DefaultDistanceMatrix = { AA => 0, AC => 1, AG => 0.5, AT => 1,
265 :     AN => 0.625, AR => 0.25, AY => 1, 'A-' => 0,
266 :     CA => 1, CC => 0, CG => 1, CT => 0.5,
267 :     CN => 0.625, CR => 1, CY => 0.25, 'C-' => 0,
268 :     GA => 0.5, GC => 1, GG => 0, GT => 1,
269 :     GN => 0, GR => 0.25, GY => 1, 'G-' => 0,
270 :     TA => 1, TC => 0.5, TG => 1, TT => 0,
271 :     TN => 0, TR => 1, TY => 0.25, 'T-' => 0,
272 :     RA => 0.25, RC => 1, RG => 0.25, RT => 1,
273 :     RN => 0.625, RR => 0.25, RY => 1, 'R-' => 0,
274 :     YA => 1, YC => 0.25, YG => 1, YT => 0.25,
275 :     YN => 0.625, YR => 1, YY => 0.25, 'Y-' => 0,
276 :     NA => 0.625, NC => 0.625, NG => 0.625, NT => 0.625,
277 :     NN => 0.625, NR => 0.625, NY => 0.625,'N-' => 0,
278 :     '-A' => 0, '-C' => 0, '-G' => 0, '-T' => 0,
279 :     '-N' => 0, '-R' => 0, '-Y' => 0, '--' => 0 };
280 :    
281 :     sub DefaultDistances {
282 :     return $DefaultDistanceMatrix;
283 :     }
284 :    
285 : parrello 1.1 =head3 DBLoad
286 :    
287 :     C<< my $stats = $simBlocks->DBLoad($rebuild); >>
288 :    
289 :     Load the database from the default directory. This is essentially identical to
290 :     a B<LoadTables> call with the default directory used instead of a caller-specified
291 :     directory.
292 :    
293 :     =over 4
294 :    
295 :     =item rebuild
296 :    
297 :     TRUE if the tables should be re-created; FALSE if the tables should be cleared and
298 : parrello 1.2 reloaded. The default is FALSE; however, TRUE is considerably faster. You would use
299 :     FALSE if you had added non-standard indexes to the database and didn't want to lose
300 :     them.
301 : parrello 1.1
302 :     =item RETURN
303 :    
304 :     Returns a statistics object describing the duration of the load, the number of
305 :     records loaded, and a list of error messages.
306 :    
307 :     =back
308 :    
309 :     =cut
310 :     #: Return Type $%@;
311 :     sub DBLoad {
312 :     # Get the parameters.
313 :     my ($self, $rebuild) = @_;
314 :     # Load the tables.
315 :     my $retVal = $self->LoadTables($FIG_Config::simBlocksData, $rebuild);
316 :     # Return the result.
317 :     return $retVal;
318 :     }
319 :    
320 :     =head3 CompareGenomes
321 :    
322 : parrello 1.2 C<< my (\%set0Blocks, \%set1Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set0, \@set1); >>
323 : parrello 1.1
324 :     Analyze two sets of genomes for commonalities. The group blocks returned will be divided
325 : parrello 1.2 into three hashes: one for those common to set 0 and not occurring at all in set 1, one
326 :     for those common to set 1 and not occurring at all in set 0, and one for those common
327 :     to both sets. Each hash is keyed by group ID and will contain B<DBObject>s for
328 :     B<GroupBlock> records with B<HasInstanceOf> data attached, though the genome ID in
329 :     the B<HasInstanceOf> section is not generally predictable.
330 : parrello 1.1
331 :     =over 4
332 :    
333 : parrello 1.2 =item set0
334 : parrello 1.1
335 : parrello 1.2 Reference to a list of genome IDs for the genomes in the first set (set 0).
336 : parrello 1.1
337 : parrello 1.2 =item set1
338 : parrello 1.1
339 : parrello 1.2 Reference to a list of genome IDs for the genomes in the second set (set 1).
340 : parrello 1.1
341 :     =item RETURN
342 :    
343 : parrello 1.2 Returns a triple of hashes. Each hash is keyed by group ID, and will contain
344 :     B<DBObject>s for records in the B<GroupBlock> table. Groups found in all of the
345 :     genomes in set 0 but in none of the genomes of set 1 will be in the first hash,
346 :     groups found in all of the genomes in set 1 but in none of the genomes of set 0
347 :     will be in the second hash, and groups found in all of the genomes of both sets
348 :     will be in the third hash.
349 : parrello 1.1
350 :     =back
351 :    
352 :     =cut
353 :     #: Return Type @%;
354 :     sub CompareGenomes {
355 :     # Get the parameters.
356 : parrello 1.2 my ($self, $set0, $set1) = @_;
357 : parrello 1.1 # Declare the three hashes.
358 : parrello 1.2 my (%set0Blocks, %set1Blocks, %bothBlocks);
359 :     # Our first task is to find the common groups in each genome set.
360 :     my %groups0 = $self->BlocksInSet($set0);
361 : parrello 1.1 my %groups1 = $self->BlocksInSet($set1);
362 :     # Create a trailer key to help tighten the loops.
363 :     my $trailer = "\xFF";
364 :     # The groups are hashed by group ID. We will move through them in key order
365 : parrello 1.2 # to get the "set0", "set1", and "both" groups.
366 :     my @blockIDs0 = (sort keys %groups0, $trailer);
367 : parrello 1.1 my @blockIDs1 = (sort keys %groups1, $trailer);
368 :     # Prime the loop by getting the first ID from each list. Thanks to the trailers
369 :     # we know this process can't fail.
370 : parrello 1.2 my $blockID0 = shift @blockIDs0;
371 : parrello 1.1 my $blockID1 = shift @blockIDs1;
372 : parrello 1.2 while ($blockID0 lt $trailer || $blockID1 lt $trailer) {
373 : parrello 1.1 # Compare the block IDs.
374 : parrello 1.2 if ($blockID0 lt $blockID1) {
375 :     # Block 0 is only in the first set.
376 :     Trace("Block IDs $blockID0 vs. $blockID1. Set 0 selected.") if T(SimCompare => 3);
377 :     $set0Blocks{$blockID0} = $groups0{$blockID0};
378 :     $blockID0 = shift @blockIDs0;
379 :     } elsif ($blockID0 gt $blockID1) {
380 :     # Block 1 is only in the second set.
381 :     Trace("Block IDs $blockID0 vs. $blockID1. Set 1 selected.") if T(SimCompare => 3);
382 : parrello 1.1 $set1Blocks{$blockID1} = $groups1{$blockID1};
383 :     $blockID1 = shift @blockIDs1;
384 :     } else {
385 :     # We have a block in both sets.
386 : parrello 1.2 Trace("Block IDs $blockID0 vs. $blockID1. Both selected.") if T(SimCompare => 3);
387 : parrello 1.1 $bothBlocks{$blockID1} = $groups1{$blockID1};
388 : parrello 1.2 $blockID0 = shift @blockIDs0;
389 : parrello 1.1 $blockID1 = shift @blockIDs1;
390 :     }
391 :     }
392 : parrello 1.2 # The set1 and set2 lists are too big. They contain groups that are common in their
393 :     # respective sets but not common in both sets. We want groups that are common in
394 :     # their respective sets but completely absent in the other set. We do this by getting
395 :     # a complete list of the blocks in each set and deleting anything we find.
396 :     $self->RemoveBlocks(\%set0Blocks, $set1);
397 :     $self->RemoveBlocks(\%set1Blocks, $set0);
398 : parrello 1.1 # Return the result.
399 : parrello 1.2 return (\%set0Blocks, \%set1Blocks, \%bothBlocks);
400 :     }
401 :    
402 :     =head3 RemoveBlocks
403 :    
404 :     C<< $simBlocks->RemoveBlocks(\%blockMap, \@set); >>
405 :    
406 :     Remove from the specified block map any blocks that occur in the specified set of genomes.
407 :     The block map can contain any data, but it must be keyed by block ID.
408 :    
409 :     =over 4
410 :    
411 :     =item blockMap
412 :    
413 :     Reference to a hash of block IDs to block data. The character of the block data is
414 :     irrelevant to this method.
415 :    
416 :     =item set
417 :    
418 :     Reference to a list of genome IDs. Any block that occurs in any one of the specified
419 :     genomes will be removed from the block map.
420 :    
421 :     =back
422 :    
423 :     =cut
424 :     #: Return Type ;
425 :     sub RemoveBlocks {
426 :     # Get the parameters.
427 :     my ($self, $blockMap, $set) = @_;
428 :     # Get a list of the blocks in the genome set.
429 :     my %setBlocks = $self->BlocksInSet($set, 1);
430 :     # Loop through the incoming block map, deleting any blocks found
431 :     # in the new block set. Note we make the assumption that the
432 :     # incoming block map is smaller.
433 :     my @blockList = keys %{$blockMap};
434 :     for my $blockID (@blockList) {
435 :     if (exists $setBlocks{$blockID}) {
436 :     delete $blockMap->{$blockID};
437 :     }
438 :     }
439 : parrello 1.1 }
440 :    
441 :     =head3 BlocksInSet
442 :    
443 : parrello 1.2 C<< my %blockList = $simBlocks->BlocksInSet($set, $count); >>
444 : parrello 1.1
445 : parrello 1.2 Return a list of the group blocks found in a given number of the genomes in a given
446 :     set. The list returned will be a hash of B<DBObject>s, each corresponding to a single
447 :     B<GroupBlock> record, with a B<HasInstanceOf> record attached, though the content of
448 :     the B<HasInstanceOf> record is not predictable. The hash will be keyed by block ID.
449 : parrello 1.1
450 :     =over 4
451 :    
452 :     =item set
453 :    
454 : parrello 1.2 Reference to a list of genome IDs.
455 :    
456 :     =item count (optional)
457 :    
458 :     Minimum number of genomes from the set in which the block must appear. If C<1> is
459 :     specified, the list will include any block that occurs in at least one genome.
460 :     If C<5> is specified, the list will only include blocks that occur in at least
461 :     5 genomes from the set. If the set consists of only 4 genomes in this latter
462 :     case, no blocks will be returned. The default is to return blocks that occur in
463 :     all genomes of the set.
464 : parrello 1.1
465 :     =item RETURN
466 :    
467 :     Returns a hash of B<DBObject>s corresponding to the group blocks found in the
468 :     genomes of the set.
469 :    
470 :     =back
471 :    
472 :     =cut
473 :     #: Return Type %%;
474 :     sub BlocksInSet {
475 :     # Get the parameters.
476 : parrello 1.2 my ($self, $set, $count) = @_;
477 :     # If the count was not specified, set it t
478 :     if (!defined $count) {
479 :     $count = @{$set};
480 :     }
481 : parrello 1.8 Trace("BlocksInSet called for $count genomes of set (" . join(", ", @{$set}) . ").") if T(2);
482 : parrello 1.1 # Create a hash to hold the groups found. The hash will be keyed by group ID
483 :     # and contain the relevant DB object.
484 :     my %retVal = ();
485 : parrello 1.2 # Create a hash to contain the number of genomes in which each block was found.
486 :     my %counters = ();
487 : parrello 1.1 # Loop through the genome IDs in the specified set.
488 :     for my $genomeID (@{$set}) {
489 :     # Get a list of group blocks for this genome.
490 :     my @blocks = $self->GetList(['HasInstanceOf', 'GroupBlock'],
491 : parrello 1.9 "HasInstanceOf(from-link) = ?", [$genomeID]);
492 : parrello 1.1 # Loop through the blocks, storing any new ones in the hash.
493 :     for my $block (@blocks) {
494 :     # Get the ID of this block.
495 :     my ($blockID) = $block->Value('GroupBlock(id)');
496 : parrello 1.2 # Determine whether this block is new or has been found in a previous
497 :     # genome of the set.
498 :     if (exists $counters{$blockID}) {
499 :     # Here it's old. Increment its counter.
500 :     $counters{$blockID}++;
501 :     } else {
502 :     # Here it's new. Create a counter for it.
503 :     $counters{$blockID} = 1;
504 :     }
505 :     # If it meets the threshhold, put it in the return hash. If it's already
506 :     # there, we'll simply overwrite the old copy, which makes no difference
507 :     # to the caller.
508 :     Trace("Block $blockID has $counters{$blockID} occurrences as of genome $genomeID.") if T(SimSet => 4);
509 :     if ($counters{$blockID} >= $count) {
510 :     $retVal{$blockID} = $block;
511 :     Trace("Block $blockID added to set.") if T(SimSet => 3);
512 :     }
513 : parrello 1.1 }
514 :     }
515 :     # Return the result.
516 :     return %retVal;
517 :     }
518 :    
519 :     =head3 GetRegions
520 :    
521 :     C<< my %regions = $simBlocks->GetRegions($blockID, \@genomes); >>
522 :    
523 :     Return the regions of the specified block that occur in the contigs of
524 :     the specified genomes. The return value is a hash of DNA strings keyed
525 :     on the region's location descriptor.
526 :    
527 :     =over 4
528 :    
529 :     =item blockID
530 :    
531 :     ID of the group block whose regions are to be returned.
532 :    
533 :     =item genomes
534 :    
535 :     Reference to a list of genome IDs. Only regions in one of the listed
536 :     genomes will be returned.
537 :    
538 :     =item RETURN
539 :    
540 :     Returns a hash keyed by region location. The objects in the hash are
541 :     the DNA strings for the specified region.
542 :    
543 :     =back
544 :    
545 :     =cut
546 :     #: Return Type %$;
547 :     sub GetRegions {
548 :     # Get the parameters.
549 :     my ($self, $blockID, $genomes) = @_;
550 :     # Create the return hash.
551 :     my %retVal = ();
552 : parrello 1.2 # Count the regions found.
553 :     my $regionCount = 0;
554 : parrello 1.1 # Query all the regions for the specified block.
555 : parrello 1.2 my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
556 : parrello 1.9 [$blockID]);
557 : parrello 1.1 # Loop through the query.
558 :     while (my $region = $query->Fetch) {
559 :     # Get this region's data.
560 : parrello 1.2 my ($location, $dna) =
561 :     $region->Values(['Region(id)', 'Region(content)']);
562 : parrello 1.1 # Insure it belongs to one of our genomes.
563 : parrello 1.2 my $found = SetNumber($location, $genomes);
564 : parrello 1.1 # If it does, put it in the hash.
565 : parrello 1.2 if ($found >= 0) {
566 : parrello 1.1 $retVal{$location} = $dna;
567 : parrello 1.2 $regionCount++;
568 : parrello 1.1 }
569 :     }
570 : parrello 1.2 Trace("$regionCount regions found by GetRegions.") if T(2);
571 : parrello 1.1 # Return the result.
572 :     return %retVal;
573 :     }
574 :    
575 : parrello 1.2 =head3 SetNumber
576 :    
577 :     C<< my $setNumber = SimBlocks::SetNumber($contigRegion, \@set0, \@set1, ..., \@setN); >>
578 :    
579 :     Examine a region string, contig ID, or genome ID, and return the number of the genome
580 :     set to which it belongs.
581 :    
582 :     =over 4
583 :    
584 :     =item contigRegion
585 :    
586 :     A region string, contig ID, or genome ID. Because the contig ID is prefixed by the genome ID
587 :     delimited by a colon symbol (C<:>), we split at the first colon to get the desired ID.
588 :    
589 :     =item set0, set1, ..., setN
590 :    
591 :     A list of references to genome ID lists. Essentially, we will return the index of the
592 :     first incoming list that has the specified genome ID in it.
593 :    
594 :     =item RETURN
595 :    
596 :     The index in the list of sets of the set containing the relevant genome, or -1 if
597 :     the genome was not found.
598 :    
599 :     =back
600 :    
601 :     =cut
602 :     #: Return Type $
603 :     sub SetNumber {
604 :     # Get the parameters.
605 :     my ($contigRegion, @sets) = @_;
606 :     # Extract the genome ID.
607 :     my ($genomeID) = split /:/, $contigRegion;
608 :     # Clear the return variable. A negative number means nothing found.
609 :     my $retVal = -1;
610 :     # Loop until we find the genome.
611 :     for (my $i = 0; $retVal < 0 && $i <= $#sets; $i++) {
612 :     my $set = $sets[$i];
613 :     if (grep /^$genomeID$/, @{$set}) {
614 :     $retVal = $i;
615 :     }
616 :     }
617 :     # Return the result.
618 :     return $retVal;
619 :     }
620 :    
621 : parrello 1.1 =head3 TagDNA
622 :    
623 :     C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>
624 :    
625 : parrello 1.3 Convert a DNA string from the B<Region> relation to the actual DNA.
626 :     The incoming DNA string will contain only the values corresponding to the
627 :     question marks in the pattern. The pattern should be taken from the DNA
628 :     string's parent B<GroupBlock>. The resulting DNA sequence is built by
629 :     copying the DNA string data into the pattern positions that have question
630 :     marks. If the string is intended for display in an HTML page, the
631 :     I<$prefix> and I<$suffix> parameters can be used to surround the
632 :     question mark positions with HTML markers that specify a different style,
633 :     weight, or font.
634 : parrello 1.1
635 :     =over 4
636 :    
637 :     =item pattern
638 :    
639 :     DNA pattern from which the positions of variance are to be taken.
640 :    
641 :     =item dnaString
642 :    
643 :     String containing DNA string to be marked.
644 :    
645 : parrello 1.3 =item prefix (optional)
646 : parrello 1.1
647 :     String to be inserted before each position of variance.
648 :    
649 : parrello 1.3 =item suffix (optional)
650 : parrello 1.1
651 :     String to be inserted after each position of variance.
652 :    
653 :     =item RETURN
654 :    
655 :     Returns a copy of the pattern string with the positions of variance filled
656 :     in from the DNA string. If a prefix and suffix are specified, they are
657 :     placed before and after each character copied into the pattern string.
658 :    
659 :     =back
660 :    
661 :     =cut
662 :     #: Return Type $;
663 :     sub TagDNA {
664 :     # Get the parameters.
665 :     my ($pattern, $dnaString, $prefix, $suffix) = @_;
666 :     # Insure the prefix and suffix behave properly.
667 :     if (!defined $prefix) {
668 :     $prefix = "";
669 :     }
670 :     if (!defined $suffix) {
671 :     $suffix = "";
672 :     }
673 :     # Initialize the return variable.
674 :     my $retVal = "";
675 :     # Convert the DNA string to a list with the prefixes and suffixes
676 :     # inserted.
677 :     my @dnaList = map { $prefix . $_ . $suffix } split //, $dnaString;
678 :     # Loop through the DNA string.
679 :     my ($start, $position) = (0, 0);
680 :     while (($position = index $pattern, "?", $start) >= 0) {
681 :     # If we have invariant text, copy it over.
682 :     if ($position > $start) {
683 :     $retVal .= substr $pattern, $start, $position - $start;
684 :     }
685 :     # Pop off the next position from the DNA string.
686 :     $retVal .= shift @dnaList;
687 :     # Start past the question mark.
688 :     $start = $position + 1;
689 :     }
690 : parrello 1.2 # Tack on the residual.
691 :     $retVal .= substr $pattern, $start;
692 : parrello 1.1 # Return the result.
693 :     return $retVal;
694 :     }
695 : parrello 1.2
696 :     =head3 SnipScan
697 :    
698 :     C<< my %positions = $simBlocks->SnipScan($blockObject, \@set0, \@set1); >>
699 :    
700 :     Examine the specified block and return a list of the positions at which the
701 :     nucleotide values for regions in the first genome set differ from the values
702 :     for regions in the second genome set. This conditions is only satisfied if
703 :     the nucleotides occurring in the first set's regions never occur in the
704 :     second set's regions. This is a very strict condition. If, for example, C<A>
705 :     occurs in one of the regions for the first set, it cannot occur in any of the
706 :     regions for the second set.
707 :    
708 :     =over 4
709 :    
710 :     =item blockObject
711 :    
712 :     A C<DBObject> representing the B<GroupBlock> record for the desired block or
713 :     the actual ID of the block whose regions are to be examined. It is expected that the
714 :     block will have regions in all of the genomes for both sets, but this is not
715 :     required by the algorithm.
716 :    
717 :     =item set0
718 :    
719 :     Reference to a list of the IDs for the genomes in the first set (0).
720 :    
721 :     =item set1
722 :    
723 :     Reference to a list of the IDs for the genomes in the second set (1).
724 :    
725 :     =item RETURN
726 :    
727 :     A hash that maps positions to contents. Each position will be relative to the
728 :     start of the block. The mapped value will list the nucleotides found in the
729 :     first set and the nucleotides found in the second set. So, for example,
730 :     C<['AC', 'G']> means that both C<A> and C<C> are found in the genomes of the
731 :     first set (0), but only C<G> is found in genomes of the second set (1).
732 :    
733 :     =back
734 :    
735 :     =cut
736 :     #: Return Type %;
737 :     sub SnipScan {
738 :     # Get the parameters.
739 :     my ($self, $blockObject, $set0, $set1) = @_;
740 :     # Convert an incoming block ID to a block object.
741 :     if (ref $blockObject ne "DBObject") {
742 :     $blockObject = $self->GetEntity('GroupBlock', $blockObject);
743 :     }
744 :     # Get the ID and length of this block.
745 :     my ($blockID) = $blockObject->Value('GroupBlock(id)');
746 :     my ($len) = $blockObject->Value('GroupBlock(snip-count)');
747 :     # Create a hash that can be used to identify the genomes in each set.
748 :     my %setMap = ();
749 :     for my $genomeID (@{$set0}) {
750 :     $setMap{$genomeID} = 0;
751 :     }
752 :     for my $genomeID (@{$set1}) {
753 :     $setMap{$genomeID} = 1;
754 :     }
755 :     # Create the snip hash. We'll create empty lists for each snip position.
756 :     # As soon as a position becomes ineligible we delete it from the hash.
757 :     my %snipHash = ();
758 :     for (my $i = 0; $i < $len; $i++) {
759 :     $snipHash{$i} = ['', ''];
760 :     }
761 :     # Ask for the regions in the block.
762 :     my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
763 : parrello 1.9 [$blockID]);
764 : parrello 1.2 # Loop through the regions.
765 :     while (my $region = $query->Fetch) {
766 :     # Determine this region's genome set. We only continue if the region is in
767 :     # one of the sets.
768 : parrello 1.4 my ($location) = $region->Value('Region(id)');
769 : parrello 1.2 my $genomeID = Genome::FromContig($location);
770 :     if (exists $setMap{$genomeID}) {
771 :     my $setIndex = $setMap{$genomeID};
772 :     # Get the region's snip values.
773 :     my ($regionValue) = $region->Value('Region(content)');
774 :     # Loop through the positions that are still active.
775 :     my @positions = keys %snipHash;
776 :     for my $position (@positions) {
777 :     # Get the nucleotide at the current position in this region.
778 :     my $letter = substr $regionValue, $position, 1;
779 :     # Find it in the appropriate list.
780 :     my $letterList = $snipHash{$position};
781 :     if ((index $letterList->[1-$setIndex], $letter) >= 0) {
782 :     # It's in the other set, so this position is no
783 :     # longer valid.
784 :     delete $snipHash{$position};
785 :     } elsif ((index $letterList->[$setIndex], $letter) < 0) {
786 :     # It's in neither set, so we add it to the list we're
787 :     # accumulating for this set at this position.
788 :     $letterList->[$setIndex] .= $letter;
789 :     }
790 :     }
791 :     }
792 :     }
793 :     # Now %snipHash contains all of the information we need, but we must relocate the
794 :     # positions so they are relative to the block address instead of the snip
795 :     # positions. We need the block pattern to do this.
796 :     my ($blockPattern) = $blockObject->Value('GroupBlock(pattern)');
797 :     my @positions = ParsePattern($blockPattern);
798 :     # Create the return hash.
799 :     my %retVal = ();
800 :     # Relocate the positions.
801 :     for my $inPosition (keys %snipHash) {
802 :     $retVal{$positions[$inPosition]} = $snipHash{$inPosition};
803 :     }
804 :     # Return the result.
805 :     return %retVal;
806 :     }
807 :    
808 :     =head3 ParsePattern
809 :    
810 :     C<< my @positions = SimBlocks::ParsePattern($pattern); >>
811 :    
812 :     Get a list of the positions of variance inside a block pattern. The
813 :     positions of variance are marked by question marks, so all we need to
814 :     do is return the location of every question mark in the incoming
815 :     patter string.
816 :    
817 :     =over 4
818 :    
819 :     =item pattern
820 :    
821 :     DNA pattern for a group block.
822 :    
823 :     =item RETURN
824 :    
825 :     Returns a list of position numbers. These are coded as offsets, with 0 indicating
826 :     the first character of the pattern.
827 :    
828 :     =back
829 :    
830 :     =cut
831 :     #: Return Type @;
832 :     sub ParsePattern {
833 :     # Get the parameters.
834 :     my ($pattern) = @_;
835 :     # Create the return list.
836 :     my @retVal = ();
837 :     # Loop through the pattern, looking for question marks.
838 :     my $position = 0;
839 :     while (($position = index($pattern, "?", $position)) >= 0) {
840 :     push @retVal, $position;
841 :     $position++;
842 :     }
843 :     # Return the result.
844 :     return @retVal;
845 :     }
846 :    
847 :     =head3 MergeDNA
848 :    
849 :     C<< my ($groupSequence, $variance) = SimBlocks::MergeDNA($groupSequence, $newSequence); >>
850 :    
851 :     Merge the DNA for a region into the group representation of similar DNA, returning the
852 :     result and the positions of variance. Positions of variance in the group representation
853 :     are indicated by question marks. This method essentially compares the new DNA to the
854 :     group DNA and adds question marks wherever the new DNA does not match.
855 :    
856 :     =over 4
857 :    
858 :     =item groupSequence
859 :    
860 :     Group representation of the DNA into which the new DNA is to be merged.
861 :    
862 :     =item newSequence
863 :    
864 :     New DNA to be merged into the group representation.
865 :    
866 :     =item RETURN
867 :    
868 :     Returns the merged group representation followed by an updated count of the number
869 :     of question marks in the group.
870 :    
871 :     =back
872 :    
873 :     =cut
874 :     sub MergeDNA {
875 :     # Get the parameters.
876 :     my ($groupSequence, $newSequence) = @_;
877 :     # Declare the return variables.
878 :     my ($retVal, $varianceCount) = ("", 0);
879 :     # If there is no group representation, or if the group representation is the
880 :     # same as the new DNA, we return the new DNA; otherwise, we have to merge.
881 :     # Note that the new DNA cannot contain any question marks, so if either of
882 :     # the degenerate cases is true, we return a variance of 0.
883 :     if ((! $groupSequence) || ($groupSequence eq $newSequence)) {
884 :     $retVal = $newSequence;
885 :     } else {
886 :     # Here the group representation and the new DNA are different, so we
887 :     # have to find all the points of difference and replace them with
888 :     # question marks. We do this using a trick: by XOR-ing the two
889 :     # strings together, we create a new string with nulls everywhere
890 :     # except where there's a mismatch. We then find the non-null characters
891 :     # and replace the corresponding positions in the return string with
892 :     # question marks. First, we get a copy of the group representation to
893 :     # play with.
894 :     $retVal = $groupSequence;
895 :     # Next we compute the XOR string.
896 :     my $hexor = $retVal ^ $newSequence;
897 :     # This loop finds all non-null string positions.
898 :     while ($hexor =~ m/[^\x00]/g) {
899 :     # Here we get the position past the mismatch point.
900 :     my $pos = pos $hexor;
901 :     # Replace the real mismatch point with a question mark.
902 :     substr $retVal, $pos - 1, 1, "?";
903 :     $varianceCount++;
904 :     }
905 :     }
906 :     # Return the result.
907 :     return ($retVal, $varianceCount);
908 :     }
909 :    
910 :     =head3 GetAlignment
911 :    
912 :     C<< my %sequences = $$simBlocks->GetAlignment(\@blockIDs, \@genomeIDs, $indels); >>
913 :    
914 :     Return an alignment of the specified genomes relative to the specified block
915 :     IDs. Only blocks in which all of the genomes occur will produce output for
916 :     the alignment.
917 :    
918 :     =over 4
919 :    
920 :     =item blockIDs
921 :    
922 :     Reference to a list of the IDs of the blocks from which the alignment
923 :     should be constructed.
924 :    
925 :     =item genomeIDs
926 :    
927 :     Reference to a list of the genome IDs participating in the alignment.
928 :    
929 :     =item indels (optional)
930 :    
931 :     C<1> if columns containing insertion characters (C<->) should be included
932 :     in the alignment, else C<0>. The default is C<0>.
933 :    
934 :     =item RETURN
935 :    
936 :     Returns a hash that maps each genome ID to its sequence in the alignment.
937 :    
938 :     =back
939 :    
940 :     =cut
941 :     #: Return Type %;
942 :     sub GetAlignment {
943 :     # Get the parameters.
944 :     my ($self, $blockIDs, $genomeIDs, $indels) = @_;
945 :     # Create the return hash. We will start with a null string for each
946 :     # genome, and add the alignment data one block at a time.
947 :     my %retVal = ();
948 :     for my $genomeID (@{$genomeIDs}) {
949 :     $retVal{$genomeID} = "";
950 :     }
951 :     # Get the number of genomes. This will help us in determining which
952 :     # blocks to keep.
953 :     my $totalGenomes = @{$genomeIDs};
954 :     # Loop through the blocks.
955 :     for my $blockID (@{$blockIDs}) {
956 :     # Get the block's data.
957 :     my $blockData = $self->GetEntity('GroupBlock', $blockID);
958 :     my ($blockName) = $blockData->Value('GroupBlock(description)');
959 :     Trace("Processing common block $blockID ($blockName).") if T(Alignment => 4);
960 :     # Only proceed if the block has real variance in it.
961 :     my ($snipCount) = $blockData->Value('GroupBlock(snip-count)');
962 :     if ($snipCount > 0) {
963 :     # Get this block's regions in the specified genomes.
964 :     my %regionHash = $self->GetRegions($blockID, $genomeIDs);
965 :     # Verify that each genome is represented.
966 :     my %blockHash = ();
967 :     my $genomeCount = 0;
968 :     for my $region (keys %regionHash) {
969 :     # Get the genome ID from the region string.
970 :     $region =~ m/^([^:]+):/;
971 :     # If it's new, count it.
972 :     if (! exists $blockHash{$1}) {
973 :     $blockHash{$1} = $regionHash{$region};
974 :     $genomeCount++;
975 :     }
976 :     }
977 :     # At this point, if the number of genomes found matches the
978 :     # size of the list, this block is good.
979 :     if ($genomeCount == $totalGenomes) {
980 :     if (! $indels) {
981 :     # If indels are off, we need to remove some of the columns. We do
982 :     # this by finding C<-> characters in each genome's DNA sequence,
983 :     # and deleting all instances of the relevant columns.
984 :     for my $genomeID (@{$genomeIDs}) {
985 :     Trace("Hyphen check for genome $genomeID.") if T(Alignment => 3);
986 :     while ((my $pos = index($blockHash{$genomeID},"-")) >= 0) {
987 :     # Here we found an insertion character. We delete its column
988 :     # from all the genomes in the alignment.
989 :     Trace("Hyphen found at position $pos in genome $genomeID.") if T(Alignment => 4);
990 :     for my $genomeID2 (@{$genomeIDs}) {
991 :     substr $blockHash{$genomeID2}, $pos, 1, "";
992 :     }
993 :     }
994 :     }
995 :     }
996 :     Trace("Common block $blockID ($blockName) added to alignment.") if T(Alignment => 3);
997 :     for my $genomeID (@{$genomeIDs}) {
998 :     $retVal{$genomeID} .= $blockHash{$genomeID};
999 :     }
1000 :     }
1001 :     }
1002 :     }
1003 :     # Return the result.
1004 :     return %retVal;
1005 :     }
1006 :    
1007 :     =head3 DistanceMatrix
1008 :    
1009 :     C<< my %distances = SimBlocks::DistanceMatrix(\%sequences, \%distances); >>
1010 :    
1011 :     Compute the distances between the sequences in an alignment. the L</SequenceDistance>
1012 :     method is used to compute the individual distances.
1013 :    
1014 :     =over 4
1015 :    
1016 :     =item sequences
1017 :    
1018 :     Reference to a hash of genome IDs to alignment sequences. The alignment sequences may
1019 :     only contain the characters C<AGCT->, and must all be the same length.
1020 :    
1021 :     =item distances
1022 :    
1023 :     Reference to a hash that maps pairs of letters to distance values. Each letter pair
1024 :     must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1025 :     map to the same distance. If no distance matrix is specified, the default distance
1026 :     matrix is used.
1027 :    
1028 :     =item RETURN
1029 :    
1030 :     Returns a hash that maps genome ID pairs to distance values. The ID pairs are
1031 :     formed using a slash. For example, the distance from C<158878.1> to C<282459.1>
1032 :     would be found attached to the key C<158878.1/282459.1>.
1033 :    
1034 :     =back
1035 :    
1036 :     =cut
1037 :     #: Return Type %
1038 :     sub DistanceMatrix {
1039 :     # Get the parameters.
1040 :     my ($sequences, $distances) = @_;
1041 :     # Declare the return variable.
1042 :     my %retVal = ();
1043 :     # Create a sorted list of the genome IDs.
1044 :     my @genomes = sort keys %{$sequences};
1045 :     # Loop through the genome IDs.
1046 :     for (my $i = 0; $i <= $#genomes; $i++) {
1047 :     my $genomei = $genomes[$i];
1048 :     # Create the 0-distance pair.
1049 :     $retVal{"$genomei/$genomei"} = 0;
1050 :     # Loop through the genomes lexically after this one. We'll
1051 :     # generate one distance and use it for both orderings.
1052 :     for (my $j = $i + 1; $j <= $#genomes; $j++) {
1053 :     my $genomej = $genomes[$j];
1054 :     # Compute the distance.
1055 :     my $distance = SequenceDistance($sequences->{$genomei}, $sequences->{$genomej},
1056 :     $distances);
1057 :     Trace("Distance from $genomei to $genomej is $distance.") if T(Distance => 3);
1058 :     # Store it in the return hash.
1059 :     $retVal{"$genomei/$genomej"} = $distance;
1060 :     $retVal{"$genomej/$genomei"} = $distance;
1061 :     }
1062 :     }
1063 :     # Return the result matrix.
1064 :     return %retVal;
1065 :     }
1066 :    
1067 :     =head3 SequenceDistance
1068 :    
1069 :     C<< my $dist = SimBlocks::SequenceDistance($seq1, $seq2, $distances); >>
1070 :    
1071 :     Return the distance between two sequences. The distance presumes that
1072 :     each alignment is a vector of sorts, with purines (C<A>/C<T>) and pyrmidines (C<G>/C<C>)
1073 :     being a half-unit from themselves and a full unit from each other. The distance is normalized
1074 :     by the number of positions in the alignment. Thus, C<AATGCTT> is 0.6267 from C<ATTTGCA>.
1075 :     The fourth and sixth positions are a full-unit jump and the second, fifth, and seventh
1076 :     positions are half-unit jumps. This works out to
1077 :    
1078 :     sqrt(0 + 0.5^2 + 0 + 1^2 + 0.5^2 + 1^2 + 0.5^2)/sqrt(7)
1079 :    
1080 :     which is approximately 0.6267. Insertion characters (C<->) are considered a distance of
1081 :     0 from everything. This distance model can be overridden by supplying a custom distance
1082 :     matrix as the third parameter.
1083 :    
1084 :     =over 4
1085 :    
1086 :     =item seq1
1087 :    
1088 :     Origin sequence for computing the distance.
1089 :    
1090 :     =item seq2
1091 :    
1092 :     End sequence for computing the distance. Both sequences must be the same length.
1093 :    
1094 :     =item distances
1095 :    
1096 :     Reference to a hash that maps pairs of letters to distance values. Each letter pair
1097 :     must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1098 :     map to the same distance. If no distance matrix is specified, the default distance
1099 :     matrix is used.
1100 :    
1101 :     =back
1102 :    
1103 :     =cut
1104 :     #: Return Type $;
1105 :     sub SequenceDistance {
1106 :     # Get the parameters.
1107 :     my ($seq1, $seq2, $distances) = @_;
1108 :     if (!defined $distances) {
1109 :     $distances = DefaultDistances();
1110 :     }
1111 : parrello 1.3 # Declare the return value. Initially, this will be a sum of squares. We'll
1112 : parrello 1.2 # take the square root and divide out the length later.
1113 :     my $retVal = 0;
1114 :     # Loop through the sequences, character by character, comparing.
1115 :     my $n = length $seq1;
1116 :     for (my $i = 0; $i < $n; $i++) {
1117 :     my $s1 = substr $seq1, $i, 1;
1118 :     my $s2 = substr $seq2, $i, 1;
1119 :     my $step = $distances->{"$s1$s2"};
1120 :     Trace("No step distance found for \"$s1$s2\".") if !defined $step && T(Distance => 1);
1121 :     $retVal += $step * $step;
1122 :     }
1123 :     # Divide by the length and take the square root.
1124 :     $retVal = sqrt($retVal / $n);
1125 :     # Return the result.
1126 :     return $retVal;
1127 :     }
1128 :    
1129 :     =head3 GetBlock
1130 :    
1131 :     C<< my $blockData = $simBlocks->GetBlock($blockID); >>
1132 :    
1133 :     Return a B<DBObject> for a specified group block.
1134 :    
1135 :     =over 4
1136 :    
1137 :     =item blockID
1138 :    
1139 :     ID of the desired block.
1140 :    
1141 :     =item RETURN
1142 :    
1143 :     Returns a B<DBObject> for the group block in question. The object allows access to
1144 :     all of the fields in the C<GroupBlock> relation of the database.
1145 :    
1146 :     =back
1147 :    
1148 :     =cut
1149 :     #: Return Type $%;
1150 :     sub GetBlock {
1151 :     # Get the parameters.
1152 :     my ($self, $blockID) = @_;
1153 :     # Get the group block.
1154 :     my $retVal = $self->GetEntity('GroupBlock', $blockID);
1155 :     # Return the result.
1156 :     return $retVal;
1157 :     }
1158 :    
1159 : parrello 1.3 =head3 GetBlockPieces
1160 :    
1161 :     C<< my @blocks = $blockData->GetBlockPieces($location); >>
1162 :    
1163 :     Return a map of the block pieces inside the specified location. The return
1164 :     value will be a list of block locations. A block location is essentially a
1165 :     location string with a block ID in place of the contig ID.
1166 :    
1167 :     This routine depends upon the fact that there is no overlap between blocks.
1168 :     A given nucleotide on a Contig is in exactly one block. Therefore, to find
1169 :     all the blocks that overlap a specific section of the Contig, we find the
1170 :     first block that ends after the start of the section and proceed until we
1171 :     find the last block that begins before the end of the section. The only
1172 :     complexity that occurs is if the Contig folds back in upon itself and
1173 :     appears twice in the same block. In this case, we might get two segments from
1174 :     the same block, and they may or may not overlap.
1175 :    
1176 :     =over 4
1177 :    
1178 :     =item location
1179 :    
1180 :     A basic location object or location string defining the range whose block pieces
1181 :     are desired.
1182 :    
1183 :     =item RETURN
1184 :    
1185 :     Returns a list of block location strings. Each string consists of a block ID
1186 :     followed by a start location and an end location. The three pieces of the
1187 :     string are separated by underscores (C<_>).
1188 :    
1189 :     =back
1190 :    
1191 :     =cut
1192 :     #: Return Type %;
1193 :     sub GetBlockPieces {
1194 :     # Get the parameters.
1195 :     my ($self, $location) = @_;
1196 :     # Declare the return variable.
1197 :     my @retVal = ();
1198 :     # Parse the incoming location.
1199 :     my $loc = BasicLocation->new($location);
1200 :     # Determine the parameters needed to get the desired list of regions.
1201 :     my ($left, $right, $contigID) = ($loc->Left, $loc->Right, $loc->Contig);
1202 : parrello 1.5 Trace("Searching for regions near " . $loc->String) if T(4);
1203 : parrello 1.3 # Ask for the regions in the section we want.
1204 :     my @regions = $self->GetAll(['Region', 'IncludesRegion', 'GroupBlock'],
1205 : parrello 1.6 'Region(endpoint) >= ? AND Region(position) <= ? AND Region(contigID) = ?',
1206 : parrello 1.3 [$left, $right, $contigID],
1207 :     ['GroupBlock(id)', 'GroupBlock(pattern)',
1208 :     'GroupBlock(len)',
1209 : parrello 1.6 'Region(position)', 'Region(endpoint)',
1210 : parrello 1.3 'Region(content)']);
1211 :     # Loop through the regions found. For each region we will output a location string.
1212 :     for my $regionData (@regions) {
1213 :     # Declare the variable to contain the formatted location.
1214 :     my $blockLoc;
1215 :     # Separate the fields from the region data.
1216 :     my ($blockID, $pattern, $len, $position, $end, $content) = @{$regionData};
1217 :     # Determine whether we're using the full region or only part of it.
1218 :     if ($position >= $left && $end <= $right) {
1219 :     # Here we're using the full region, so the location is simple.
1220 :     $blockLoc = "${blockID}_1_${len}";
1221 :     } else {
1222 :     # This is the hard part. We need to convert the contig positions to
1223 :     # block positions. Since the block contains insertion characters and
1224 :     # the regions don't, this requires fancy dancing. First, we get the
1225 :     # region's DNA string.
1226 :     my $dnaSequence = TagDNA($pattern, $content);
1227 :     # Let's call the "area of interest" the part of the Contig that
1228 :     # is in the incoming location AND in the current region. This could
1229 :     # be at either end of the region or somewhere in the middle. We
1230 :     # need to walk the block DNA to find what positions in the block
1231 :     # correspond to the desired positions in the contig. First, we
1232 :     # put ourselves at the beginning of the block and the region,
1233 :     # and walk to the start point. Note that our position arithmetic
1234 :     # is 1-based, in keeping with the conventions of biologists.
1235 :     my $blockPos = 1;
1236 :     my $contigPos = $position;
1237 :     # Check to see if we can use the current position as are starting
1238 :     # point or we need to find it.
1239 :     if ($left > $position) {
1240 :     # Here we need to find the left edge of the area of interest.
1241 :     $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $left);
1242 :     # Update the contig position to match.
1243 :     $contigPos = $left;
1244 :     }
1245 :     # Save the left edge.
1246 :     my $blockLeft = $blockPos;
1247 :     # Now find the right edge.
1248 :     if ($right >= $end) {
1249 :     # Here the right edge is the end of the block.
1250 :     $blockPos = $len;
1251 :     } else {
1252 :     # Here we have to find the right edge of the area of interest.
1253 :     $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $right);
1254 :     }
1255 :     my $blockRight = $blockPos;
1256 :     # Format the location.
1257 :     $blockLoc = "${blockID}_${blockLeft}_${blockRight}";
1258 :     }
1259 :     # Push the block location onto the return list.
1260 :     push @retVal, $blockLoc;
1261 :     }
1262 :     # Return the result.
1263 :     return @retVal;
1264 :     }
1265 :    
1266 : parrello 1.4 =head3 GetFeatureBlockPieces
1267 :    
1268 :     C<< my @pieces = $simBlocks->GetFeatureBlockPieces($fig, \@featureIDs, $distance); >>
1269 :    
1270 :     Return a list of the block pieces within the specified distance of the
1271 :     specified features. This method essentially computes locations from
1272 :     the distance and the feature locations, then passes them to L/GetBlockPieces>.
1273 :     That method will return the sections of various similarity blocks that
1274 :     occur inside the the locations computed.
1275 :    
1276 :     =over 4
1277 :    
1278 :     =item fig
1279 :    
1280 :     A FIG-like object that can be used to convert features into locations.
1281 :    
1282 :     =item featureIDs
1283 :    
1284 :     Reference to a list of feature IDs. Blocks in the vicinity of any of the
1285 :     features are returned.
1286 :    
1287 :     =item distance
1288 :    
1289 :     Distance to search from the feature's actual location.
1290 :    
1291 :     =item RETURN
1292 :    
1293 :     Returns a list of block piece location objects. A block piece location object
1294 :     is a standard B<BasicLocation> object with a block ID instead of a contig ID.
1295 :    
1296 :     =back
1297 :    
1298 :     =cut
1299 :     #: Return Type @;
1300 :     sub GetFeatureBlockPieces {
1301 :     # Get the parameters.
1302 :     my ($self, $fig, $featureIDs, $distance) = @_;
1303 :     # Declare a hash for keeping the return data. This will weed out
1304 :     # duplicates, of which we expect plenty. We'll need to handle
1305 :     # overlaps later, however.
1306 :     my %retHash = ();
1307 :     # Loop through the features.
1308 :     for my $featureID (@{$featureIDs}) {
1309 :     # Get the feature's genome ID.
1310 :     my $genomeID = FIG::genome_of($featureID);
1311 :     # Get the feature's locations.
1312 :     my @locations = $fig->feature_location($featureID);
1313 :     # Loop through the locations, sticking the resulting pieces in the return
1314 :     # hash.
1315 :     for my $loc (@locations) {
1316 :     my $locObject = BasicLocation->new($loc);
1317 :     # Get the length of the contig in question.
1318 :     my $len = $fig->contig_ln($genomeID, $locObject->Contig);
1319 :     # Expand the location.
1320 : parrello 1.5 $locObject->Widen($distance, $len);
1321 :     # Insure the location is Sprout-style;
1322 :     $locObject->FixContig($genomeID);
1323 : parrello 1.4 # Get the desired block pieces.
1324 :     my @pieces = $self->GetBlockPieces($locObject);
1325 : parrello 1.5 Trace(scalar(@pieces) . " pieces found for location $loc.") if T(4);
1326 : parrello 1.4 # Put them in the hash.
1327 :     for my $piece (@pieces) {
1328 :     $retHash{$piece} = 1;
1329 :     }
1330 :     }
1331 :     }
1332 :     # Now we have all the block pieces that occur in any one of our regions. The
1333 :     # next step is to merge adjacent and overlapping regions. First we convert
1334 :     # the pieces to location objects so we can sort them.
1335 :     my @retVal = ();
1336 :     for my $piece (keys %retHash) {
1337 :     my $loc = BasicLocation->new($piece);
1338 : parrello 1.5 push @retVal, $loc;
1339 : parrello 1.4 }
1340 : parrello 1.5 Trace("Beginning sort.") if T(3);
1341 : parrello 1.4 @retVal = sort { BasicLocation::Cmp($a,$b) } @retVal;
1342 : parrello 1.5 Trace(scalar(@retVal) . " pieces found before overlap check.") if T(3);
1343 : parrello 1.4 # Now the locations are sorted by block ID, start position, and descending
1344 :     # length. This means that if there's an overlap, the two overlapping
1345 :     # pieces will be next to each other.
1346 :     my $i = 0;
1347 :     while ($i < $#retVal) {
1348 :     # Check for overlap between this location and the next.
1349 :     my ($type, $len) = Overlap::CheckOverlap($retVal[$i], $retVal[$i+1]);
1350 :     if ($len == 0) {
1351 :     # Here there is no overlap, so we push ahead.
1352 :     $i++;
1353 :     } elsif ($type eq 'embedded') {
1354 :     # Here the second piece is inside the first, so we delete the
1355 :     # second.
1356 :     delete $retVal[$i+1];
1357 :     } else {
1358 :     # Here we have a normal overlap. Because all of our pieces
1359 :     # are forward locations, this means the second piece starts
1360 :     # before the end of the of the first. We set the end point
1361 :     # if the first to the end point of the second and delete the
1362 :     # second.
1363 :     $retVal[$i]->SetEnd($retVal[$i+1]->EndPoint);
1364 :     delete $retVal[$i+1];
1365 :     }
1366 :     }
1367 :     # Return the result.
1368 :     return @retVal;
1369 :     }
1370 :    
1371 : parrello 1.3 =head3 WalkDNA
1372 :    
1373 :     C<< my $blockPos = SimBlocks::WalkDNA($blockPos, $contigPos, $dna, $loc); >>
1374 :    
1375 :     Location the desired position within a block of a specified location in a contig.
1376 :    
1377 :     The idea here is that a block contains insertion characters and a contig doesn't.
1378 :     The incoming DNA sequence is for the block. On entry, we have a current position
1379 :     in the block and a current position in the contig. We advance the two pointers
1380 :     in parallel until the contig pointer equals the desired location.
1381 :    
1382 :     =over 4
1383 :    
1384 :     =item blockPos
1385 :    
1386 :     Current position in the block (1-based).
1387 :    
1388 :     =item contigPos
1389 :    
1390 :     Current position in the contig.
1391 :    
1392 :     =item dna
1393 :    
1394 :     DNA string for the block with the points of variance filled in with data from
1395 :     the contig. In other word's the block's version of the current contig region.
1396 :    
1397 :     =item loc
1398 :    
1399 :     Desired location in the contig.
1400 :    
1401 :     =item RETURN
1402 :    
1403 :     Returns the position in the block corresponding to the desired position in the
1404 :     contig.
1405 :    
1406 :     =back
1407 :    
1408 :     =cut
1409 :     #: Return Type $;
1410 :     sub WalkDNA {
1411 :     # Get the parameters.
1412 :     my ($blockPos, $contigPos, $dna, $loc) = @_;
1413 :     # Copy the incoming positions into variables with which we can play. While
1414 :     # we're at it, we'll convert the block position from 1-based to 0-based.
1415 :     my ($retVal, $contigPtr) = ($blockPos - 1, $contigPos);
1416 :     # Loop until we find our destination point.
1417 :     while ($contigPtr < $loc) {
1418 :     # Find the next insertion character in the DNA string.
1419 :     my $nextHyphen = index $dna, '-', $retVal;
1420 :     # At this point, "nextHyphen" is either -1 (indicating no hyphen
1421 :     # found) or it is positioned on a hyphen. If it's -1, move it after
1422 :     # the end of the dna string.
1423 :     if ($nextHyphen < 0) {
1424 :     $nextHyphen = length $dna;
1425 :     }
1426 :     # Now, if we subtract $retVal from $nextHyphen, we have the largest
1427 :     # value by which we can advance the contig pointer without getting
1428 :     # out of sync. The first step is to see if this would move us past
1429 :     # our desired location.
1430 :     my $offset = $nextHyphen - $retVal;
1431 :     my $contigNext = $contigPtr + $offset;
1432 :     if ($contigNext >= $loc) {
1433 :     # We now know that everything between the current position and
1434 :     # the desired position is real nucleotides. We simply push both
1435 :     # pointers forward by the same amount so that the contig pointer
1436 :     # becomes $loc.
1437 :     $retVal += $loc - $contigPtr;
1438 :     $contigPtr = $loc
1439 :     } else {
1440 :     # Here we need to go the whole distance to the next hyphen and
1441 :     # keep going.
1442 :     $contigPtr += $offset;
1443 :     $retVal += $offset;
1444 :     $retVal++;
1445 :     }
1446 :     }
1447 :     # Return the result. Note we convert back to 1-based.
1448 :     return $retVal + 1;
1449 :     }
1450 :    
1451 : parrello 1.1 1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3