[Bio] / Sprout / SimBlocks.pm Repository:
ViewVC logotype

Diff of /Sprout/SimBlocks.pm

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.1, Wed May 4 03:24:43 2005 UTC revision 1.5, Fri Jan 13 07:00:28 2006 UTC
# Line 3  Line 3 
3      require Exporter;      require Exporter;
4      use ERDB;      use ERDB;
5      @ISA = qw(Exporter ERDB);      @ISA = qw(Exporter ERDB);
6      @EXPORT = qw();      @EXPORT = qw(DefaultDistances);
7      @EXPORT_OK = qw();      @EXPORT_OK = qw(BlocksInSet GetBlock GetBlockPieces CompareGenomes DBLoad DistanceMatrix GetAlignment GetRegions ParsePattern RemoveBlocks SequenceDistance SetNumber SnipScan);
8    
9      use strict;      use strict;
10      use Tracer;      use Tracer;
11      use PageBuilder;      use PageBuilder;
12      use Genome;      use Genome;
13        use BasicLocation;
14        use Overlap;
15      use FIG_Config;      use FIG_Config;
16        use FIG;
17    
18  =head1 Similarity Block Database  =head1 Similarity Block Database
19    
# Line 23  Line 25 
25  by the C<SimBlocksDBD.xml> file.  by the C<SimBlocksDBD.xml> file.
26    
27  The primary entities of the database are B<Genome>, which represents a specified  The primary entities of the database are B<Genome>, which represents a specified
28  organism, B<Group>, which represents a set of similar regions, and B<Contig>,  organism, B<GroupBlock>, which represents a set of similar regions, and B<Contig>,
29  which represents a contiguous segment of DNA. The key relationship is  which represents a contiguous segment of DNA. The key relationship is
30  B<ContainsRegionIn>, which maps groups to contigs. The mapping information  B<ContainsRegionIn>, which maps blocks to contigs. The mapping information
31  includes the DNA nucleotides as well as the location in the contig.  includes the DNA nucleotides as well as the location in the contig.
32    
33  For a set of genomes, we will define a I<common group> as a group that  =head2 Formal Definitions
34  is present in most of the genomes. (The definition of I<most> is  
35  determined by a parameter.)  Let {B<G(1)>, B<G(2)>, ..., B<G(>I<n>B<)>} be a set of genomes
36    and let {B<g(1)>, B<g(2)>, ..., B<g(>I<m>B<)>} be the set of
37    genes called on these genomes.
38    
39    The primitive notion from which similarity blocks are derived is the
40    I<correspondence mapping>. Roughly, this mapping connects genes on
41    one genome to corresponding genes on other genomes. We will treat
42    the mapping as an equivalence class. In practice, the entire mapping
43    will not be available when we are building to load files for the
44    similarity block database. Instead, the user will provide a subset
45    of the mapping from which the entire partitioning can be deduced
46    via commutativity, reflexivity, and transitivity. This subset is
47    called the I<correspondence relation>.
48    
49    In actual practice, the correspondence relation will contain some
50    false positives. The process of developing the correspondence
51    relation usually involves analyzing similarities above a certain
52    threshhold. (This type of similarity is not necessarily transitive;
53    however, the true, underlying correspondence mapping we want I<is>
54    transitive.) Often, a sequence of correspondence pairs that maps
55    a gene on one particular genome to another gene on the same genome
56    is an indication of a false positive.
57    
58    In addition to false positives, we may have invalid genes, missing
59    genes, frame shifts, and incorrect start locations. Detecting and
60    cleaning these problems is an important part of getting a good
61    correspondence relation.
62    
63    Given a full-blown correspondence mapping, we define the following
64    concepts.
65    
66    =over 4
67    
68    =item corresponding gene
69    
70    The I<corresponding genes> of a given gene B<g(>I<i>B<)> are those
71    genes mapped to it by the correspondence mapping.
72    
73    =item gene block
74    
75    A I<gene block> is a maximal set of corresponding genes.
76    
77    =item intergenic region
78    
79    An I<intergenic region> is a section of the contig between
80    two adjacent genes. The adjacent genes are called the
81    I<left gene> and the I<right gene>, depending on whether they
82    precede or follow the region. For example, if C<G2_100+50>
83    and C<G2_175+100> are adjacent, C<G2_150+25> is their
84    intergenic region, the left gene is C<G2_100+50>, and
85    the right gene is C<G2_175+100>.
86    
87    =item intergenic block
88    
89    An I<intergenic block> is a maximal set of intergenic regions such
90    that (1) the left genes all correspond and have the same orientation,
91    (2) the right genes all correspond and have the same orientation, and
92    (3) the region lengths differ by no more than a specified threshhold
93    (usually 10%).
94    
95    =item similarity block
96    
97    A I<similarity block> is either a gene block or an intergenic block.
98    A similarity block is therefore a set of regions (either gene regions
99    or intergenic regions) that are tied together by the correspondence
100    mapping.
101    
102    =item common block
103    
104    A I<common block> with respect to a subset of the genomes
105    {B<G(>I<i1>B<)>, B<G(>I<i2>B<)>, ... B<G(>I<ik>B<)>} is a
106    similarity block that has exactly one region in each
107    of the genomes of the subset. There is a lesser notion of
108    a block that occurs in most genomes in the subset. The actual
109    program allows the notion of I<most> to be defined externally.
110    The default is all of the genomes in the subset.
111    
112    =item unique sequence
113    
114    A I<unique sequence> is a region that does not appear in any
115    similarity block with another region. In other words, a unique
116    sequence is a region that does not correspond to any other region.
117    
118    =back
119    
120    =head2 Usage
121    
122  The similarity block database must be able to answer two questions, taking  The similarity block database must be able to answer two questions, taking
123  as input two sets of genomes.  as input two genome sets. (These would be subsets of the total set of genomes
124    represented in the correspondence mapping.)
125    
126  =over 4  =over 4
127    
128  =item (1) Which groups are common in both sets and which are common in only one or the other?  =item (1) Which blocks are common in one set and do not occur in the other.
129    
130  =item (2) For each group that is common in both sets, how do the individual regions differ?  These are considered major variations. They result from prophage insertions,
131    transposition events, and recombinations. A major variation results in a
132    different set of proteins in an operating cell.
133    
134    =item (2) What individual variations in the shared common blocks distinguish the two sets?
135    
136    These are considered minor variations, and are the most common result of
137    single-nucleotide mutations. Rather than changing the set of proteins,
138    minor variations change the proteins themselves.
139    
140  =back  =back
141    
142  This class is a subclass of B<ERDB>, and inherits all of its public methods.  This class is a subclass of B<ERDB>, and inherits all of its public methods.
143    
144    =head2 Setup
145    
146    To begin working with similarity blocks, you must first create a database to
147    contain the block data. Next, you must specify the various similarity block
148    parameters in C<FIG_Config>. These are as follows. (Of course, the values
149    you specify will be different.)
150    
151        $simBlocksDB     = "conser5_blocks";    # SimBlocks database schema name
152        $simBlocksData   = "$data/BlockData";   # directory containing SimBlocks data files
153                                                # (used to load the Similarity database)
154        $simBlastData    = "$data/BlastData";   # directory containing SimBlast data files
155                                                # (input to SimLoad.pl)
156    
157    Run the C<SimLoad> script to load the database.
158    
159    The C<SimBlockForm.cgi> service enables you to run a similarity analysis between genome
160    sets in the database. The C<Html/ERDBTest.html> page enables you to run general queries
161    against the database and execute simple scripts (however, this latter capability is
162    disabled unless you have
163    
164        $debug_mode      = 1;                   # TRUE to enable the debugging scripts
165    
166    The Similarity Block methods and scripts work in the Windows version of the SEED (see
167    C<WinBuild>).
168    
169  =cut  =cut
170    
171  #: Constructor SimBlocks->new();  #: Constructor SimBlocks->new();
# Line 62  Line 183 
183  will be used. The user name and password for connecting are taken from  will be used. The user name and password for connecting are taken from
184  C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.  C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.
185    
186    In almost every case, you will be calling
187    
188    C<< my $simdb = SimBlocks->new(); >>
189    
190  =over 4  =over 4
191    
192  =item dbname (optional)  =item dbname (optional)
# Line 106  Line 231 
231      return $retVal;      return $retVal;
232  }  }
233    
234    =head3 DefaultDistances
235    
236    C<< my $distances = DefaultDistances(); >>
237    
238    Return the default distance matrix for computing the alignment distances (see
239    also L</DistanceMatrix>. This matrix returns a distance of 0 between an insertion
240    character (C<->) and any other nucleotide, a distance of 0.5 between nucleotides
241    of the same type, and a distance of 1 between nucleotides of different types.
242    
243    =cut
244    
245    my $DefaultDistanceMatrix = { AA  => 0,      AC  => 1,     AG  => 0.5,   AT  => 1,
246                                  AN  => 0.625,  AR  => 0.25,  AY  => 1,    'A-' => 0,
247                                  CA  => 1,      CC  => 0,     CG  => 1,     CT  => 0.5,
248                                  CN  => 0.625,  CR  => 1,     CY  => 0.25, 'C-' => 0,
249                                  GA  => 0.5,    GC  => 1,     GG  => 0,     GT  => 1,
250                                  GN  => 0,      GR  => 0.25,  GY  => 1,    'G-' => 0,
251                                  TA  => 1,      TC  => 0.5,   TG  => 1,     TT  => 0,
252                                  TN  => 0,      TR  => 1,     TY  => 0.25, 'T-' => 0,
253                                  RA  => 0.25,   RC  => 1,     RG  => 0.25,  RT  => 1,
254                                  RN  => 0.625,  RR  => 0.25,  RY  => 1,    'R-' => 0,
255                                  YA  => 1,      YC  => 0.25,  YG  => 1,     YT  => 0.25,
256                                  YN  => 0.625,  YR  => 1,     YY  => 0.25, 'Y-' => 0,
257                                  NA  => 0.625,  NC  => 0.625, NG  => 0.625, NT  => 0.625,
258                                  NN  => 0.625,  NR  => 0.625, NY  => 0.625,'N-' => 0,
259                                 '-A' => 0,     '-C' => 0,    '-G' => 0,    '-T' => 0,
260                                 '-N' => 0,     '-R' => 0,    '-Y' => 0,    '--' => 0 };
261    
262    sub DefaultDistances {
263        return $DefaultDistanceMatrix;
264    }
265    
266  =head3 DBLoad  =head3 DBLoad
267    
268  C<< my $stats = $simBlocks->DBLoad($rebuild); >>  C<< my $stats = $simBlocks->DBLoad($rebuild); >>
# Line 119  Line 276 
276  =item rebuild  =item rebuild
277    
278  TRUE if the tables should be re-created; FALSE if the tables should be cleared and  TRUE if the tables should be re-created; FALSE if the tables should be cleared and
279  reloaded. The default is FALSE; however, TRUE is considerably faster.  reloaded. The default is FALSE; however, TRUE is considerably faster. You would use
280    FALSE if you had added non-standard indexes to the database and didn't want to lose
281    them.
282    
283  =item RETURN  =item RETURN
284    
# Line 141  Line 300 
300    
301  =head3 CompareGenomes  =head3 CompareGenomes
302    
303  C<< my (\%set1Blocks, \%set2Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set1, \@set2); >>  C<< my (\%set0Blocks, \%set1Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set0, \@set1); >>
304    
305  Analyze two sets of genomes for commonalities. The group blocks returned will be divided  Analyze two sets of genomes for commonalities. The group blocks returned will be divided
306  into three hashes: one for those found only in set 1, one for those found only in set 2,  into three hashes: one for those common to set 0 and not occurring at all in set 1, one
307  and one for those found in both sets. Each hash is keyed by group ID and will contain  for those common to set 1 and not occurring at all in set 0, and one for those common
308  B<DBObject>s for B<GroupBlock> records with B<HasInstanceOf> data attached, though the  to both sets. Each hash is keyed by group ID and will contain B<DBObject>s for
309  genome ID in the B<HasInstanceOf> section is not generally predictable.  B<GroupBlock> records with B<HasInstanceOf> data attached, though the genome ID in
310    the B<HasInstanceOf> section is not generally predictable.
311    
312  =over 4  =over 4
313    
314  =item set1  =item set0
315    
316  Reference to a list of genome IDs for the genomes in the first set.  Reference to a list of genome IDs for the genomes in the first set (set 0).
317    
318  =item set2  =item set1
319    
320  Reference to a list of genome IDs for the genomes in the second set.  Reference to a list of genome IDs for the genomes in the second set (set 1).
321    
322  =item RETURN  =item RETURN
323    
324  Returns a list of hashes of lists. Each hash is keyed by group ID, and will contain  Returns a triple of hashes. Each hash is keyed by group ID, and will contain
325  B<DBObject>s for records in the B<GroupBlock> table. Groups found only in set  B<DBObject>s for records in the B<GroupBlock> table. Groups found in all of the
326  1 will be in the first hash, groups found only in set 2 will be in the second  genomes in set 0 but in none of the genomes of set 1 will be in the first hash,
327  hash, and groups found in both sets will be in the third hash.  groups found in all of the genomes in set 1 but in none of the genomes of set 0
328    will be in the second hash, and groups found in all of the genomes of both sets
329    will be in the third hash.
330    
331  =back  =back
332    
# Line 172  Line 334 
334  #: Return Type @%;  #: Return Type @%;
335  sub CompareGenomes {  sub CompareGenomes {
336      # Get the parameters.      # Get the parameters.
337      my ($self, $set1, $set2) = @_;      my ($self, $set0, $set1) = @_;
338      # Declare the three hashes.      # Declare the three hashes.
339      my (%set1Blocks, %set2Blocks, %bothBlocks);      my (%set0Blocks, %set1Blocks, %bothBlocks);
340      # Our first task is to find the groups in each genome set.      # Our first task is to find the common groups in each genome set.
341        my %groups0 = $self->BlocksInSet($set0);
342      my %groups1 = $self->BlocksInSet($set1);      my %groups1 = $self->BlocksInSet($set1);
     my %groups2 = $self->BlocksInSet($set2);  
343      # Create a trailer key to help tighten the loops.      # Create a trailer key to help tighten the loops.
344      my $trailer = "\xFF";      my $trailer = "\xFF";
345      # The groups are hashed by group ID. We will move through them in key order      # The groups are hashed by group ID. We will move through them in key order
346      # to get the "set1", "set2", and "both" groups.      # to get the "set0", "set1", and "both" groups.
347        my @blockIDs0 = (sort keys %groups0, $trailer);
348      my @blockIDs1 = (sort keys %groups1, $trailer);      my @blockIDs1 = (sort keys %groups1, $trailer);
     my @blockIDs2 = (sort keys %groups2, $trailer);  
349      # Prime the loop by getting the first ID from each list. Thanks to the trailers      # Prime the loop by getting the first ID from each list. Thanks to the trailers
350      # we know this process can't fail.      # we know this process can't fail.
351        my $blockID0 = shift @blockIDs0;
352      my $blockID1 = shift @blockIDs1;      my $blockID1 = shift @blockIDs1;
353      my $blockID2 = shift @blockIDs2;      while ($blockID0 lt $trailer || $blockID1 lt $trailer) {
     while ($blockID1 lt $trailer || $blockID2 lt $trailer) {  
354          # Compare the block IDs.          # Compare the block IDs.
355          if ($blockID1 lt $blockID2) {          if ($blockID0 lt $blockID1) {
356              # Block 1 is only in the first set.              # Block 0 is only in the first set.
357                Trace("Block IDs $blockID0 vs. $blockID1. Set 0 selected.") if T(SimCompare => 3);
358                $set0Blocks{$blockID0} = $groups0{$blockID0};
359                $blockID0 = shift @blockIDs0;
360            } elsif ($blockID0 gt $blockID1) {
361                # Block 1 is only in the second set.
362                Trace("Block IDs $blockID0 vs. $blockID1. Set 1 selected.") if T(SimCompare => 3);
363              $set1Blocks{$blockID1} = $groups1{$blockID1};              $set1Blocks{$blockID1} = $groups1{$blockID1};
364              $blockID1 = shift @blockIDs1;              $blockID1 = shift @blockIDs1;
         } elsif ($blockID1 gt $blockID2) {  
             # Block 2 is only in the second set.  
             $set2Blocks{$blockID2} = $groups2{$blockID2};  
             $blockID2 = shift @blockIDs2;  
365          } else {          } else {
366              # We have a block in both sets.              # We have a block in both sets.
367                Trace("Block IDs $blockID0 vs. $blockID1. Both selected.") if T(SimCompare => 3);
368              $bothBlocks{$blockID1} = $groups1{$blockID1};              $bothBlocks{$blockID1} = $groups1{$blockID1};
369                $blockID0 = shift @blockIDs0;
370              $blockID1 = shift @blockIDs1;              $blockID1 = shift @blockIDs1;
             $blockID2 = shift @blockIDs2;  
371          }          }
372      }      }
373        # The set1 and set2 lists are too big. They contain groups that are common in their
374        # respective sets but not common in both sets. We want groups that are common in
375        # their respective sets but completely absent in the other set. We do this by getting
376        # a complete list of the blocks in each set and deleting anything we find.
377        $self->RemoveBlocks(\%set0Blocks, $set1);
378        $self->RemoveBlocks(\%set1Blocks, $set0);
379      # Return the result.      # Return the result.
380      return (\%set1Blocks, \%set2Blocks, \%bothBlocks);      return (\%set0Blocks, \%set1Blocks, \%bothBlocks);
381    }
382    
383    =head3 RemoveBlocks
384    
385    C<< $simBlocks->RemoveBlocks(\%blockMap, \@set); >>
386    
387    Remove from the specified block map any blocks that occur in the specified set of genomes.
388    The block map can contain any data, but it must be keyed by block ID.
389    
390    =over 4
391    
392    =item blockMap
393    
394    Reference to a hash of block IDs to block data. The character of the block data is
395    irrelevant to this method.
396    
397    =item set
398    
399    Reference to a list of genome IDs. Any block that occurs in any one of the specified
400    genomes will be removed from the block map.
401    
402    =back
403    
404    =cut
405    #: Return Type ;
406    sub RemoveBlocks {
407        # Get the parameters.
408        my ($self, $blockMap, $set) = @_;
409        # Get a list of the blocks in the genome set.
410        my %setBlocks = $self->BlocksInSet($set, 1);
411        # Loop through the incoming block map, deleting any blocks found
412        # in the new block set. Note we make the assumption that the
413        # incoming block map is smaller.
414        my @blockList = keys %{$blockMap};
415        for my $blockID (@blockList) {
416            if (exists $setBlocks{$blockID}) {
417                delete $blockMap->{$blockID};
418            }
419        }
420  }  }
421    
422  =head3 BlocksInSet  =head3 BlocksInSet
423    
424  C<< my %blockList = $simBlocks->BlocksInSet($set); >>  C<< my %blockList = $simBlocks->BlocksInSet($set, $count); >>
425    
426  Return a list of the group blocks found in a given set of genomes. The list returned  Return a list of the group blocks found in a given number of the genomes in a given
427  will be a hash of B<DBObject>s, each corresponding to a single B<GroupBlock> record,  set. The list returned will be a hash of B<DBObject>s, each corresponding to a single
428  with a B<HasInstanceOf> record attached, though the content of the B<HasInstanceOf>  B<GroupBlock> record, with a B<HasInstanceOf> record attached, though the content of
429  record is not predictable. The hash will be keyed by group ID.  the B<HasInstanceOf> record is not predictable. The hash will be keyed by block ID.
430    
431  =over 4  =over 4
432    
433  =item set  =item set
434    
435  Reference to a list of genome IDs. All blocks appearing in any one of the genome  Reference to a list of genome IDs.
436  IDs will be in the list returned.  
437    =item count (optional)
438    
439    Minimum number of genomes from the set in which the block must appear. If C<1> is
440    specified, the list will include any block that occurs in at least one genome.
441    If C<5> is specified, the list will only include blocks that occur in at least
442    5 genomes from the set. If the set consists of only 4 genomes in this latter
443    case, no blocks will be returned. The default is to return blocks that occur in
444    all genomes of the set.
445    
446  =item RETURN  =item RETURN
447    
# Line 236  Line 454 
454  #: Return Type %%;  #: Return Type %%;
455  sub BlocksInSet {  sub BlocksInSet {
456      # Get the parameters.      # Get the parameters.
457      my ($self, $set) = @_;      my ($self, $set, $count) = @_;
458        # If the count was not specified, set it t
459        if (!defined $count) {
460            $count = @{$set};
461        }
462        Trace("BlocksInSet called for $count genomes of set (" . join(", ", @{$set}) . ").") if T(2);
463      # Create a hash to hold the groups found. The hash will be keyed by group ID      # Create a hash to hold the groups found. The hash will be keyed by group ID
464      # and contain the relevant DB object.      # and contain the relevant DB object.
465      my %retVal = ();      my %retVal = ();
466        # Create a hash to contain the number of genomes in which each block was found.
467        my %counters = ();
468      # Loop through the genome IDs in the specified set.      # Loop through the genome IDs in the specified set.
469      for my $genomeID (@{$set}) {      for my $genomeID (@{$set}) {
470          # Get a list of group blocks for this genome.          # Get a list of group blocks for this genome.
# Line 249  Line 474 
474          for my $block (@blocks) {          for my $block (@blocks) {
475              # Get the ID of this block.              # Get the ID of this block.
476              my ($blockID) = $block->Value('GroupBlock(id)');              my ($blockID) = $block->Value('GroupBlock(id)');
477              # Store it in the hash. If it's a duplicate, it will erase the              # Determine whether this block is new or has been found in a previous
478              # old one.              # genome of the set.
479                if (exists $counters{$blockID}) {
480                    # Here it's old. Increment its counter.
481                    $counters{$blockID}++;
482                } else {
483                    # Here it's new. Create a counter for it.
484                    $counters{$blockID} = 1;
485                }
486                # If it meets the threshhold, put it in the return hash. If it's already
487                # there, we'll simply overwrite the old copy, which makes no difference
488                # to the caller.
489                Trace("Block $blockID has $counters{$blockID} occurrences as of genome $genomeID.") if T(SimSet => 4);
490                if ($counters{$blockID} >= $count) {
491              $retVal{$blockID} = $block;              $retVal{$blockID} = $block;
492                    Trace("Block $blockID added to set.") if T(SimSet => 3);
493                }
494          }          }
495      }      }
496      # Return the result.      # Return the result.
# Line 291  Line 530 
530      my ($self, $blockID, $genomes) = @_;      my ($self, $blockID, $genomes) = @_;
531      # Create the return hash.      # Create the return hash.
532      my %retVal = ();      my %retVal = ();
533        # Count the regions found.
534        my $regionCount = 0;
535      # Query all the regions for the specified block.      # Query all the regions for the specified block.
536      my $query = $self->Get(['ContainsRegionIn'], "ContainsRegionIn(from-link) = ?", $blockID);      my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
537                               $blockID);
538      # Loop through the query.      # Loop through the query.
539      while (my $region = $query->Fetch) {      while (my $region = $query->Fetch) {
540          # Get this region's data.          # Get this region's data.
541          my ($contigID, $start, $dir, $len, $dna) =          my ($location, $dna) =
542              $region->Values(['ContainsRegionIn(to-link)', 'ContainsRegionIn(position)',              $region->Values(['Region(id)', 'Region(content)']);
                              'ContainsRegionIn(direction)', 'ContainsRegionIn(len)',  
                              'ContainsRegionIn(content)']);  
543          # Insure it belongs to one of our genomes.          # Insure it belongs to one of our genomes.
544          my $genomeID = Genome::FromContig($contigID);          my $found = SetNumber($location, $genomes);
         my $found = grep($_ eq $genomeID, @{$genomes});  
545          # If it does, put it in the hash.          # If it does, put it in the hash.
546          if ($found) {          if ($found >= 0) {
             my $location = "${contigID}_$start$dir$len";  
547              $retVal{$location} = $dna;              $retVal{$location} = $dna;
548                $regionCount++;
549          }          }
550      }      }
551        Trace("$regionCount regions found by GetRegions.") if T(2);
552      # Return the result.      # Return the result.
553      return %retVal;      return %retVal;
554  }  }
555    
556    =head3 SetNumber
557    
558    C<< my $setNumber = SimBlocks::SetNumber($contigRegion, \@set0, \@set1, ..., \@setN); >>
559    
560    Examine a region string, contig ID, or genome ID, and return the number of the genome
561    set to which it belongs.
562    
563    =over 4
564    
565    =item contigRegion
566    
567    A region string, contig ID, or genome ID. Because the contig ID is prefixed by the genome ID
568    delimited by a colon symbol (C<:>), we split at the first colon to get the desired ID.
569    
570    =item set0, set1, ..., setN
571    
572    A list of references to genome ID lists. Essentially, we will return the index of the
573    first incoming list that has the specified genome ID in it.
574    
575    =item RETURN
576    
577    The index in the list of sets of the set containing the relevant genome, or -1 if
578    the genome was not found.
579    
580    =back
581    
582    =cut
583    #: Return Type $
584    sub SetNumber {
585        # Get the parameters.
586        my ($contigRegion, @sets) = @_;
587        # Extract the genome ID.
588        my ($genomeID) = split /:/, $contigRegion;
589        # Clear the return variable. A negative number means nothing found.
590        my $retVal = -1;
591        # Loop until we find the genome.
592        for (my $i = 0; $retVal < 0 && $i <= $#sets; $i++) {
593            my $set = $sets[$i];
594            if (grep /^$genomeID$/, @{$set}) {
595                $retVal = $i;
596            }
597        }
598        # Return the result.
599        return $retVal;
600    }
601    
602  =head3 TagDNA  =head3 TagDNA
603    
604  C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>  C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>
605    
606  Convert a DNA string from the B<ContainsRegionIn> relationship to the actual  Convert a DNA string from the B<Region> relation to the actual DNA.
607  DNA, optionally with markings surrounding them. The DNA string will contain  The incoming DNA string will contain only the values corresponding to the
608  only the values corresponding to the question marks in the pattern,  question marks in the pattern. The pattern should be taken from the DNA
609  which should be taken from the DNA string's parent B<GroupBlock>. The  string's parent B<GroupBlock>. The resulting DNA sequence is built by
610  resulting DNA sequence is built by copying  copying the DNA string data into the pattern positions that have question
611    marks. If the string is intended for display in an HTML page, the
612    I<$prefix> and I<$suffix> parameters can be used to surround the
613    question mark positions with HTML markers that specify a different style,
614    weight, or font.
615    
616  =over 4  =over 4
617    
# Line 333  Line 623 
623    
624  String containing DNA string to be marked.  String containing DNA string to be marked.
625    
626  =item prefix  =item prefix (optional)
627    
628  String to be inserted before each position of variance.  String to be inserted before each position of variance.
629    
630  =item suffix  =item suffix (optional)
631    
632  String to be inserted after each position of variance.  String to be inserted after each position of variance.
633    
# Line 378  Line 668 
668          # Start past the question mark.          # Start past the question mark.
669          $start = $position + 1;          $start = $position + 1;
670      }      }
671        # Tack on the residual.
672        $retVal .= substr $pattern, $start;
673      # Return the result.      # Return the result.
674      return $retVal;      return $retVal;
675  }  }
676    
677    =head3 SnipScan
678    
679    C<< my %positions = $simBlocks->SnipScan($blockObject, \@set0, \@set1); >>
680    
681    Examine the specified block and return a list of the positions at which the
682    nucleotide values for regions in the first genome set differ from the values
683    for regions in the second genome set. This conditions is only satisfied if
684    the nucleotides occurring in the first set's regions never occur in the
685    second set's regions. This is a very strict condition. If, for example, C<A>
686    occurs in one of the regions for the first set, it cannot occur in any of the
687    regions for the second set.
688    
689    =over 4
690    
691    =item blockObject
692    
693    A C<DBObject> representing the B<GroupBlock> record for the desired block or
694    the actual ID of the block whose regions are to be examined. It is expected that the
695    block will have regions in all of the genomes for both sets, but this is not
696    required by the algorithm.
697    
698    =item set0
699    
700    Reference to a list of the IDs for the genomes in the first set (0).
701    
702    =item set1
703    
704    Reference to a list of the IDs for the genomes in the second set (1).
705    
706    =item RETURN
707    
708    A hash that maps positions to contents. Each position will be relative to the
709    start of the block. The mapped value will list the nucleotides found in the
710    first set and the nucleotides found in the second set. So, for example,
711    C<['AC', 'G']> means that both C<A> and C<C> are found in the genomes of the
712    first set (0), but only C<G> is found in genomes of the second set (1).
713    
714    =back
715    
716    =cut
717    #: Return Type %;
718    sub SnipScan {
719        # Get the parameters.
720        my ($self, $blockObject, $set0, $set1) = @_;
721        # Convert an incoming block ID to a block object.
722        if (ref $blockObject ne "DBObject") {
723            $blockObject = $self->GetEntity('GroupBlock', $blockObject);
724        }
725        # Get the ID and length of this block.
726        my ($blockID) = $blockObject->Value('GroupBlock(id)');
727        my ($len) = $blockObject->Value('GroupBlock(snip-count)');
728        # Create a hash that can be used to identify the genomes in each set.
729        my %setMap = ();
730        for my $genomeID (@{$set0}) {
731            $setMap{$genomeID} = 0;
732        }
733        for my $genomeID (@{$set1}) {
734            $setMap{$genomeID} = 1;
735        }
736        # Create the snip hash. We'll create empty lists for each snip position.
737        # As soon as a position becomes ineligible we delete it from the hash.
738        my %snipHash = ();
739        for (my $i = 0; $i < $len; $i++) {
740            $snipHash{$i} = ['', ''];
741        }
742        # Ask for the regions in the block.
743        my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
744                               $blockID);
745        # Loop through the regions.
746        while (my $region = $query->Fetch) {
747            # Determine this region's genome set. We only continue if the region is in
748            # one of the sets.
749            my ($location) = $region->Value('Region(id)');
750            my $genomeID = Genome::FromContig($location);
751            if (exists $setMap{$genomeID}) {
752                my $setIndex = $setMap{$genomeID};
753                # Get the region's snip values.
754                my ($regionValue) = $region->Value('Region(content)');
755                # Loop through the positions that are still active.
756                my @positions = keys %snipHash;
757                for my $position (@positions) {
758                    # Get the nucleotide at the current position in this region.
759                    my $letter = substr $regionValue, $position, 1;
760                    # Find it in the appropriate list.
761                    my $letterList = $snipHash{$position};
762                    if ((index $letterList->[1-$setIndex], $letter) >= 0) {
763                        # It's in the other set, so this position is no
764                        # longer valid.
765                        delete $snipHash{$position};
766                    } elsif ((index $letterList->[$setIndex], $letter) < 0) {
767                        # It's in neither set, so we add it to the list we're
768                        # accumulating for this set at this position.
769                        $letterList->[$setIndex] .= $letter;
770                    }
771                }
772            }
773        }
774        # Now %snipHash contains all of the information we need, but we must relocate the
775        # positions so they are relative to the block address instead of the snip
776        # positions. We need the block pattern to do this.
777        my ($blockPattern) = $blockObject->Value('GroupBlock(pattern)');
778        my @positions = ParsePattern($blockPattern);
779        # Create the return hash.
780        my %retVal = ();
781        # Relocate the positions.
782        for my $inPosition (keys %snipHash) {
783            $retVal{$positions[$inPosition]} = $snipHash{$inPosition};
784        }
785        # Return the result.
786        return %retVal;
787    }
788    
789    =head3 ParsePattern
790    
791    C<< my @positions = SimBlocks::ParsePattern($pattern); >>
792    
793    Get a list of the positions of variance inside a block pattern. The
794    positions of variance are marked by question marks, so all we need to
795    do is return the location of every question mark in the incoming
796    patter string.
797    
798    =over 4
799    
800    =item pattern
801    
802    DNA pattern for a group block.
803    
804    =item RETURN
805    
806    Returns a list of position numbers. These are coded as offsets, with 0 indicating
807    the first character of the pattern.
808    
809    =back
810    
811    =cut
812    #: Return Type @;
813    sub ParsePattern {
814        # Get the parameters.
815        my ($pattern) = @_;
816        # Create the return list.
817        my @retVal = ();
818        # Loop through the pattern, looking for question marks.
819        my $position = 0;
820        while (($position = index($pattern, "?", $position)) >= 0) {
821            push @retVal, $position;
822            $position++;
823        }
824        # Return the result.
825        return @retVal;
826    }
827    
828    =head3 MergeDNA
829    
830    C<< my ($groupSequence, $variance) = SimBlocks::MergeDNA($groupSequence, $newSequence); >>
831    
832    Merge the DNA for a region into the group representation of similar DNA, returning the
833    result and the positions of variance. Positions of variance in the group representation
834    are indicated by question marks. This method essentially compares the new DNA to the
835    group DNA and adds question marks wherever the new DNA does not match.
836    
837    =over 4
838    
839    =item groupSequence
840    
841    Group representation of the DNA into which the new DNA is to be merged.
842    
843    =item newSequence
844    
845    New DNA to be merged into the group representation.
846    
847    =item RETURN
848    
849    Returns the merged group representation followed by an updated count of the number
850    of question marks in the group.
851    
852    =back
853    
854    =cut
855    sub MergeDNA {
856        # Get the parameters.
857        my ($groupSequence, $newSequence) = @_;
858        # Declare the return variables.
859        my ($retVal, $varianceCount) = ("", 0);
860        # If there is no group representation, or if the group representation is the
861        # same as the new DNA, we return the new DNA; otherwise, we have to merge.
862        # Note that the new DNA cannot contain any question marks, so if either of
863        # the degenerate cases is true, we return a variance of 0.
864        if ((! $groupSequence) || ($groupSequence eq $newSequence)) {
865            $retVal = $newSequence;
866        } else {
867            # Here the group representation and the new DNA are different, so we
868            # have to find all the points of difference and replace them with
869            # question marks. We do this using a trick: by XOR-ing the two
870            # strings together, we create a new string with nulls everywhere
871            # except where there's a mismatch. We then find the non-null characters
872            # and replace the corresponding positions in the return string with
873            # question marks. First, we get a copy of the group representation to
874            # play with.
875            $retVal = $groupSequence;
876            # Next we compute the XOR string.
877            my $hexor = $retVal ^ $newSequence;
878            # This loop finds all non-null string positions.
879            while ($hexor =~ m/[^\x00]/g) {
880                # Here we get the position past the mismatch point.
881                my $pos = pos $hexor;
882                # Replace the real mismatch point with a question mark.
883                substr $retVal, $pos - 1, 1, "?";
884                $varianceCount++;
885            }
886        }
887        # Return the result.
888        return ($retVal, $varianceCount);
889    }
890    
891    =head3 GetAlignment
892    
893    C<< my %sequences = $$simBlocks->GetAlignment(\@blockIDs, \@genomeIDs, $indels); >>
894    
895    Return an alignment of the specified genomes relative to the specified block
896    IDs. Only blocks in which all of the genomes occur will produce output for
897    the alignment.
898    
899    =over 4
900    
901    =item blockIDs
902    
903    Reference to a list of the IDs of the blocks from which the alignment
904    should be constructed.
905    
906    =item genomeIDs
907    
908    Reference to a list of the genome IDs participating in the alignment.
909    
910    =item indels (optional)
911    
912    C<1> if columns containing insertion characters (C<->) should be included
913    in the alignment, else C<0>. The default is C<0>.
914    
915    =item RETURN
916    
917    Returns a hash that maps each genome ID to its sequence in the alignment.
918    
919    =back
920    
921    =cut
922    #: Return Type %;
923    sub GetAlignment {
924        # Get the parameters.
925        my ($self, $blockIDs, $genomeIDs, $indels) = @_;
926        # Create the return hash. We will start with a null string for each
927        # genome, and add the alignment data one block at a time.
928        my %retVal = ();
929        for my $genomeID (@{$genomeIDs}) {
930            $retVal{$genomeID} = "";
931        }
932        # Get the number of genomes. This will help us in determining which
933        # blocks to keep.
934        my $totalGenomes = @{$genomeIDs};
935        # Loop through the blocks.
936        for my $blockID (@{$blockIDs}) {
937            # Get the block's data.
938            my $blockData = $self->GetEntity('GroupBlock', $blockID);
939            my ($blockName) = $blockData->Value('GroupBlock(description)');
940            Trace("Processing common block $blockID ($blockName).") if T(Alignment => 4);
941            # Only proceed if the block has real variance in it.
942            my ($snipCount) = $blockData->Value('GroupBlock(snip-count)');
943            if ($snipCount > 0) {
944                # Get this block's regions in the specified genomes.
945                my %regionHash = $self->GetRegions($blockID, $genomeIDs);
946                # Verify that each genome is represented.
947                my %blockHash = ();
948                my $genomeCount = 0;
949                for my $region (keys %regionHash) {
950                    # Get the genome ID from the region string.
951                    $region =~ m/^([^:]+):/;
952                    # If it's new, count it.
953                    if (! exists $blockHash{$1}) {
954                        $blockHash{$1} = $regionHash{$region};
955                        $genomeCount++;
956                    }
957                }
958                # At this point, if the number of genomes found matches the
959                # size of the list, this block is good.
960                if ($genomeCount == $totalGenomes) {
961                    if (! $indels) {
962                        # If indels are off, we need to remove some of the columns. We do
963                        # this by finding C<-> characters in each genome's DNA sequence,
964                        # and deleting all instances of the relevant columns.
965                        for my $genomeID (@{$genomeIDs}) {
966                            Trace("Hyphen check for genome $genomeID.") if T(Alignment => 3);
967                            while ((my $pos = index($blockHash{$genomeID},"-")) >= 0) {
968                                # Here we found an insertion character. We delete its column
969                                # from all the genomes in the alignment.
970                                Trace("Hyphen found at position $pos in genome $genomeID.") if T(Alignment => 4);
971                                for my $genomeID2 (@{$genomeIDs}) {
972                                    substr $blockHash{$genomeID2}, $pos, 1, "";
973                                }
974                            }
975                        }
976                    }
977                    Trace("Common block $blockID ($blockName) added to alignment.") if T(Alignment => 3);
978                    for my $genomeID (@{$genomeIDs}) {
979                        $retVal{$genomeID} .= $blockHash{$genomeID};
980                    }
981                }
982            }
983        }
984        # Return the result.
985        return %retVal;
986    }
987    
988    =head3 DistanceMatrix
989    
990    C<< my %distances = SimBlocks::DistanceMatrix(\%sequences, \%distances); >>
991    
992    Compute the distances between the sequences in an alignment. the L</SequenceDistance>
993    method is used to compute the individual distances.
994    
995    =over 4
996    
997    =item sequences
998    
999    Reference to a hash of genome IDs to alignment sequences. The alignment sequences may
1000    only contain the characters C<AGCT->, and must all be the same length.
1001    
1002    =item distances
1003    
1004    Reference to a hash that maps pairs of letters to distance values. Each letter pair
1005    must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1006    map to the same distance. If no distance matrix is specified, the default distance
1007    matrix is used.
1008    
1009    =item RETURN
1010    
1011    Returns a hash that maps genome ID pairs to distance values. The ID pairs are
1012    formed using a slash. For example, the distance from C<158878.1> to C<282459.1>
1013    would be found attached to the key C<158878.1/282459.1>.
1014    
1015    =back
1016    
1017    =cut
1018    #: Return Type %
1019    sub DistanceMatrix {
1020        # Get the parameters.
1021        my ($sequences, $distances) = @_;
1022        # Declare the return variable.
1023        my %retVal = ();
1024        # Create a sorted list of the genome IDs.
1025        my @genomes = sort keys %{$sequences};
1026        # Loop through the genome IDs.
1027        for (my $i = 0; $i <= $#genomes; $i++) {
1028            my $genomei = $genomes[$i];
1029            # Create the 0-distance pair.
1030            $retVal{"$genomei/$genomei"} = 0;
1031            # Loop through the genomes lexically after this one. We'll
1032            # generate one distance and use it for both orderings.
1033            for (my $j = $i + 1; $j <= $#genomes; $j++) {
1034                my $genomej = $genomes[$j];
1035                # Compute the distance.
1036                my $distance = SequenceDistance($sequences->{$genomei}, $sequences->{$genomej},
1037                                                $distances);
1038                Trace("Distance from $genomei to $genomej is $distance.") if T(Distance => 3);
1039                # Store it in the return hash.
1040                $retVal{"$genomei/$genomej"} = $distance;
1041                $retVal{"$genomej/$genomei"} = $distance;
1042            }
1043        }
1044        # Return the result matrix.
1045        return %retVal;
1046    }
1047    
1048    =head3 SequenceDistance
1049    
1050    C<< my $dist = SimBlocks::SequenceDistance($seq1, $seq2, $distances); >>
1051    
1052    Return the distance between two sequences. The distance presumes that
1053    each alignment is a vector of sorts, with purines (C<A>/C<T>) and pyrmidines (C<G>/C<C>)
1054    being a half-unit from themselves and a full unit from each other. The distance is normalized
1055    by the number of positions in the alignment. Thus, C<AATGCTT> is 0.6267 from C<ATTTGCA>.
1056    The fourth and sixth positions are a full-unit jump and the second, fifth, and seventh
1057    positions are half-unit jumps. This works out to
1058    
1059        sqrt(0 + 0.5^2 + 0 + 1^2 + 0.5^2 + 1^2 + 0.5^2)/sqrt(7)
1060    
1061    which is approximately 0.6267. Insertion characters (C<->) are considered a distance of
1062    0 from everything. This distance model can be overridden by supplying a custom distance
1063    matrix as the third parameter.
1064    
1065    =over 4
1066    
1067    =item seq1
1068    
1069    Origin sequence for computing the distance.
1070    
1071    =item seq2
1072    
1073    End sequence for computing the distance. Both sequences must be the same length.
1074    
1075    =item distances
1076    
1077    Reference to a hash that maps pairs of letters to distance values. Each letter pair
1078    must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1079    map to the same distance. If no distance matrix is specified, the default distance
1080    matrix is used.
1081    
1082    =back
1083    
1084    =cut
1085    #: Return Type $;
1086    sub SequenceDistance {
1087        # Get the parameters.
1088        my ($seq1, $seq2, $distances) = @_;
1089        if (!defined $distances) {
1090            $distances = DefaultDistances();
1091        }
1092        # Declare the return value. Initially, this will be a sum of squares. We'll
1093        # take the square root and divide out the length later.
1094        my $retVal = 0;
1095        # Loop through the sequences, character by character, comparing.
1096        my $n = length $seq1;
1097        for (my $i = 0; $i < $n; $i++) {
1098            my $s1 = substr $seq1, $i, 1;
1099            my $s2 = substr $seq2, $i, 1;
1100            my $step = $distances->{"$s1$s2"};
1101            Trace("No step distance found for \"$s1$s2\".") if !defined $step && T(Distance => 1);
1102            $retVal += $step * $step;
1103        }
1104        # Divide by the length and take the square root.
1105        $retVal = sqrt($retVal / $n);
1106        # Return the result.
1107        return $retVal;
1108    }
1109    
1110    =head3 GetBlock
1111    
1112    C<< my $blockData = $simBlocks->GetBlock($blockID); >>
1113    
1114    Return a B<DBObject> for a specified group block.
1115    
1116    =over 4
1117    
1118    =item blockID
1119    
1120    ID of the desired block.
1121    
1122    =item RETURN
1123    
1124    Returns a B<DBObject> for the group block in question. The object allows access to
1125    all of the fields in the C<GroupBlock> relation of the database.
1126    
1127    =back
1128    
1129    =cut
1130    #: Return Type $%;
1131    sub GetBlock {
1132        # Get the parameters.
1133        my ($self, $blockID) = @_;
1134        # Get the group block.
1135        my $retVal = $self->GetEntity('GroupBlock', $blockID);
1136        # Return the result.
1137        return $retVal;
1138    }
1139    
1140    =head3 GetBlockPieces
1141    
1142    C<< my @blocks = $blockData->GetBlockPieces($location); >>
1143    
1144    Return a map of the block pieces inside the specified location. The return
1145    value will be a list of block locations. A block location is essentially a
1146    location string with a block ID in place of the contig ID.
1147    
1148    This routine depends upon the fact that there is no overlap between blocks.
1149    A given nucleotide on a Contig is in exactly one block. Therefore, to find
1150    all the blocks that overlap a specific section of the Contig, we find the
1151    first block that ends after the start of the section and proceed until we
1152    find the last block that begins before the end of the section. The only
1153    complexity that occurs is if the Contig folds back in upon itself and
1154    appears twice in the same block. In this case, we might get two segments from
1155    the same block, and they may or may not overlap.
1156    
1157    =over 4
1158    
1159    =item location
1160    
1161    A basic location object or location string defining the range whose block pieces
1162    are desired.
1163    
1164    =item RETURN
1165    
1166    Returns a list of block location strings. Each string consists of a block ID
1167    followed by a start location and an end location. The three pieces of the
1168    string are separated by underscores (C<_>).
1169    
1170    =back
1171    
1172    =cut
1173    #: Return Type %;
1174    sub GetBlockPieces {
1175        # Get the parameters.
1176        my ($self, $location) = @_;
1177        # Declare the return variable.
1178        my @retVal = ();
1179        # Parse the incoming location.
1180        my $loc = BasicLocation->new($location);
1181        # Determine the parameters needed to get the desired list of regions.
1182        my ($left, $right, $contigID) = ($loc->Left, $loc->Right, $loc->Contig);
1183        Trace("Searching for regions near " . $loc->String) if T(4);
1184        # Ask for the regions in the section we want.
1185        my @regions = $self->GetAll(['Region', 'IncludesRegion', 'GroupBlock'],
1186                                    'Region(end) >= ? AND Region(position) <= ? AND Region(contigID) = ?',
1187                                    [$left, $right, $contigID],
1188                                    ['GroupBlock(id)', 'GroupBlock(pattern)',
1189                                     'GroupBlock(len)',
1190                                     'Region(position)', 'Region(end)',
1191                                     'Region(content)']);
1192        # Loop through the regions found. For each region we will output a location string.
1193        for my $regionData (@regions) {
1194            # Declare the variable to contain the formatted location.
1195            my $blockLoc;
1196            # Separate the fields from the region data.
1197            my ($blockID, $pattern, $len, $position, $end, $content) = @{$regionData};
1198            # Determine whether we're using the full region or only part of it.
1199            if ($position >= $left && $end <= $right) {
1200                # Here we're using the full region, so the location is simple.
1201                $blockLoc = "${blockID}_1_${len}";
1202            } else {
1203                # This is the hard part. We need to convert the contig positions to
1204                # block positions. Since the block contains insertion characters and
1205                # the regions don't, this requires fancy dancing. First, we get the
1206                # region's DNA string.
1207                my $dnaSequence = TagDNA($pattern, $content);
1208                # Let's call the "area of interest" the part of the Contig that
1209                # is in the incoming location AND in the current region. This could
1210                # be at either end of the region or somewhere in the middle. We
1211                # need to walk the block DNA to find what positions in the block
1212                # correspond to the desired positions in the contig. First, we
1213                # put ourselves at the beginning of the block and the region,
1214                # and walk to the start point. Note that our position arithmetic
1215                # is 1-based, in keeping with the conventions of biologists.
1216                my $blockPos = 1;
1217                my $contigPos = $position;
1218                # Check to see if we can use the current position as are starting
1219                # point or we need to find it.
1220                if ($left > $position) {
1221                    # Here we need to find the left edge of the area of interest.
1222                    $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $left);
1223                    # Update the contig position to match.
1224                    $contigPos = $left;
1225                }
1226                # Save the left edge.
1227                my $blockLeft = $blockPos;
1228                # Now find the right edge.
1229                if ($right >= $end) {
1230                    # Here the right edge is the end of the block.
1231                    $blockPos = $len;
1232                } else {
1233                    # Here we have to find the right edge of the area of interest.
1234                    $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $right);
1235                }
1236                my $blockRight = $blockPos;
1237                # Format the location.
1238                $blockLoc = "${blockID}_${blockLeft}_${blockRight}";
1239            }
1240            # Push the block location onto the return list.
1241            push @retVal, $blockLoc;
1242        }
1243        # Return the result.
1244        return @retVal;
1245    }
1246    
1247    =head3 GetFeatureBlockPieces
1248    
1249    C<< my @pieces = $simBlocks->GetFeatureBlockPieces($fig, \@featureIDs, $distance); >>
1250    
1251    Return a list of the block pieces within the specified distance of the
1252    specified features. This method essentially computes locations from
1253    the distance and the feature locations, then passes them to L/GetBlockPieces>.
1254    That method will return the sections of various similarity blocks that
1255    occur inside the the locations computed.
1256    
1257    =over 4
1258    
1259    =item fig
1260    
1261    A FIG-like object that can be used to convert features into locations.
1262    
1263    =item featureIDs
1264    
1265    Reference to a list of feature IDs. Blocks in the vicinity of any of the
1266    features are returned.
1267    
1268    =item distance
1269    
1270    Distance to search from the feature's actual location.
1271    
1272    =item RETURN
1273    
1274    Returns a list of block piece location objects. A block piece location object
1275    is a standard B<BasicLocation> object with a block ID instead of a contig ID.
1276    
1277    =back
1278    
1279    =cut
1280    #: Return Type @;
1281    sub GetFeatureBlockPieces {
1282        # Get the parameters.
1283        my ($self, $fig, $featureIDs, $distance) = @_;
1284        # Declare a hash for keeping the return data. This will weed out
1285        # duplicates, of which we expect plenty. We'll need to handle
1286        # overlaps later, however.
1287        my %retHash = ();
1288        # Loop through the features.
1289        for my $featureID (@{$featureIDs}) {
1290            # Get the feature's genome ID.
1291            my $genomeID = FIG::genome_of($featureID);
1292            # Get the feature's locations.
1293            my @locations = $fig->feature_location($featureID);
1294            # Loop through the locations, sticking the resulting pieces in the return
1295            # hash.
1296            for my $loc (@locations) {
1297                my $locObject = BasicLocation->new($loc);
1298                # Get the length of the contig in question.
1299                my $len = $fig->contig_ln($genomeID, $locObject->Contig);
1300                # Expand the location.
1301                $locObject->Widen($distance, $len);
1302                # Insure the location is Sprout-style;
1303                $locObject->FixContig($genomeID);
1304                # Get the desired block pieces.
1305                my @pieces = $self->GetBlockPieces($locObject);
1306                Trace(scalar(@pieces) . " pieces found for location $loc.") if T(4);
1307                # Put them in the hash.
1308                for my $piece (@pieces) {
1309                    $retHash{$piece} = 1;
1310                }
1311            }
1312        }
1313        # Now we have all the block pieces that occur in any one of our regions. The
1314        # next step is to merge adjacent and overlapping regions. First we convert
1315        # the pieces to location objects so we can sort them.
1316        my @retVal = ();
1317        for my $piece (keys %retHash) {
1318            my $loc = BasicLocation->new($piece);
1319            push @retVal, $loc;
1320        }
1321        Trace("Beginning sort.") if T(3);
1322        @retVal = sort { BasicLocation::Cmp($a,$b) } @retVal;
1323        Trace(scalar(@retVal) . " pieces found before overlap check.") if T(3);
1324        # Now the locations are sorted by block ID, start position, and descending
1325        # length. This means that if there's an overlap, the two overlapping
1326        # pieces will be next to each other.
1327        my $i = 0;
1328        while ($i < $#retVal) {
1329            # Check for overlap between this location and the next.
1330            my ($type, $len) = Overlap::CheckOverlap($retVal[$i], $retVal[$i+1]);
1331            if ($len == 0) {
1332                # Here there is no overlap, so we push ahead.
1333                $i++;
1334            } elsif ($type eq 'embedded') {
1335                # Here the second piece is inside the first, so we delete the
1336                # second.
1337                delete $retVal[$i+1];
1338            } else {
1339                # Here we have a normal overlap. Because all of our pieces
1340                # are forward locations, this means the second piece starts
1341                # before the end of the of the first. We set the end point
1342                # if the first to the end point of the second and delete the
1343                # second.
1344                $retVal[$i]->SetEnd($retVal[$i+1]->EndPoint);
1345                delete $retVal[$i+1];
1346            }
1347        }
1348        # Return the result.
1349        return @retVal;
1350    }
1351    
1352    =head3 WalkDNA
1353    
1354    C<< my $blockPos = SimBlocks::WalkDNA($blockPos, $contigPos, $dna, $loc); >>
1355    
1356    Location the desired position within a block of a specified location in a contig.
1357    
1358    The idea here is that a block contains insertion characters and a contig doesn't.
1359    The incoming DNA sequence is for the block. On entry, we have a current position
1360    in the block and a current position in the contig. We advance the two pointers
1361    in parallel until the contig pointer equals the desired location.
1362    
1363    =over 4
1364    
1365    =item blockPos
1366    
1367    Current position in the block (1-based).
1368    
1369    =item contigPos
1370    
1371    Current position in the contig.
1372    
1373    =item dna
1374    
1375    DNA string for the block with the points of variance filled in with data from
1376    the contig. In other word's the block's version of the current contig region.
1377    
1378    =item loc
1379    
1380    Desired location in the contig.
1381    
1382    =item RETURN
1383    
1384    Returns the position in the block corresponding to the desired position in the
1385    contig.
1386    
1387    =back
1388    
1389    =cut
1390    #: Return Type $;
1391    sub WalkDNA {
1392        # Get the parameters.
1393        my ($blockPos, $contigPos, $dna, $loc) = @_;
1394        # Copy the incoming positions into variables with which we can play. While
1395        # we're at it, we'll convert the block position from 1-based to 0-based.
1396        my ($retVal, $contigPtr) = ($blockPos - 1, $contigPos);
1397        # Loop until we find our destination point.
1398        while ($contigPtr < $loc) {
1399            # Find the next insertion character in the DNA string.
1400            my $nextHyphen = index $dna, '-', $retVal;
1401            # At this point, "nextHyphen" is either -1 (indicating no hyphen
1402            # found) or it is positioned on a hyphen. If it's -1, move it after
1403            # the end of the dna string.
1404            if ($nextHyphen < 0) {
1405                $nextHyphen = length $dna;
1406            }
1407            # Now, if we subtract $retVal from $nextHyphen, we have the largest
1408            # value by which we can advance the contig pointer without getting
1409            # out of sync. The first step is to see if this would move us past
1410            # our desired location.
1411            my $offset = $nextHyphen - $retVal;
1412            my $contigNext = $contigPtr + $offset;
1413            if ($contigNext >= $loc) {
1414                # We now know that everything between the current position and
1415                # the desired position is real nucleotides. We simply push both
1416                # pointers forward by the same amount so that the contig pointer
1417                # becomes $loc.
1418                $retVal += $loc - $contigPtr;
1419                $contigPtr = $loc
1420            } else {
1421                # Here we need to go the whole distance to the next hyphen and
1422                # keep going.
1423                $contigPtr += $offset;
1424                $retVal += $offset;
1425                $retVal++;
1426            }
1427        }
1428        # Return the result. Note we convert back to 1-based.
1429        return $retVal + 1;
1430    }
1431    
1432  1;  1;

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.5

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3