[Bio] / Sprout / SimBlocks.pm Repository:
ViewVC logotype

Diff of /Sprout/SimBlocks.pm

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.1, Wed May 4 03:24:43 2005 UTC revision 1.7, Mon Feb 13 15:42:48 2006 UTC
# Line 3  Line 3 
3      require Exporter;      require Exporter;
4      use ERDB;      use ERDB;
5      @ISA = qw(Exporter ERDB);      @ISA = qw(Exporter ERDB);
6      @EXPORT = qw();      @EXPORT = qw(DefaultDistances);
7      @EXPORT_OK = qw();      @EXPORT_OK = qw(BlocksInSet GetBlock GetBlockPieces CompareGenomes DBLoad DistanceMatrix GetAlignment GetRegions ParsePattern RemoveBlocks SequenceDistance SetNumber SnipScan);
8    
9      use strict;      use strict;
10      use Tracer;      use Tracer;
11      use PageBuilder;      use PageBuilder;
12      use Genome;      use Genome;
13        use BasicLocation;
14        use Overlap;
15      use FIG_Config;      use FIG_Config;
16        use FIG;
17    
18  =head1 Similarity Block Database  =head1 Similarity Block Database
19    
# Line 23  Line 25 
25  by the C<SimBlocksDBD.xml> file.  by the C<SimBlocksDBD.xml> file.
26    
27  The primary entities of the database are B<Genome>, which represents a specified  The primary entities of the database are B<Genome>, which represents a specified
28  organism, B<Group>, which represents a set of similar regions, and B<Contig>,  organism, B<GroupBlock>, which represents a set of similar regions, and B<Contig>,
29  which represents a contiguous segment of DNA. The key relationship is  which represents a contiguous segment of DNA. The key relationship is
30  B<ContainsRegionIn>, which maps groups to contigs. The mapping information  B<ContainsRegionIn>, which maps blocks to contigs. The mapping information
31  includes the DNA nucleotides as well as the location in the contig.  includes the DNA nucleotides as well as the location in the contig.
32    
33  For a set of genomes, we will define a I<common group> as a group that  =head2 Formal Definitions
34  is present in most of the genomes. (The definition of I<most> is  
35  determined by a parameter.)  Let {B<G(1)>, B<G(2)>, ..., B<G(>I<n>B<)>} be a set of genomes
36    and let {B<g(1)>, B<g(2)>, ..., B<g(>I<m>B<)>} be the set of
37    genes called on these genomes.
38    
39    The primitive notion from which similarity blocks are derived is the
40    I<correspondence mapping>. Roughly, this mapping connects genes on
41    one genome to corresponding genes on other genomes. We will treat
42    the mapping as an equivalence class. In practice, the entire mapping
43    will not be available when we are building to load files for the
44    similarity block database. Instead, the user will provide a subset
45    of the mapping from which the entire partitioning can be deduced
46    via commutativity, reflexivity, and transitivity. This subset is
47    called the I<correspondence relation>.
48    
49    In actual practice, the correspondence relation will contain some
50    false positives. The process of developing the correspondence
51    relation usually involves analyzing similarities above a certain
52    threshhold. (This type of similarity is not necessarily transitive;
53    however, the true, underlying correspondence mapping we want I<is>
54    transitive.) Often, a sequence of correspondence pairs that maps
55    a gene on one particular genome to another gene on the same genome
56    is an indication of a false positive.
57    
58    In addition to false positives, we may have invalid genes, missing
59    genes, frame shifts, and incorrect start locations. Detecting and
60    cleaning these problems is an important part of getting a good
61    correspondence relation.
62    
63    Given a full-blown correspondence mapping, we define the following
64    concepts.
65    
66    =over 4
67    
68    =item corresponding gene
69    
70    The I<corresponding genes> of a given gene B<g(>I<i>B<)> are those
71    genes mapped to it by the correspondence mapping.
72    
73    =item gene block
74    
75    A I<gene block> is a maximal set of corresponding genes.
76    
77    =item intergenic region
78    
79    An I<intergenic region> is a section of the contig between
80    two adjacent genes. The adjacent genes are called the
81    I<left gene> and the I<right gene>, depending on whether they
82    precede or follow the region. For example, if C<G2_100+50>
83    and C<G2_175+100> are adjacent, C<G2_150+25> is their
84    intergenic region, the left gene is C<G2_100+50>, and
85    the right gene is C<G2_175+100>.
86    
87    =item intergenic block
88    
89    An I<intergenic block> is a maximal set of intergenic regions such
90    that (1) the left genes all correspond and have the same orientation,
91    (2) the right genes all correspond and have the same orientation, and
92    (3) the region lengths differ by no more than a specified threshhold
93    (usually 10%).
94    
95    =item similarity block
96    
97    A I<similarity block> is either a gene block or an intergenic block.
98    A similarity block is therefore a set of regions (either gene regions
99    or intergenic regions) that are tied together by the correspondence
100    mapping.
101    
102    =item common block
103    
104    A I<common block> with respect to a subset of the genomes
105    {B<G(>I<i1>B<)>, B<G(>I<i2>B<)>, ... B<G(>I<ik>B<)>} is a
106    similarity block that has exactly one region in each
107    of the genomes of the subset. There is a lesser notion of
108    a block that occurs in most genomes in the subset. The actual
109    program allows the notion of I<most> to be defined externally.
110    The default is all of the genomes in the subset.
111    
112    =item unique sequence
113    
114    A I<unique sequence> is a region that does not appear in any
115    similarity block with another region. In other words, a unique
116    sequence is a region that does not correspond to any other region.
117    
118    =back
119    
120    =head2 Usage
121    
122  The similarity block database must be able to answer two questions, taking  The similarity block database must be able to answer two questions, taking
123  as input two sets of genomes.  as input two genome sets. (These would be subsets of the total set of genomes
124    represented in the correspondence mapping.)
125    
126  =over 4  =over 4
127    
128  =item (1) Which groups are common in both sets and which are common in only one or the other?  =item (1) Which blocks are common in one set and do not occur in the other.
129    
130  =item (2) For each group that is common in both sets, how do the individual regions differ?  These are considered major variations. They result from prophage insertions,
131    transposition events, and recombinations. A major variation results in a
132    different set of proteins in an operating cell.
133    
134    =item (2) What individual variations in the shared common blocks distinguish the two sets?
135    
136    These are considered minor variations, and are the most common result of
137    single-nucleotide mutations. Rather than changing the set of proteins,
138    minor variations change the proteins themselves.
139    
140  =back  =back
141    
142  This class is a subclass of B<ERDB>, and inherits all of its public methods.  This class is a subclass of B<ERDB>, and inherits all of its public methods.
143    
144    =head2 Setup
145    
146    To begin working with similarity blocks, you must first create a database to
147    contain the block data. Next, you must specify the various similarity block
148    parameters in C<FIG_Config>. These are as follows. (Of course, the values
149    you specify will be different.)
150    
151        $simBlocksDB     = "conser5_blocks";    # SimBlocks database schema name
152        $simBlocksData   = "$data/BlockData";   # directory containing SimBlocks data files
153                                                # (used to load the Similarity database)
154        $simBlastData    = "$data/BlastData";   # directory containing SimBlast data files
155                                                # (input to SimLoad.pl)
156    
157    Run the C<SimLoad> script to load the database.
158    
159    The C<SimBlockForm.cgi> service enables you to run a similarity analysis between genome
160    sets in the database. The C<Html/ERDBTest.html> page enables you to run general queries
161    against the database and execute simple scripts (however, this latter capability is
162    disabled unless you have
163    
164        $debug_mode      = 1;                   # TRUE to enable the debugging scripts
165    
166    The Similarity Block methods and scripts work in the Windows version of the SEED (see
167    C<WinBuild>).
168    
169  =cut  =cut
170    
171  #: Constructor SimBlocks->new();  #: Constructor SimBlocks->new();
# Line 62  Line 183 
183  will be used. The user name and password for connecting are taken from  will be used. The user name and password for connecting are taken from
184  C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.  C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.
185    
186    In almost every case, you will be calling
187    
188    C<< my $simdb = SimBlocks->new(); >>
189    
190  =over 4  =over 4
191    
192  =item dbname (optional)  =item dbname (optional)
# Line 88  Line 213 
213      my ($class, $dbname, $dbType, $port) = @_;      my ($class, $dbname, $dbType, $port) = @_;
214      # Plug in the default values.      # Plug in the default values.
215      if (! $dbname) {      if (! $dbname) {
216          $dbname = $FIG_Config::simBlocksDB;          $dbname = DBName();
217      }      }
218      if (! $dbType) {      if (! $dbType) {
219          $dbType = $FIG_Config::dbms;          $dbType = $FIG_Config::dbms;
# Line 106  Line 231 
231      return $retVal;      return $retVal;
232  }  }
233    
234    =head3 DBName
235    
236    C<< my $name = SimBlocks::DBName; >>
237    
238    Return the name of the database. This is set from a config variable, but if the
239    variable is undefined a default value is used.
240    
241    =cut
242    
243    sub DBName {
244        my $retVal;
245        if (defined $FIG_Config::simBlocksDB) {
246            $retVal = $FIG_Config::simBlocksDB;
247        } else {
248            $retVal = "Blocks";
249        }
250        return $retVal;
251    }
252    
253    =head3 DefaultDistances
254    
255    C<< my $distances = DefaultDistances(); >>
256    
257    Return the default distance matrix for computing the alignment distances (see
258    also L</DistanceMatrix>. This matrix returns a distance of 0 between an insertion
259    character (C<->) and any other nucleotide, a distance of 0.5 between nucleotides
260    of the same type, and a distance of 1 between nucleotides of different types.
261    
262    =cut
263    
264    my $DefaultDistanceMatrix = { AA  => 0,      AC  => 1,     AG  => 0.5,   AT  => 1,
265                                  AN  => 0.625,  AR  => 0.25,  AY  => 1,    'A-' => 0,
266                                  CA  => 1,      CC  => 0,     CG  => 1,     CT  => 0.5,
267                                  CN  => 0.625,  CR  => 1,     CY  => 0.25, 'C-' => 0,
268                                  GA  => 0.5,    GC  => 1,     GG  => 0,     GT  => 1,
269                                  GN  => 0,      GR  => 0.25,  GY  => 1,    'G-' => 0,
270                                  TA  => 1,      TC  => 0.5,   TG  => 1,     TT  => 0,
271                                  TN  => 0,      TR  => 1,     TY  => 0.25, 'T-' => 0,
272                                  RA  => 0.25,   RC  => 1,     RG  => 0.25,  RT  => 1,
273                                  RN  => 0.625,  RR  => 0.25,  RY  => 1,    'R-' => 0,
274                                  YA  => 1,      YC  => 0.25,  YG  => 1,     YT  => 0.25,
275                                  YN  => 0.625,  YR  => 1,     YY  => 0.25, 'Y-' => 0,
276                                  NA  => 0.625,  NC  => 0.625, NG  => 0.625, NT  => 0.625,
277                                  NN  => 0.625,  NR  => 0.625, NY  => 0.625,'N-' => 0,
278                                 '-A' => 0,     '-C' => 0,    '-G' => 0,    '-T' => 0,
279                                 '-N' => 0,     '-R' => 0,    '-Y' => 0,    '--' => 0 };
280    
281    sub DefaultDistances {
282        return $DefaultDistanceMatrix;
283    }
284    
285  =head3 DBLoad  =head3 DBLoad
286    
287  C<< my $stats = $simBlocks->DBLoad($rebuild); >>  C<< my $stats = $simBlocks->DBLoad($rebuild); >>
# Line 119  Line 295 
295  =item rebuild  =item rebuild
296    
297  TRUE if the tables should be re-created; FALSE if the tables should be cleared and  TRUE if the tables should be re-created; FALSE if the tables should be cleared and
298  reloaded. The default is FALSE; however, TRUE is considerably faster.  reloaded. The default is FALSE; however, TRUE is considerably faster. You would use
299    FALSE if you had added non-standard indexes to the database and didn't want to lose
300    them.
301    
302  =item RETURN  =item RETURN
303    
# Line 141  Line 319 
319    
320  =head3 CompareGenomes  =head3 CompareGenomes
321    
322  C<< my (\%set1Blocks, \%set2Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set1, \@set2); >>  C<< my (\%set0Blocks, \%set1Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set0, \@set1); >>
323    
324  Analyze two sets of genomes for commonalities. The group blocks returned will be divided  Analyze two sets of genomes for commonalities. The group blocks returned will be divided
325  into three hashes: one for those found only in set 1, one for those found only in set 2,  into three hashes: one for those common to set 0 and not occurring at all in set 1, one
326  and one for those found in both sets. Each hash is keyed by group ID and will contain  for those common to set 1 and not occurring at all in set 0, and one for those common
327  B<DBObject>s for B<GroupBlock> records with B<HasInstanceOf> data attached, though the  to both sets. Each hash is keyed by group ID and will contain B<DBObject>s for
328  genome ID in the B<HasInstanceOf> section is not generally predictable.  B<GroupBlock> records with B<HasInstanceOf> data attached, though the genome ID in
329    the B<HasInstanceOf> section is not generally predictable.
330    
331  =over 4  =over 4
332    
333  =item set1  =item set0
334    
335  Reference to a list of genome IDs for the genomes in the first set.  Reference to a list of genome IDs for the genomes in the first set (set 0).
336    
337  =item set2  =item set1
338    
339  Reference to a list of genome IDs for the genomes in the second set.  Reference to a list of genome IDs for the genomes in the second set (set 1).
340    
341  =item RETURN  =item RETURN
342    
343  Returns a list of hashes of lists. Each hash is keyed by group ID, and will contain  Returns a triple of hashes. Each hash is keyed by group ID, and will contain
344  B<DBObject>s for records in the B<GroupBlock> table. Groups found only in set  B<DBObject>s for records in the B<GroupBlock> table. Groups found in all of the
345  1 will be in the first hash, groups found only in set 2 will be in the second  genomes in set 0 but in none of the genomes of set 1 will be in the first hash,
346  hash, and groups found in both sets will be in the third hash.  groups found in all of the genomes in set 1 but in none of the genomes of set 0
347    will be in the second hash, and groups found in all of the genomes of both sets
348    will be in the third hash.
349    
350  =back  =back
351    
# Line 172  Line 353 
353  #: Return Type @%;  #: Return Type @%;
354  sub CompareGenomes {  sub CompareGenomes {
355      # Get the parameters.      # Get the parameters.
356      my ($self, $set1, $set2) = @_;      my ($self, $set0, $set1) = @_;
357      # Declare the three hashes.      # Declare the three hashes.
358      my (%set1Blocks, %set2Blocks, %bothBlocks);      my (%set0Blocks, %set1Blocks, %bothBlocks);
359      # Our first task is to find the groups in each genome set.      # Our first task is to find the common groups in each genome set.
360        my %groups0 = $self->BlocksInSet($set0);
361      my %groups1 = $self->BlocksInSet($set1);      my %groups1 = $self->BlocksInSet($set1);
     my %groups2 = $self->BlocksInSet($set2);  
362      # Create a trailer key to help tighten the loops.      # Create a trailer key to help tighten the loops.
363      my $trailer = "\xFF";      my $trailer = "\xFF";
364      # The groups are hashed by group ID. We will move through them in key order      # The groups are hashed by group ID. We will move through them in key order
365      # to get the "set1", "set2", and "both" groups.      # to get the "set0", "set1", and "both" groups.
366        my @blockIDs0 = (sort keys %groups0, $trailer);
367      my @blockIDs1 = (sort keys %groups1, $trailer);      my @blockIDs1 = (sort keys %groups1, $trailer);
     my @blockIDs2 = (sort keys %groups2, $trailer);  
368      # Prime the loop by getting the first ID from each list. Thanks to the trailers      # Prime the loop by getting the first ID from each list. Thanks to the trailers
369      # we know this process can't fail.      # we know this process can't fail.
370        my $blockID0 = shift @blockIDs0;
371      my $blockID1 = shift @blockIDs1;      my $blockID1 = shift @blockIDs1;
372      my $blockID2 = shift @blockIDs2;      while ($blockID0 lt $trailer || $blockID1 lt $trailer) {
     while ($blockID1 lt $trailer || $blockID2 lt $trailer) {  
373          # Compare the block IDs.          # Compare the block IDs.
374          if ($blockID1 lt $blockID2) {          if ($blockID0 lt $blockID1) {
375              # Block 1 is only in the first set.              # Block 0 is only in the first set.
376                Trace("Block IDs $blockID0 vs. $blockID1. Set 0 selected.") if T(SimCompare => 3);
377                $set0Blocks{$blockID0} = $groups0{$blockID0};
378                $blockID0 = shift @blockIDs0;
379            } elsif ($blockID0 gt $blockID1) {
380                # Block 1 is only in the second set.
381                Trace("Block IDs $blockID0 vs. $blockID1. Set 1 selected.") if T(SimCompare => 3);
382              $set1Blocks{$blockID1} = $groups1{$blockID1};              $set1Blocks{$blockID1} = $groups1{$blockID1};
383              $blockID1 = shift @blockIDs1;              $blockID1 = shift @blockIDs1;
         } elsif ($blockID1 gt $blockID2) {  
             # Block 2 is only in the second set.  
             $set2Blocks{$blockID2} = $groups2{$blockID2};  
             $blockID2 = shift @blockIDs2;  
384          } else {          } else {
385              # We have a block in both sets.              # We have a block in both sets.
386                Trace("Block IDs $blockID0 vs. $blockID1. Both selected.") if T(SimCompare => 3);
387              $bothBlocks{$blockID1} = $groups1{$blockID1};              $bothBlocks{$blockID1} = $groups1{$blockID1};
388                $blockID0 = shift @blockIDs0;
389              $blockID1 = shift @blockIDs1;              $blockID1 = shift @blockIDs1;
             $blockID2 = shift @blockIDs2;  
390          }          }
391      }      }
392        # The set1 and set2 lists are too big. They contain groups that are common in their
393        # respective sets but not common in both sets. We want groups that are common in
394        # their respective sets but completely absent in the other set. We do this by getting
395        # a complete list of the blocks in each set and deleting anything we find.
396        $self->RemoveBlocks(\%set0Blocks, $set1);
397        $self->RemoveBlocks(\%set1Blocks, $set0);
398      # Return the result.      # Return the result.
399      return (\%set1Blocks, \%set2Blocks, \%bothBlocks);      return (\%set0Blocks, \%set1Blocks, \%bothBlocks);
400    }
401    
402    =head3 RemoveBlocks
403    
404    C<< $simBlocks->RemoveBlocks(\%blockMap, \@set); >>
405    
406    Remove from the specified block map any blocks that occur in the specified set of genomes.
407    The block map can contain any data, but it must be keyed by block ID.
408    
409    =over 4
410    
411    =item blockMap
412    
413    Reference to a hash of block IDs to block data. The character of the block data is
414    irrelevant to this method.
415    
416    =item set
417    
418    Reference to a list of genome IDs. Any block that occurs in any one of the specified
419    genomes will be removed from the block map.
420    
421    =back
422    
423    =cut
424    #: Return Type ;
425    sub RemoveBlocks {
426        # Get the parameters.
427        my ($self, $blockMap, $set) = @_;
428        # Get a list of the blocks in the genome set.
429        my %setBlocks = $self->BlocksInSet($set, 1);
430        # Loop through the incoming block map, deleting any blocks found
431        # in the new block set. Note we make the assumption that the
432        # incoming block map is smaller.
433        my @blockList = keys %{$blockMap};
434        for my $blockID (@blockList) {
435            if (exists $setBlocks{$blockID}) {
436                delete $blockMap->{$blockID};
437            }
438        }
439  }  }
440    
441  =head3 BlocksInSet  =head3 BlocksInSet
442    
443  C<< my %blockList = $simBlocks->BlocksInSet($set); >>  C<< my %blockList = $simBlocks->BlocksInSet($set, $count); >>
444    
445  Return a list of the group blocks found in a given set of genomes. The list returned  Return a list of the group blocks found in a given number of the genomes in a given
446  will be a hash of B<DBObject>s, each corresponding to a single B<GroupBlock> record,  set. The list returned will be a hash of B<DBObject>s, each corresponding to a single
447  with a B<HasInstanceOf> record attached, though the content of the B<HasInstanceOf>  B<GroupBlock> record, with a B<HasInstanceOf> record attached, though the content of
448  record is not predictable. The hash will be keyed by group ID.  the B<HasInstanceOf> record is not predictable. The hash will be keyed by block ID.
449    
450  =over 4  =over 4
451    
452  =item set  =item set
453    
454  Reference to a list of genome IDs. All blocks appearing in any one of the genome  Reference to a list of genome IDs.
455  IDs will be in the list returned.  
456    =item count (optional)
457    
458    Minimum number of genomes from the set in which the block must appear. If C<1> is
459    specified, the list will include any block that occurs in at least one genome.
460    If C<5> is specified, the list will only include blocks that occur in at least
461    5 genomes from the set. If the set consists of only 4 genomes in this latter
462    case, no blocks will be returned. The default is to return blocks that occur in
463    all genomes of the set.
464    
465  =item RETURN  =item RETURN
466    
# Line 236  Line 473 
473  #: Return Type %%;  #: Return Type %%;
474  sub BlocksInSet {  sub BlocksInSet {
475      # Get the parameters.      # Get the parameters.
476      my ($self, $set) = @_;      my ($self, $set, $count) = @_;
477        # If the count was not specified, set it t
478        if (!defined $count) {
479            $count = @{$set};
480        }
481        Trace("BlocksInSet called for $count genomes of set (" . join(", ", @{$set}) . ").") if T(2);
482      # Create a hash to hold the groups found. The hash will be keyed by group ID      # Create a hash to hold the groups found. The hash will be keyed by group ID
483      # and contain the relevant DB object.      # and contain the relevant DB object.
484      my %retVal = ();      my %retVal = ();
485        # Create a hash to contain the number of genomes in which each block was found.
486        my %counters = ();
487      # Loop through the genome IDs in the specified set.      # Loop through the genome IDs in the specified set.
488      for my $genomeID (@{$set}) {      for my $genomeID (@{$set}) {
489          # Get a list of group blocks for this genome.          # Get a list of group blocks for this genome.
# Line 249  Line 493 
493          for my $block (@blocks) {          for my $block (@blocks) {
494              # Get the ID of this block.              # Get the ID of this block.
495              my ($blockID) = $block->Value('GroupBlock(id)');              my ($blockID) = $block->Value('GroupBlock(id)');
496              # Store it in the hash. If it's a duplicate, it will erase the              # Determine whether this block is new or has been found in a previous
497              # old one.              # genome of the set.
498                if (exists $counters{$blockID}) {
499                    # Here it's old. Increment its counter.
500                    $counters{$blockID}++;
501                } else {
502                    # Here it's new. Create a counter for it.
503                    $counters{$blockID} = 1;
504                }
505                # If it meets the threshhold, put it in the return hash. If it's already
506                # there, we'll simply overwrite the old copy, which makes no difference
507                # to the caller.
508                Trace("Block $blockID has $counters{$blockID} occurrences as of genome $genomeID.") if T(SimSet => 4);
509                if ($counters{$blockID} >= $count) {
510              $retVal{$blockID} = $block;              $retVal{$blockID} = $block;
511                    Trace("Block $blockID added to set.") if T(SimSet => 3);
512                }
513          }          }
514      }      }
515      # Return the result.      # Return the result.
# Line 291  Line 549 
549      my ($self, $blockID, $genomes) = @_;      my ($self, $blockID, $genomes) = @_;
550      # Create the return hash.      # Create the return hash.
551      my %retVal = ();      my %retVal = ();
552        # Count the regions found.
553        my $regionCount = 0;
554      # Query all the regions for the specified block.      # Query all the regions for the specified block.
555      my $query = $self->Get(['ContainsRegionIn'], "ContainsRegionIn(from-link) = ?", $blockID);      my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
556                               $blockID);
557      # Loop through the query.      # Loop through the query.
558      while (my $region = $query->Fetch) {      while (my $region = $query->Fetch) {
559          # Get this region's data.          # Get this region's data.
560          my ($contigID, $start, $dir, $len, $dna) =          my ($location, $dna) =
561              $region->Values(['ContainsRegionIn(to-link)', 'ContainsRegionIn(position)',              $region->Values(['Region(id)', 'Region(content)']);
                              'ContainsRegionIn(direction)', 'ContainsRegionIn(len)',  
                              'ContainsRegionIn(content)']);  
562          # Insure it belongs to one of our genomes.          # Insure it belongs to one of our genomes.
563          my $genomeID = Genome::FromContig($contigID);          my $found = SetNumber($location, $genomes);
         my $found = grep($_ eq $genomeID, @{$genomes});  
564          # If it does, put it in the hash.          # If it does, put it in the hash.
565          if ($found) {          if ($found >= 0) {
             my $location = "${contigID}_$start$dir$len";  
566              $retVal{$location} = $dna;              $retVal{$location} = $dna;
567                $regionCount++;
568          }          }
569      }      }
570        Trace("$regionCount regions found by GetRegions.") if T(2);
571      # Return the result.      # Return the result.
572      return %retVal;      return %retVal;
573  }  }
574    
575    =head3 SetNumber
576    
577    C<< my $setNumber = SimBlocks::SetNumber($contigRegion, \@set0, \@set1, ..., \@setN); >>
578    
579    Examine a region string, contig ID, or genome ID, and return the number of the genome
580    set to which it belongs.
581    
582    =over 4
583    
584    =item contigRegion
585    
586    A region string, contig ID, or genome ID. Because the contig ID is prefixed by the genome ID
587    delimited by a colon symbol (C<:>), we split at the first colon to get the desired ID.
588    
589    =item set0, set1, ..., setN
590    
591    A list of references to genome ID lists. Essentially, we will return the index of the
592    first incoming list that has the specified genome ID in it.
593    
594    =item RETURN
595    
596    The index in the list of sets of the set containing the relevant genome, or -1 if
597    the genome was not found.
598    
599    =back
600    
601    =cut
602    #: Return Type $
603    sub SetNumber {
604        # Get the parameters.
605        my ($contigRegion, @sets) = @_;
606        # Extract the genome ID.
607        my ($genomeID) = split /:/, $contigRegion;
608        # Clear the return variable. A negative number means nothing found.
609        my $retVal = -1;
610        # Loop until we find the genome.
611        for (my $i = 0; $retVal < 0 && $i <= $#sets; $i++) {
612            my $set = $sets[$i];
613            if (grep /^$genomeID$/, @{$set}) {
614                $retVal = $i;
615            }
616        }
617        # Return the result.
618        return $retVal;
619    }
620    
621  =head3 TagDNA  =head3 TagDNA
622    
623  C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>  C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>
624    
625  Convert a DNA string from the B<ContainsRegionIn> relationship to the actual  Convert a DNA string from the B<Region> relation to the actual DNA.
626  DNA, optionally with markings surrounding them. The DNA string will contain  The incoming DNA string will contain only the values corresponding to the
627  only the values corresponding to the question marks in the pattern,  question marks in the pattern. The pattern should be taken from the DNA
628  which should be taken from the DNA string's parent B<GroupBlock>. The  string's parent B<GroupBlock>. The resulting DNA sequence is built by
629  resulting DNA sequence is built by copying  copying the DNA string data into the pattern positions that have question
630    marks. If the string is intended for display in an HTML page, the
631    I<$prefix> and I<$suffix> parameters can be used to surround the
632    question mark positions with HTML markers that specify a different style,
633    weight, or font.
634    
635  =over 4  =over 4
636    
# Line 333  Line 642 
642    
643  String containing DNA string to be marked.  String containing DNA string to be marked.
644    
645  =item prefix  =item prefix (optional)
646    
647  String to be inserted before each position of variance.  String to be inserted before each position of variance.
648    
649  =item suffix  =item suffix (optional)
650    
651  String to be inserted after each position of variance.  String to be inserted after each position of variance.
652    
# Line 378  Line 687 
687          # Start past the question mark.          # Start past the question mark.
688          $start = $position + 1;          $start = $position + 1;
689      }      }
690        # Tack on the residual.
691        $retVal .= substr $pattern, $start;
692      # Return the result.      # Return the result.
693      return $retVal;      return $retVal;
694  }  }
695    
696    =head3 SnipScan
697    
698    C<< my %positions = $simBlocks->SnipScan($blockObject, \@set0, \@set1); >>
699    
700    Examine the specified block and return a list of the positions at which the
701    nucleotide values for regions in the first genome set differ from the values
702    for regions in the second genome set. This conditions is only satisfied if
703    the nucleotides occurring in the first set's regions never occur in the
704    second set's regions. This is a very strict condition. If, for example, C<A>
705    occurs in one of the regions for the first set, it cannot occur in any of the
706    regions for the second set.
707    
708    =over 4
709    
710    =item blockObject
711    
712    A C<DBObject> representing the B<GroupBlock> record for the desired block or
713    the actual ID of the block whose regions are to be examined. It is expected that the
714    block will have regions in all of the genomes for both sets, but this is not
715    required by the algorithm.
716    
717    =item set0
718    
719    Reference to a list of the IDs for the genomes in the first set (0).
720    
721    =item set1
722    
723    Reference to a list of the IDs for the genomes in the second set (1).
724    
725    =item RETURN
726    
727    A hash that maps positions to contents. Each position will be relative to the
728    start of the block. The mapped value will list the nucleotides found in the
729    first set and the nucleotides found in the second set. So, for example,
730    C<['AC', 'G']> means that both C<A> and C<C> are found in the genomes of the
731    first set (0), but only C<G> is found in genomes of the second set (1).
732    
733    =back
734    
735    =cut
736    #: Return Type %;
737    sub SnipScan {
738        # Get the parameters.
739        my ($self, $blockObject, $set0, $set1) = @_;
740        # Convert an incoming block ID to a block object.
741        if (ref $blockObject ne "DBObject") {
742            $blockObject = $self->GetEntity('GroupBlock', $blockObject);
743        }
744        # Get the ID and length of this block.
745        my ($blockID) = $blockObject->Value('GroupBlock(id)');
746        my ($len) = $blockObject->Value('GroupBlock(snip-count)');
747        # Create a hash that can be used to identify the genomes in each set.
748        my %setMap = ();
749        for my $genomeID (@{$set0}) {
750            $setMap{$genomeID} = 0;
751        }
752        for my $genomeID (@{$set1}) {
753            $setMap{$genomeID} = 1;
754        }
755        # Create the snip hash. We'll create empty lists for each snip position.
756        # As soon as a position becomes ineligible we delete it from the hash.
757        my %snipHash = ();
758        for (my $i = 0; $i < $len; $i++) {
759            $snipHash{$i} = ['', ''];
760        }
761        # Ask for the regions in the block.
762        my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
763                               $blockID);
764        # Loop through the regions.
765        while (my $region = $query->Fetch) {
766            # Determine this region's genome set. We only continue if the region is in
767            # one of the sets.
768            my ($location) = $region->Value('Region(id)');
769            my $genomeID = Genome::FromContig($location);
770            if (exists $setMap{$genomeID}) {
771                my $setIndex = $setMap{$genomeID};
772                # Get the region's snip values.
773                my ($regionValue) = $region->Value('Region(content)');
774                # Loop through the positions that are still active.
775                my @positions = keys %snipHash;
776                for my $position (@positions) {
777                    # Get the nucleotide at the current position in this region.
778                    my $letter = substr $regionValue, $position, 1;
779                    # Find it in the appropriate list.
780                    my $letterList = $snipHash{$position};
781                    if ((index $letterList->[1-$setIndex], $letter) >= 0) {
782                        # It's in the other set, so this position is no
783                        # longer valid.
784                        delete $snipHash{$position};
785                    } elsif ((index $letterList->[$setIndex], $letter) < 0) {
786                        # It's in neither set, so we add it to the list we're
787                        # accumulating for this set at this position.
788                        $letterList->[$setIndex] .= $letter;
789                    }
790                }
791            }
792        }
793        # Now %snipHash contains all of the information we need, but we must relocate the
794        # positions so they are relative to the block address instead of the snip
795        # positions. We need the block pattern to do this.
796        my ($blockPattern) = $blockObject->Value('GroupBlock(pattern)');
797        my @positions = ParsePattern($blockPattern);
798        # Create the return hash.
799        my %retVal = ();
800        # Relocate the positions.
801        for my $inPosition (keys %snipHash) {
802            $retVal{$positions[$inPosition]} = $snipHash{$inPosition};
803        }
804        # Return the result.
805        return %retVal;
806    }
807    
808    =head3 ParsePattern
809    
810    C<< my @positions = SimBlocks::ParsePattern($pattern); >>
811    
812    Get a list of the positions of variance inside a block pattern. The
813    positions of variance are marked by question marks, so all we need to
814    do is return the location of every question mark in the incoming
815    patter string.
816    
817    =over 4
818    
819    =item pattern
820    
821    DNA pattern for a group block.
822    
823    =item RETURN
824    
825    Returns a list of position numbers. These are coded as offsets, with 0 indicating
826    the first character of the pattern.
827    
828    =back
829    
830    =cut
831    #: Return Type @;
832    sub ParsePattern {
833        # Get the parameters.
834        my ($pattern) = @_;
835        # Create the return list.
836        my @retVal = ();
837        # Loop through the pattern, looking for question marks.
838        my $position = 0;
839        while (($position = index($pattern, "?", $position)) >= 0) {
840            push @retVal, $position;
841            $position++;
842        }
843        # Return the result.
844        return @retVal;
845    }
846    
847    =head3 MergeDNA
848    
849    C<< my ($groupSequence, $variance) = SimBlocks::MergeDNA($groupSequence, $newSequence); >>
850    
851    Merge the DNA for a region into the group representation of similar DNA, returning the
852    result and the positions of variance. Positions of variance in the group representation
853    are indicated by question marks. This method essentially compares the new DNA to the
854    group DNA and adds question marks wherever the new DNA does not match.
855    
856    =over 4
857    
858    =item groupSequence
859    
860    Group representation of the DNA into which the new DNA is to be merged.
861    
862    =item newSequence
863    
864    New DNA to be merged into the group representation.
865    
866    =item RETURN
867    
868    Returns the merged group representation followed by an updated count of the number
869    of question marks in the group.
870    
871    =back
872    
873    =cut
874    sub MergeDNA {
875        # Get the parameters.
876        my ($groupSequence, $newSequence) = @_;
877        # Declare the return variables.
878        my ($retVal, $varianceCount) = ("", 0);
879        # If there is no group representation, or if the group representation is the
880        # same as the new DNA, we return the new DNA; otherwise, we have to merge.
881        # Note that the new DNA cannot contain any question marks, so if either of
882        # the degenerate cases is true, we return a variance of 0.
883        if ((! $groupSequence) || ($groupSequence eq $newSequence)) {
884            $retVal = $newSequence;
885        } else {
886            # Here the group representation and the new DNA are different, so we
887            # have to find all the points of difference and replace them with
888            # question marks. We do this using a trick: by XOR-ing the two
889            # strings together, we create a new string with nulls everywhere
890            # except where there's a mismatch. We then find the non-null characters
891            # and replace the corresponding positions in the return string with
892            # question marks. First, we get a copy of the group representation to
893            # play with.
894            $retVal = $groupSequence;
895            # Next we compute the XOR string.
896            my $hexor = $retVal ^ $newSequence;
897            # This loop finds all non-null string positions.
898            while ($hexor =~ m/[^\x00]/g) {
899                # Here we get the position past the mismatch point.
900                my $pos = pos $hexor;
901                # Replace the real mismatch point with a question mark.
902                substr $retVal, $pos - 1, 1, "?";
903                $varianceCount++;
904            }
905        }
906        # Return the result.
907        return ($retVal, $varianceCount);
908    }
909    
910    =head3 GetAlignment
911    
912    C<< my %sequences = $$simBlocks->GetAlignment(\@blockIDs, \@genomeIDs, $indels); >>
913    
914    Return an alignment of the specified genomes relative to the specified block
915    IDs. Only blocks in which all of the genomes occur will produce output for
916    the alignment.
917    
918    =over 4
919    
920    =item blockIDs
921    
922    Reference to a list of the IDs of the blocks from which the alignment
923    should be constructed.
924    
925    =item genomeIDs
926    
927    Reference to a list of the genome IDs participating in the alignment.
928    
929    =item indels (optional)
930    
931    C<1> if columns containing insertion characters (C<->) should be included
932    in the alignment, else C<0>. The default is C<0>.
933    
934    =item RETURN
935    
936    Returns a hash that maps each genome ID to its sequence in the alignment.
937    
938    =back
939    
940    =cut
941    #: Return Type %;
942    sub GetAlignment {
943        # Get the parameters.
944        my ($self, $blockIDs, $genomeIDs, $indels) = @_;
945        # Create the return hash. We will start with a null string for each
946        # genome, and add the alignment data one block at a time.
947        my %retVal = ();
948        for my $genomeID (@{$genomeIDs}) {
949            $retVal{$genomeID} = "";
950        }
951        # Get the number of genomes. This will help us in determining which
952        # blocks to keep.
953        my $totalGenomes = @{$genomeIDs};
954        # Loop through the blocks.
955        for my $blockID (@{$blockIDs}) {
956            # Get the block's data.
957            my $blockData = $self->GetEntity('GroupBlock', $blockID);
958            my ($blockName) = $blockData->Value('GroupBlock(description)');
959            Trace("Processing common block $blockID ($blockName).") if T(Alignment => 4);
960            # Only proceed if the block has real variance in it.
961            my ($snipCount) = $blockData->Value('GroupBlock(snip-count)');
962            if ($snipCount > 0) {
963                # Get this block's regions in the specified genomes.
964                my %regionHash = $self->GetRegions($blockID, $genomeIDs);
965                # Verify that each genome is represented.
966                my %blockHash = ();
967                my $genomeCount = 0;
968                for my $region (keys %regionHash) {
969                    # Get the genome ID from the region string.
970                    $region =~ m/^([^:]+):/;
971                    # If it's new, count it.
972                    if (! exists $blockHash{$1}) {
973                        $blockHash{$1} = $regionHash{$region};
974                        $genomeCount++;
975                    }
976                }
977                # At this point, if the number of genomes found matches the
978                # size of the list, this block is good.
979                if ($genomeCount == $totalGenomes) {
980                    if (! $indels) {
981                        # If indels are off, we need to remove some of the columns. We do
982                        # this by finding C<-> characters in each genome's DNA sequence,
983                        # and deleting all instances of the relevant columns.
984                        for my $genomeID (@{$genomeIDs}) {
985                            Trace("Hyphen check for genome $genomeID.") if T(Alignment => 3);
986                            while ((my $pos = index($blockHash{$genomeID},"-")) >= 0) {
987                                # Here we found an insertion character. We delete its column
988                                # from all the genomes in the alignment.
989                                Trace("Hyphen found at position $pos in genome $genomeID.") if T(Alignment => 4);
990                                for my $genomeID2 (@{$genomeIDs}) {
991                                    substr $blockHash{$genomeID2}, $pos, 1, "";
992                                }
993                            }
994                        }
995                    }
996                    Trace("Common block $blockID ($blockName) added to alignment.") if T(Alignment => 3);
997                    for my $genomeID (@{$genomeIDs}) {
998                        $retVal{$genomeID} .= $blockHash{$genomeID};
999                    }
1000                }
1001            }
1002        }
1003        # Return the result.
1004        return %retVal;
1005    }
1006    
1007    =head3 DistanceMatrix
1008    
1009    C<< my %distances = SimBlocks::DistanceMatrix(\%sequences, \%distances); >>
1010    
1011    Compute the distances between the sequences in an alignment. the L</SequenceDistance>
1012    method is used to compute the individual distances.
1013    
1014    =over 4
1015    
1016    =item sequences
1017    
1018    Reference to a hash of genome IDs to alignment sequences. The alignment sequences may
1019    only contain the characters C<AGCT->, and must all be the same length.
1020    
1021    =item distances
1022    
1023    Reference to a hash that maps pairs of letters to distance values. Each letter pair
1024    must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1025    map to the same distance. If no distance matrix is specified, the default distance
1026    matrix is used.
1027    
1028    =item RETURN
1029    
1030    Returns a hash that maps genome ID pairs to distance values. The ID pairs are
1031    formed using a slash. For example, the distance from C<158878.1> to C<282459.1>
1032    would be found attached to the key C<158878.1/282459.1>.
1033    
1034    =back
1035    
1036    =cut
1037    #: Return Type %
1038    sub DistanceMatrix {
1039        # Get the parameters.
1040        my ($sequences, $distances) = @_;
1041        # Declare the return variable.
1042        my %retVal = ();
1043        # Create a sorted list of the genome IDs.
1044        my @genomes = sort keys %{$sequences};
1045        # Loop through the genome IDs.
1046        for (my $i = 0; $i <= $#genomes; $i++) {
1047            my $genomei = $genomes[$i];
1048            # Create the 0-distance pair.
1049            $retVal{"$genomei/$genomei"} = 0;
1050            # Loop through the genomes lexically after this one. We'll
1051            # generate one distance and use it for both orderings.
1052            for (my $j = $i + 1; $j <= $#genomes; $j++) {
1053                my $genomej = $genomes[$j];
1054                # Compute the distance.
1055                my $distance = SequenceDistance($sequences->{$genomei}, $sequences->{$genomej},
1056                                                $distances);
1057                Trace("Distance from $genomei to $genomej is $distance.") if T(Distance => 3);
1058                # Store it in the return hash.
1059                $retVal{"$genomei/$genomej"} = $distance;
1060                $retVal{"$genomej/$genomei"} = $distance;
1061            }
1062        }
1063        # Return the result matrix.
1064        return %retVal;
1065    }
1066    
1067    =head3 SequenceDistance
1068    
1069    C<< my $dist = SimBlocks::SequenceDistance($seq1, $seq2, $distances); >>
1070    
1071    Return the distance between two sequences. The distance presumes that
1072    each alignment is a vector of sorts, with purines (C<A>/C<T>) and pyrmidines (C<G>/C<C>)
1073    being a half-unit from themselves and a full unit from each other. The distance is normalized
1074    by the number of positions in the alignment. Thus, C<AATGCTT> is 0.6267 from C<ATTTGCA>.
1075    The fourth and sixth positions are a full-unit jump and the second, fifth, and seventh
1076    positions are half-unit jumps. This works out to
1077    
1078        sqrt(0 + 0.5^2 + 0 + 1^2 + 0.5^2 + 1^2 + 0.5^2)/sqrt(7)
1079    
1080    which is approximately 0.6267. Insertion characters (C<->) are considered a distance of
1081    0 from everything. This distance model can be overridden by supplying a custom distance
1082    matrix as the third parameter.
1083    
1084    =over 4
1085    
1086    =item seq1
1087    
1088    Origin sequence for computing the distance.
1089    
1090    =item seq2
1091    
1092    End sequence for computing the distance. Both sequences must be the same length.
1093    
1094    =item distances
1095    
1096    Reference to a hash that maps pairs of letters to distance values. Each letter pair
1097    must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1098    map to the same distance. If no distance matrix is specified, the default distance
1099    matrix is used.
1100    
1101    =back
1102    
1103    =cut
1104    #: Return Type $;
1105    sub SequenceDistance {
1106        # Get the parameters.
1107        my ($seq1, $seq2, $distances) = @_;
1108        if (!defined $distances) {
1109            $distances = DefaultDistances();
1110        }
1111        # Declare the return value. Initially, this will be a sum of squares. We'll
1112        # take the square root and divide out the length later.
1113        my $retVal = 0;
1114        # Loop through the sequences, character by character, comparing.
1115        my $n = length $seq1;
1116        for (my $i = 0; $i < $n; $i++) {
1117            my $s1 = substr $seq1, $i, 1;
1118            my $s2 = substr $seq2, $i, 1;
1119            my $step = $distances->{"$s1$s2"};
1120            Trace("No step distance found for \"$s1$s2\".") if !defined $step && T(Distance => 1);
1121            $retVal += $step * $step;
1122        }
1123        # Divide by the length and take the square root.
1124        $retVal = sqrt($retVal / $n);
1125        # Return the result.
1126        return $retVal;
1127    }
1128    
1129    =head3 GetBlock
1130    
1131    C<< my $blockData = $simBlocks->GetBlock($blockID); >>
1132    
1133    Return a B<DBObject> for a specified group block.
1134    
1135    =over 4
1136    
1137    =item blockID
1138    
1139    ID of the desired block.
1140    
1141    =item RETURN
1142    
1143    Returns a B<DBObject> for the group block in question. The object allows access to
1144    all of the fields in the C<GroupBlock> relation of the database.
1145    
1146    =back
1147    
1148    =cut
1149    #: Return Type $%;
1150    sub GetBlock {
1151        # Get the parameters.
1152        my ($self, $blockID) = @_;
1153        # Get the group block.
1154        my $retVal = $self->GetEntity('GroupBlock', $blockID);
1155        # Return the result.
1156        return $retVal;
1157    }
1158    
1159    =head3 GetBlockPieces
1160    
1161    C<< my @blocks = $blockData->GetBlockPieces($location); >>
1162    
1163    Return a map of the block pieces inside the specified location. The return
1164    value will be a list of block locations. A block location is essentially a
1165    location string with a block ID in place of the contig ID.
1166    
1167    This routine depends upon the fact that there is no overlap between blocks.
1168    A given nucleotide on a Contig is in exactly one block. Therefore, to find
1169    all the blocks that overlap a specific section of the Contig, we find the
1170    first block that ends after the start of the section and proceed until we
1171    find the last block that begins before the end of the section. The only
1172    complexity that occurs is if the Contig folds back in upon itself and
1173    appears twice in the same block. In this case, we might get two segments from
1174    the same block, and they may or may not overlap.
1175    
1176    =over 4
1177    
1178    =item location
1179    
1180    A basic location object or location string defining the range whose block pieces
1181    are desired.
1182    
1183    =item RETURN
1184    
1185    Returns a list of block location strings. Each string consists of a block ID
1186    followed by a start location and an end location. The three pieces of the
1187    string are separated by underscores (C<_>).
1188    
1189    =back
1190    
1191    =cut
1192    #: Return Type %;
1193    sub GetBlockPieces {
1194        # Get the parameters.
1195        my ($self, $location) = @_;
1196        # Declare the return variable.
1197        my @retVal = ();
1198        # Parse the incoming location.
1199        my $loc = BasicLocation->new($location);
1200        # Determine the parameters needed to get the desired list of regions.
1201        my ($left, $right, $contigID) = ($loc->Left, $loc->Right, $loc->Contig);
1202        Trace("Searching for regions near " . $loc->String) if T(4);
1203        # Ask for the regions in the section we want.
1204        my @regions = $self->GetAll(['Region', 'IncludesRegion', 'GroupBlock'],
1205                                    'Region(endpoint) >= ? AND Region(position) <= ? AND Region(contigID) = ?',
1206                                    [$left, $right, $contigID],
1207                                    ['GroupBlock(id)', 'GroupBlock(pattern)',
1208                                     'GroupBlock(len)',
1209                                     'Region(position)', 'Region(endpoint)',
1210                                     'Region(content)']);
1211        # Loop through the regions found. For each region we will output a location string.
1212        for my $regionData (@regions) {
1213            # Declare the variable to contain the formatted location.
1214            my $blockLoc;
1215            # Separate the fields from the region data.
1216            my ($blockID, $pattern, $len, $position, $end, $content) = @{$regionData};
1217            # Determine whether we're using the full region or only part of it.
1218            if ($position >= $left && $end <= $right) {
1219                # Here we're using the full region, so the location is simple.
1220                $blockLoc = "${blockID}_1_${len}";
1221            } else {
1222                # This is the hard part. We need to convert the contig positions to
1223                # block positions. Since the block contains insertion characters and
1224                # the regions don't, this requires fancy dancing. First, we get the
1225                # region's DNA string.
1226                my $dnaSequence = TagDNA($pattern, $content);
1227                # Let's call the "area of interest" the part of the Contig that
1228                # is in the incoming location AND in the current region. This could
1229                # be at either end of the region or somewhere in the middle. We
1230                # need to walk the block DNA to find what positions in the block
1231                # correspond to the desired positions in the contig. First, we
1232                # put ourselves at the beginning of the block and the region,
1233                # and walk to the start point. Note that our position arithmetic
1234                # is 1-based, in keeping with the conventions of biologists.
1235                my $blockPos = 1;
1236                my $contigPos = $position;
1237                # Check to see if we can use the current position as are starting
1238                # point or we need to find it.
1239                if ($left > $position) {
1240                    # Here we need to find the left edge of the area of interest.
1241                    $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $left);
1242                    # Update the contig position to match.
1243                    $contigPos = $left;
1244                }
1245                # Save the left edge.
1246                my $blockLeft = $blockPos;
1247                # Now find the right edge.
1248                if ($right >= $end) {
1249                    # Here the right edge is the end of the block.
1250                    $blockPos = $len;
1251                } else {
1252                    # Here we have to find the right edge of the area of interest.
1253                    $blockPos = WalkDNA($blockPos, $contigPos, $dnaSequence, $right);
1254                }
1255                my $blockRight = $blockPos;
1256                # Format the location.
1257                $blockLoc = "${blockID}_${blockLeft}_${blockRight}";
1258            }
1259            # Push the block location onto the return list.
1260            push @retVal, $blockLoc;
1261        }
1262        # Return the result.
1263        return @retVal;
1264    }
1265    
1266    =head3 GetFeatureBlockPieces
1267    
1268    C<< my @pieces = $simBlocks->GetFeatureBlockPieces($fig, \@featureIDs, $distance); >>
1269    
1270    Return a list of the block pieces within the specified distance of the
1271    specified features. This method essentially computes locations from
1272    the distance and the feature locations, then passes them to L/GetBlockPieces>.
1273    That method will return the sections of various similarity blocks that
1274    occur inside the the locations computed.
1275    
1276    =over 4
1277    
1278    =item fig
1279    
1280    A FIG-like object that can be used to convert features into locations.
1281    
1282    =item featureIDs
1283    
1284    Reference to a list of feature IDs. Blocks in the vicinity of any of the
1285    features are returned.
1286    
1287    =item distance
1288    
1289    Distance to search from the feature's actual location.
1290    
1291    =item RETURN
1292    
1293    Returns a list of block piece location objects. A block piece location object
1294    is a standard B<BasicLocation> object with a block ID instead of a contig ID.
1295    
1296    =back
1297    
1298    =cut
1299    #: Return Type @;
1300    sub GetFeatureBlockPieces {
1301        # Get the parameters.
1302        my ($self, $fig, $featureIDs, $distance) = @_;
1303        # Declare a hash for keeping the return data. This will weed out
1304        # duplicates, of which we expect plenty. We'll need to handle
1305        # overlaps later, however.
1306        my %retHash = ();
1307        # Loop through the features.
1308        for my $featureID (@{$featureIDs}) {
1309            # Get the feature's genome ID.
1310            my $genomeID = FIG::genome_of($featureID);
1311            # Get the feature's locations.
1312            my @locations = $fig->feature_location($featureID);
1313            # Loop through the locations, sticking the resulting pieces in the return
1314            # hash.
1315            for my $loc (@locations) {
1316                my $locObject = BasicLocation->new($loc);
1317                # Get the length of the contig in question.
1318                my $len = $fig->contig_ln($genomeID, $locObject->Contig);
1319                # Expand the location.
1320                $locObject->Widen($distance, $len);
1321                # Insure the location is Sprout-style;
1322                $locObject->FixContig($genomeID);
1323                # Get the desired block pieces.
1324                my @pieces = $self->GetBlockPieces($locObject);
1325                Trace(scalar(@pieces) . " pieces found for location $loc.") if T(4);
1326                # Put them in the hash.
1327                for my $piece (@pieces) {
1328                    $retHash{$piece} = 1;
1329                }
1330            }
1331        }
1332        # Now we have all the block pieces that occur in any one of our regions. The
1333        # next step is to merge adjacent and overlapping regions. First we convert
1334        # the pieces to location objects so we can sort them.
1335        my @retVal = ();
1336        for my $piece (keys %retHash) {
1337            my $loc = BasicLocation->new($piece);
1338            push @retVal, $loc;
1339        }
1340        Trace("Beginning sort.") if T(3);
1341        @retVal = sort { BasicLocation::Cmp($a,$b) } @retVal;
1342        Trace(scalar(@retVal) . " pieces found before overlap check.") if T(3);
1343        # Now the locations are sorted by block ID, start position, and descending
1344        # length. This means that if there's an overlap, the two overlapping
1345        # pieces will be next to each other.
1346        my $i = 0;
1347        while ($i < $#retVal) {
1348            # Check for overlap between this location and the next.
1349            my ($type, $len) = Overlap::CheckOverlap($retVal[$i], $retVal[$i+1]);
1350            if ($len == 0) {
1351                # Here there is no overlap, so we push ahead.
1352                $i++;
1353            } elsif ($type eq 'embedded') {
1354                # Here the second piece is inside the first, so we delete the
1355                # second.
1356                delete $retVal[$i+1];
1357            } else {
1358                # Here we have a normal overlap. Because all of our pieces
1359                # are forward locations, this means the second piece starts
1360                # before the end of the of the first. We set the end point
1361                # if the first to the end point of the second and delete the
1362                # second.
1363                $retVal[$i]->SetEnd($retVal[$i+1]->EndPoint);
1364                delete $retVal[$i+1];
1365            }
1366        }
1367        # Return the result.
1368        return @retVal;
1369    }
1370    
1371    =head3 WalkDNA
1372    
1373    C<< my $blockPos = SimBlocks::WalkDNA($blockPos, $contigPos, $dna, $loc); >>
1374    
1375    Location the desired position within a block of a specified location in a contig.
1376    
1377    The idea here is that a block contains insertion characters and a contig doesn't.
1378    The incoming DNA sequence is for the block. On entry, we have a current position
1379    in the block and a current position in the contig. We advance the two pointers
1380    in parallel until the contig pointer equals the desired location.
1381    
1382    =over 4
1383    
1384    =item blockPos
1385    
1386    Current position in the block (1-based).
1387    
1388    =item contigPos
1389    
1390    Current position in the contig.
1391    
1392    =item dna
1393    
1394    DNA string for the block with the points of variance filled in with data from
1395    the contig. In other word's the block's version of the current contig region.
1396    
1397    =item loc
1398    
1399    Desired location in the contig.
1400    
1401    =item RETURN
1402    
1403    Returns the position in the block corresponding to the desired position in the
1404    contig.
1405    
1406    =back
1407    
1408    =cut
1409    #: Return Type $;
1410    sub WalkDNA {
1411        # Get the parameters.
1412        my ($blockPos, $contigPos, $dna, $loc) = @_;
1413        # Copy the incoming positions into variables with which we can play. While
1414        # we're at it, we'll convert the block position from 1-based to 0-based.
1415        my ($retVal, $contigPtr) = ($blockPos - 1, $contigPos);
1416        # Loop until we find our destination point.
1417        while ($contigPtr < $loc) {
1418            # Find the next insertion character in the DNA string.
1419            my $nextHyphen = index $dna, '-', $retVal;
1420            # At this point, "nextHyphen" is either -1 (indicating no hyphen
1421            # found) or it is positioned on a hyphen. If it's -1, move it after
1422            # the end of the dna string.
1423            if ($nextHyphen < 0) {
1424                $nextHyphen = length $dna;
1425            }
1426            # Now, if we subtract $retVal from $nextHyphen, we have the largest
1427            # value by which we can advance the contig pointer without getting
1428            # out of sync. The first step is to see if this would move us past
1429            # our desired location.
1430            my $offset = $nextHyphen - $retVal;
1431            my $contigNext = $contigPtr + $offset;
1432            if ($contigNext >= $loc) {
1433                # We now know that everything between the current position and
1434                # the desired position is real nucleotides. We simply push both
1435                # pointers forward by the same amount so that the contig pointer
1436                # becomes $loc.
1437                $retVal += $loc - $contigPtr;
1438                $contigPtr = $loc
1439            } else {
1440                # Here we need to go the whole distance to the next hyphen and
1441                # keep going.
1442                $contigPtr += $offset;
1443                $retVal += $offset;
1444                $retVal++;
1445            }
1446        }
1447        # Return the result. Note we convert back to 1-based.
1448        return $retVal + 1;
1449    }
1450    
1451  1;  1;

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.7

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3