[Bio] / Sprout / SimBlocks.pm Repository:
ViewVC logotype

Diff of /Sprout/SimBlocks.pm

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.1, Wed May 4 03:24:43 2005 UTC revision 1.2, Thu Jun 9 19:06:55 2005 UTC
# Line 3  Line 3 
3      require Exporter;      require Exporter;
4      use ERDB;      use ERDB;
5      @ISA = qw(Exporter ERDB);      @ISA = qw(Exporter ERDB);
6      @EXPORT = qw();      @EXPORT = qw(DefaultDistances);
7      @EXPORT_OK = qw();      @EXPORT_OK = qw(BlocksInSet GetBlock CompareGenomes DBLoad DistanceMatrix GetAlignment GetRegions ParsePattern RemoveBlocks SequenceDistance SetNumber SnipScan);
8    
9      use strict;      use strict;
10      use Tracer;      use Tracer;
# Line 23  Line 23 
23  by the C<SimBlocksDBD.xml> file.  by the C<SimBlocksDBD.xml> file.
24    
25  The primary entities of the database are B<Genome>, which represents a specified  The primary entities of the database are B<Genome>, which represents a specified
26  organism, B<Group>, which represents a set of similar regions, and B<Contig>,  organism, B<GroupBlock>, which represents a set of similar regions, and B<Contig>,
27  which represents a contiguous segment of DNA. The key relationship is  which represents a contiguous segment of DNA. The key relationship is
28  B<ContainsRegionIn>, which maps groups to contigs. The mapping information  B<ContainsRegionIn>, which maps blocks to contigs. The mapping information
29  includes the DNA nucleotides as well as the location in the contig.  includes the DNA nucleotides as well as the location in the contig.
30    
31  For a set of genomes, we will define a I<common group> as a group that  =head2 Formal Definitions
32  is present in most of the genomes. (The definition of I<most> is  
33  determined by a parameter.)  Let {B<G(1)>, B<G(2)>, ..., B<G(>I<n>B<)>} be a set of genomes
34    and let {B<g(1)>, B<g(2)>, ..., B<g(>I<m>B<)>} be the set of
35    genes called on these genomes.
36    
37    The primitive notion from which similarity blocks are derived is the
38    I<correspondence mapping>. Roughly, this mapping connects genes on
39    one genome to corresponding genes on other genomes. We will treat
40    the mapping as an equivalence class. In practice, the entire mapping
41    will not be available when we are building to load files for the
42    similarity block database. Instead, the user will provide a subset
43    of the mapping from which the entire partitioning can be deduced
44    via commutativity, reflexivity, and transitivity. This subset is
45    called the I<correspondence relation>.
46    
47    In actual practice, the correspondence relation will contain some
48    false positives. The process of developing the correspondence
49    relation usually involves analyzing similarities above a certain
50    threshhold. (This type of similarity is not necessarily transitive;
51    however, the true, underlying correspondence mapping we want I<is>
52    transitive.) Often, a sequence of correspondence pairs that maps
53    a gene on one particular genome to another gene on the same genome
54    is an indication of a false positive.
55    
56    In addition to false positives, we may have invalid genes, missing
57    genes, frame shifts, and incorrect start locations. Detecting and
58    cleaning these problems is an important part of getting a good
59    correspondence relation.
60    
61    Given a full-blown correspondence mapping, we define the following
62    concepts.
63    
64    =over 4
65    
66    =item corresponding gene
67    
68    The I<corresponding genes> of a given gene B<g(>I<i>B<)> are those
69    genes mapped to it by the correspondence mapping.
70    
71    =item gene block
72    
73    A I<gene block> is a maximal set of corresponding genes.
74    
75    =item intergenic region
76    
77    An I<intergenic region> is a section of the contig between
78    two adjacent genes. The adjacent genes are called the
79    I<left gene> and the I<right gene>, depending on whether they
80    precede or follow the region. For example, if C<G2_100+50>
81    and C<G2_175+100> are adjacent, C<G2_150+25> is their
82    intergenic region, the left gene is C<G2_100+50>, and
83    the right gene is C<G2_175+100>.
84    
85    =item intergenic block
86    
87    An I<intergenic block> is a maximal set of intergenic regions such
88    that (1) the left genes all correspond and have the same orientation,
89    (2) the right genes all correspond and have the same orientation, and
90    (3) the region lengths differ by no more than a specified threshhold
91    (usually 10%).
92    
93    =item similarity block
94    
95    A I<similarity block> is either a gene block or an intergenic block.
96    A similarity block is therefore a set of regions (either gene regions
97    or intergenic regions) that are tied together by the correspondence
98    mapping.
99    
100    =item common block
101    
102    A I<common block> with respect to a subset of the genomes
103    {B<G(>I<i1>B<)>, B<G(>I<i2>B<)>, ... B<G(>I<ik>B<)>} is a
104    similarity block that has exactly one region in each
105    of the genomes of the subset. There is a lesser notion of
106    a block that occurs in most genomes in the subset. The actual
107    program allows the notion of I<most> to be defined externally.
108    The default is all of the genomes in the subset.
109    
110    =item unique sequence
111    
112    A I<unique sequence> is a region that does not appear in any
113    similarity block with another region. In other words, a unique
114    sequence is a region that does not correspond to any other region.
115    
116    =back
117    
118    =head2 Usage
119    
120  The similarity block database must be able to answer two questions, taking  The similarity block database must be able to answer two questions, taking
121  as input two sets of genomes.  as input two genome sets. (These would be subsets of the total set of genomes
122    represented in the correspondence mapping.)
123    
124  =over 4  =over 4
125    
126  =item (1) Which groups are common in both sets and which are common in only one or the other?  =item (1) Which blocks are common in one set and do not occur in the other.
127    
128  =item (2) For each group that is common in both sets, how do the individual regions differ?  These are considered major variations. They result from prophage insertions,
129    transposition events, and recombinations. A major variation results in a
130    different set of proteins in an operating cell.
131    
132    =item (2) What individual variations in the shared common blocks distinguish the two sets?
133    
134    These are considered minor variations, and are the most common result of
135    single-nucleotide mutations. Rather than changing the set of proteins,
136    minor variations change the proteins themselves.
137    
138  =back  =back
139    
140  This class is a subclass of B<ERDB>, and inherits all of its public methods.  This class is a subclass of B<ERDB>, and inherits all of its public methods.
141    
142    =head2 Setup
143    
144    To begin working with similarity blocks, you must first create a database to
145    contain the block data. Next, you must specify the various similarity block
146    parameters in C<FIG_Config>. These are as follows. (Of course, the values
147    you specify will be different.)
148    
149        $simBlocksDB     = "conser5_blocks";    # SimBlocks database schema name
150        $simBlocksData   = "$data/BlockData";   # directory containing SimBlocks data files
151                                                # (used to load the Similarity database)
152        $simBlastData    = "$data/BlastData";   # directory containing SimBlast data files
153                                                # (input to SimBlast.pl)
154    
155    Run the C<SimBlast> script to load the database.
156    
157    The C<SimBlockForm.cgi> service enables you to run a similarity analysis between genome
158    sets in the database. The C<Html/ERDBTest.html> page enables you to run general queries
159    against the database and execute simple scripts (however, this latter capability is
160    disabled unless you have
161    
162        $debug_mode      = 1;                   # TRUE to enable the debugging scripts
163    
164    The Similarity Block methods and scripts work in the Windows version of the SEED (see
165    C<WinBuild>).
166    
167  =cut  =cut
168    
169  #: Constructor SimBlocks->new();  #: Constructor SimBlocks->new();
# Line 62  Line 181 
181  will be used. The user name and password for connecting are taken from  will be used. The user name and password for connecting are taken from
182  C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.  C<FIG_Config::dbuser> and C<FIG_Config::dbpass>.
183    
184    In almost every case, you will be calling
185    
186    C<< my $simdb = SimBlocks->new(); >>
187    
188  =over 4  =over 4
189    
190  =item dbname (optional)  =item dbname (optional)
# Line 106  Line 229 
229      return $retVal;      return $retVal;
230  }  }
231    
232    =head3 DefaultDistances
233    
234    C<< my $distances = DefaultDistances(); >>
235    
236    Return the default distance matrix for computing the alignment distances (see
237    also L</DistanceMatrix>. This matrix returns a distance of 0 between an insertion
238    character (C<->) and any other nucleotide, a distance of 0.5 between nucleotides
239    of the same type, and a distance of 1 between nucleotides of different types.
240    
241    =cut
242    
243    my $DefaultDistanceMatrix = { AA  => 0,      AC  => 1,     AG  => 0.5,   AT  => 1,
244                                  AN  => 0.625,  AR  => 0.25,  AY  => 1,    'A-' => 0,
245                                  CA  => 1,      CC  => 0,     CG  => 1,     CT  => 0.5,
246                                  CN  => 0.625,  CR  => 1,     CY  => 0.25, 'C-' => 0,
247                                  GA  => 0.5,    GC  => 1,     GG  => 0,     GT  => 1,
248                                  GN  => 0,      GR  => 0.25,  GY  => 1,    'G-' => 0,
249                                  TA  => 1,      TC  => 0.5,   TG  => 1,     TT  => 0,
250                                  TN  => 0,      TR  => 1,     TY  => 0.25, 'T-' => 0,
251                                  RA  => 0.25,   RC  => 1,     RG  => 0.25,  RT  => 1,
252                                  RN  => 0.625,  RR  => 0.25,  RY  => 1,    'R-' => 0,
253                                  YA  => 1,      YC  => 0.25,  YG  => 1,     YT  => 0.25,
254                                  YN  => 0.625,  YR  => 1,     YY  => 0.25, 'Y-' => 0,
255                                  NA  => 0.625,  NC  => 0.625, NG  => 0.625, NT  => 0.625,
256                                  NN  => 0.625,  NR  => 0.625, NY  => 0.625,'N-' => 0,
257                                 '-A' => 0,     '-C' => 0,    '-G' => 0,    '-T' => 0,
258                                 '-N' => 0,     '-R' => 0,    '-Y' => 0,    '--' => 0 };
259    
260    sub DefaultDistances {
261        return $DefaultDistanceMatrix;
262    }
263    
264  =head3 DBLoad  =head3 DBLoad
265    
266  C<< my $stats = $simBlocks->DBLoad($rebuild); >>  C<< my $stats = $simBlocks->DBLoad($rebuild); >>
# Line 119  Line 274 
274  =item rebuild  =item rebuild
275    
276  TRUE if the tables should be re-created; FALSE if the tables should be cleared and  TRUE if the tables should be re-created; FALSE if the tables should be cleared and
277  reloaded. The default is FALSE; however, TRUE is considerably faster.  reloaded. The default is FALSE; however, TRUE is considerably faster. You would use
278    FALSE if you had added non-standard indexes to the database and didn't want to lose
279    them.
280    
281  =item RETURN  =item RETURN
282    
# Line 141  Line 298 
298    
299  =head3 CompareGenomes  =head3 CompareGenomes
300    
301  C<< my (\%set1Blocks, \%set2Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set1, \@set2); >>  C<< my (\%set0Blocks, \%set1Blocks, \%bothBlocks) = $simBlocks->CompareGenomes(\@set0, \@set1); >>
302    
303  Analyze two sets of genomes for commonalities. The group blocks returned will be divided  Analyze two sets of genomes for commonalities. The group blocks returned will be divided
304  into three hashes: one for those found only in set 1, one for those found only in set 2,  into three hashes: one for those common to set 0 and not occurring at all in set 1, one
305  and one for those found in both sets. Each hash is keyed by group ID and will contain  for those common to set 1 and not occurring at all in set 0, and one for those common
306  B<DBObject>s for B<GroupBlock> records with B<HasInstanceOf> data attached, though the  to both sets. Each hash is keyed by group ID and will contain B<DBObject>s for
307  genome ID in the B<HasInstanceOf> section is not generally predictable.  B<GroupBlock> records with B<HasInstanceOf> data attached, though the genome ID in
308    the B<HasInstanceOf> section is not generally predictable.
309    
310  =over 4  =over 4
311    
312  =item set1  =item set0
313    
314  Reference to a list of genome IDs for the genomes in the first set.  Reference to a list of genome IDs for the genomes in the first set (set 0).
315    
316  =item set2  =item set1
317    
318  Reference to a list of genome IDs for the genomes in the second set.  Reference to a list of genome IDs for the genomes in the second set (set 1).
319    
320  =item RETURN  =item RETURN
321    
322  Returns a list of hashes of lists. Each hash is keyed by group ID, and will contain  Returns a triple of hashes. Each hash is keyed by group ID, and will contain
323  B<DBObject>s for records in the B<GroupBlock> table. Groups found only in set  B<DBObject>s for records in the B<GroupBlock> table. Groups found in all of the
324  1 will be in the first hash, groups found only in set 2 will be in the second  genomes in set 0 but in none of the genomes of set 1 will be in the first hash,
325  hash, and groups found in both sets will be in the third hash.  groups found in all of the genomes in set 1 but in none of the genomes of set 0
326    will be in the second hash, and groups found in all of the genomes of both sets
327    will be in the third hash.
328    
329  =back  =back
330    
# Line 172  Line 332 
332  #: Return Type @%;  #: Return Type @%;
333  sub CompareGenomes {  sub CompareGenomes {
334      # Get the parameters.      # Get the parameters.
335      my ($self, $set1, $set2) = @_;      my ($self, $set0, $set1) = @_;
336      # Declare the three hashes.      # Declare the three hashes.
337      my (%set1Blocks, %set2Blocks, %bothBlocks);      my (%set0Blocks, %set1Blocks, %bothBlocks);
338      # Our first task is to find the groups in each genome set.      # Our first task is to find the common groups in each genome set.
339        my %groups0 = $self->BlocksInSet($set0);
340      my %groups1 = $self->BlocksInSet($set1);      my %groups1 = $self->BlocksInSet($set1);
     my %groups2 = $self->BlocksInSet($set2);  
341      # Create a trailer key to help tighten the loops.      # Create a trailer key to help tighten the loops.
342      my $trailer = "\xFF";      my $trailer = "\xFF";
343      # The groups are hashed by group ID. We will move through them in key order      # The groups are hashed by group ID. We will move through them in key order
344      # to get the "set1", "set2", and "both" groups.      # to get the "set0", "set1", and "both" groups.
345        my @blockIDs0 = (sort keys %groups0, $trailer);
346      my @blockIDs1 = (sort keys %groups1, $trailer);      my @blockIDs1 = (sort keys %groups1, $trailer);
     my @blockIDs2 = (sort keys %groups2, $trailer);  
347      # Prime the loop by getting the first ID from each list. Thanks to the trailers      # Prime the loop by getting the first ID from each list. Thanks to the trailers
348      # we know this process can't fail.      # we know this process can't fail.
349        my $blockID0 = shift @blockIDs0;
350      my $blockID1 = shift @blockIDs1;      my $blockID1 = shift @blockIDs1;
351      my $blockID2 = shift @blockIDs2;      while ($blockID0 lt $trailer || $blockID1 lt $trailer) {
     while ($blockID1 lt $trailer || $blockID2 lt $trailer) {  
352          # Compare the block IDs.          # Compare the block IDs.
353          if ($blockID1 lt $blockID2) {          if ($blockID0 lt $blockID1) {
354              # Block 1 is only in the first set.              # Block 0 is only in the first set.
355                Trace("Block IDs $blockID0 vs. $blockID1. Set 0 selected.") if T(SimCompare => 3);
356                $set0Blocks{$blockID0} = $groups0{$blockID0};
357                $blockID0 = shift @blockIDs0;
358            } elsif ($blockID0 gt $blockID1) {
359                # Block 1 is only in the second set.
360                Trace("Block IDs $blockID0 vs. $blockID1. Set 1 selected.") if T(SimCompare => 3);
361              $set1Blocks{$blockID1} = $groups1{$blockID1};              $set1Blocks{$blockID1} = $groups1{$blockID1};
362              $blockID1 = shift @blockIDs1;              $blockID1 = shift @blockIDs1;
         } elsif ($blockID1 gt $blockID2) {  
             # Block 2 is only in the second set.  
             $set2Blocks{$blockID2} = $groups2{$blockID2};  
             $blockID2 = shift @blockIDs2;  
363          } else {          } else {
364              # We have a block in both sets.              # We have a block in both sets.
365                Trace("Block IDs $blockID0 vs. $blockID1. Both selected.") if T(SimCompare => 3);
366              $bothBlocks{$blockID1} = $groups1{$blockID1};              $bothBlocks{$blockID1} = $groups1{$blockID1};
367                $blockID0 = shift @blockIDs0;
368              $blockID1 = shift @blockIDs1;              $blockID1 = shift @blockIDs1;
             $blockID2 = shift @blockIDs2;  
369          }          }
370      }      }
371        # The set1 and set2 lists are too big. They contain groups that are common in their
372        # respective sets but not common in both sets. We want groups that are common in
373        # their respective sets but completely absent in the other set. We do this by getting
374        # a complete list of the blocks in each set and deleting anything we find.
375        $self->RemoveBlocks(\%set0Blocks, $set1);
376        $self->RemoveBlocks(\%set1Blocks, $set0);
377      # Return the result.      # Return the result.
378      return (\%set1Blocks, \%set2Blocks, \%bothBlocks);      return (\%set0Blocks, \%set1Blocks, \%bothBlocks);
379    }
380    
381    =head3 RemoveBlocks
382    
383    C<< $simBlocks->RemoveBlocks(\%blockMap, \@set); >>
384    
385    Remove from the specified block map any blocks that occur in the specified set of genomes.
386    The block map can contain any data, but it must be keyed by block ID.
387    
388    =over 4
389    
390    =item blockMap
391    
392    Reference to a hash of block IDs to block data. The character of the block data is
393    irrelevant to this method.
394    
395    =item set
396    
397    Reference to a list of genome IDs. Any block that occurs in any one of the specified
398    genomes will be removed from the block map.
399    
400    =back
401    
402    =cut
403    #: Return Type ;
404    sub RemoveBlocks {
405        # Get the parameters.
406        my ($self, $blockMap, $set) = @_;
407        # Get a list of the blocks in the genome set.
408        my %setBlocks = $self->BlocksInSet($set, 1);
409        # Loop through the incoming block map, deleting any blocks found
410        # in the new block set. Note we make the assumption that the
411        # incoming block map is smaller.
412        my @blockList = keys %{$blockMap};
413        for my $blockID (@blockList) {
414            if (exists $setBlocks{$blockID}) {
415                delete $blockMap->{$blockID};
416            }
417        }
418  }  }
419    
420  =head3 BlocksInSet  =head3 BlocksInSet
421    
422  C<< my %blockList = $simBlocks->BlocksInSet($set); >>  C<< my %blockList = $simBlocks->BlocksInSet($set, $count); >>
423    
424  Return a list of the group blocks found in a given set of genomes. The list returned  Return a list of the group blocks found in a given number of the genomes in a given
425  will be a hash of B<DBObject>s, each corresponding to a single B<GroupBlock> record,  set. The list returned will be a hash of B<DBObject>s, each corresponding to a single
426  with a B<HasInstanceOf> record attached, though the content of the B<HasInstanceOf>  B<GroupBlock> record, with a B<HasInstanceOf> record attached, though the content of
427  record is not predictable. The hash will be keyed by group ID.  the B<HasInstanceOf> record is not predictable. The hash will be keyed by block ID.
428    
429  =over 4  =over 4
430    
431  =item set  =item set
432    
433  Reference to a list of genome IDs. All blocks appearing in any one of the genome  Reference to a list of genome IDs.
434  IDs will be in the list returned.  
435    =item count (optional)
436    
437    Minimum number of genomes from the set in which the block must appear. If C<1> is
438    specified, the list will include any block that occurs in at least one genome.
439    If C<5> is specified, the list will only include blocks that occur in at least
440    5 genomes from the set. If the set consists of only 4 genomes in this latter
441    case, no blocks will be returned. The default is to return blocks that occur in
442    all genomes of the set.
443    
444  =item RETURN  =item RETURN
445    
# Line 236  Line 452 
452  #: Return Type %%;  #: Return Type %%;
453  sub BlocksInSet {  sub BlocksInSet {
454      # Get the parameters.      # Get the parameters.
455      my ($self, $set) = @_;      my ($self, $set, $count) = @_;
456        # If the count was not specified, set it t
457        if (!defined $count) {
458            $count = @{$set};
459        }
460        Trace("BlocksInSet called for $count genomes of set (" . join(", ", @{$set}) . ").") if T(2);
461      # Create a hash to hold the groups found. The hash will be keyed by group ID      # Create a hash to hold the groups found. The hash will be keyed by group ID
462      # and contain the relevant DB object.      # and contain the relevant DB object.
463      my %retVal = ();      my %retVal = ();
464        # Create a hash to contain the number of genomes in which each block was found.
465        my %counters = ();
466      # Loop through the genome IDs in the specified set.      # Loop through the genome IDs in the specified set.
467      for my $genomeID (@{$set}) {      for my $genomeID (@{$set}) {
468          # Get a list of group blocks for this genome.          # Get a list of group blocks for this genome.
# Line 249  Line 472 
472          for my $block (@blocks) {          for my $block (@blocks) {
473              # Get the ID of this block.              # Get the ID of this block.
474              my ($blockID) = $block->Value('GroupBlock(id)');              my ($blockID) = $block->Value('GroupBlock(id)');
475              # Store it in the hash. If it's a duplicate, it will erase the              # Determine whether this block is new or has been found in a previous
476              # old one.              # genome of the set.
477                if (exists $counters{$blockID}) {
478                    # Here it's old. Increment its counter.
479                    $counters{$blockID}++;
480                } else {
481                    # Here it's new. Create a counter for it.
482                    $counters{$blockID} = 1;
483                }
484                # If it meets the threshhold, put it in the return hash. If it's already
485                # there, we'll simply overwrite the old copy, which makes no difference
486                # to the caller.
487                Trace("Block $blockID has $counters{$blockID} occurrences as of genome $genomeID.") if T(SimSet => 4);
488                if ($counters{$blockID} >= $count) {
489              $retVal{$blockID} = $block;              $retVal{$blockID} = $block;
490                    Trace("Block $blockID added to set.") if T(SimSet => 3);
491                }
492          }          }
493      }      }
494      # Return the result.      # Return the result.
# Line 291  Line 528 
528      my ($self, $blockID, $genomes) = @_;      my ($self, $blockID, $genomes) = @_;
529      # Create the return hash.      # Create the return hash.
530      my %retVal = ();      my %retVal = ();
531        # Count the regions found.
532        my $regionCount = 0;
533      # Query all the regions for the specified block.      # Query all the regions for the specified block.
534      my $query = $self->Get(['ContainsRegionIn'], "ContainsRegionIn(from-link) = ?", $blockID);      my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
535                               $blockID);
536      # Loop through the query.      # Loop through the query.
537      while (my $region = $query->Fetch) {      while (my $region = $query->Fetch) {
538          # Get this region's data.          # Get this region's data.
539          my ($contigID, $start, $dir, $len, $dna) =          my ($location, $dna) =
540              $region->Values(['ContainsRegionIn(to-link)', 'ContainsRegionIn(position)',              $region->Values(['Region(id)', 'Region(content)']);
                              'ContainsRegionIn(direction)', 'ContainsRegionIn(len)',  
                              'ContainsRegionIn(content)']);  
541          # Insure it belongs to one of our genomes.          # Insure it belongs to one of our genomes.
542          my $genomeID = Genome::FromContig($contigID);          my $found = SetNumber($location, $genomes);
         my $found = grep($_ eq $genomeID, @{$genomes});  
543          # If it does, put it in the hash.          # If it does, put it in the hash.
544          if ($found) {          if ($found >= 0) {
             my $location = "${contigID}_$start$dir$len";  
545              $retVal{$location} = $dna;              $retVal{$location} = $dna;
546                $regionCount++;
547          }          }
548      }      }
549        Trace("$regionCount regions found by GetRegions.") if T(2);
550      # Return the result.      # Return the result.
551      return %retVal;      return %retVal;
552  }  }
553    
554    =head3 SetNumber
555    
556    C<< my $setNumber = SimBlocks::SetNumber($contigRegion, \@set0, \@set1, ..., \@setN); >>
557    
558    Examine a region string, contig ID, or genome ID, and return the number of the genome
559    set to which it belongs.
560    
561    =over 4
562    
563    =item contigRegion
564    
565    A region string, contig ID, or genome ID. Because the contig ID is prefixed by the genome ID
566    delimited by a colon symbol (C<:>), we split at the first colon to get the desired ID.
567    
568    =item set0, set1, ..., setN
569    
570    A list of references to genome ID lists. Essentially, we will return the index of the
571    first incoming list that has the specified genome ID in it.
572    
573    =item RETURN
574    
575    The index in the list of sets of the set containing the relevant genome, or -1 if
576    the genome was not found.
577    
578    =back
579    
580    =cut
581    #: Return Type $
582    sub SetNumber {
583        # Get the parameters.
584        my ($contigRegion, @sets) = @_;
585        # Extract the genome ID.
586        my ($genomeID) = split /:/, $contigRegion;
587        # Clear the return variable. A negative number means nothing found.
588        my $retVal = -1;
589        # Loop until we find the genome.
590        for (my $i = 0; $retVal < 0 && $i <= $#sets; $i++) {
591            my $set = $sets[$i];
592            if (grep /^$genomeID$/, @{$set}) {
593                $retVal = $i;
594            }
595        }
596        # Return the result.
597        return $retVal;
598    }
599    
600  =head3 TagDNA  =head3 TagDNA
601    
602  C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>  C<< my $taggedDNA = SimBlocks::TagDNA($pattern, $dnaString, $prefix, $suffix); >>
# Line 378  Line 662 
662          # Start past the question mark.          # Start past the question mark.
663          $start = $position + 1;          $start = $position + 1;
664      }      }
665        # Tack on the residual.
666        $retVal .= substr $pattern, $start;
667      # Return the result.      # Return the result.
668      return $retVal;      return $retVal;
669  }  }
670    
671    =head3 SnipScan
672    
673    C<< my %positions = $simBlocks->SnipScan($blockObject, \@set0, \@set1); >>
674    
675    Examine the specified block and return a list of the positions at which the
676    nucleotide values for regions in the first genome set differ from the values
677    for regions in the second genome set. This conditions is only satisfied if
678    the nucleotides occurring in the first set's regions never occur in the
679    second set's regions. This is a very strict condition. If, for example, C<A>
680    occurs in one of the regions for the first set, it cannot occur in any of the
681    regions for the second set.
682    
683    =over 4
684    
685    =item blockObject
686    
687    A C<DBObject> representing the B<GroupBlock> record for the desired block or
688    the actual ID of the block whose regions are to be examined. It is expected that the
689    block will have regions in all of the genomes for both sets, but this is not
690    required by the algorithm.
691    
692    =item set0
693    
694    Reference to a list of the IDs for the genomes in the first set (0).
695    
696    =item set1
697    
698    Reference to a list of the IDs for the genomes in the second set (1).
699    
700    =item RETURN
701    
702    A hash that maps positions to contents. Each position will be relative to the
703    start of the block. The mapped value will list the nucleotides found in the
704    first set and the nucleotides found in the second set. So, for example,
705    C<['AC', 'G']> means that both C<A> and C<C> are found in the genomes of the
706    first set (0), but only C<G> is found in genomes of the second set (1).
707    
708    =back
709    
710    =cut
711    #: Return Type %;
712    sub SnipScan {
713        # Get the parameters.
714        my ($self, $blockObject, $set0, $set1) = @_;
715        # Convert an incoming block ID to a block object.
716        if (ref $blockObject ne "DBObject") {
717            $blockObject = $self->GetEntity('GroupBlock', $blockObject);
718        }
719        # Get the ID and length of this block.
720        my ($blockID) = $blockObject->Value('GroupBlock(id)');
721        my ($len) = $blockObject->Value('GroupBlock(snip-count)');
722        # Create a hash that can be used to identify the genomes in each set.
723        my %setMap = ();
724        for my $genomeID (@{$set0}) {
725            $setMap{$genomeID} = 0;
726        }
727        for my $genomeID (@{$set1}) {
728            $setMap{$genomeID} = 1;
729        }
730        # Create the snip hash. We'll create empty lists for each snip position.
731        # As soon as a position becomes ineligible we delete it from the hash.
732        my %snipHash = ();
733        for (my $i = 0; $i < $len; $i++) {
734            $snipHash{$i} = ['', ''];
735        }
736        # Ask for the regions in the block.
737        my $query = $self->Get(['IncludesRegion', 'Region'], "IncludesRegion(from-link) = ?",
738                               $blockID);
739        # Loop through the regions.
740        while (my $region = $query->Fetch) {
741            # Determine this region's genome set. We only continue if the region is in
742            # one of the sets.
743            my ($location) = $region->Value('Region(from-link)');
744            my $genomeID = Genome::FromContig($location);
745            if (exists $setMap{$genomeID}) {
746                my $setIndex = $setMap{$genomeID};
747                # Get the region's snip values.
748                my ($regionValue) = $region->Value('Region(content)');
749                # Loop through the positions that are still active.
750                my @positions = keys %snipHash;
751                for my $position (@positions) {
752                    # Get the nucleotide at the current position in this region.
753                    my $letter = substr $regionValue, $position, 1;
754                    # Find it in the appropriate list.
755                    my $letterList = $snipHash{$position};
756                    if ((index $letterList->[1-$setIndex], $letter) >= 0) {
757                        # It's in the other set, so this position is no
758                        # longer valid.
759                        delete $snipHash{$position};
760                    } elsif ((index $letterList->[$setIndex], $letter) < 0) {
761                        # It's in neither set, so we add it to the list we're
762                        # accumulating for this set at this position.
763                        $letterList->[$setIndex] .= $letter;
764                    }
765                }
766            }
767        }
768        # Now %snipHash contains all of the information we need, but we must relocate the
769        # positions so they are relative to the block address instead of the snip
770        # positions. We need the block pattern to do this.
771        my ($blockPattern) = $blockObject->Value('GroupBlock(pattern)');
772        my @positions = ParsePattern($blockPattern);
773        # Create the return hash.
774        my %retVal = ();
775        # Relocate the positions.
776        for my $inPosition (keys %snipHash) {
777            $retVal{$positions[$inPosition]} = $snipHash{$inPosition};
778        }
779        # Return the result.
780        return %retVal;
781    }
782    
783    =head3 ParsePattern
784    
785    C<< my @positions = SimBlocks::ParsePattern($pattern); >>
786    
787    Get a list of the positions of variance inside a block pattern. The
788    positions of variance are marked by question marks, so all we need to
789    do is return the location of every question mark in the incoming
790    patter string.
791    
792    =over 4
793    
794    =item pattern
795    
796    DNA pattern for a group block.
797    
798    =item RETURN
799    
800    Returns a list of position numbers. These are coded as offsets, with 0 indicating
801    the first character of the pattern.
802    
803    =back
804    
805    =cut
806    #: Return Type @;
807    sub ParsePattern {
808        # Get the parameters.
809        my ($pattern) = @_;
810        # Create the return list.
811        my @retVal = ();
812        # Loop through the pattern, looking for question marks.
813        my $position = 0;
814        while (($position = index($pattern, "?", $position)) >= 0) {
815            push @retVal, $position;
816            $position++;
817        }
818        # Return the result.
819        return @retVal;
820    }
821    
822    
823    
824    =head3 MergeDNA
825    
826    C<< my ($groupSequence, $variance) = SimBlocks::MergeDNA($groupSequence, $newSequence); >>
827    
828    Merge the DNA for a region into the group representation of similar DNA, returning the
829    result and the positions of variance. Positions of variance in the group representation
830    are indicated by question marks. This method essentially compares the new DNA to the
831    group DNA and adds question marks wherever the new DNA does not match.
832    
833    =over 4
834    
835    =item groupSequence
836    
837    Group representation of the DNA into which the new DNA is to be merged.
838    
839    =item newSequence
840    
841    New DNA to be merged into the group representation.
842    
843    =item RETURN
844    
845    Returns the merged group representation followed by an updated count of the number
846    of question marks in the group.
847    
848    =back
849    
850    =cut
851    sub MergeDNA {
852        # Get the parameters.
853        my ($groupSequence, $newSequence) = @_;
854        # Declare the return variables.
855        my ($retVal, $varianceCount) = ("", 0);
856        # If there is no group representation, or if the group representation is the
857        # same as the new DNA, we return the new DNA; otherwise, we have to merge.
858        # Note that the new DNA cannot contain any question marks, so if either of
859        # the degenerate cases is true, we return a variance of 0.
860        if ((! $groupSequence) || ($groupSequence eq $newSequence)) {
861            $retVal = $newSequence;
862        } else {
863            # Here the group representation and the new DNA are different, so we
864            # have to find all the points of difference and replace them with
865            # question marks. We do this using a trick: by XOR-ing the two
866            # strings together, we create a new string with nulls everywhere
867            # except where there's a mismatch. We then find the non-null characters
868            # and replace the corresponding positions in the return string with
869            # question marks. First, we get a copy of the group representation to
870            # play with.
871            $retVal = $groupSequence;
872            # Next we compute the XOR string.
873            my $hexor = $retVal ^ $newSequence;
874            # This loop finds all non-null string positions.
875            while ($hexor =~ m/[^\x00]/g) {
876                # Here we get the position past the mismatch point.
877                my $pos = pos $hexor;
878                # Replace the real mismatch point with a question mark.
879                substr $retVal, $pos - 1, 1, "?";
880                $varianceCount++;
881            }
882        }
883        # Return the result.
884        return ($retVal, $varianceCount);
885    }
886    
887    =head3 GetAlignment
888    
889    C<< my %sequences = $$simBlocks->GetAlignment(\@blockIDs, \@genomeIDs, $indels); >>
890    
891    Return an alignment of the specified genomes relative to the specified block
892    IDs. Only blocks in which all of the genomes occur will produce output for
893    the alignment.
894    
895    =over 4
896    
897    =item blockIDs
898    
899    Reference to a list of the IDs of the blocks from which the alignment
900    should be constructed.
901    
902    =item genomeIDs
903    
904    Reference to a list of the genome IDs participating in the alignment.
905    
906    =item indels (optional)
907    
908    C<1> if columns containing insertion characters (C<->) should be included
909    in the alignment, else C<0>. The default is C<0>.
910    
911    =item RETURN
912    
913    Returns a hash that maps each genome ID to its sequence in the alignment.
914    
915    =back
916    
917    =cut
918    #: Return Type %;
919    sub GetAlignment {
920        # Get the parameters.
921        my ($self, $blockIDs, $genomeIDs, $indels) = @_;
922        # Create the return hash. We will start with a null string for each
923        # genome, and add the alignment data one block at a time.
924        my %retVal = ();
925        for my $genomeID (@{$genomeIDs}) {
926            $retVal{$genomeID} = "";
927        }
928        # Get the number of genomes. This will help us in determining which
929        # blocks to keep.
930        my $totalGenomes = @{$genomeIDs};
931        # Loop through the blocks.
932        for my $blockID (@{$blockIDs}) {
933            # Get the block's data.
934            my $blockData = $self->GetEntity('GroupBlock', $blockID);
935            my ($blockName) = $blockData->Value('GroupBlock(description)');
936            Trace("Processing common block $blockID ($blockName).") if T(Alignment => 4);
937            # Only proceed if the block has real variance in it.
938            my ($snipCount) = $blockData->Value('GroupBlock(snip-count)');
939            if ($snipCount > 0) {
940                # Get this block's regions in the specified genomes.
941                my %regionHash = $self->GetRegions($blockID, $genomeIDs);
942                # Verify that each genome is represented.
943                my %blockHash = ();
944                my $genomeCount = 0;
945                for my $region (keys %regionHash) {
946                    # Get the genome ID from the region string.
947                    $region =~ m/^([^:]+):/;
948                    # If it's new, count it.
949                    if (! exists $blockHash{$1}) {
950                        $blockHash{$1} = $regionHash{$region};
951                        $genomeCount++;
952                    }
953                }
954                # At this point, if the number of genomes found matches the
955                # size of the list, this block is good.
956                if ($genomeCount == $totalGenomes) {
957                    if (! $indels) {
958                        # If indels are off, we need to remove some of the columns. We do
959                        # this by finding C<-> characters in each genome's DNA sequence,
960                        # and deleting all instances of the relevant columns.
961                        for my $genomeID (@{$genomeIDs}) {
962                            Trace("Hyphen check for genome $genomeID.") if T(Alignment => 3);
963                            while ((my $pos = index($blockHash{$genomeID},"-")) >= 0) {
964                                # Here we found an insertion character. We delete its column
965                                # from all the genomes in the alignment.
966                                Trace("Hyphen found at position $pos in genome $genomeID.") if T(Alignment => 4);
967                                for my $genomeID2 (@{$genomeIDs}) {
968                                    substr $blockHash{$genomeID2}, $pos, 1, "";
969                                }
970                            }
971                        }
972                    }
973                    Trace("Common block $blockID ($blockName) added to alignment.") if T(Alignment => 3);
974                    for my $genomeID (@{$genomeIDs}) {
975                        $retVal{$genomeID} .= $blockHash{$genomeID};
976                    }
977                }
978            }
979        }
980        # Return the result.
981        return %retVal;
982    }
983    
984    =head3 DistanceMatrix
985    
986    C<< my %distances = SimBlocks::DistanceMatrix(\%sequences, \%distances); >>
987    
988    Compute the distances between the sequences in an alignment. the L</SequenceDistance>
989    method is used to compute the individual distances.
990    
991    =over 4
992    
993    =item sequences
994    
995    Reference to a hash of genome IDs to alignment sequences. The alignment sequences may
996    only contain the characters C<AGCT->, and must all be the same length.
997    
998    =item distances
999    
1000    Reference to a hash that maps pairs of letters to distance values. Each letter pair
1001    must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1002    map to the same distance. If no distance matrix is specified, the default distance
1003    matrix is used.
1004    
1005    =item RETURN
1006    
1007    Returns a hash that maps genome ID pairs to distance values. The ID pairs are
1008    formed using a slash. For example, the distance from C<158878.1> to C<282459.1>
1009    would be found attached to the key C<158878.1/282459.1>.
1010    
1011    =back
1012    
1013    =cut
1014    #: Return Type %
1015    sub DistanceMatrix {
1016        # Get the parameters.
1017        my ($sequences, $distances) = @_;
1018        # Declare the return variable.
1019        my %retVal = ();
1020        # Create a sorted list of the genome IDs.
1021        my @genomes = sort keys %{$sequences};
1022        # Loop through the genome IDs.
1023        for (my $i = 0; $i <= $#genomes; $i++) {
1024            my $genomei = $genomes[$i];
1025            # Create the 0-distance pair.
1026            $retVal{"$genomei/$genomei"} = 0;
1027            # Loop through the genomes lexically after this one. We'll
1028            # generate one distance and use it for both orderings.
1029            for (my $j = $i + 1; $j <= $#genomes; $j++) {
1030                my $genomej = $genomes[$j];
1031                # Compute the distance.
1032                my $distance = SequenceDistance($sequences->{$genomei}, $sequences->{$genomej},
1033                                                $distances);
1034                Trace("Distance from $genomei to $genomej is $distance.") if T(Distance => 3);
1035                # Store it in the return hash.
1036                $retVal{"$genomei/$genomej"} = $distance;
1037                $retVal{"$genomej/$genomei"} = $distance;
1038            }
1039        }
1040        # Return the result matrix.
1041        return %retVal;
1042    }
1043    
1044    =head3 SequenceDistance
1045    
1046    C<< my $dist = SimBlocks::SequenceDistance($seq1, $seq2, $distances); >>
1047    
1048    Return the distance between two sequences. The distance presumes that
1049    each alignment is a vector of sorts, with purines (C<A>/C<T>) and pyrmidines (C<G>/C<C>)
1050    being a half-unit from themselves and a full unit from each other. The distance is normalized
1051    by the number of positions in the alignment. Thus, C<AATGCTT> is 0.6267 from C<ATTTGCA>.
1052    The fourth and sixth positions are a full-unit jump and the second, fifth, and seventh
1053    positions are half-unit jumps. This works out to
1054    
1055        sqrt(0 + 0.5^2 + 0 + 1^2 + 0.5^2 + 1^2 + 0.5^2)/sqrt(7)
1056    
1057    which is approximately 0.6267. Insertion characters (C<->) are considered a distance of
1058    0 from everything. This distance model can be overridden by supplying a custom distance
1059    matrix as the third parameter.
1060    
1061    =over 4
1062    
1063    =item seq1
1064    
1065    Origin sequence for computing the distance.
1066    
1067    =item seq2
1068    
1069    End sequence for computing the distance. Both sequences must be the same length.
1070    
1071    =item distances
1072    
1073    Reference to a hash that maps pairs of letters to distance values. Each letter pair
1074    must be represented as a two-character string. Dual strings such as "AG" and "GA" must
1075    map to the same distance. If no distance matrix is specified, the default distance
1076    matrix is used.
1077    
1078    =back
1079    
1080    =cut
1081    #: Return Type $;
1082    sub SequenceDistance {
1083        # Get the parameters.
1084        my ($seq1, $seq2, $distances) = @_;
1085        if (!defined $distances) {
1086            $distances = DefaultDistances();
1087        }
1088        # Declare the return value. Initially, this will be a sum of squared. We'll
1089        # take the square root and divide out the length later.
1090        my $retVal = 0;
1091        # Loop through the sequences, character by character, comparing.
1092        my $n = length $seq1;
1093        for (my $i = 0; $i < $n; $i++) {
1094            my $s1 = substr $seq1, $i, 1;
1095            my $s2 = substr $seq2, $i, 1;
1096            my $step = $distances->{"$s1$s2"};
1097            Trace("No step distance found for \"$s1$s2\".") if !defined $step && T(Distance => 1);
1098            $retVal += $step * $step;
1099        }
1100        # Divide by the length and take the square root.
1101        $retVal = sqrt($retVal / $n);
1102        # Return the result.
1103        return $retVal;
1104    }
1105    
1106    =head3 GetBlock
1107    
1108    C<< my $blockData = $simBlocks->GetBlock($blockID); >>
1109    
1110    Return a B<DBObject> for a specified group block.
1111    
1112    =over 4
1113    
1114    =item blockID
1115    
1116    ID of the desired block.
1117    
1118    =item RETURN
1119    
1120    Returns a B<DBObject> for the group block in question. The object allows access to
1121    all of the fields in the C<GroupBlock> relation of the database.
1122    
1123    =back
1124    
1125    =cut
1126    #: Return Type $%;
1127    sub GetBlock {
1128        # Get the parameters.
1129        my ($self, $blockID) = @_;
1130        # Get the group block.
1131        my $retVal = $self->GetEntity('GroupBlock', $blockID);
1132        # Return the result.
1133        return $retVal;
1134    }
1135    
1136  1;  1;

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.2

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3