[Bio] / FigKernelPackages / FullLocation.pm Repository:
ViewVC logotype

Annotation of /FigKernelPackages/FullLocation.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.23 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 : olson 1.10 #
3 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
4 :     # for Interpretations of Genomes. All Rights Reserved.
5 :     #
6 :     # This file is part of the SEED Toolkit.
7 : parrello 1.12 #
8 : olson 1.10 # The SEED Toolkit is free software. You can redistribute
9 :     # it and/or modify it under the terms of the SEED Toolkit
10 : parrello 1.12 # Public License.
11 : olson 1.10 #
12 :     # You should have received a copy of the SEED Toolkit Public License
13 :     # along with this program; if not write to the University of Chicago
14 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
15 :     # Genomes at veronika@thefig.info or download a copy from
16 :     # http://www.theseed.org/LICENSE.TXT.
17 :     #
18 :    
19 : parrello 1.1
20 :     package FullLocation;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :     use PageBuilder;
25 :     use BasicLocation;
26 :    
27 :     =head1 Full Location Object
28 :    
29 :     =head2 Introduction
30 :    
31 :     A I<full location object> describes a list of basic locations (segments) in a
32 :     particular genome. In addition to an array of basic location objects, it contains
33 :     a genome ID and a reference to a FIG-like object. The FIG-like object is always
34 :     accessed using the variable I<$fig>. This simplifies the process of determining which
35 :     FIG methods must be supported in order to make use of this object's features.
36 :    
37 :     The simplest way to create a full location object is by passing in the genome ID
38 :     and a location string. The location string contains a list of basic locations
39 :     separated by commas. These are converted into location objects and assembled into
40 :     the full location. The full location is considered I<bounded> by the first and last
41 :     basic locations in the list. This bounded region has a B<Left>, B<Right>, B<Begin>, and
42 :     B<Endpoint>, just like a basic location. So, for example, with a location list of
43 :    
44 :     RED_100_250, RED_275_325, RED_330_430
45 :    
46 :     the B<Left> and B<Begin> locations are C<RED_100>, while the B<Right> and B<EndPoint>
47 : parrello 1.23 locations are C<RED_430>. Similarly, with the location list
48 : parrello 1.1
49 :     BLUE_500_450, BLUE_425_300, BLUE_295_200
50 :    
51 : parrello 1.23 the B<Left> and B<EndPoint> locations are C<BLUE_200>, while the B<Right> and B<Begin>
52 : parrello 1.1 locations are C<BLUE_500>.
53 :    
54 :     A location can be converted to a DNA string using the genome ID and data accessible
55 :     through the fig-like object. A location also has a I<translation> that represents the
56 :     protein sequence produced DNA. The translation can be computed from the DNA or can be
57 :     provided by the client. If it is provided by the client, it will automatically be
58 :     extended when the boundaries are moved.
59 :    
60 :     Theoretically, a location can contain basic locations on different contigs and
61 :     pointing in different directions. In practice, this does not occur; therefore,
62 :     the location will be treated as a set of basic locations in a single direction
63 :     on a single contig. If this is not the case, problems will arise with some of
64 :     the methods.
65 :    
66 :     =cut
67 :    
68 :     #: Constructor FullLocation->new();
69 :    
70 :     =head2 Public Methods
71 :    
72 :     =head3 new
73 :    
74 : parrello 1.18 my $loc = FullLocation->new($fig, $genomeID, $locList, $translation, $code);
75 : parrello 1.1
76 :     Construct a new FullLocation object.
77 :    
78 :     =over 4
79 :    
80 :     =item fig
81 :    
82 :     A fig-like object that can be used to get the DNA and translation information.
83 :    
84 :     =item genomeID
85 :    
86 :     ID of the genome containing the location.
87 :    
88 :     =item locList
89 :    
90 :     List of locations. This can be a reference to a list of location strings, a
91 :     comma-delimited list of location strings, or a reference to a list of basic
92 :     location objects.
93 :    
94 :     =item translation (optional)
95 :    
96 :     Protein string representing the DNA inside the boundaries of the location.
97 :    
98 : parrello 1.14 =item code (optional)
99 :    
100 :     Translation code table to use for translating DNA triplets to proteins. If
101 :     none is specified, the standard code table will be used. If a number is
102 :     specified, then the appropriate NCBI code table will be requested from
103 : parrello 1.22 [[FigPm]].
104 : parrello 1.14
105 : parrello 1.1 =back
106 :    
107 :     =cut
108 :    
109 :     sub new {
110 :     # Get the parameters.
111 : parrello 1.14 my ($class, $fig, $genomeID, $locList, $translation, $code) = @_;
112 :     # If there's no code, default to the standard translation code.
113 :     if (! defined $code) {
114 :     $code = FIG::standard_genetic_code();
115 :     } elsif (! ref $code) {
116 :     $code = FIG::genetic_code($code);
117 :     }
118 : parrello 1.1 # Create the $loc object.
119 :     my $retVal = {
120 :     fig => $fig,
121 :     genomeID => $genomeID,
122 : parrello 1.14 translation => $translation,
123 :     code => $code
124 : parrello 1.1 };
125 :     # The tricky part is the location list. Regardless of its incoming format,
126 :     # we must convert it to a list of location objects.
127 :     my $locType = ref $locList;
128 :     if ($locType eq '') {
129 :     # Here we have a comma-delimited list of locations, so we convert it to
130 :     # an array reference.
131 :     $retVal->{locs} = _ParseLocations($retVal, split /\s*,\s*/, $locList);
132 :     } elsif ($locType eq 'ARRAY') {
133 :     # Here we have an array of location objects or strings, which we can
134 :     # pass directly to the parser.
135 :     $retVal->{locs} = _ParseLocations($retVal, @{$locList});
136 :     } else {
137 :     Confess("Invalid location list parameter of type $locType.");
138 :     }
139 :     # Scan the location array to determine the contig and direction. We choose
140 :     # the most popular contig and direction to represent all the locations.
141 :     my %contigs = ();
142 :     my %dirs = ( '+' => 0, '-' => 0 );
143 :     for my $loc (@{$retVal->{locs}}) {
144 :     $dirs{$loc->Dir}++;
145 :     $contigs{$loc->Contig}++
146 :     }
147 :     # Choose the most popular direction and contig.
148 :     $retVal->{dir} = GetBest(%dirs);
149 :     $retVal->{contig} = GetBest(%contigs);
150 :     if ($dirs{$retVal->{dir}} == 0) {
151 :     # Here the location list was empty.
152 :     Confess("Attempt to create a location for genome $genomeID that has no locations.");
153 :     }
154 :     # Bless and return the object.
155 :     bless $retVal, $class;
156 :     return $retVal;
157 :     }
158 :    
159 :     =head3 Locs
160 :    
161 : parrello 1.12 my $locObject = $loc->Locs->[$idx];
162 : parrello 1.1
163 :     Return a reference to the array of location objects.
164 :    
165 : parrello 1.13 B<NOTE>: Do not update the locations, as it will mess up the translation. Use the
166 :     L</Extend> method to change a full location's begin and/or end points.
167 :    
168 : parrello 1.1 =cut
169 :     #: Return Type $@;
170 :     sub Locs {
171 :     return $_[0]->{locs};
172 :     }
173 :    
174 :     =head3 Contig
175 :    
176 : parrello 1.12 my $contigID = $loc->Contig();
177 : parrello 1.1
178 :     Return the ID of the base contig for this location list. The base contig is the
179 :     contig used for most of the locations in the list.
180 :    
181 :     =cut
182 :     #: Return Type $;
183 :     sub Contig {
184 :     # Get the parameters.
185 :     my ($self) = @_;
186 :     # Return the result.
187 :     return $self->{contig};
188 :     }
189 :    
190 :     =head3 Dir
191 :    
192 : parrello 1.12 my $dir = $loc->Dir;
193 : parrello 1.1
194 :     Return the base direction for this location list. The base direction is the direction
195 :     (C<+> or C<->) used for most of the locations in the list.
196 :    
197 :     =cut
198 :     #: Return Type $;
199 :     sub Dir {
200 :     # Get the parameters.
201 :     my ($self) = @_;
202 :     # Return the result.
203 :     return $self->{dir};
204 :     }
205 :    
206 : parrello 1.2 =head3 NextPoint
207 :    
208 : parrello 1.12 my $offset = $loc->NextPoint;
209 : parrello 1.2
210 :     Return the location immediately after the end point of the last location.
211 :    
212 :     =cut
213 :     #: Return Type $;
214 :     sub NextPoint {
215 :     # Get the parameters.
216 :     my ($self) = @_;
217 :     # Get the last location.
218 :     my (undef, undef, $locN) = $self->GetBounds();
219 :     # Return the point after it.
220 :     return $locN->PointOffset($locN->Length);
221 :     }
222 :    
223 :     =head3 PrevPoint
224 :    
225 : parrello 1.12 my $offset = $loc->PrevPoint;
226 : parrello 1.2
227 :     Return the location immediately before the begin point of the first location.
228 :    
229 :     =cut
230 :     #: Return Type $;
231 :     sub PrevPoint {
232 :     # Get the parameters.
233 :     my ($self) = @_;
234 :     # Get the last location.
235 :     my (undef, $loc1, undef) = $self->GetBounds();
236 :     # Return the point after it.
237 :     return $loc1->PointOffset(-1);
238 :     }
239 :    
240 :     =head3 Begin
241 :    
242 : parrello 1.12 my $offset = $loc->Begin;
243 : parrello 1.2
244 :     Return the begin point of the first location.
245 :    
246 :     =cut
247 :     #: Return Type $;
248 :     sub Begin {
249 :     # Get the parameters.
250 :     my ($self) = @_;
251 : parrello 1.4 # Get the first location.
252 : parrello 1.2 my (undef, $loc1, undef) = $self->GetBounds();
253 :     # Return its begin point.
254 :     return $loc1->Begin;
255 :     }
256 :    
257 : parrello 1.4 =head3 EndPoint
258 :    
259 : parrello 1.12 my $offset = $loc->EndPoint;
260 : parrello 1.4
261 :     Return the end point of the last location.
262 :    
263 :     =cut
264 :     #: Return Type $;
265 :     sub EndPoint {
266 :     # Get the parameters.
267 :     my ($self) = @_;
268 :     # Get the last location.
269 :     my (undef, undef, $locN) = $self->GetBounds();
270 :     # Return its end point.
271 :     return $locN->EndPoint;
272 :     }
273 :    
274 : parrello 1.2 =head3 SeedString
275 :    
276 : parrello 1.12 my $string = $loc->SeedString;
277 : parrello 1.2
278 :     Return a comma-delimited list of this object's basic locations, in SEED format.
279 :    
280 :     =cut
281 :     #: Return Type $;
282 :     sub SeedString {
283 :     # Get the parameters.
284 :     my ($self) = @_;
285 :     # Map the location list to SEED strings.
286 :     my @seeds = map { $_->SeedString } @{$self->{locs}};
287 :     # Return the result.
288 :     return join ", ", @seeds;
289 :     }
290 :    
291 :     =head3 Adjusted
292 :    
293 : parrello 1.12 my $offset = $loc->Adjusted($oldOffset, $distance);
294 : parrello 1.2
295 :     Adjust the specified offset by the specified distance in the direction of this
296 :     location. If this is a forward location, the distance is added; if it is a backward
297 :     location, the distance is subtracted.
298 :    
299 :     =over 4
300 :    
301 :     =item oldOffset
302 :    
303 :     Offset to adjust.
304 :    
305 :     =item distance
306 :    
307 :     Distance by which to adjust the offset. This value can be negative.
308 :    
309 :     =item RETURN
310 :    
311 :     Returns a new offset formed by moving the specified distance from the original offset
312 :     in this location's direction.
313 :    
314 :     =back
315 :    
316 :     =cut
317 :     #: Return Type $;
318 :     sub Adjusted {
319 :     # Get the parameters.
320 :     my ($self, $oldOffset, $distance) = @_;
321 :     # Do the adjustment.
322 :     return $oldOffset + ($self->Dir eq '+' ? $distance : -$distance);
323 :     }
324 :    
325 : parrello 1.1 =head3 GetBest
326 :    
327 : parrello 1.12 my $bestKey = FullLocation::GetBest(%hash);
328 : parrello 1.1
329 :     Return the key of the hash element with the highest positive numeric value.
330 :    
331 :     =over 4
332 :    
333 :     =item hash
334 :    
335 :     A hash mapping keys to numbers.
336 :    
337 :     =item RETURN
338 :    
339 :     Returns the key having the highest value. If the hash is empty, or has no non-negative
340 : overbeek 1.6 values, returns C<undef>.
341 : parrello 1.1
342 :     =back
343 :    
344 :     =cut
345 :     #: Return Type $;
346 :     sub GetBest {
347 :     # Get the parameters.
348 :     my (%hash) = @_;
349 :     # Declare the return variable and initialize the best-count.
350 :     my ($retVal, $best) = (undef, 0);
351 :     # Search the hash.
352 :     for my $key (keys %hash) {
353 :     my $value = $hash{$key};
354 :     if ($value >= $best) {
355 :     $retVal = $key;
356 :     $best = $value;
357 :     }
358 :     }
359 :     # Return the result.
360 :     return $retVal;
361 :     }
362 :    
363 :     =head3 DNA
364 :    
365 : parrello 1.12 my $dnaString = $loc->DNA;
366 : parrello 1.1
367 :     Return the complete DNA string for this location.
368 :    
369 :     =cut
370 :     #: Return Type $;
371 :     sub DNA {
372 :     # Get the parameters.
373 :     my ($self) = @_;
374 :     my $fig = $self->{fig};
375 :     # Use the FIG object to extract the DNA.
376 :     my $retVal = $fig->dna_seq($self->{genomeID}, $self->Contig, map { $_->SeedString } @{$self->Locs});
377 :     # Return the result.
378 :     return $retVal;
379 :     }
380 :    
381 : parrello 1.2 =head3 Codon
382 :    
383 : parrello 1.12 my $codon = $loc->Codon($point);
384 : parrello 1.2
385 :     Return the DNA codon at the specified point on this location's contig in this
386 :     location's direction.
387 :    
388 :     =over 4
389 :    
390 :     =item point
391 :    
392 :     Offset into the contig of the codon.
393 :    
394 :     =item RETURN
395 :    
396 :     Returns a three-letter DNA codon from the specified point.
397 :    
398 :     =back
399 :    
400 :     =cut
401 :     #: Return Type $;
402 :     sub Codon {
403 :     # Get the parameters.
404 :     my ($self, $point) = @_;
405 :     # Get the FIG object.
406 :     my $fig = $self->{fig};
407 :     # Compute the codon location.
408 :     my $loc = $self->Contig . "_" . $point . "_" . $self->Adjusted($point,2);
409 :     # Return the DNA.
410 :     return $fig->dna_seq($self->{genomeID}, $loc);
411 :     }
412 :    
413 : parrello 1.1 =head3 Translation
414 :    
415 : parrello 1.12 my $proteinString = $loc->Translation($code, $fixStart);
416 : parrello 1.1
417 :     Return the protein translation of this location's DNA. The first time a
418 :     translation is requested, it will be cached in the object, and returned
419 :     unmodified. It is also possible that a translation specified in the
420 :     constructor exists, in which case it will be returned. Thus, the
421 :     I<$code> and I<$fixStart> parameters only matter on the first call.
422 :    
423 :     =over 4
424 :    
425 :     =item code (optional)
426 :    
427 :     Translation code table, in the form of a hash mapping DNA triples to protein
428 :     letters. If omitted, the standard translation table in C<FIG.pm> will be
429 :     used.
430 :    
431 :     =item fixStart (optional)
432 :    
433 :     TRUE if the first DNA triple should be given special handling, else FALSE.
434 :     If TRUE, then a value of C<TTG> or C<GTG> in the first position will be
435 :     translated to C<M> instead of the value specified in the translation code.
436 :    
437 :     =item RETURN
438 :    
439 :     Returns the protein translation for this location.
440 :    
441 :     =back
442 :    
443 :     =cut
444 :     #: Return Type $;
445 :     sub Translation {
446 :     # Get the parameters.
447 :     my ($self, $code, $fixStart) = @_;
448 :     # Declare the return variable.
449 :     my $retVal;
450 :     # Check for a cached translation.
451 :     if ($self->{translation}) {
452 :     # Return the cahced translation.
453 :     $retVal = $self->{translation};
454 :     } else {
455 : parrello 1.14 # Check for a translation code.
456 :     if (! defined $code) {
457 :     $code = FIG::standard_genetic_code();
458 :     }
459 : parrello 1.1 # Here we have to do some work. Extract our DNA.
460 :     my $dna = $self->DNA;
461 :     # Translate it.
462 :     $retVal = FIG::translate($dna, $code, $fixStart);
463 : parrello 1.14 # Chop off the stop codon.
464 :     $retVal =~ s/\*$//;
465 :     # Cache the translation and the code table.
466 : parrello 1.1 $self->{translation} = $retVal;
467 : parrello 1.14 $self->{code} = $code;
468 : parrello 1.1 }
469 :     # Return the result.
470 :     return $retVal;
471 :     }
472 :    
473 : parrello 1.23 =head3 ConstrainPoint
474 : parrello 1.9
475 : parrello 1.23 my $constrainedPoint = $loc->ConstrainPoint($point);
476 : parrello 1.9
477 :     Change a point location value so that it fits inside the base contig. If the point
478 :     location is less than 1, it will be set to 1. If it's greater than the length of
479 :     the contig, it will be set to the length of the contig.
480 :    
481 :     =over 4
482 :    
483 :     =item point
484 :    
485 :     Location to be constrained.
486 :    
487 :     =item RETURN
488 :    
489 :     Returns the value of the nearest location that is on the base contig of this location.
490 :    
491 :     =back
492 :    
493 :     =cut
494 :     #: Return Type $;
495 : parrello 1.23 sub ConstrainPoint {
496 : parrello 1.9 # Get the parameters.
497 :     my ($self, $point) = @_;
498 :     # Declare the return variable.
499 :     my $retVal;
500 :     # Check for a value less than 1.
501 :     if ($point < 1) {
502 :     $retVal = 1;
503 :     } else {
504 :     # Check for a value off the end of the contig.
505 :     my $fig = $self->{fig};
506 :     my $contigEnd = $fig->contig_ln($self->{genomeID}, $self->Contig);
507 :     if ($point > $contigEnd) {
508 :     # Bring the point back onto the contig.
509 :     $retVal = $contigEnd;
510 :     } else {
511 :     # Return the incoming point unmodified.
512 :     $retVal = $point;
513 :     }
514 :     }
515 :     # Return the result.
516 :     return $retVal;
517 :     }
518 :    
519 : parrello 1.3 =head3 ExtremeCodon
520 :    
521 : parrello 1.12 my $loc->ExtremeCodon($dir);
522 : parrello 1.3
523 :     Return the most extreme codon in the specified direction. This is not always the most
524 :     extreme location, since the distance to the appropriate edge of the location must be
525 :     a multiple of 3.
526 :    
527 :     =over 4
528 :    
529 :     =item dir
530 :    
531 :     C<first> to get the codon moving away from the beginning of the location, C<last>
532 :     to get the codon moving away from the end of the location.
533 :    
534 :     =item RETURN
535 :    
536 : parrello 1.4 Returns the edge location of the desired codon. If we are going toward the left, this
537 :     is the left point in the codon; if we are going toward the right, this is the right
538 :     point in the codon.
539 : parrello 1.3
540 :     =back
541 :    
542 :     =cut
543 :    
544 :     sub ExtremeCodon {
545 :     # Get the parameters.
546 :     my ($self, $dir) = @_;
547 :     my $fig = $self->{fig};
548 :     # The first task is to determine the starting point and direction for the
549 :     # search. We start by converting the direction to the same format as the
550 :     # location direction.
551 :     my $parity = ($dir eq 'first' ? '-' : '+');
552 : parrello 1.4 # Get the contig length.
553 :     my $contig_len = $fig->contig_ln($self->{genomeID}, $self->Contig);
554 : parrello 1.8 # If we're moving in the opposite direction from the location, we're going to
555 : parrello 1.3 # go toward the beginning of the contig; otherwise, we're going toward the
556 :     # end.
557 :     my ($multiplier, $endPoint);
558 :     if ($parity ne $self->Dir) {
559 :     ($multiplier, $endPoint) = (-3, 1);
560 :     } else {
561 : parrello 1.4 ($multiplier, $endPoint) = (3, $contig_len);
562 : parrello 1.3 }
563 :     # Now we need the start point, which is determined by direction of this method.
564 : parrello 1.4 my $beginPoint = ($parity eq '-' ? $self->Begin : $self->EndPoint);
565 : parrello 1.3 # Compute the number of positions to move and add it to the begin point.
566 :     my $retVal = int(($endPoint - $beginPoint) / $multiplier) * $multiplier +
567 :     $beginPoint;
568 :     # Return the codon found.
569 :     return $retVal;
570 :     }
571 :    
572 : parrello 1.13
573 : parrello 1.2 =head3 Extend
574 :    
575 : parrello 1.12 my = $loc->Extend($newBegin, $newEnd, $trimFlag);
576 : parrello 1.2
577 :     Extend this gene to a new begin point and a new end point. If a translation exists,
578 : parrello 1.5 it will be updated to match the new locations. The I<$trimFlag> indicates whether
579 :     or not it is permissible to shrink the location at either end. If an attempt is made
580 :     to shrink and I<$trimFlag> is not specified, then a fatal error will occur.
581 : parrello 1.2
582 :     =over 4
583 :    
584 :     =item newBegin
585 :    
586 : parrello 1.13 Proposed new beginning offset. If undefined, the begin location will not be changed.
587 :    
588 :     =item newEnd
589 : parrello 1.2
590 : parrello 1.13 Proposed new ending offset. If undefined, the ending location will not be changed.
591 : parrello 1.2
592 : parrello 1.13 =item trimFlag
593 :    
594 :     If TRUE, the begin and end offsets can shrink the location.
595 : parrello 1.2
596 :     =back
597 :    
598 :     =cut
599 :     #: Return Type ;
600 :     sub Extend {
601 :     # Get the parameters.
602 : parrello 1.5 my ($self, $newBegin, $newEnd, $trimFlag) = @_;
603 : parrello 1.2 my $fig = $self->{fig};
604 :     # Get our boundaries.
605 :     my ($boundLoc, $loc1, $locN) = $self->GetBounds;
606 :     # Get the current length of the start location.
607 : parrello 1.5 my $len = $boundLoc->Length;
608 :     # Extend the beginning of the bounds.
609 : parrello 1.13 if (defined $newBegin) {
610 :     $boundLoc->SetBegin($newBegin);
611 :     my $excess = $boundLoc->Length - $len;
612 :     # Insure this is a real extension.
613 :     if ($excess < 0) {
614 :     # Find out if we can trim.
615 :     if (! $trimFlag) {
616 :     # We can't trim, so it's an error.
617 :     Confess("Invalid begin location $newBegin for location " . $boundLoc->String . ".");
618 :     } elsif ($self->{translation}) {
619 :     # We can trim, and a translation exists, so we lop some characters off
620 :     # the front. Note that we divide by -3 because if we're here, the excess
621 :     # is automatically negative.
622 :     my $proteinExcess = int($excess / -3);
623 :     $self->{translation} = substr $self->{translation}, $proteinExcess;
624 :     }
625 :     } elsif ($excess > 0 && $self->{translation}) {
626 :     # Here we have new stuff to translate. Get its location.
627 :     my $excessLoc = BasicLocation->new($loc1->Contig, $newBegin, $loc1->Dir, $excess);
628 :     # Extract its DNA and translate it.
629 :     my $newDNA = $fig->dna_seq($self->{genomeID}, $excessLoc->SeedString);
630 : parrello 1.14 my $newTran = FIG::translate($newDNA, $self->{code});
631 : parrello 1.13 # Prefix the new translation to the old one.
632 :     $self->{translation} = $newTran . $self->{translation};
633 :     }
634 :     # We successfully updated the translation (if necessary), so we adjust the
635 :     # start of the first location in the full location's list.
636 :     $loc1->SetBegin($newBegin);
637 :     # Get the new current length of the bounds location.
638 :     $len = $boundLoc->Length;
639 :     }
640 :     # Extend the ending.
641 :     if (defined $newEnd) {
642 :     $boundLoc->SetEnd($newEnd);
643 :     my $excess = $boundLoc->Length - $len;
644 :     # Insure this is a real extension.
645 :     if ($excess < 0) {
646 :     # Find out if we can trim.
647 :     if (! $trimFlag) {
648 :     # We can't trim, so it's an error.
649 :     Confess("Invalid end location $newEnd for location " . $boundLoc->String . ".");
650 :     } elsif ($self->{translation}) {
651 :     # We can trim, and a translation exists, so we lop off some proteins at
652 :     # the end. Note that we divide by 3 and the excess is negative, so the
653 :     # result will be negative. We use it as a negative length in the substr
654 :     # expression to trim end characters.
655 :     my $negativeProteinExcess = int($excess / 3);
656 :     $self->{translation} = substr $self->{translation}, 0, $negativeProteinExcess;
657 :     }
658 :     } elsif ($excess > 0 && $self->{translation}) {
659 :     # Here we have new stuff to translate. Get its location.
660 :     my $excessLoc = BasicLocation->new($locN->Contig, $boundLoc->PointOffset($len), $locN->Dir,
661 :     $excess);
662 :     # Extract its DNA and translate it.
663 :     my $newDNA = $fig->dna_seq($self->{genomeID}, $excessLoc->SeedString);
664 : parrello 1.14 my $newTran = FIG::translate($newDNA, $self->{code});
665 : parrello 1.13 # Append the new translation to the old one.
666 :     $self->{translation} .= $newTran;
667 :     }
668 :     # Here we sucessfully updated the translation and the update is legal, so
669 :     # we can modify the end of the last location.
670 :     $locN->SetEnd($newEnd);
671 :     }
672 : parrello 1.15 if ($self->{translation}) {
673 :     # Chop the stop codon off the end of the translation.
674 :     $self->{translation} =~ s/\*$//;
675 :     }
676 : parrello 1.2 }
677 :    
678 : parrello 1.14 =head3 ConstrainCodon
679 :    
680 :     my $point = $loc->ConstrainCodon($point, $codonPoint);
681 :    
682 :     Constrain the specified point so that it is inside the bounds of this
683 :     location's contig and its distance to the specified codon point is a
684 :     multiple of 3.
685 :    
686 :     =over 4
687 :    
688 :     =item point
689 :    
690 :     Point index (relative to the contig) to be constrained.
691 :    
692 :     =item codonPoint
693 :    
694 :     Index (relative to the contig) of a point that is the start of a codon.
695 :    
696 :     =item RETURN
697 :    
698 :     Returns the constrained index.
699 :    
700 :     =back
701 :    
702 :     =cut
703 :    
704 :     sub ConstrainCodon {
705 :     # Get the parameters.
706 :     my ($self, $point, $codonPoint) = @_;
707 :     # Declare the return variable.
708 :     my $retVal = $point;
709 :     # Check for too far left.
710 :     if ($retVal < 1) {
711 :     $retVal = ($codonPoint - 1) % 3 + 1;
712 :     } else {
713 :     # Check for too far right.
714 :     my $contigLen = $self->{fig}->contig_ln($self->{genomeID}, $self->Contig);
715 :     if ($retVal > $contigLen) {
716 :     $retVal = $contigLen - ($contigLen - $codonPoint) % 3;
717 :     }
718 :     }
719 :     # Return the result.
720 :     return $retVal;
721 :     }
722 :    
723 :    
724 : parrello 1.1 =head3 GetBounds
725 :    
726 : parrello 1.13 my ($boundLoc, $loc1, $locN) = $loc->GetBounds;
727 : parrello 1.1
728 :     Analyze this location and return information about its boundaries. This includes
729 :     a location for the bounds, the first location, and the last location. The
730 :     bounds essentially define the location as it would be if it were all on a single
731 :     contig in the same direction and had no gaps. The first location is the location
732 :     object containing the begin point of the bounds, and the last location is the
733 :     location object containing the end point of the bounds.
734 :    
735 :     =cut
736 :    
737 :     sub GetBounds {
738 :     # Get the parameters.
739 :     my ($self) = @_;
740 :     # Declare the return variables.
741 :     my ($boundLoc, $loc1, $locN);
742 :     # Get a reference to the location list.
743 :     my $locList = $self->Locs;
744 :     # The most common case is a singleton location list. We handle that first.
745 :     if (@{$locList} == 1) {
746 :     my $bloc = $locList->[0];
747 :     $boundLoc = BasicLocation->new($bloc);
748 :     $loc1 = $bloc;
749 :     $locN = $bloc;
750 :     } else {
751 :     # Here we have a multiple-location list. We search for the leftmost left
752 :     # and rightmost right on the base contig. To do that, we first extract
753 :     # all the eligible locations.
754 :     my $baseContig = $self->Contig;
755 :     my @baseLocs = grep { $_->Contig eq $baseContig } @{$locList};
756 :     # Next we prime the loop with a location popped off the list.
757 :     my $loc0 = pop @baseLocs;
758 :     my ($leftLoc, $rightLoc) = ($loc0, $loc0);
759 :     # Search for the leftmost and rightmost locations.
760 :     for my $loci (@baseLocs) {
761 :     if ($loci->Left < $leftLoc->Left) {
762 :     $leftLoc = $loci;
763 :     }
764 :     if ($loci->Right > $rightLoc->Right) {
765 :     $rightLoc = $loci;
766 :     }
767 :     }
768 :     # Now we have enough information to build the bounding location. First,
769 :     # we get the length.
770 :     my $len = $rightLoc->Right - $leftLoc->Left + 1;
771 :     # Next, we arrange the left and right locations according to the direction.
772 :     if ($self->Dir eq '+') {
773 :     ($loc1, $locN) = ($leftLoc, $rightLoc);
774 :     } else {
775 :     ($loc1, $locN) = ($rightLoc, $leftLoc);
776 :     }
777 :     # Finally, we create the bounding location.
778 :     $boundLoc = BasicLocation->new($baseContig, $loc1->Begin, $self->Dir, $len);
779 :     }
780 :     # Return the results.
781 :     return ($boundLoc, $loc1, $locN);
782 :     }
783 :    
784 : parrello 1.21 =head2 Codon Search Methods
785 :    
786 :     =head3 UpstreamSearch
787 :    
788 :     my $loc = $floc->UpstreamSearch($pattern, $limit);
789 :    
790 :     Search upstream from this location for a codon as identified by the
791 :     specified pattern, stopping at the end of the contig or when the
792 :     specified limit is reached.
793 :    
794 :     =over 4
795 :    
796 :     =item pattern
797 :    
798 :     Codon pattern to search for, expressed as a bar-delimited list of base triplets.
799 :     For example, C<taa|tag|tga> would search for a stop codon.
800 :    
801 :     =item limit (optional)
802 :    
803 :     Maximum number of base pairs to search. Must be a multiple of 3.
804 :    
805 :     =item RETURN
806 :    
807 :     Returns a [[BasicLocationPm]] object for the codon found.
808 :    
809 :     =back
810 :    
811 :     =cut
812 :    
813 :     sub UpstreamSearch {
814 :     # Get the parameters.
815 :     my ($self, $pattern, $limit) = @_;
816 :     # Get the FIG object.
817 :     my $fig = $self->{fig};
818 :     # Declare the return variable.
819 :     my $retVal;
820 :     # Locate the starting and ending positions for the search. The search
821 :     # ends immediately upstream of our begin point.
822 :     my $end = $self->PrevPoint;
823 :     my $start = $end - ($self->Dir . 1) * ($limit - 1);
824 :     # Constrain these values to the inside of the contig.
825 :     $start = $self->ConstrainCodon($start, $self->Begin);
826 :     $end = $self->ConstrainCodon($end, $self->Begin + ($self->Dir . 2));
827 :     # Get the DNA to search. Note we convert it automatically to lower case.
828 :     my $dna = lc $fig->dna_seq($self->{genomeID}, $self->Contig . "_${start}_${end}");
829 :     # Insure the pattern is also lower-case.
830 :     $pattern = lc $pattern;
831 :     # Get the location of the last codon in the dna sequence.
832 :     my $i1 = length($dna) - 3;
833 :     Trace("$i1 base pairs in search.") if T(4);
834 :     Trace("Upsearch DNA translation\n" . FIG::translate($dna, $self->{code})) if T(4);
835 :     for (my $i = $i1; $i >= 0 && ! defined($retVal); $i -= 3) {
836 :     # Check for a match.
837 :     if (substr($dna, $i, 3) =~ /$pattern/) {
838 :     # Compute the actual return value. This will also stop the loop.
839 :     $retVal = BasicLocation->new($self->Contig, $i * ($self->Dir . 1) + $start, $self->Dir, 3);
840 :     }
841 :     }
842 :     # Return the result.
843 :     return $retVal;
844 :     }
845 :    
846 :    
847 :     =head3 DownstreamSearch
848 :    
849 :     my $loc = $floc->DownstreamSearch($pattern, $limit);
850 :    
851 :     Search downstream from this location for a codon as identified by the
852 :     specified pattern, stopping at the end of the contig or when the
853 :     specified limit is reached.
854 :    
855 :     =over 4
856 :    
857 :     =item pattern
858 :    
859 :     Codon pattern to search for, expressed as a bar-delimited list of base triplets.
860 :     For example, C<taa|tag|tga> would search for a stop codon.
861 :    
862 :     =item limit (optional)
863 :    
864 :     Maximum number of base pairs to search. Must be a multiple of 3.
865 :    
866 :     =item RETURN
867 :    
868 : parrello 1.22 Returns a [[BasicLocationPm]] object for the codon found.
869 : parrello 1.21
870 :     =back
871 :    
872 :     =cut
873 :    
874 :     sub DownstreamSearch {
875 :     # Get the parameters.
876 :     my ($self, $pattern, $limit) = @_;
877 :     # Get the FIG object.
878 :     my $fig = $self->{fig};
879 :     # Declare the return variable.
880 :     my $retVal;
881 :     # Locate the starting and ending positions for the search. The search starts
882 :     # immediately downstream of our end point.
883 :     my $start = $self->NextPoint;
884 :     my $end = $start + ($self->Dir . 1) * ($limit - 1);
885 :     # Do a down search to find the codon.
886 :     $retVal = $self->DownSearch($pattern, $start, $end);
887 :     # Return the result.
888 :     return $retVal;
889 :     }
890 :    
891 :    
892 :     =head3 InsideSearch
893 :    
894 :     my $loc = $floc->InsideSearch($pattern);
895 :    
896 :     Search inside this location for a codon as identified by the
897 :     specified pattern.
898 :    
899 :     =over 4
900 :    
901 :     =item pattern
902 :    
903 :     Codon pattern to search for, expressed as a bar-delimited list of base triplets.
904 :     For example, C<taa|tag|tga> would search for a stop codon.
905 :    
906 :     =item RETURN
907 :    
908 :     Returns a [[BasicLocationPm]] object for the codon found.
909 :    
910 :     =back
911 :    
912 :     =cut
913 :    
914 :     sub InsideSearch {
915 :     # Get the parameters.
916 :     my ($self, $pattern, $limit) = @_;
917 :     # Get the FIG object.
918 :     my $fig = $self->{fig};
919 :     # Declare the return variable.
920 :     my $retVal;
921 :     # Locate the starting and ending positions for the search. The search
922 :     # is entirely inside the location.
923 :     my $start = $self->Begin;
924 :     my $end = $self->EndPoint;
925 :     # Do a down search to find the codon.
926 :     $retVal = $self->DownSearch($pattern, $start, $end);
927 :     # Return the result.
928 :     return $retVal;
929 :     }
930 :    
931 :     =head3 DownSearch
932 :    
933 :     my $loc = $floc->DownSearch($pattern, $start, $end);
934 :    
935 :     Search parallel to this location for a codon as identified by the
936 :     specified pattern, starting at the specified start point and stopping
937 :     at the specified end point.
938 :    
939 :     =over 4
940 :    
941 :     =item pattern
942 :    
943 :     Codon pattern to search for, expressed as a bar-delimited list of base triplets.
944 :     For example, C<taa|tag|tga> would search for a stop codon.
945 :    
946 :     =item $start
947 :    
948 :     Starting position for search.
949 :    
950 :     =item $end
951 :    
952 :     Ending position for search.
953 :    
954 :     =item RETURN
955 :    
956 :     Returns a [[BasicLocationPm]] object for the codon found.
957 :    
958 :     =back
959 :    
960 :     =cut
961 :    
962 :     sub DownSearch {
963 :     # Get the parameters.
964 :     my ($self, $pattern, $start, $end) = @_;
965 :     # Get the FIG object.
966 :     my $fig = $self->{fig};
967 :     # Declare the return variable.
968 :     my $retVal;
969 :     # Insure we're inside the contig.
970 :     my $realStart = $self->ConstrainCodon($start, $start);
971 :     my $realEnd = $self->ConstrainCodon($end, $start + ($self->Dir . 2));
972 :     # Get the DNA to search. Note we convert it automatically to lower case.
973 :     my $dna = lc $fig->dna_seq($self->{genomeID}, $self->Contig . "_${realStart}_${realEnd}");
974 :     # Insure the pattern is also lower-case.
975 :     $pattern = lc $pattern;
976 :     # Get the length of the dna sequence.
977 :     my $i1 = length($dna);
978 :     Trace("$i1 base pairs in search.") if T(4);
979 :     Trace("Downsearch DNA translation\n" . FIG::translate($dna, $self->{code})) if T(4);
980 :     for (my $i = 0; $i < $i1 && ! defined($retVal); $i += 3) {
981 :     # Check for a match.
982 :     if (substr($dna, $i, 3) =~ /$pattern/) {
983 :     # Compute the actual return value. This will also stop the loop.
984 :     $retVal = BasicLocation->new($self->Contig, $i * ($self->Dir . 1) + $realStart, $self->Dir, 3);
985 :     }
986 :     }
987 :     # Return the result.
988 :     return $retVal;
989 :     }
990 :    
991 : parrello 1.14 =head3 PickGeneBoundaries
992 :    
993 : parrello 1.19 my $rc = $floc->PickGeneBoundaries(-stop => $stopPattern,
994 :     -start => $startPattern, -limit => $limit);
995 : parrello 1.14
996 :     Update this location so that it has a valid start and stop. The basic
997 :     algorithm used is to search upstream and downstream for stop codons, then
998 :     search between the stop codons for the first start codon.
999 :    
1000 :     =over 4
1001 :    
1002 : parrello 1.19 =item stop (optional)
1003 : parrello 1.14
1004 :     Search pattern for the stop codon, encoded as a bar-delimited list of DNA triplets
1005 :     (e.g. C<tta|ata|tag>). If omitted, the default is to look for stop codons in the
1006 :     attached genetic code table.
1007 :    
1008 : parrello 1.19 =item start (optional)
1009 : parrello 1.14
1010 :     Search pattern for the start codon, encoded as a bar-delimited list of DNA triplets.
1011 :     If omitted, the default is C<atg|gtg|ttg>.
1012 :    
1013 :     =item limit (optional)
1014 :    
1015 : parrello 1.18 If a number, then the maximum distance to search when attempting to extend the
1016 :     location. If a [[BasicLocationPm]] object or a location string, then none of the searches will
1017 :     go outside the region spanned by the location. If omitted, the default is a scalar value of C<9000>.
1018 : parrello 1.14
1019 :     =item RETURN
1020 :    
1021 : parrello 1.20 Returns TRUE if successful, FALSE if the process fails.
1022 : parrello 1.14
1023 :     =back
1024 :    
1025 :     =cut
1026 :    
1027 :     sub PickGeneBoundaries {
1028 :     # Get the parameters.
1029 : parrello 1.19 my ($self, %parms) = @_;
1030 : parrello 1.14 # Declare the return variable.
1031 :     my $retVal;
1032 : parrello 1.19 # Get the stop pattern.
1033 :     my $stopPattern = $parms{-stop} || lc join("|", grep { $self->{code}->{$_} eq '*' } keys %{$self->{code}});
1034 :     # Get the start pattern.
1035 :     my $startPattern = $parms{-start} || 'atg|gtg|ttg';
1036 :     # Compute the limits. We have an upstream limit and a downstream limit.
1037 :     my $limit = $parms{-limit};
1038 : parrello 1.18 my ($upLimit, $downLimit);
1039 :     if (! defined $limit) {
1040 :     $upLimit = 9000;
1041 :     $downLimit = 9000;
1042 :     } elsif ($limit =~ /^\d+$/) {
1043 :     $upLimit = $limit;
1044 :     $downLimit = $limit;
1045 :     } else {
1046 :     my $limitLoc = (! ref $limit ? new BasicLocation($limit) : $limit);
1047 :     $upLimit = abs($limitLoc->Begin - $self->Begin);
1048 :     $downLimit = abs($limitLoc->EndPoint - $self->EndPoint);
1049 :     }
1050 : parrello 1.21 # Insure the limits are on codon boundaries.
1051 :     $upLimit -= $upLimit % 3;
1052 :     $downLimit -= $downLimit % 3;
1053 : parrello 1.20 # Save the current boundaries.
1054 :     my $oldBegin = $self->Begin;
1055 :     my $oldEnd = $self->EndPoint;
1056 : parrello 1.21 # Get the contig boundaries.
1057 :     my $contigBegin = $self->ExtremeCodon('first');
1058 :     my $contigEnd = $self->ExtremeCodon('last');
1059 :     # Get the distance to each extreme.
1060 :     my $upExtreme = abs($contigBegin - $oldBegin);
1061 :     my $downExtreme = abs($contigEnd - $oldEnd);
1062 :     Trace("Up: limit = $upLimit, extreme = $upExtreme. Down: limit = $downLimit, extreme = $downExtreme.") if T(4);
1063 : parrello 1.14 # Search upstream for a stop.
1064 : parrello 1.18 my $upLoc = $self->UpstreamSearch($stopPattern, $upLimit);
1065 : parrello 1.15 # Check to see if we found one.
1066 :     my $newBegin;
1067 : parrello 1.21 if (defined $upLoc) {
1068 :     # Here we found a stop, so the new beginning is after the codon found.
1069 :     my $newOrfBegin = $upLoc->PointOffset(3);
1070 :     $self->Extend($newOrfBegin);
1071 : parrello 1.18 # Now look for a start between the stop codon found and the end of the location.
1072 :     my $startLoc = $self->InsideSearch($startPattern);
1073 : parrello 1.21 if (! defined $startLoc) {
1074 : parrello 1.19 Trace("New start not found for " . $self->SeedString() . ".") if T(3);
1075 :     } else {
1076 : parrello 1.21 # We found the start, so it becomes our new beginning.
1077 :     $newBegin = $startLoc->Begin;
1078 :     }
1079 :     } elsif ($upExtreme <= $upLimit) {
1080 :     # Here we fell off the end of the contig while searching, so this is ok.
1081 :     Trace("Contig beginning used for stop codon after upstream search.") if T(3);
1082 :     $newBegin = $contigBegin;
1083 :     } else {
1084 :     Trace("Upstream stop not found for " . $self->SeedString() . ".") if T(3);
1085 :     }
1086 :     # Only proceed if we have a new beginning.
1087 :     if (defined $newBegin) {
1088 :     # Search downstream for a stop codon.
1089 :     my $stopLoc = $self->DownstreamSearch($stopPattern, $downLimit);
1090 :     # Check the result.
1091 :     my $endPoint;
1092 :     if (defined $stopLoc) {
1093 :     # Here we found the end point.
1094 :     $endPoint = $stopLoc->EndPoint;
1095 :     } elsif ($downLimit >= $downExtreme) {
1096 :     # Here we fell off the end, so we use the end.
1097 :     $endPoint = $contigEnd;
1098 :     Trace("Contig end used for stop codon after downstream search.") if T(3);
1099 :     } else {
1100 :     Trace("Downsream stop not found for " . $self->SeedString() . ".") if T(3);
1101 :     }
1102 :     if (defined $endPoint) {
1103 :     # Extend the location to the start and stop found. This will almost certainly
1104 :     # require trimming in the start-codon direction.
1105 :     $self->Extend($newBegin, $endPoint, 'trim');
1106 :     # Denote we've succeeded.
1107 :     $retVal = 1;
1108 : parrello 1.18 }
1109 : parrello 1.14 }
1110 : parrello 1.20 # If we failed, restore the location.
1111 :     if (! $retVal) {
1112 :     $self->Extend($oldBegin, $oldEnd, 'trim');
1113 :     }
1114 : parrello 1.14 # Return the success indicator.
1115 :     return $retVal;
1116 :     }
1117 :    
1118 : parrello 1.9
1119 : parrello 1.1 =head2 Internal Methods
1120 :    
1121 :     =head3 ParseLocations
1122 :    
1123 :     Parse an array into a list of basic location objects. The array can contain
1124 :     basic location objects or location strings. The first parameter must be the
1125 :     full location object being constructed.
1126 :    
1127 :     This is a static method.
1128 :    
1129 :     =cut
1130 :    
1131 :     sub _ParseLocations {
1132 :     # Get the location list.
1133 :     my ($parent, @locs) = @_;
1134 :     # Create the return array.
1135 :     my @retVal = ();
1136 :     # Create a location index counter.
1137 :     my $idx = 0;
1138 :     # Loop through the locations.
1139 :     for my $loc (@locs) {
1140 :     # Create a variable to hold the location object created.
1141 :     my $locObject;
1142 :     # Check to see if this is a string or a location object.
1143 :     if (ref $loc eq '') {
1144 :     # It's a string, so parse it into a location object.
1145 :     $locObject = BasicLocation->new($loc, $parent, $idx);
1146 :     } elsif (UNIVERSAL::isa($loc, "BasicLocation")) {
1147 :     # It's a location object, so copy it and set the parent and
1148 :     # index.
1149 :     $locObject = BasicLocation->new($loc);
1150 :     $locObject->Attach($parent, $idx);
1151 :     } else {
1152 :     # Here we have an error.
1153 :     my $type = ref $loc;
1154 :     Confess("Invalid location object of type $type found at index $idx.");
1155 :     }
1156 :     # Add the location to the list.
1157 :     push @retVal, $locObject;
1158 :     $idx++;
1159 :     }
1160 :     # Return a reference to the location list.
1161 :     return \@retVal;
1162 :     }
1163 :    
1164 :     =head3 FindPattern
1165 :    
1166 :     Locate the index of a specified pattern in a DNA string.
1167 :    
1168 :     This is a static method.
1169 :    
1170 :     =over 4
1171 :    
1172 :     =item dna
1173 :    
1174 :     DNA string to search.
1175 :    
1176 :     =item pattern
1177 :    
1178 : parrello 1.2 Pattern for which to search (see L</Search>).
1179 : parrello 1.1
1180 :     =item RETURN
1181 :    
1182 :     Returns the index of the specified pattern, or C<undef> if the pattern is not found.
1183 :    
1184 :     =back
1185 :    
1186 :     =cut
1187 :     #: Return Type $;
1188 :     sub _FindPattern {
1189 :     # Get the parameters.
1190 : parrello 1.2 my ($dna, $pattern) = @_;
1191 : parrello 1.1 # Declare the return variable.
1192 :     my $retVal;
1193 :     # Insure the pattern is lower case.
1194 :     my $realPattern = lc $pattern;
1195 : parrello 1.9 # Start at the beginning of the string. We will chop stuff off the string
1196 :     # as we search through it. This smount chopped must then be added to the offset
1197 :     # of the found string in order to get the return value.
1198 :     my $pos = 0;
1199 :     # Do the search. We search for stop codons, and stop at the first one
1200 :     # that lies on a codon boundary.
1201 :     while (!defined($retVal) && $dna =~ m/$realPattern/g) {
1202 :     # We have a match. Get its location. Note that the "pos" function returns
1203 :     # the point where the search left off, so we have to back off a bit to
1204 :     # get the starting point of the codon.
1205 :     my $newPos = (pos $dna) - 3;
1206 :     # See if this is on a codon boundary.
1207 :     my $mod = $newPos % 3;
1208 :     if ($mod == 0) {
1209 :     # It is, so return the offset from the original start of the string.
1210 :     $retVal = $pos + $newPos;
1211 :     } else {
1212 :     # Here we need to keep searching. First, however, we move the search
1213 :     # position forward to the next codon boundary. This avoids useless checking
1214 :     # of the next byte or two and insures that we don't find the same value
1215 :     # again.
1216 :     $newPos += 3 - $mod;
1217 :     $pos += $newPos;
1218 :     $dna = substr($dna, $newPos);
1219 :     }
1220 :     }
1221 :     # Return the result.
1222 :     return $retVal;
1223 :     }
1224 :    
1225 : parrello 1.1
1226 :     1;
1227 :    

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3