[Bio] / FigKernelPackages / ServerThing.pm Repository:
ViewVC logotype

Diff of /FigKernelPackages/ServerThing.pm

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.36, Thu Feb 25 20:30:10 2010 UTC revision 1.37, Sun Mar 14 13:26:32 2010 UTC
# Line 291  Line 291 
291      }      }
292  }  }
293    
 =head3 FindGeneCorrespondenceFile  
294    
295      my $ih = ServerThing::FindGeneCorrespondenceFile($genome1, $genome2);  =head2 Gene Correspondence File Methods
296    
297  Return an open input handle for a file that maps the genes in the first genome  These methods relate to gene correspondence files, which are generated by the
298  to corresponding genes in the second genome. These files can be found for some  L<svr_corresponding_genes.pl> script. Correspondence files are cached in the
299  genomes in the organism directories. Additional files are available in the  organism cache (I<$FIG_Config::orgCache>) directory. Eventually they will be
300  organism cache directory (I<$FIG_Config::orgCache>). If the desired file does  copied into the organism directories themselves. At that point, the code below
301  not exist, one will be created in the organism cache directory, presumably to be  will be modified to check the organism directories first and use the cache
302  moved to a permanent location at a later time.  directory if no file is found there.
303    
304    A gene correspondence file contains correspondences from a source genome to a
305    target genome. Most such correspondences are bidirectional best hits. A unidirectional
306    best hit may exist from the source genome to the target genome or in the reverse
307    direction from the targtet genome to the source genome. The cache directory itself
308    is divided into subdirectories by organism. The subdirectory has the source genome
309    name and the files themselves are named by the target genome.
310    
311    Some of the files are invalid and will be erased when they are found. A file is
312    considered invalid if it has a non-numeric value in a numeric column or if it
313    does not have any unidirectional hits from the target genome to the source
314    genome.
315    
316    The process of managing the correspondence files is tricky and dangerous because
317    of the possibility of race conditions. It can take several minutes to generate a
318    file, and if two processes try to generate the same file at the same time we need
319    to make sure they don't step on each other.
320    
321    In stored files, the source genome ID is always lexically lower than the target
322    genome ID. If a correspondence in the reverse direction is desired, the converse
323    file is found and the contents flipped automatically as they are read. So, the
324    correspondence from B<360108.3> to B<100226.1> would be found in a file with the
325    name B<360108.3> in the directory for B<100226.1>. Since this file actually has
326    B<100226.1> as the source and B<360108.3> as the target, the columns are
327    re-ordered and the arrows reversed before the file contents are passed to the
328    caller.
329    
330    =head4 Gene Correspondence List
331    
332    A gene correspondence file contains 18 columns. These are usually packaged as
333    a reference to list of lists. Each sub-list has the following format.
334    
335    =over 4
336    
337    =item 0
338    
339    The ID of a PEG in genome 1.
340    
341    =item 1
342    
343    The ID of a PEG in genome 2 that is our best estimate of a "corresponding gene".
344    
345    =item 2
346    
347    Count of the number of pairs of matching genes were found in the context.
348    
349    =item 3
350    
351    Pairs of corresponding genes from the contexts.
352    
353    =item 4
354    
355    The function of the gene in genome 1.
356    
357    =item 5
358    
359    The function of the gene in genome 2.
360    
361    =item 6
362    
363    Comma-separated list of aliases for the gene in genome 1 (any protein with an
364    identical sequence is considered an alias, whether or not it is actually the
365    name of the same gene in the same genome).
366    
367    =item 7
368    
369    Comma-separated list of aliases for the gene in genome 2 (any protein with an
370    identical sequence is considered an alias, whether or not it is actually the
371    name of the same gene in the same genome).
372    
373    =item 8
374    
375    Bi-directional best hits will contain "<=>" in this column; otherwise, "->" will appear.
376    
377    =item 9
378    
379    Percent identity over the region of the detected match.
380    
381    =item 10
382    
383    The P-score for the detected match.
384    
385    =item 11
386    
387    Beginning match coordinate in the protein encoded by the gene in genome 1.
388    
389    =item 12
390    
391    Ending match coordinate in the protein encoded by the gene in genome 1.
392    
393    =item 13
394    
395    Length of the protein encoded by the gene in genome 1.
396    
397    =item 14
398    
399    Beginning match coordinate in the protein encoded by the gene in genome 2.
400    
401    =item 15
402    
403    Ending match coordinate in the protein encoded by the gene in genome 2.
404    
405    =item 16
406    
407    Length of the protein encoded by the gene in genome 2.
408    
409    =item 17
410    
411    Bit score for the match. Divide by the length of the longer PEG to get
412    what we often refer to as a "normalized bit score".
413    
414    =back
415    
416    In the actual files, there will also be reverse correspondences indicated by a
417    back-arrow ("<-") in item (8). The output returned by the servers, however,
418    is filtered so that only forward correspondences occur. If a converse file
419    is used, the columns are re-ordered and the arrows reversed so that it looks
420    correct.
421    
422    =cut
423    
424    # hash for reversing the arrows
425    use constant ARROW_FLIP => { '->' => '<-', '<=>' => '<=>', '<-' => '->' };
426    # list of columns that contain numeric values that need to be validated
427    use constant NUM_COLS => [2,9,10,11,12,13,14,15,16,17];
428    
429    =head3 CheckForGeneCorrespondenceFile
430    
431        my ($fileName, $converse) = ServerThing::CheckForGeneCorrespondenceFile($genome1, $genome2);
432    
433    Try to find a gene correspondence file for the specified genome pairing. If the
434    file exists, its name and an indication of whether or not it is in the correct
435    direction will be returned.
436    
437  =over 4  =over 4
438    
439  =item genome1  =item genome1
440    
441  ID of the source genome.  Source genome for the desired correspondence.
442    
443  =item genome2  =item genome2
444    
445  ID of the target genome.  Target genome for the desired correspondence.
446    
447  =item RETURN  =item RETURN
448    
449  Returns the name of a tab-delimited file. The first column in the file contains IDs  Returns a two-element list. The first element is the name of the file containing the
450  of genes in the source genome; the second column contains IDs of genes in the  correspondence, or C<undef> if the file does not exist. The second element is TRUE
451  target genone. The remaining columns contain additional data about the correspondence.  if the correspondence would be forward or FALSE if the file needs to be flipped.
452    
453  =back  =back
454    
455  =cut  =cut
456    
457  sub FindGeneCorrespondenceFile {  sub CheckForGeneCorrespondenceFile {
458      # Get the parameters.      # Get the parameters.
459      my ($genome1, $genome2) = @_;      my ($genome1, $genome2) = @_;
460      # Declare the return variable.      # Declare the return variables.
461      my $retVal;      my ($fileName, $converse);
462      # Look for a pre-computed file in the organism directories.      # Determine the ordering of the genome IDs.
463      my $fileName = "$FIG_Config::organisms/$genome1/CorrToReferenceGenomes/$genome2";      my ($corrFileName, $genomeA, $genomeB) = ComputeCorrespondenceFileName($genome1, $genome2);
464      if (-f $fileName) {      $converse = ($genomeA ne $genome1);
465        # Look for a file containing the desired correspondence. (The code to check for a
466        # pre-computed file in the organism directories is currently turned off, because
467        # these files are all currently invalid.)
468        my $testFileName = "$FIG_Config::organisms/$genomeA/CorrToReferenceGenomes/$genomeB";
469        if (0 && -f $testFileName) {
470          # Use the pre-computed file.          # Use the pre-computed file.
471          Trace("Using pre-computed file $fileName for genome correspondence.") if T(3);          Trace("Using pre-computed file $fileName for genome correspondence.") if T(3);
472          $retVal = Open(undef, "<$fileName");          $fileName = $testFileName;
473      } else {      } elsif (-f $corrFileName) {
474          # Check for an organism cache.          $fileName = $corrFileName;
475          if (! $FIG_Config::orgCache) {          Trace("Using cached file $fileName for genome correspondence.") if T(3);
476              # No cache, so simply open a pipe.      }
477              Trace("No organism cache found: using pipe to compute $genome1 vs. $genome2.") if T(3);      # Return the result.
478              $retVal = Open(undef, "$FIG_Config::bin/svr_corresponding_genes $genome1 $genome2 |");      return ($fileName, $converse);
479    }
480    
481    
482    =head3 ComputeCorrespondenceFileName
483    
484        my ($fileName, $genomeA, $genomeB) = ServerThing::ComputeCorrespondenceFileName($genome1, $genome2);
485    
486    Compute the name to be given to a genome correspondence file in the organism cache
487    and return the source and target genomes that would be in it.
488    
489    =over 4
490    
491    =item genome1
492    
493    Source genome for the desired correspondence.
494    
495    =item genome2
496    
497    Target genome for the desired correspondence.
498    
499    =item RETURN
500    
501    Returns a three-element list. The first element is the name of the file to contain the
502    correspondence, the second element is the name of the genome that would act as the
503    source genome in the file, and the third element is the name of the genome that would
504    act as the target genome in the file.
505    
506    =back
507    
508    =cut
509    
510    sub ComputeCorrespondenceFileName {
511        # Get the parameters.
512        my ($genome1, $genome2) = @_;
513        # Declare the return variables.
514        my ($fileName, $genomeA, $genomeB);
515        # Determine the ordering of the genome IDs.
516        if ($genome1 lt $genome2) {
517            ($genomeA, $genomeB) = ($genome1, $genome2);
518          } else {          } else {
519              # Insure the source organism has a subdirectory in the cache.          ($genomeA, $genomeB) = ($genome2, $genome1);
520              my $orgDir = "$FIG_Config::orgCache/$genome1";      }
521        # Insure the source organism has a subdirectory in the organism cache.
522        my $orgDir = "$FIG_Config::orgCache/$genomeA";
523              Tracer::Insure($orgDir, 0777);              Tracer::Insure($orgDir, 0777);
524              # Check for a correspondence file that matches the target genome.      # Compute the name of the correspondence file for the appropriate target genome.
525              $fileName = "$orgDir/$genome2";      $fileName = "$orgDir/$genomeB";
526              if (-f $fileName) {      # Return the results.
527                  # We found one, so try to open it.      return ($fileName, $genomeA, $genomeB);
528                  my $ok = open $retVal, "<$fileName";  }
529                  # If the open failed, then the file is being built by another process, so  
530                  # use a pipe.  
531                  if (! $ok) {  =head3 CreateGeneCorrespondenceFile
532                      Trace("Failed to open $fileName. Using pipe to compute correspondence.") if T(3);  
533                      $retVal = Open(undef, "$FIG_Config::bin/svr_corresponding_genes $genome1 $genome2 |");      my ($fileName, $converse) = ServerThing::CheckForGeneCorrespondenceFile($genome1, $genome2);
534                  } else {  
535                      Trace("Using cached file $fileName for genome correspondence.") if T(3);  Create a new gene correspondence file in the organism cache for the specified
536    genome correspondence. The name of the new file will be returned along with
537    an indicator of whether or not it is in the correct direction.
538    
539    =over 4
540    
541    =item genome1
542    
543    Source genome for the desired correspondence.
544    
545    =item genome2
546    
547    Target genome for the desired correspondence.
548    
549    =item RETURN
550    
551    Returns a two-element list. The first element is the name of the file containing the
552    correspondence, or C<undef> if an error occurred. The second element is TRUE
553    if the correspondence would be forward or FALSE if the file needs to be flipped.
554    
555    =back
556    
557    =cut
558    
559    sub CreateGeneCorrespondenceFile {
560        # Get the parameters.
561        my ($genome1, $genome2) = @_;
562        # Declare the return variables.
563        my ($fileName, $converse);
564        # Compute the ultimate name for the correspondence file.
565        my ($corrFileName, $genomeA, $genomeB) = ComputeCorrespondenceFileName($genome1, $genome2);
566        $converse = ($genome1 ne $genomeA);
567        # Generate a temporary file name in the same directory. We'll build the temporary
568        # file and then rename it when we're done.
569        my $tempFileName = "$corrFileName.$$.tmp";
570        # This will be set to FALSE if we detect an error.
571        my $fileOK = 1;
572        # The file handles will be put in here.
573        my ($ih, $oh);
574        # Protect from errors.
575        eval {
576            # Open the temporary file for output.
577            $oh = Open(undef, ">$tempFileName");
578            # Open a pipe to get the correspondence data.
579            $ih = Open(undef, "$FIG_Config::bin/svr_corresponding_genes -u localhost $genomeA $genomeB |");
580            Trace("Creating correspondence file for $genomeA to $genomeB in temporary file $tempFileName.") if T(3);
581            # Copy the pipe date into the temporary file.
582            while (! eof $ih) {
583                my $line = <$ih>;
584                print $oh $line;
585            }
586            # Close both files. If the close fails we need to know: it means there was a pipe
587            # error.
588            $fileOK &&= close $ih;
589            $fileOK &&= close $oh;
590        };
591        if ($@) {
592            # Here a fatal error of some sort occurred. We need to force the files closed.
593            close $ih if $ih;
594            close $oh if $oh;
595        } elsif ($fileOK) {
596            # Here everything worked. Try to rename the temporary file to the real
597            # file name.
598            if (rename $tempFileName, $corrFileName) {
599                # Everything is ok, fix the permissions and return the file name.
600                chmod 0664, $corrFileName;
601                $fileName = $corrFileName;
602                Trace("Created correspondence file $fileName.") if T(3);
603            }
604        }
605        # If the temporary file exists, delete it.
606        if (-f $tempFileName) {
607            unlink $tempFileName;
608                  }                  }
609        # Return the results.
610        return ($fileName, $converse);
611    }
612    
613    
614    =head3 ReadGeneCorrespondenceFile
615    
616        my $list = ServerThing::ReadGeneCorrespondenceFile($fileName, $converse);
617    
618    Return the contents of the specified gene correspondence file in the form of
619    a list of lists, with backward correspondences filtered out. If the file is
620    for the converse of the desired correspondence, the columns will be reordered
621    automatically so that it looks as if the file were designed for the proper
622    direction.
623    
624    =over 4
625    
626    =item fileName
627    
628    The name of the gene correspondence file to read.
629    
630    =item converse (optional)
631    
632    TRUE if the file is for the converse of the desired correspondence, else FALSE.
633    If TRUE, the file columns will be reorderd automatically. The default is FALSE,
634    meaning we want to use the file as it appears on disk.
635    
636    =item RETURN
637    
638    Returns a L</Gene Correspondence List> in the form of a reference to a list of lists.
639    If the file's contents are invalid or an error occurs, an undefined value will be
640    returned.
641    
642    =back
643    
644    =cut
645    
646    sub ReadGeneCorrespondenceFile {
647        # Get the parameters.
648        my ($fileName, $converse) = @_;
649        # Declare the return variable. We will only put something in here if we are
650        # completely successful.
651        my $retVal;
652        # This value will be set to 1 if an error is detected.
653        my $error = 0;
654        # Try to open the file.
655        my $ih;
656        Trace("Reading correspondence file $fileName.") if T(3);
657        if (! open $ih, "<$fileName") {
658            # Here the open failed, so we have an error.
659            Trace("Failed to open gene correspondence file $fileName: $!") if T(3);
660            $error = 1;
661        }
662        # The gene correspondence list will be built in here.
663        my @corrList;
664        # This variable will be set to TRUE if we find a reverse correspondence somewhere
665        # in the file. Not finding one is an error.
666        my $reverseFound = 0;
667        # Loop until we hit the end of the file or an error occurs. We must check the error
668        # first in case the file handle failed to open.
669        while (! $error && ! eof $ih) {
670            # Get the current line.
671            my @row = Tracer::GetLine($ih);
672            # Get the correspondence direction and check for a reverse arrow.
673            $reverseFound = 1 if ($row[8] eq '<-');
674            # If we're in converse mode, reformat the line.
675            if ($converse) {
676                ($row[1], $row[0], $row[2], $row[3], $row[5], $row[4], $row[7], $row[6],
677                 ARROW_FLIP->{$row[8]}, $row[9], $row[10], $row[14], $row[15], $row[16],
678                 $row[11], $row[12], $row[13], $row[17]) = @row;
679            }
680            # Validate the row.
681            if (ValidateGeneCorrespondenceRow(\@row)) {
682                Trace("Invalid row $. found in correspondence file $fileName.") if T(3);
683                $error = 1;
684            }
685            # If this row is in the correct direction, keep it.
686            if ($row[8] eq '->') {
687                push @corrList, \@row;
688            }
689        }
690        # Close the input file.
691        close $ih;
692        # If we have no errors and we found a reverse arrow, keep the result.
693        if (! $error) {
694            if ($reverseFound) {
695                $retVal = \@corrList;
696              } else {              } else {
697                  # Here we need to create the file.              Trace("No reverse arrow found in correspondence file $fileName.") if T(3);
698                  Trace("Creating genome correspondence file $fileName.") if T(3);          }
                 system "svr_corresponding_genes $genome1 $genome2 >$fileName";  
                 Trace("Using created file for genome correspondence.") if T(3);  
                 $retVal = Open(undef, "<$fileName");  
699              }              }
700        # Return the result (if any).
701        return $retVal;
702          }          }
703    
704    
705    =head3 ValidateGeneCorrespondenceRow
706    
707        my $errorCount = ServerThing::ValidateGeneCorrespondenceRow($row);
708    
709    Validate a gene correspondence row. The numeric fields are checked to insure they
710    are numeric and the source and target gene IDs are validated. The return value will
711    indicate the number of errors found.
712    
713    =over 4
714    
715    =item row
716    
717    Reference to a list containing a single row from a L</Gene Correspondence List>.
718    
719    =item RETURN
720    
721    Returns the number of errors found in the row. A return of C<0> indicates the row
722    is valid.
723    
724    =back
725    
726    =cut
727    
728    sub ValidateGeneCorrespondenceRow {
729        # Get the parameters.
730        my ($row, $genome1, $genome2) = @_;
731        # Denote no errors have been found so far.
732        my $retVal = 0;
733        # Check for non-numeric values in the number columns.
734        for my $col (@{NUM_COLS()}) {
735            unless ($row->[$col] =~ /^-?\d+\.?\d*(?:e[+-]?\d+)?$/) {
736                $retVal++;
737            }
738        }
739        # Check the gene IDs.
740        for my $col (0, 1) {
741            unless ($row->[$col] =~ /^fig\|\d+\.\d+\.\w+\.\d+$/) {
742                $retVal++;
743            }
744        }
745        # Verify the arrow.
746        unless (exists ARROW_FLIP->{$row->[8]}) {
747            $retVal++;
748      }      }
749      # Return the desired file handle.      # Return the error count.
750      return $retVal;      return $retVal;
751  }  }
752    

Legend:
Removed from v.1.36  
changed lines
  Added in v.1.37

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3