[Bio] / FigKernelScripts / blast2gff.pl Repository:
ViewVC logotype

View of /FigKernelScripts/blast2gff.pl

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.23 - (download) (as text) (annotate)
Fri Oct 22 04:15:37 2010 UTC (9 years, 4 months ago) by redwards
Branch: MAIN
CVS Tags: mgrast_dev_08112011, mgrast_dev_08022011, rast_rel_2014_0912, myrast_rel40, mgrast_dev_05262011, mgrast_dev_04082011, mgrast_version_3_2, mgrast_dev_12152011, mgrast_dev_06072011, rast_rel_2014_0729, mgrast_dev_02212011, rast_rel_2010_1206, mgrast_release_3_0, mgrast_dev_03252011, rast_rel_2011_0119, mgrast_release_3_0_4, mgrast_release_3_0_2, mgrast_release_3_0_3, mgrast_release_3_0_1, mgrast_dev_03312011, mgrast_release_3_1_2, mgrast_release_3_1_1, mgrast_release_3_1_0, mgrast_dev_04132011, mgrast_dev_04012011, myrast_33, rast_rel_2011_0928, mgrast_dev_04052011, mgrast_dev_02222011, mgrast_dev_10262011, HEAD
Changes since 1.22: +12 -2 lines
adding an option for gziped files

# Copyright (c) 2003-2006 University of Chicago and Fellowship
# for Interpretations of Genomes. All Rights Reserved.
# This file is part of the SEED Toolkit.
# The SEED Toolkit is free software. You can redistribute
# it and/or modify it under the terms of the SEED Toolkit
# Public License. 
# You should have received a copy of the SEED Toolkit Public License
# along with this program; if not write to the University of Chicago
# at info@ci.uchicago.edu or the Fellowship for Interpretation of
# Genomes at veronika@thefig.info or download a copy from
# http://www.theseed.org/LICENSE.TXT.


=head1 blast2gff.pl

This is a de novo orf caller that I have written. At the base level it
will take two input files, a fasta file and a blastx file. Based on
the latter the ORFs in the former will be called. They will be given
unique fig ids and a gff3 file created that can be used to load the
genome into the seed.

In addition, the blastx file will be rewritten in seed sims
formatting. The changes are minor and include the addition of the
approximate lengths of the proteins (these are known for the database
but not for the queries), and we also reduce the starts/stops from

find the ORFs based on blastp using m8 output. This will allow for some frameshifting if the same ORF hits nearby and there is another overlapping ORF. Here is an example of the problem:

AAFY01000001    fig|1314.1.peg.1145     29.61   1074    736     28      4565    1404    2       940     7e-110  399.8
AAFY01000001    fig|1349.1.peg.604      34.96   718     463     13      3539    1398    260     933     1e-105  386.0
AAFY01000001    fig|1349.1.peg.604      33.98   103     68      2       4565    4257    2       96      4.3e-08 61.62
AAFY01000001    fig|1585.2.peg.426      29.76   988     681     21      4535    1611    13      931     1e-102  375.6
AAFY01000001    fig|100226.1.peg.1706   26.84   1099    757     26      4571    1416    2       1047    2.7e-95 351.3
AAFY01000001    fig|159087.1.peg.1691   26.48   1046    761     22      4571    1458    4       958     1.4e-83 312.4
AAFY01000001    fig|160488.1.peg.3093   25.68   1059    779     20      4574    1422    10      978     3.4e-82 307.8

There is probably an ORF that runs from 1400-ish through
4500-ish. We'd want to be inclusive and call this something like 1416
through 4571. We then run into problems that there may be
frameshifts/gaps here, so the protein sequence may be wrong, but for
metagenomics it shouldn't matter too greatly.

First we are going to iterate through the file and figure out the
starts and ends of the ORFs. Then we are going to iterate through
again and rewrite the output with new id's. This means we go throught
the blast output files twice.

We are going to start by making qstart/end be strand ignorant in the
first iteration. It will make life easier with the comparisons

Initially we are not calling things the same orf if the start and stop
are both less than the start or both less than the stop. We will only
add something once, and then move along to the next one.

We will have to do one final round of merging everything with everything else

We also reformat the sims for the output, adding the length of the proteins. 

In the initial version there was a problem where the strand
information was disposed of, and calls that were overlapping but on
different strands were causing a problem. Should be fixed now.


use strict;
use DB_File;
use raelib;
use FIG;
use URI::Escape;
my $fig=new FIG;

my $usage=<<EOF;

$0 <options> <list of blast files>


-f fasta file of contig sequences. This should be the fasta file that you submitted to the blast job
-s output file for sims (default=sims) 
-o output file for gff3 (default=gff3)
-ev file for the annotation evidence, a tab separated list of peg, function, similar peg, length of that peg, timestamp

-gs	genus/species. This is the display name of the sequence
-taxonomy taxonomy (be sure to enquote). We usually use unclassified; environmental samples; Something for this. E.g. see ~/FIGdisk/FIG/Data/Organisms/9999999.1/TAXONOMY
-proj	project A project description like "Water Samples"
-taxid	taxon id (optional)


-g genome id (default=xxxxxxx). Just use the genome number, the fig| stuff will be added as needed. Metagenomes usually have 6 or more consecutive nine's
-pegnum      (default=0) last peg number used 
-figonly     ignore hits that are not to fig proteins
-nr          name of NR to use (defaults to NR in SEED path)
-nrlen       btree database of NR sequence lengths
-b           blast program (default=blastx). This is just used (currently) for factoring the lengths
-e           E value cutoff (default = 10)
-na          Do not annotate the PEGs. The PEGs will be annotated towards subsystems...


The critical options are
-f, -s, -o, -gs, -taxonomy, -proj, -g

Others can be left as defaults. Don't forget to add a list of blastx files at the end. Something like blast/out*


my ($blastfile, $simsfile, $gfffile, $genome, $figonly, $fastaf, $ann_ev_file)=([], 'sims', 'gff3', 'xxxxxxx', '', '', '');
my ($taxid, $genusspecies, $taxonomy, $project)=('', 'Unknown sample', 'None', 'Robs BLAST ORF calling');
my ($nrf, $blastprogram)=($FIG_Config::global."/nr", 'blastx');
my ($lastpeg, $evaluecutoff, $annotate) = (0, 10, 1);
my $nrlenf;

while (@ARGV) {
 my $test=shift @ARGV;
 if ($test eq "-s")           { $simsfile = shift @ARGV; }
 elsif ($test eq "-o")        { $gfffile  = shift @ARGV; }
 elsif ($test eq "-f")        { $fastaf   = shift @ARGV; }
 elsif ($test eq "-e")        { $evaluecutoff   = shift @ARGV; }
 elsif ($test eq "-ev")       { $ann_ev_file = shift @ARGV; }
 elsif ($test eq "-g")        { $genome   = shift @ARGV; }
 elsif ($test eq "-pegnum")   { $lastpeg  = shift @ARGV; }
 elsif ($test eq "-figonly")  { $figonly  = 1; }
 elsif ($test eq "-taxid")    { $taxid    = shift @ARGV; }
 elsif ($test eq "-nr")       { $nrf      = shift @ARGV; }
 elsif ($test eq "-nrlen")    { $nrlenf   = shift @ARGV; }
 elsif ($test eq "-gs")       { $genusspecies = shift @ARGV; }
 elsif ($test eq "-taxonomy") { $taxonomy = shift @ARGV; }
 elsif ($test eq "-proj")     { $project  = shift @ARGV; }
 elsif ($test eq "-b")        { $blastprogram = shift @ARGV; $blastprogram = lc($blastprogram); }
 elsif ($test eq "-na")       { $annotate = 0}
 else {push @$blastfile, $test}

die $usage unless ($blastfile->[0]);

my ($start);
foreach my $bf (@$blastfile)
    if ($bf =~ /gz$/) {
        open(IN, "gunzip -c $bf |") || die "can't open pipe to gunzip $bf";
    else {
        open(IN, $bf) || die "Can't open $bf";
    while (<IN>)
	my ($query,$hit, $percent_id, $hsp_len, $mismatches,$gaps,$qstart,$qend,$hstart,$hend,$evalue,$bits) = split;
	next unless ($evalue <= $evaluecutoff);
	next if ($hit !~ /^fig/ && $figonly);
	my $strand=1;
	if ($qstart > $qend)
	    ($qstart, $qend)=($qend, $qstart);
	# If we already have an ORF candidate at this start position of the same size.
	next if (exists $start->{$query}->{$strand}->{$qstart} &&
		 $start->{$query}->{$strand}->{$qstart} == $qend);
	my $added;

	# Loop over the starts we already have for this read and strand
	foreach my $teststart (sort {$a <=> $b} keys %{$start->{$query}->{$strand}})
	    my $testend=$start->{$query}->{$strand}->{$teststart};

	    # $teststart - $testend is the ORF in question

	    # Skip this candidate if it falls entirely before or after the current blast hit
	    next if (($teststart < $qstart && $testend < $qstart) || ($teststart > $qend && $testend > $qend));
	    #print STDERR "Merged $teststart and ", $start->{$query}->{$teststart}, " into $qstart and $qend for $query\n";

	    # Extend the ORF stop if this overlapping hit extends beyond the end.
	    if ($qend > $start->{$query}->{$strand}->{$teststart})
	    # Extend the ORF start if this overlapping hit extends before the start
	    if ($qstart < $teststart)
		delete $start->{$query}->{$strand}->{$teststart};
	    # so we can assign it to the gene later
	unless ($added)
    close IN;

print STDERR "Read the files the first time around in ", time-$^T, " seconds\n";

# now cycle through all of them and double check if we have overlapping ends to store
my $sortedstarts;

# $id is the read id
foreach my $id (keys %$start)
    foreach my $strand (keys %{$start->{$id}})
	# compute list of all ORF starts in this strand for this read
        my @allstarts = sort {$a <=> $b} keys %{$start->{$id}->{$strand}};
        foreach my $i (0 .. $#allstarts-1)
            foreach my $j ($i+1 .. $#allstarts)
		# Compare each pair of ORFs.
                my ($starti, $startj)=($allstarts[$i], $allstarts[$j]);

		# Ensure we have orfs for each.
                next unless (exists $start->{$id}->{$strand}->{$starti} && exists $start->{$id}->{$strand}->{$startj});

		# Skip if non-overlapping
                next if (($starti < $startj && $start->{$id}->{$strand}->{$starti} < $startj) 
                        || ($start->{$id}->{$strand}->{$starti} > $start->{$id}->{$strand}->{$startj} 
                            && $starti > $start->{$id}->{$strand}->{$startj}));

		# If the ORFs overlap, use the longest.
		# make the ends the same for each
                if ($start->{$id}->{$strand}->{$starti} > $start->{$id}->{$strand}->{$startj}) 

		# now remove the lowest start
		# ? Looks like we use the longest ORF.
                if ($starti < $startj)
		    delete $start->{$id}->{$strand}->{$startj};
		    delete $start->{$id}->{$strand}->{$starti};

        @{$sortedstarts->{$id}->{$strand}} = sort {$a <=> $b} keys %{$start->{$id}->{$strand}};
#print STDERR "For $id there are $#allstarts starts after round 2\n";
# We just need to work our way back through the feature table seeing where the similarities are, 
# concatenating them (maybe) and rewriting the sim table. I am not sure if we should concatenate them or leave them separate.

# read the nr file if we have one
# If $nrf ends in ".btree", it is a btree form of the NR. Create a DB_File tie for NR lookups.

my $length;
my $fa;
my($tied, $len_tied);

if ($nrf =~ /\.btree$/)
    $fa = {};
    $length = {};
    $tied = tie %$fa, "DB_File", $nrf, O_RDONLY, 0666, $DB_BTREE;
    $tied or die "Cannot open btree $nrf: $!\n";
    print "Tied btree\n";
elsif ($nrf)
    print STDERR "Reading NR file $nrf\n";
    map {$fa->{$_}=~s/\s+//; $length->{$_}=length($fa->{$_})} keys %$fa;

if ($nrlenf ne '')
    $len_tied = tie %$length, "DB_File", $nrlenf, O_RDONLY, 0666, $DB_BTREE;
    $len_tied or die "Cannot open btree $nrlenf: $!\n";
    print "Tied length btree $nrlenf\n";

# read the fasta file if we have one
my $fasta=raelib->read_fasta($fastaf) if (-e $fastaf);
map {$fasta->{$_}=~s/X/N/g; $fasta->{$_}=~s/\s+//g;} keys %$fasta; # correct phred/phrap calls at source

# if we want to write the annotation evidence, open a file
if ($ann_ev_file)
	open(ANNEV, ">$ann_ev_file") || die "Can't open $ann_ev_file";
open(SIMS, ">$simsfile") || die "Can't open $simsfile for writing\n";
my $contiginfo; my $translation;
my $pegcount; 
my $nextpeg = $lastpeg; 
my $pegoncontig;
my $fnrole;
foreach my $bf (@$blastfile) {
    if ($bf =~ /gz$/) {
        open(IN, "gunzip -c $bf |") || die "can't open pipe to gunzip $bf";
    else {
        open(IN, $bf) || die "Can't open $bf";
 while (<IN>) {
  my ($query,$hit, $percent_id, $hsp_len, $mismatches,$gaps,$qstart,$qend,$hstart,$hend,$evalue,$bits) = split;
  next unless ($evalue <= $evaluecutoff);
  next if ($hit !~ /^fig/ && $figonly);
  unless ($hit) {
   print STDERR  "Didn't get a hit from $bf line \n$_\n";
  my $strand=1; my $printstrand="+";
  if ($qstart > $qend) {($qstart, $qend)=($qend, $qstart); $strand=-1; $printstrand = "-"}
  if (!exists $length->{$hit}) {
   # figure out the lengths of the two sequences. This is tricky because we need to adjust for using (t)blastx
   my $hmax;
   $hstart > $hend ? $hmax=$hstart : $hmax=$hend;
   # We are generally using blastx and so we have dna vs protein. The query sequence should be divided by 3 to get protein length
   # we also need to adjust this because the length of the "protein" is really the length of the sequence
   # int removes the decimal, so if the number is x.1 or more, the +0.9 will increment it to the next number
   # hence int(1/3 + 0.9)=1 and int(3/3 + 0.9)=1

   if ($blastprogram eq "tblastn" || $blastprogram eq "tblastx") {
    my $hlen=abs($hstart-$hend); $hlen=int($hlen/3 + 0.9);
    if (!exists $length->{$hit} || $hlen > $length->{$hit}) {$length->{$hit}=$hlen}
   # finally just add the max if it is not one of those.
   unless (exists $length->{$hit}) {$length->{$hit}=$hmax}
  # the peg matches so this should be in the same place
  my $changed;
  foreach my $i (0 .. @{$sortedstarts->{$query}->{$strand}}) { 
   if (${$sortedstarts->{$query}->{$strand}}[$i] <= $qstart && $start->{$query}->{$strand}->{${$sortedstarts->{$query}->{$strand}}[$i]} >= $qend) {
    $pegcount->{$query}->{($i+1)} = ++$nextpeg unless ($pegcount->{$query}->{($i+1)});
    my $peg="fig|". $genome. ".peg.". $pegcount->{$query}->{($i+1)};
    unless (exists $contiginfo->{$peg}) {
     my ($ps, $pe)=(${$sortedstarts->{$query}->{$strand}}[$i], $start->{$query}->{$strand}->{${$sortedstarts->{$query}->{$strand}}[$i]});
     # GFF3: contig	db/type	cds	start	end	score strand	phase	attributes
     $contiginfo->{$peg}=[$query, "the SEED", "cds", $ps, $pe, ".", $printstrand, ".", 
     push @{$pegoncontig->{$query}}, $peg;
     # translate the dna sequence
     if ($fasta->{$query} && !$translation->{$peg}) {
      my $subseq;
      if ($printstrand eq "-") {
       $subseq=substr($fasta->{$query}, $ps, $pe-$ps); 
       if (!$subseq || length($subseq) < 6) {$subseq=$fasta->{$query}} # there is a problem with some of the sequences not having any translation
       $subseq = raelib->rc($subseq);
      } else {
       # I want to sanity check these, so we're going to calc them separately.
       my $st=$ps-1; my $sp=$pe-$ps+1;
       if ($st < 0) {$st=0}
       if ($sp > length($fasta->{$query})) {$sp=length($fasta->{$query})}
       $subseq=substr($fasta->{$query}, $st, $sp);
       if (!$subseq || length($subseq) < 6) {$subseq=$fasta->{$query}} # there is a problem with some of the sequences not having any translation
     unless (exists $length->{$peg}) {$length->{$peg}=int(($pe-$ps)/3+0.9)}
    # convert the query to be the peg id
  unless ($changed) {
   print STDERR "Couldn't figure out a match for $hit with start $qstart and end $qend\n";
  # for the sims the lengths need to be corrected so that we think they are protein and not dna!
  # we need to adjust the length to take into account the DNA->protein conversion
  # in this conversion we subtract the offset of the gene from the beginning of the dna
  # and then we increment that by 1 because we don't want a start of 0
  # we divide by 3 to convert from dna->aa 
  # then we add 0.9 before taking the int. This is a big roundup
  # if, after the subtraction, we have 0, and then add 1, when we do int(1/3) we get 0
  # if we add 0.9 we do int(0.333+0.9)=1
  # if we're int'ing on 3 we would get int(3/3 + 0.9)=1

  if ($blastprogram eq "blastx" || $blastprogram eq "tblastx")  {
   #$qstart=int(((($qstart-$contiginfo->{$query}->[3])+1)/3) + 0.9); $qend=int(((($qend-$contiginfo->{$query}->[3])+1)/3) + 0.9);
   $qstart=1; $qend=$length->{$query};
  if ($blastprogram eq "tblastn" || $blastprogram eq "tblastx") {
   $hstart=int(((($hstart-$contiginfo->{$query}->[3])+1)/3) + 0.9); $hend=int(((($hend-$contiginfo->{$query}->[3])+1)/3) + 0.9);
  # unless ($length->{$query}) {print STDERR "No length for query $query\n"}
  # unless ($length->{$hit}) {print STDERR "No length for hit -->$hit<--\n"}
  $qstart > $qend ? (($qstart, $qend)=($qend, $qstart)) : 1;
  $hstart > $hend ? (($hstart, $hend)=($hend, $hstart)) : 1;
  # annotate the pegs. We do this favoring (a) the first annotation, and then (b) annotations that would bring the peg into subsystems
  if ($annotate && !$fnrole->{$query})
  	my $inss;
	my $annotation_from;
  	foreach my $expandedpeg ($fig->mapped_prot_ids($hit))
		if ($expandedpeg->[0] =~ /^fig\|/)
			my $fn=scalar($fig->function_of($expandedpeg->[0]));
			if (!$fnrole->{$query}) {$fnrole->{$query}=$fn; $annotation_from=$expandedpeg->[0]} # give the first function
			my @ss=$fig->role_to_subsystems($fn);
			if (scalar(@ss) && !$inss)
		last if ($inss);
	print ANNEV join("\t", $query, $fnrole->{$query}, $annotation_from, $fig->translation_length($annotation_from)), "\n" if ($ann_ev_file);
	$contiginfo->{$query}->[8] .= ";description=".$fnrole->{$query} if ($fnrole->{$query});
  print SIMS join("\t", $query,$hit, $percent_id, $hsp_len, $mismatches,$gaps,$qstart,$qend,
                         $hstart,$hend,$evalue,$bits, $length->{$query}, $length->{$hit}), "\n";
 close IN;
close ANNEV;
close SIMS;

print STDERR "Read the files the second time around in ", time-$^T, " (total) seconds\n";

# now we want to write out the feature table (although we need contig info and whatnot)
# this is in GFF3 format. Roughly.

open(GFF, ">$gfffile") || die "Can't open $gfffile for writing";
print GFF <<EOF;
##gff-version   3

foreach my $contig (sort {$a cmp $b} keys %$start) {
 if ($fasta->{$contig}) {
  print GFF "##sequence-region\t$contig\t1\t", length($fasta->{$contig}), "\n";
 else {
  print STDERR "$contig not found in the fasta file\n";
  my @longest=map {$_=$contiginfo->{$_}->[4]} @{$pegoncontig->{$contig}};
  print GFF "##sequence-region\t$contig\t1\t$longest[$#longest]\n";

 # print GFF map {(join "\t", (@{$contiginfo->{$_}})) . "\n"} @{$pegoncontig->{$contig}};
 my $poc = $pegoncontig->{$contig};

 if (ref($poc))
     for my $ent (@$poc)
	     if (ref($contiginfo->{$ent}))
		     print GFF join("\t", @{$contiginfo->{$ent}}), "\n";
		     print STDERRR Dumper("No contiginfo for $ent in ", $poc);
     print STDERR Dumper("missing", $poc);

print GFF "\n##FASTA\n";
print GFF map {">$_\n" . $translation->{$_}. "\n"} keys %$translation;
print GFF map {">$_\n" . $fasta->{$_} . "\n"} keys %$fasta;

print STDERR "Calculating the gff3 and sims is complete in ", time-$^T, " seconds\nPlease run the following commands:\n";
print STDERR <<EOF;

To Add Genome:
	gff2seed $gfffile 
	fig add_genome as described in the output from gff2seed. Be sure to use the -force and -skipnr options

To Add Sims:
	cp $simsfile $FIG_Config::data/NewSims/$genome
	index_sims $FIG_Config::data/NewSims/$genome

To generate assignments
	fig pegs_of $genome | tr \\, \\\\n | auto_assign > $genome.assigned_pegs.txt
	fig assign_functionF master:auto_assign $genome.assigned_pegs.txt


MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3