[Bio] / FigKernelScripts / seed2gff.pl Repository:
ViewVC logotype

View of /FigKernelScripts/seed2gff.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.36 - (download) (as text) (annotate)
Mon Jun 30 15:53:03 2008 UTC (11 years, 7 months ago) by redwards
Branch: MAIN
CVS Tags: mgrast_dev_08112011, rast_rel_2009_05_18, mgrast_dev_08022011, rast_rel_2014_0912, myrast_rel40, mgrast_dev_05262011, rast_rel_2008_12_18, mgrast_dev_04082011, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, mgrast_version_3_2, mgrast_dev_12152011, mgrast_dev_06072011, rast_rel_2008_09_30, rast_rel_2009_0925, rast_rel_2010_0526, rast_rel_2014_0729, mgrast_dev_02212011, rast_rel_2010_1206, mgrast_release_3_0, mgrast_dev_03252011, rast_rel_2010_0118, mgrast_rel_2008_0924, mgrast_rel_2008_1110_v2, rast_rel_2009_02_05, rast_rel_2011_0119, mgrast_rel_2008_0625, mgrast_release_3_0_4, mgrast_release_3_0_2, mgrast_release_3_0_3, mgrast_release_3_0_1, mgrast_dev_03312011, mgrast_release_3_1_2, mgrast_release_3_1_1, mgrast_release_3_1_0, mgrast_dev_04132011, rast_rel_2008_10_09, mgrast_dev_04012011, rast_release_2008_09_29, mgrast_rel_2008_0806, mgrast_rel_2008_0923, mgrast_rel_2008_0919, rast_rel_2009_07_09, rast_rel_2010_0827, mgrast_rel_2008_1110, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, mgrast_rel_2008_0917, rast_rel_2008_10_29, mgrast_dev_04052011, mgrast_dev_02222011, rast_rel_2009_03_26, mgrast_dev_10262011, rast_rel_2008_11_24, rast_rel_2008_08_07, HEAD
Changes since 1.35: +4 -1 lines
getting rid of need for strain info

# -*- perl -*-
#
# Copyright (c) 2003-2006 University of Chicago and Fellowship
# for Interpretations of Genomes. All Rights Reserved.
#
# This file is part of the SEED Toolkit.
# 
# The SEED Toolkit is free software. You can redistribute
# it and/or modify it under the terms of the SEED Toolkit
# Public License. 
#
# You should have received a copy of the SEED Toolkit Public License
# along with this program; if not write to the University of Chicago
# at info@ci.uchicago.edu or the Fellowship for Interpretation of
# Genomes at veronika@thefig.info or download a copy from
# http://www.theseed.org/LICENSE.TXT.
#


# some pod

=head1 SEED2GFF3 CONVERTER

This will take data from the SEED and print it out in GFF3 format suitable for exchange with  other software.

GFF3 format is essentially an extended tab-separated format. See below for full details. GFF3 is also supposed to have some validation with the SO/GO components. At the moment that funcionality is missing. If anyone knows of a GFF3 validator, please try running the output of this script through it and let me know the outcome.
 
Written by Rob Edwards January 1, 2005.
 RobE@thefig.info

=head2 DESCRIPTION OF THE DATA FORMAT FOR GFF3

 This is the GFF3 format taken from http://song.sourceforge.net/gff3.shtml
 column		contents
 1		sequence id = contig name
 2		source = NMPDR or SEED
 3		type of feature. At the moment we will use: 
			CDS: SO:0000316 (CDS)
			RNA: SO:0000483 (nc_primary_transcript) 
			These come from 
 4		start (1-based integer)
 5		end (note: start <= end)
 6		score (not used)
 7		strand ('+', '-', '.', '?' for plus, minus, either, don't know)
 8		phase (usu. used for exons. I ignore)
 9		attributes:
			ID     - must be unique - this will be the FIG id right now
			Name   - display name   - this will be the gene name if we know it ??
			Alias  - other names - note that I will mostly only use this for aliases that
				 we don't know databases for. Those will be in dbxref
			Parent - anything that is a sub part of something else. I don't think we'll use this
				 but it could be used for promoters, etc
			Target - not used
			Gap    - not used
			Note   - anything I can't figure out where it should be
			DBXref - the following aliases (at the moment)
				 GI, KEGG, SP, UNI, EMBL
				 these are in the form described at ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs
				 Please note that there is now a BRC-compliant list at http://www.brc-central.org/cgi-bin/brc-central/dbxref_list.cgi
				 Dbxref="EMBL:AA816246"
				 Dbxref="NCBI_gi:10727410"
				 Dbxref="KEGG:vch+VC0007"
				 Dbxref="UniProt:Q9KVY1"
				 Dbxref="Swiss-Prot"
			Ontology_term - the GO number
				 Ontology_term="GO:1234"



=head2 GENE ONTOLOGY

I ditched the old GO term file stuff, and now we just grab the GO terms from the attributes.

=head2 NOTE ON EXTERNAL MODULES
 
I use the following modules:
 
 Getopt::Long  - this is used for parsing the command line
 URI::Escape   - this is to escape any text that we use. This is available from http://search.cpan.org/~rse/lcwa-1.0.0/

=head2 BUGS AND CONTACT

Contact Rob Edwards (RobE@thefig.info) with any bugs and/or comments.

=head2 TO DO

Bioperl is working on a GFF3 writer. It would probably be nice to convert this to create Bio::SeqFeature::Generic objects and then output through bioperl. That way the output could be generalize for GenBank, EMBL, fasta, etc, etc, etc.

At the time of writing Bioperls GFF3 writer was not working properly!

=head2 NMPDR vs SEED

There are some specific requirements that we need to have in the GFF3 files for the BRC interoperability working group. The NMPDR flag will bring those in.

=head2 timestamps

If a file called 'assigned' is present in the working directory, and it contains five columns the peg and timestamp (in the first two columns) will be used to timestamp the data.

Similarly if a file called connected is also present, the same will happen.

=cut



use strict;
use FIG;
use URI::Escape; # this is used to escape the sequences and comes from http://search.cpan.org/~rse/lcwa-1.0.0/
use Getopt::Long;
my $gzip=1;
# check and see if we can use gzip
eval {require PerlIO::gzip};
undef $gzip if ($@);

my $fig=new FIG;

my $usage=<<EOF;

$0 <OPTIONS>
 -g <genome number>         Number of the genome to extract (required). 
 -o <output file>           Default is the genome name. This will only be used for the first genome if many are requested.
 -u <user>                  Default is master:master
 -b <value>		    First base to include (inclusive)
 -e <value>                 Last base to include (inclusive)
 -s 			    Output the CDS and protein sequences in FASTA format as well as the whole genome DNA sequence
 -t |trn|pro|cds|gen|all|   Include features in the output. See below.
 -escapespace		    Escape spaces (default is to leave spaces as ' '. If this is called they will be converted to %20) 
 -linelength <value>        Line length for the sequences. Default is 60 nt
 -nmpdr			    Include NMPDR specific requirements
 -xml			    Output an XML file of the genome data. This will be enabled by default if you use -nmpdr since we have to have XML for the NMPDR orgs.
 
 The genome number can be a comma separated list of genomes, and that way the ontology file will only be read once. Don't put spaces in the list unless you quote the whole thing, of course.

 e.g. $0 -g 243277.1,223926.1,216895.1,196600.1 -n ./gene_association.goa_uniprot.gz

 will extract all the Vibrio genomes.

 Features:

 Features can be any of:
 	trn  :  transcript
 	pro  :  protein
 	cds  :  cds
 	gen  :  gene
 	all  :  trn, pro, cds, gen

 Note at the moment this will probably put out the same information for all of these!!
 
EOF

my ($allgen, $outputf, $user, $beg, $end, $outputfasta, $trn, $escapespace, $linelength, $nmpdr, $xmloutput);
GetOptions(
 'g|genome:s'      => \$allgen,
 'o|output:s'      => \$outputf,
 'u|user:s'        => \$user,
 'b|begin:i'       => \$beg,
 'e|end:i'         => \$end,
 's|seq'           => \$outputfasta,
 't|type:s'        => \$trn,    
 'x|xml'	   => \$xmloutput,
 'escapespace'     => \$escapespace,
 'linelength:i'    => \$linelength, 
 'nmpdr'	   => \$nmpdr,
);


die $usage unless ($allgen);

my %outputtype;
foreach (split /\,\s*/, $trn) {
 if (/trn/i || /transcript/i) {$outputtype{'transcript'}=1}
 elsif (/pro/i || /protein/i) {$outputtype{'protein'}=1}
 elsif (/cds/i) {$outputtype{'cds'}=1}
 elsif (/gen/i) {$outputtype{'gene'}=1}
 elsif (/all/i) {foreach my $t (qw[transcript protein cds gene]) {$outputtype{$t}=1}}
 else {print STDERR "Don't know what output type |$_| is, so we ignored it\n"}
}
my $fastasequences;
unless ($linelength) {$linelength=60}

unless ($user) {$user="master:master"}

# set the source depending on if we are making the files for the NMPDR
my $source="The SEED";
$nmpdr ? $source = "NMPDR" : 1;
$nmpdr ? $xmloutput = 1 : 1; # set xml output for the nmpdr only settings.

my %asstimestamp;
if (-e "assigned")
{
	open(IN, "assigned") || die "Can't open assigned even though it exists";
	while(<IN>)
	{
		chomp;
		my @a=split /\t/;
		$asstimestamp{$a[0]}="$a[2]:$a[1]";
	}
}
else {print STDERR "The file 'assigned' was not found so no timestamps added\n"}
my %conntimestamp;
if (-e "connected")
{
	open(IN, "connected") || die "Can't open connected even though it exists";
	while(<IN>)
	{
		chomp;
		my @a=split /\t/;
		$conntimestamp{$a[0]}="$a[2]:$a[1]";
	}
}
else {print STDERR "The file 'connected' was not found so no timestamps added\n"}


# now process each genome
my $gs;
foreach my $genome (split /\,/, $allgen) {
 unless ($outputf) {$outputf = $genome}
 if (defined $beg && defined $end) 
 {
  if ($end < $beg) {($end, $beg)=($beg, $end)}
  print STDERR "Only outputting the ", int(($end-$beg)/1000), " kb between $beg and $end\n";
  $outputf .= "_" . $beg . "_" . $end;
 }
 
 $gs=$fig->genus_species($genome);
 unless ($gs) {
  print STDERR "Couldn't recognize $genome. Skipped\n";
  next;
 }
 else {print STDERR "Converting $gs ($genome) to GFF3. Output will be in $outputf\n"} 
 
 my $taxid=$genome; $taxid=~s/\.\d+$//;
  # a hack to separate genus/species from strain
  $gs =~ /^(\S+\s+\S+)\s+(.*)$/;
  my ($on, $strain)=($1, $2);
  # unless ($on) {die "Genus/Species not defined"}
  unless (defined $on) {$on = ""}; # just get rid of undef warnings
  unless (defined $strain) {$strain = ""}; # just get rid of undef warnings
  
 
 if ($xmloutput) {
 	my $xmlf=$outputf;
	$xmlf =~ s/\.gff3*$/.xml/;
 	open(XML, ">$xmlf") || die "Can't open $xmlf for writing";
	my ($xmlname, $xmlacro)=("The SEED Environment", "SEED");
	$nmpdr ? (($xmlname, $xmlacro)=("National Microbial Pathogen Data Resource", "NMPDR")) : 1;
	my $xmltime=scalar(localtime(time));
	# this is to remove the dir from the start of the output file name so that the link that 
	# appears in the xml is correct. Oy.
	my $linktooutput=$outputf;
	$linktooutput =~ s/^.*?\///;
	print XML <<EOF;
<?xml version='1.0' encoding='utf-8'?>
<!-- Preliminary XML creation. Some fields may be intentionally left blank or reverted to No -->
<!-- XML Created at $xmltime -->
<!-- For comments or questions on this XML file please contact Rob Edwards -->

<source>
  <category name="brc">
  <parameter name="name">$xmlname</parameter>
  <parameter name="acronym">$xmlacro</parameter>
  </category>

  <category name="organism">
  <parameter name="name">$on</parameter>
   <parameter name="strain">$strain</parameter>
  <parameter name="taxon_id">$taxid</parameter>
  </category>

  <category name="gff3">
  <parameter name="corresponding_gff3_file">$linktooutput</parameter>
  <parameter name="ftp_url">ftp://ftp.nmpdr.org/$linktooutput</parameter>
  </category>

  <category name="submission">
  <parameter name="brc_central_submits_to_refseq">No</parameter>
  </category>

  <category name="curatorship">
  <parameter name="annotation">NMPDR</parameter>
  <parameter name="gene_ends">NMPDR</parameter>
  <parameter name="gene_families">NMPDR</parameter>
  </category>

  <category name="ownership">
  <parameter name="original_source"></parameter>
  <parameter name="original_sequence"></parameter>
  <parameter name="original_annotation"></parameter>
  <parameter name="downloaded_from"></parameter>
  <parameter name="date_of_download"></parameter>
  <parameter name="owner_url"></parameter>
  </category> 

EOF
  }
 
 open (OUT, ">$outputf") || die "Can't open $outputf for writing";
 my $taxonomy=$fig->taxonomy_of($genome);
 open( PROJECT, "<$FIG_Config::organisms/$genome/PROJECT" ) || return ();
 my @project = <PROJECT>;
 chomp(@project);
 close PROJECT;

 my $md5=$fig->genome_md5sum($genome);
 print OUT "##gff-version\t3\n";
 print OUT "##feature-ontology\tso.obo\n##attribute-ontology\tgff3_attributes.obo\n" if ($nmpdr); # they want this in there, even though we don't use GO. Whatever.
 print OUT <<EOF;
#seed\tgenome_id\t$genome
#seed\ttaxon_id\t$taxid
#seed\tname\t$gs
#seed\ttaxonomy\t$taxonomy
#seed\tproject\t$project[$#project]
#seed\tgenome_md5\t$md5
#The web_ids are unique ids that you can link to our data with. Use the url: http://www.nmpdr.org/linkin.cgi?id=web_id to access the data

EOF

 my $contig_data; # all the data we get for all the contigs. Going to chew thru memory I think
 my %len; # length of each of the contigs.

 my @all_pegs = $fig->pegs_of($genome);

 my(%go, %ev);

 warn "Retrieve all go terms\n";
 for my $res ($fig->get_attributes(\@all_pegs, "GO"))
 {
     push(@{$go{$res->[0]}}, $res);
 }

 warn "Retrieve all ev codes\n";
 for my $res ($fig->get_attributes(\@all_pegs, "evidence_code"))
 {
     push(@{$ev{$res->[0]}}, $res);
 }


 foreach my $peg (sort { &FIG::by_fig_id($a,$b) } @all_pegs, $fig->rnas_of($genome)) {
  # get some initial information about this peg 
  #if (lc($type) eq "peg") {$type="SO:0000316"}
  #elsif (lc($type) eq "rna") {$type="SO:0000483"}
  #else {die "Can't figure out what type |$type| is"}
 
  ###########
  #
  # Begin figuring out the column 9 information about notes and aliases and GO terms
  # All the information is temporarily stored in @alias or @note, and at the end is joined
  # into $allnote
  #
  ###########

  
  my $note; # the notes for the last column. This is a hash where each key is the type (Note, Dbxref, etc) and the value is a reference to an array of values
  my $func=$fig->function_of($peg, $user);

  # parse out the ec number if there is one
  if ($func =~ /E\.*C\.*\s+(\d+|-)\.(\d+|-)\.(\d+|-)\.(\d+|-)/)
  {
    #push @{$note->{"ec_number"}}, "$1.$2.$3.$4";
    $note->{"Dbxref"}->{"EC:$1.$2.$3.$4"}=1;
  }
  
  # if this is for the nmpdr we are required to have a description, so we'll make it hypothetical protein :)
  ($nmpdr && !$func && $peg =~ /peg/) ? ($func = "hypothetical protein") : 1;
  $func=uri_escape($func);
  ($nmpdr && $func)  ? push @{$note->{"description"}}, $func : $func ? push @{$note->{"Note"}}, $func : 1;

  #push @{$note->{"Note"}}, $func if ($func); # this should escape the strings using URL rules.
  
  foreach my $alias ($fig->feature_aliases($peg))
  {
   if (my $cleanalias=$fig->rewrite_db_xrefs_brc($alias))
   {
	$note->{"Dbxref"}->{$cleanalias}=1;
   }
   elsif ($alias =~ /^[a-zA-Z][a-zA-Z0-9]{2,}\_[a-zA-Z]*\d+$/)
   {
   	# that should be the regexp for a valid locus tag. Starts with a letter, has at least three alphanumerics before the _
	# then has a series of optional letters and ends with numbers
       if (!$nmpdr)
       {
	   push @{$note->{"locus_tag"}}, $alias;
       }
   }
   elsif (($alias =~ /^.{4,6}$/ && $alias !~ /^cds\./) || $alias =~ s/^GENE\:// || $alias =~ /^tRNA-/)
   {
   	push @{$note->{"gene_symbol"}}, $alias;
   }
   else {
    # don't know what it is so keep it as an alias
    $alias = uri_escape($alias); # just in case!
    push @{$note->{"Alias"}}, $alias;
   }
  }

  # Now for the GO terms
  my $go=&go_term(\%go, $peg);
  push @{$note->{"Ontology_term"}}, $go if ($go);

  # And the timestamps
  if ($asstimestamp{$peg}) {push @{$note->{"assigned_timestamp"}}, $asstimestamp{$peg}}
  if ($conntimestamp{$peg}) {push @{$note->{"connected_timestamp"}}, $conntimestamp{$peg}}

  # And the evidence code
#  $ev_data = $fig->get_attributes($peg, "evidence_code");
  my $ev_data = $ev{$peg};
  my $evc=join (",", map {$_->[2] =~ s/\;.*$//; $_->[2]} $ev_data ? @$ev_data : ());
  push @{$note->{"evidence_code"}}, $evc if ($evc);

  # now go through the links and see if there is anything that we should add as a Dbxref.
  # the links are stored like this:
  # <A HREF=http://smart.embl-heidelberg.de/smart/do_annotation.pl?ACC=SM00482>SMART SM00482</A>
  # we will parse out the things between the > and </A>. In this case the SMART SM00482
  # generally these are in the format DB\s+ID, but I have added some special cases
  #
  # Set links_to_keep to true for the link to keep it. Set links to keep to false to ignore it
  
  my %links_to_keep=(
        Pfam            => 1,
        PubMed          => 1,
        PIRSF           => 1,
        SMART           => 1,
        InterPro        => 1,
	PMID		=> 1,
	UniProtKB	=> 1,
        PDB             => 1,
        UniProt         => 1,
        IEDB            => 1,
        EcoCyc          => 1,
        Colibri         => 1,
        ATP             => 0,
        Animation       => 0,
        vibrio          => 0,
	HarvardClone	=> 0,
	"E."		=> 0,
        Entrez          => 0,
	Locus_Tag    	=> 0,
	locus_tag    	=> 0,
    );

  my %warn;
  foreach my $ln ($fig->fid_links($peg))
  {
      my ($db, $id);
      if ($ln =~ /field0=CATID/ && $ln =~ /query0=(\d+)/ && ($id=$1) && $ln =~ /pcenter=harvard/) {$db="HarvardClone"}
      elsif ($ln =~ /(PIRSF)(\d+)/) {($db, $id)=($1, $2)} # special case for pirsf
      elsif ($ln =~ />(\S+)\s+(\S+.*?)</) {($db, $id)=($1, $2)}
	# correct some cases where the db needs a specific name
	$db eq "PubMed" ? $db = "PMID" : 1;
	$db eq "UniProt" ? $db = "UniProtKB" : 1;
      
      $db =~ s/\://;
      if (!$db && !$id) {print STDERR "Ignored link: $ln\n"; next}
      if (!defined $links_to_keep{$db} && !$warn{$db}) 
      {
        $warn{$db}=1; 
        print STDERR "Not sure whether to keep or ignore DB: |$db| ID: |$id| $ln\n";
        next
      } #the warn is so we only error once
      if ($links_to_keep{$db}) {$note->{"Dbxref"}->{"$db:$id"}=1}
  }

 
  # the LAST thing I am going to add as a note is the FIG id so that I can grep it out easily
  $note->{"Dbxref"}->{"FIG_ID:$peg"}=1;
  
  # finally join all the notes into a long string that can be added as column 9. 
  
  # RAE: Jan 2006. Initially we had Dbxrefs joined by comma's. Now we have multiple per line
  # I actually like this better as you can hit them all with Dbxref=(.*?)
  # RAE: Sep 2006. Now they changed back to joined with comma's, so I did too.
 
  my $allnotes;
  if (1) 
  {
    $allnotes=join(";", map {$_ .= "=" . join(",", @{$note->{$_}})} grep {!/Dbxref/} keys %$note); 
    $allnotes && ($allnotes .= ";"); # we only want to add a trailing ; if we actually have something to add it to
    $allnotes.="Dbxref=". join(",", sort keys %{$note->{"Dbxref"}}) if (defined $note->{"Dbxref"});
  }
  else
  {
    $allnotes=join(";", map {$_ .= "=" . join(",", @{$note->{$_}})} grep {!/Dbxref/} keys %$note); 
    $allnotes && ($allnotes .= ";"); # we only want to add a trailing ; if we actually have something to add it to
    $allnotes.="Dbxref=". join(";Dbxref=", sort keys %{$note->{"Dbxref"}}) if (defined $note->{"Dbxref"});
  }

  # do we want to convert '%20' to  ' '
  unless ($escapespace) {$allnotes =~ s/\%20/ /g}

  
  ###########
  #
  # End figuring out the column 9 information about notes and aliases and GO terms
  #
  ###########

  my $gff; # where we store the output
  my @location=$fig->feature_location($peg);

  warn "$peg @location\n";


  foreach my $loc (@location) {
   $loc =~ /^(.*)\_(\d+)\_(\d+)$/;
   my ($contig, $start, $stop)=($1, $2, $3);
   my $original_contig=$contig;

   my $contig_id = $contig;
   if ($nmpdr)
   {
       $contig_id = "nmpdr|$genome.contig.$contig_id";
   }

   $contig = uri_escape($contig); # the contig name must be escaped

   unless (defined $len{$contig}) {$len{$contig}=$fig->contig_ln($genome, $original_contig)}
   my $strand='+';
   
   next if (defined $beg && ($start < $beg || $stop < $beg));
   next if (defined $end && ($start > $end || $stop > $end));
   
   if ($start > $stop) {($start, $stop, $strand)=($stop, $start, '-')}
   elsif ($start == $stop) {$strand="."}
   
   my $type=$fig->ftype($peg);
   
   if ($type eq "peg") {
    # it is a protein coding gene
    # create an artificial id that is just the peg.(\d+) information
    # we will use this to create ids in the form cds.xxx; trn.xxxx; pro.xxx; gen.xxx;
    # note I have got rid of this because it was stupid. Now we use the peg id.
    $peg =~ /.peg.(\d+)/;
    my $geneid=$1;
    
    # SO terms:
    # transcript: SO:0000673
    # gene: SO:0000704
    # cds: SO:0000316
    # protein:  NOT VALID: should be protein_coding_primary_transcript SO:0000120

    my $gene_id = "gene.$geneid";
    my $gff_id = $peg;

    if ($nmpdr)
    {
	$gene_id = "nmpdr|$genome.$gene_id";
	$gff_id =~ s/^fig/nmpdr/;
    }
    
    if ($outputfasta) {
     my $adddnaseq = uc($fig->dna_seq($genome, $fig->feature_location($peg)));
     $adddnaseq =~ s/(.{$linelength})/$1\n/g; chomp($adddnaseq);
     $adddnaseq = uc($adddnaseq);
     $fastasequences .= ">$gene_id\n$adddnaseq\n";
     my $addseq = $fig->get_translation($peg);
     $addseq =~ s/(.{$linelength})/$1\n/g; chomp($addseq); # the chomp is so that we know for sure to add the line back
     $addseq=uc($addseq);
     #$fastasequences .= ">pro.$geneid\n$addseq\n";
     $fastasequences .= ">$gff_id\n$addseq\n";
    }
    
    push (@{$contig_data->{$contig}}, join("\t", $contig_id, $source, "gene", $start, $stop, ".", $strand, ".", "ID=$gene_id"));
    #push (@{$contig_data->{$contig}}, join("\t", $contig, $source, "cds", $start, $stop, ".", $strand, ".", "ID=pro.$geneid;$allnotes;Parent=gene.$geneid"));
    push (@{$contig_data->{$contig}}, join("\t", $contig_id, $source, "cds", $start, $stop, ".", $strand, ".", "ID=$gff_id;web_id=$peg;$allnotes;Parent=$gene_id"));
   } # end the if type==peg
   elsif ($type eq "rna") {
    $peg =~ /.(rna.\d+)/;
    my $geneid=$1;
    my $type="tRNA"; # tRNA is a valid SOFA term == SO:0000253

    my $gene_id = "gene.$geneid";
    my $gff_id = $peg;

    if ($nmpdr)
    {
	$gene_id = "nmpdr|$genome.$gene_id";
	$gff_id =~ s/^fig/nmpdr/;
    }

    if ($outputfasta) {
     my $addseq = $fig->dna_seq($genome, $fig->feature_location($peg));
     $addseq =~ s/(.{$linelength})/$1\n/g; chomp($addseq);
     $addseq = uc($addseq);
     $fastasequences .= ">$gff_id\n$addseq\n";
    }
    push (@{$contig_data->{$contig}}, join("\t", $contig_id, $source, "gene", $start, $stop, ".", $strand, ".", "ID=$gene_id"));
    push (@{$contig_data->{$contig}}, (join "\t", ($contig_id, $source, $type, $start, $stop, ".", $strand, ".", "ID=$gff_id;web_id=$peg;$allnotes;Parent=$gene_id")));
   } # end the if type == rna
   else {
    die "Don't know what type: |$type| is";
   }
  } # end the foreach $loc (@location)
 } # end the foreach peg/rna
 
 #############
 #
 # Now we just need to add the sequence data
 #
 #############
 
 if ($xmloutput) {print XML '<category name="sequence_info">'."\n"}

 foreach my $contig (sort keys %{$contig_data}) 
 {
  my ($contigstart, $contigend)=(1, $len{$contig});
  if (defined $beg) {$contigstart=$beg}
  if (defined $end) {$contigend  =$end}

  my $contig_id = $contig;
  if ($nmpdr)
  {
      $contig_id = "nmpdr|$genome.contig.$contig_id";
  }
  
  print OUT "##sequence-region\t$contig_id\t$contigstart\t$contigend\n";
 
  if ($xmloutput)
  {
  	print XML <<EOF;
	<sequence>
		<internal>
		<parameter name="sequence_identifier">fig|$genome.contig.$contig</parameter>
		</internal>
		<genbank>
		<parameter name="latest_accession"></parameter>
		<parameter name="latest_version"></parameter>
		<parameter name="owner"></parameter>
		</genbank>
		<refseq>
		<parameter name="latest_accession"></parameter>
		<parameter name="latest_version"></parameter>
		</refseq>
		<modified>
		<parameter name="sequence_changed">NO</parameter> 
		<parameter name="submitted_to_genbank">NO</parameter> 
		</modified>
	</sequence>

EOF
  }

  # print out the sequence definition line
  #NMPDR NOTE: Some of this could be wrong, esp, the molecule_type and translation_table, but we don't have that information at the moment.

  print OUT join
    ("\t", 
    $contig_id, $source, "contig", $contigstart, $contigend, ".\t+\t.", 
    "ID=$contig_id;Name=Contig $contig;organism_name=$on;strain=$strain;Dbxref=taxon:$taxid;molecule_type=dsDNA;translation_table=11"
    ), "\n";
  print OUT join("\n", @{$contig_data->{$contig}}), "\n";
 }
 if ($xmloutput) {print XML "</category>\n"}

 
 print OUT "\n##FASTA\n";
 # print out the cds and pro if we need them
 print OUT $fastasequences if ($outputfasta);
 
 foreach my $contig (sort $fig->all_contigs($genome)) 
 {
  my $len=$fig->contig_ln($genome, $contig);
  my $dna_seq=$fig->dna_seq($genome, $contig . "_1_". $len);
  if (defined $beg) 
  {
   unless (defined $end) {$end=$len}
   $dna_seq = substr($dna_seq, $beg, $end);
  } 
  elsif (defined $end)
  {
   $beg=1;
   $dna_seq = substr($dna_seq, $beg, $end);
  }
  
  my $contig=uri_escape($contig);
  $dna_seq =~ s/(.{60})/$1\n/g;
  chomp($dna_seq); # just remove the last \n if there is one

  my $contig_id = $contig;
  if ($nmpdr)
  {
      $contig_id = "nmpdr|$genome.contig.$contig_id";
  }


  print OUT ">$contig_id\n$dna_seq\n";
 }
 undef $outputf; # so we don't overwrite the output file!
 close OUT;
 if ($xmloutput) {print XML "</source>\n"; close XML}
}

print STDERR "Completed in ", time - $^T, " seconds\n";
exit(0);


sub go_term {
    my($go_cache, $peg) = @_;
  my @ego;
  # my @attr = $fig->get_attributes($peg, "GO"));
  my $attr = $go_cache->{$peg};
    return "" unless ref($attr);
  foreach my $g (@$attr)
  {
  	my $term = $g->[2];
	$term =~ s/^.*?(GO\:\S+).*?$/$1/;
	push @ego, $term;
  }
  return join(",", @ego);
}




MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3