GFF3 Help

Written by RobE, May, 2006.

Getting GFF3 files from the SEED

The command seed2gff will take a single genome and output a GFF3 file for you. There are several options for this command that allow you to select which parts of the genome you want included in your GFF3 file such as proteins, or limit the sequence to a particular region of the genome.

Creating GFF3 files for the NMPDR

Use the program nmpdr2gff to create the GFF3 files for uploading to the BRC site. This takes a single argument, the name of the directory to put the files into. The program goes through each organism and looks for the flag file NMPDR in the organism directory. If that is present it creates the GFF3 file and writes it int a subdirectory called the name of the genus. There are a couple of flags that must be set to swtich the GFF3 output from SEED to NMPDR for the BRC. Mainly the database is called NMPDR not SEED.

Once these files are created you can gzip them and transfer them to the BRC site via ftp. You'll need the username and password from Tom Creasey at TIGR, or me.

Getting GFF3 files from the BRC site

ftp to the BRC Central site ( and download the files that are from the other sites. If you don't want to do that, the easiest way to get the data is to use this command:

wget -r

This will recursively download the entire directory structure on the ftp site. I have been doing this in /home/seed/IOWG/.

Once you have the data downloaded, then use the command to convert those files into the three files that we need for the mapping, fasta, assigned_functions, and org.table. This command just takes the name of the directory with all the subdirectories and goes through them look for gff files and extracting data.

seed2gff options

 -g          Number of the genome to extract (required).
 -o            Default is the genome name. This will only be used for the first genome if many are requested.
 -n     See POD for description. Optional.
 -u                   Default is master:master
 -b                  First base to include (inclusive)
 -e                  Last base to include (inclusive)
 -s                         Output the CDS and protein sequences in FASTA format as well as the whole genome DNA sequence
 -t |trn|pro|cds|gen|all|   Include features in the output. See below.
 -escapespace               Escape spaces (default is to leave spaces as ' '. If this is called they will be converted to %20)
 -linelength         Line length for the sequences. Default is 60 nt
 -nmpdr                     Include NMPDR specific requirements

 The genome number can be a comma separated list of genomes, and that way the ontology file will only be read once. Don't put spaces in the list unless you quote the whole thing, of course.

 e.g. seed2gff -g 243277.1,223926.1,216895.1,196600.1 -n ./gene_association.goa_uniprot.gz

 will extract all the Vibrio genomes.


 Features can be any of:
        trn  :  transcript
        pro  :  protein
        cds  :  cds
        gen  :  gene
        all  :  trn, pro, cds, gen

 Note at the moment this will probably put out the same information for all of these!!