[Bio] / FigKernelScripts / kmer_property_search.pl Repository:
ViewVC logotype

View of /FigKernelScripts/kmer_property_search.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (download) (as text) (annotate)
Wed Sep 18 13:09:09 2013 UTC (6 years, 1 month ago) by overbeek
Branch: MAIN
CVS Tags: rast_rel_2014_0729, rast_rel_2014_0912, HEAD
Changes since 1.2: +4 -3 lines
make frac hits for kmer a parameter

########################################################################
=head1 kmer_property_search

Classify a genome based on a property and a set of values of the property.

------

Example:

    Suppose that you wished to implement a classifier that took as
    input a fasta file of amino acid strings (the tranlations of the pegs),
    and you wanted to classify the genome they came from as either "gram_positive"
    or "gram_negative".  In this case, the "property" is "gram_stain", and the
    possible values might be

          {gram_negative,gram_positive,should_not_classify}

    I added the third value to cover things like Mycoplasmas, but
    you could just as easily leave it off.

    Then, you make an initialized Data directory using

        mkdir Data
        cp properties.table Data

    Here, the properties.table is a 2-column, tab-separated
    table. The first column must be a pubSEED genome id, and the 
    second must be one of the proprty values (the actual set of
    values is implied by the number of distinct values in the 
    second column).

    Then, assuming input is a fasta file of translations of
    the protein-encoding genes in a single genome, use

        kmer_property_search -d Data < input > report

    to compute the "classification counts" for the strings in "input".
    Lines like the following are written to STDERR indicating the calls
    of specific pegs.

    gram_positive	fig|83333.1.peg.4290	6
    gram_negative	fig|83333.1.peg.4290	297
    .
    .
    .
    Each line reflects the classification of a single peg in the input.
    At the lend, you will get a summary like

    4178	gram_negative
    422	        gram_positive

    indicating that in this case (E.coli), the predicted value would be
    gram_negative.

    NOTE: The first time or two you use this, it will go very, very slow.
          The first invocation constructs a kmer table, and this can take
          several hours.  The next time a memory map is constructed, and
          this can take 1-20 minutes.  After that, the times should drop 
          to a few seonds. 
=cut

########
use strict;
use Data::Dumper;
use SeedEnv;
use Getopt::Long;
use SeedEnv;

my $usage = "usage: kmer_property_search -d DataDir [-f MinFracForKmer] < PegFile > Report\n";
my $dataD;
my $min_hits = 5;
my $max_gap  = 200;
my $frac = 0.8;    # Note that MinFract only applies on a build (when Data/final.kmers is not there)

my $rc  = GetOptions('d=s' => \$dataD,
		     'f=f' => \$frac,
		     'g=i' => \$max_gap,
                     'm=i' => \$min_hits);

if ((! $rc) || (! -d $dataD))
{ 
    print STDERR $usage; exit ;
}
if (! -s "$dataD/properties") { die "you need to specify a properties in $dataD/properties" }

if (! -s "$dataD/final.kmers")
{
    &SeedUtils::run("kp_build_Data -d $dataD -k 8 -f $frac");
}

my $primes = [3769,6337,12791,24571,51043,101533,206933,400187,
              821999,2000003,4000037,8000009,16000057,32000011,
	      64000031,128000003,248000009,508000037,1073741824,
	      1400303159,2147483648];
open(SZ,"<$dataD/size") || die "could not open $dataD/size";
my $sz = <SZ>;
chomp $sz;
my $i;
print STDERR "required sz = $sz\n";
for ($i=0; ($i < @$primes) && ($primes->[$i] < (3 * $sz)); $i++) {}
if ($i == @$primes) { die "$sz is too large - adjust the '$primes' above" }
my $hash_size = $primes->[$i];
print STDERR "hash_size=$hash_size\n";
my $write_mem = '';
if (! -s "$dataD/kmer.table.mem_map")
{
    $write_mem = "-w ";
}

&SeedUtils::run("kmer_guts -s $hash_size -D $dataD -a $write_mem -m $min_hits -g $max_gap | kp_process_hits -d $dataD");



MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3