[Bio] / FigMetagenomeTools / homopolymers.pl Repository:
ViewVC logotype

View of /FigMetagenomeTools/homopolymers.pl

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1.1.1 - (download) (as text) (annotate) (vendor branch)
Mon Feb 19 17:15:26 2007 UTC (13 years ago) by olson
Branch: x, MAIN
CVS Tags: mgrast_dev_08112011, mgrast_dev_08022011, mgrast_dev_05262011, mgrast_dev_04082011, mgrast_version_3_2, mgrast_dev_12152011, mgrast_dev_06072011, mgrast_rel_2008_0806, mgrast_dev_10262011, mgrast_dev_02212011, mgrast_rel_2008_0923, mgrast_release_3_0, mgrast_dev_03252011, mgrast_rel_2008_0924, mgrast_rel_2008_1110_v2, mgrast_rel_2008_0625, mgrast_release_3_0_4, mgrast_release_3_0_2, mgrast_release_3_0_3, mgrast_release_3_0_1, mgrast_dev_03312011, mgrast_release_3_1_2, mgrast_release_3_1_1, mgrast_release_3_1_0, mgrast_dev_04132011, mgrast_dev_04012011, mgrast_rel_2008_0919, mgrast_rel_2008_1110, myrast_33, mgrast_rel_2008_0917, mgrast_dev_04052011, mgrast_dev_02222011, y, HEAD
Changes since 1.1: +0 -0 lines
Initial import

#!/usr/bin/perl -w

=pod

=head1 homopolymers.pl

Are homopolymers a problem in our sequences? In 454 sequencing, the controls clearly demonstrate that up-to seven bases are OK, but more than that are a problem. So given a fasta sequence(s) what are the characteristics of the homopolymeric tracts in those sequences

=head2 methods

There are (at least) two different ways of doing this in perl, and some sort of both methods are encoded below. The first method uses index for an exact pattern match. This is considerably faster than the regexp, and is the method that I used. The problem is that in the run AAAAAA if you look for AAA you will get two hits. Therefore, I modified the search to look for all variants of NAAAN except NAAAA and AAAAN. This makes it at least feasible to search through the 454 datasets.

The second method, using regexps, is conceptually a lot easier - you just find the longest runs and "x" them out so that you don't see them again and count the number of times that you do this, but it is much, much slower.

=cut

use strict;
use Rob;

my $f = shift || die "$0 <fasta file>";
my $fa=Rob->read_fasta($f);

#my ($minlen, $maxlen)=(3,15);
my ($minlen, $maxlen)=(10,200);

my %tracts=&tracts;
my %score; my $len; my %hitseq; my $seqcount;
# initialize to remove undef errors;
map {$score{$_}=0} ($minlen .. $maxlen);

foreach my $id (keys %$fa)
{
	#print STDERR "$id\n";
	my $seq=uc($fa->{$id});
	$len+=length($seq);
	# this adds a non-running base at the start so we correctly see runs at the start/ends of sequences
	my $first=substr($seq, 0, 1);
	my $last=substr($seq, -1);
	$first =~ tr/AGCT/TCGA/;
	$last =~ tr/AGCT/TCGA/;
	$seq = $first.$seq.$last;
	foreach my $tr (keys %tracts)
	{
		my $hit=0;
		my $posn=index($seq, $tr);
		while ($posn > -1)
		{
			$hit++;
			$posn++;
			$posn=index($seq, $tr, $posn);
		}
		$score{length($tr)-2}+=$hit if ($hit);
		$hitseq{length($tr)-2}++ if ($hit);
		$seqcount++;
	}
}



print join("\t", "Length", "Number of sequences", "Total sequence length", "Homopolymers/bp", "Number of seqs with hp", "Total no. of seqs"), "\n";
#print map {"$_\t$score{$_}\t$len\t".$score{$_}/$len."\t$hitseq{$_}\t$seqcount\n"} sort {$a <=> $b} keys %score;
print map {"$_\t$score{$_}\t$len\t".$score{$_}/$len."\t$hitseq{$_}\t$seqcount\n"} ($minlen .. $maxlen);


sub tracts {
	# we are only concerned with homopolymeric tracts upto 12bp, so lets fake them
	my %seq;
	foreach my $base (qw[G A T C])
	{
#for (my $i=3; $i<=15; $i++) {$seq{$base x $i}=1}
		for (my $i=$minlen; $i<=$maxlen; $i++)
		{
			my $run = $base x $i;
			foreach my $start (grep {$_ ne $base} qw[G A T C])
			{
				foreach my $stop (grep {$_ ne $base} qw[G A T C])
				{
					$seq{$start.$run.$stop}=1;
				}
			}
		}
	}
	print STDERR "There are ", scalar(keys %seq), " seuences to check\n";
	return %seq;
}





exit(0);

# old way using regexp. Ignore this
my %tract;
foreach my $id (keys %$fa)
{
	print STDERR "$id\n";
	my $seq=uc($fa->{$id});
	my $len=length($seq);
	foreach my $base (qw[G A T C])
	{
		for (my $i=$len; $i>3; $i--)
		{
			my $from=$base x $i;
			my $to="n" x $i;
			#my $count=($seq =~ s/$base{$i}/"n" x $i/eg);
			my $count=($seq =~ s/$from/$to/g);
			if ($count) {$tract{$i}+=$count}
		}
	}
}

print map {"$_\t$tract{$_}\n"} sort {$a <=> $b} keys %tract;


MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3