[Bio] / FigMetagenomeTools / trim_repeats.pl Repository:
ViewVC logotype

View of /FigMetagenomeTools/trim_repeats.pl

Parent Directory Parent Directory | Revision Log Revision Log

Revision - (download) (as text) (annotate) (vendor branch)
Mon Feb 19 17:15:26 2007 UTC (13 years ago) by olson
Branch: x, MAIN
CVS Tags: mgrast_dev_08112011, mgrast_dev_08022011, mgrast_dev_05262011, mgrast_dev_04082011, mgrast_version_3_2, mgrast_dev_12152011, mgrast_dev_06072011, mgrast_rel_2008_0806, mgrast_dev_10262011, mgrast_dev_02212011, mgrast_rel_2008_0923, mgrast_release_3_0, mgrast_dev_03252011, mgrast_rel_2008_0924, mgrast_rel_2008_1110_v2, mgrast_rel_2008_0625, mgrast_release_3_0_4, mgrast_release_3_0_2, mgrast_release_3_0_3, mgrast_release_3_0_1, mgrast_dev_03312011, mgrast_release_3_1_2, mgrast_release_3_1_1, mgrast_release_3_1_0, mgrast_dev_04132011, mgrast_dev_04012011, mgrast_rel_2008_0919, mgrast_rel_2008_1110, myrast_33, mgrast_rel_2008_0917, mgrast_dev_04052011, mgrast_dev_02222011, y, HEAD
Changes since 1.1: +0 -0 lines
Initial import

#!/usr/bin/perl -w

# trim repeats. Save only those that start or stop between given coordinates or those longer than a given length

# Copyright Rob Edwards
# You may distribute this script under the GPL. See below.

=head1 NAME

trim_repeats.pl - remove repeats not in a specific region or shorter than a defined length


You can run this script from the command line with the command

perl -w trim_repeats.pl -f <filename> -b <beginning> -e <end> -m <minimum length>


This script will read in a genbank file and remove any repeats that are outside of
a given range. I use it to cut down the number of repeats in a file.

=head1 AUTHOR -  Rob Edwards

Email redwards@utmem.edu

The author gratefully acknowledges all the help from the members of the BioPerl team.


use strict;

my ($file, $start, $stop, $min)=('', 0,0,0);
my ($rangeb, $rangee);

while (@ARGV) {
 my $t=shift @ARGV;
 if ($t eq "-f") {$file=shift @ARGV}
 elsif ($t eq "-b") {$start=shift @ARGV}
 elsif ($t eq "-e") {$stop =shift @ARGV}
 elsif ($t eq "-m") {$min  =shift @ARGV}
 elsif ($t eq "-r") {$t = shift; ($rangeb, $rangee)=split /\.\./, $t}

unless ($file) {die "$0\n-f file name\n-b beginning\n-e end\n-r range (must be something like 200.300) to cover\n-m minimum\nIf beginning or end are not supplied length of sequence is used"}

unless ($stop) {
 # quickly see how long the sequence is. Could get it from the first line, but ech.
 open(IN, $file) || die "Can't open $file";
 my ($keep, $seq);
 while (<IN>) {
  # Example first line:
  #LOCUS       Phage_1                79274 bp    dna     linear   UNK
  if (/^LOCUS.*?(\d+)\s+bp/) {$stop=$1;last}
  if (/ORIGIN/) {$keep=1; next}
  next unless ($keep);
  chomp; s/\s//g; s/\d//g;
  $seq .= $_;
 close IN;
 if (!$stop && $seq) {$stop=length($seq)}
 unless ($stop) {die "Can't parse the length from $file??"}

if ($start>$stop) {($start, $stop)=($stop, $start)}

print STDERR "Looking for repeats from $start to $stop with minimum length $min\n";

open (IN, $file) || die "Can't open $file";

my $out=$file;
$out =~ s/.gbk$/.trimmed.gbk/;
$out =~ s/.gb$/.trimmed.gbk/;
if ($out eq $file) {$out .= ".trimmed.gbk"}

my $skip;
open (OUT, ">$out") || die "Can't open $out for writing";
while (my $l=<IN>) {
 if ($skip) {
  next if ($l =~ /^\s+\//);
  undef $skip;
 unless ($l =~ /^\s+repeat/) {print OUT $l; next}
 if ($l =~ /repeat_region/) {print OUT $l; next}
 # repeat has the format join(411..423,13244..13256)
 my ($one, $two, $three, $four)=($1, $2, $3, $4);
 unless ($one && $two && $three && $four) {die "Can't parse numbers from $l\n"}
 my $keep;
 foreach my $pos ($one, $two, $three, $four) {
   if ((($two-$one+1) > $min) && $pos > $start && $pos < $stop) {$keep=1}
 if ($rangeb && $rangee) {
  if (($one < $rangeb || $two < $rangeb) && ($three > $rangee || $four > $rangee)) {$keep=1}
  else {undef $keep}
 if ($keep) {print OUT $l; next}
 else {$skip=1}

#    Copyright 2002-2004 Rob Edwards
#    For updates, more information, or to discuss the scripts
#    please contact Rob Edwards at redwards@utmem.edu or via http://www.salmonella.org/
#    This file is part of Rob's Scripts developed by Rob Edwards.
#    These scripts are free software; you can redistribute and/or modify
#    them under the terms of the GNU General Public License as published by
#    the Free Software Foundation; either version 2 of the License, or
#    (at your option) any later version.
#    They are distributed in the hope that they will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    GNU General Public License for more details.
#    You should have received a copy of the GNU General Public License
#    in the file (COPYING) along with these scripts; if not, write to the Free Software
#    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3