[Bio] / FigKernelPackages / representative_sequences.pm Repository:
ViewVC logotype

Diff of /FigKernelPackages/representative_sequences.pm

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1.1, Wed Dec 20 20:49:21 2006 UTC revision 1.2, Fri Feb 16 17:41:34 2007 UTC
# Line 29  Line 29 
29  #    ( \@repseqs, \%representing, \@low_sim ) = representative_sequences( $ref,  #    ( \@repseqs, \%representing, \@low_sim ) = representative_sequences( $ref,
30  #                                                 \@seqs, $max_sim, \%options );  #                                                 \@seqs, $max_sim, \%options );
31  #  #
32    #  Build or add to a set of representative sequences (if you to not want an
33    #  enrichment of sequences around a focus sequence (called the reference), this
34    #  is probably the subroutine that you want).
35    #
36    #    \@reps = rep_seq_2( \@reps, \@new, \%options );
37    #    \@reps = rep_seq_2(         \@new, \%options );
38    #
39    #  or
40    #
41    #    ( \@reps, \%representing ) = rep_seq_2( \@reps, \@new, \%options );
42    #    ( \@reps, \%representing ) = rep_seq_2(         \@new, \%options );
43  #  #
44  #  Output:  #  Output:
45  #  #
# Line 53  Line 64 
64  #               included in the representative set.  A limit is put on the  #               included in the representative set.  A limit is put on the
65  #               similarity of lineages retained by the reference sequence with  #               similarity of lineages retained by the reference sequence with
66  #               the max_ref_sim option (default = 0.99).  The reference sequence  #               the max_ref_sim option (default = 0.99).  The reference sequence
67  #               should not be repeated in the set of other sequences.  #               should not be repeated in the set of other sequences.  (Only
68    #               applies to representative_sequences; there is no equivalent for
69    #               rep_seq_2.)
70    #
71    #    \@reps     In rep_seq_2, these sequences will each be placed in their own
72    #               cluster, regardless of their similarity to one another.  Each
73    #               remaining sequence is added to the cluster to which it is
74    #               most similar, unless it is less simililar than max_sim, in
75    #               which case it represents a new cluster.
76  #  #
77  #    \@seqs     Set of sequences to be pruned.  If there is no reference  #    \@seqs     Set of sequences to be pruned.  If there is no reference
78  #               sequence, the fist sequence in this list will be the starting  #               sequence, the fist sequence in this list will be the starting
# Line 64  Line 83 
83  #               relative to the reference (or the fist sequence if there is no  #               relative to the reference (or the fist sequence if there is no
84  #               reference) are dropped.  #               reference) are dropped.
85  #  #
86  #    $max_sim   Sequences with a higher similarity than max_sim to a retained  #    $max_sim   (representative_sequences only; an option for rep_seq_2)
87  #               sequence will be deleted.  The details of the behaviour is  #               Sequences with a higher similarity than max_sim to an existing
88  #               modified by other options. (default = 0.80)  #               representative sequence will not be included in the @reps
89    #               output.  Their ids are associated with the identifier of the
90    #               sequence representing them in \%representing.  The details of
91    #               the behaviour are modified by other options. (default = 0.80)
92  #  #
93  #    \%options  Key=>Value pairs that modify the behaviour:  #    \%options  Key=>Value pairs that modify the behaviour:
94  #  #
95  #        logfile    Filehandle for a logfile of the progress.  As each  #        logfile    Filehandle for a logfile of the progress.  As each
96  #                   representative sequence is identified, its id is written  #                   sequence is analyzed, its disposition in recorded.
97  #                   to the logfile, followed by a tab separated list of the  #                   In representative_sequences(), the id of each new
98  #                   ids that it represents.  Autoflush is set for the logfile.  #                   representative is followed by a tab separated list of the
99    #                   ids that it represents.  In rep_seq_2(), as each sequence
100    #                   is analyzed, it is recorded, followed by the id of the
101    #                   sequence representing it, if it is not the first member
102    #                   of a new cluster.  Autoflush is set for the logfile.
103  #                   If the value supplied is not a reference to a GLOB, then  #                   If the value supplied is not a reference to a GLOB, then
104  #                   the log is sent to STDOUT (which is probably not what you  #                   the log is sent to STDOUT (which is probably not what you
105  #                   want in most cases).  The behavior is intended to aid in  #                   want in most cases).  The behavior is intended to aid in
106  #                   following prgress, and in recovery of interupted runs.  #                   following prgress, and in recovery of interupted runs.
107  #  #
108  #        max_ref_sim  #        max_ref_sim (representative_sequences only)
109  #                   Maximum similarity of any sequence to the reference.  If  #                   Maximum similarity of any sequence to the reference.  If
110  #                   max_ref_sim is less than max_sim, it is silently reset to  #                   max_ref_sim is less than max_sim, it is silently reset to
111  #                   max_sim.  (default = 0.99, because 1.0 can be annoying)  #                   max_sim.  (default = 0.99, because 1.0 can be annoying)
112  #  #
113  #        max_e_val  Maximum E-value for BLAST (probably moot, but will help  #        max_e_val  Maximum E-value for blastall.  Probably moot, but will help
114  #                   with performance) (default = 0.01)  #                   with performance.  (default = 0.01)
115    #
116    #        max_sim    Sequences with a higher similarity than max_sim to a
117    #                   retained sequence will be deleted.  The details of the
118    #                   behaviour is modified by other options. (default = 0.80)
119    #                   (a parameter for representative_sequences, but an option
120    #                   for rep_seq_2).
121  #  #
122  #        sim_meas   Measure similarity for inclusion or exclusion by  #        sim_meas   Measure similarity for inclusion or exclusion by
123  #                  'identity_fraction' (default), 'positive_fraction', or  #                  'identity_fraction' (default), 'positive_fraction', or
124  #                  'score_per_position'  #                  'score_per_position'
125  #  #
126  #        save_exp   When there is a reference sequence, lineages more similar  #        save_exp   (representative_sequences only)
127    #                   When there is a reference sequence, lineages more similar
128  #                   than max_sim will be retained near the reference.  The  #                   than max_sim will be retained near the reference.  The
129  #                   default goal is to save one member of each lineage.  If  #                   default goal is to save one member of each lineage.  If
130  #                   the initial representative of the lineage is seq1, we  #                   the initial representative of the lineage is seq1, we
# Line 124  Line 157 
157  #                   are roughly 0.7 to 1.0.  (At save_exp = 0, any similarity  #                   are roughly 0.7 to 1.0.  (At save_exp = 0, any similarity
158  #                   would be allowed; yuck.)  #                   would be allowed; yuck.)
159  #  #
160  #        stable     If true (not undef, '', or 0), then the representatives  #        stable     (representative_sequences only; always true for rep_seq_2)
161    #                   If true (not undef, '', or 0), then the representatives
162  #                   will be chosen from as early in the list as possible (this  #                   will be chosen from as early in the list as possible (this
163  #                   facilitates augmentation of an existing list).  #                   facilitates augmentation of an existing list).
164  #  #
165  #-------------------------------------------------------------------------------  #-------------------------------------------------------------------------------
166  #  #
167  #  Diagram of the pruning behavior:  #  Diagram of the pruning behavior of representative_sequences():
168  #  #
169  #  0.5       0.6       0.7       0.8       0.9       1.0   Similarity  #  0.5       0.6       0.7       0.8       0.9       1.0   Similarity
170  #   |---------|---------|---------|---------|---------|  #   |---------|---------|---------|---------|---------|

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.2

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3