[Bio] / FigKernelScripts / analyze_loose_sets.pl Repository:
ViewVC logotype

View of /FigKernelScripts/analyze_loose_sets.pl

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.3 - (download) (as text) (annotate)
Thu Apr 21 21:04:15 2011 UTC (8 years, 11 months ago) by parrello
Branch: MAIN
CVS Tags: mgrast_dev_08112011, mgrast_dev_06072011, mgrast_dev_08022011, rast_rel_2014_0912, myrast_rel40, rast_rel_2014_0729, mgrast_dev_05262011, mgrast_release_3_1_2, mgrast_release_3_1_1, rast_rel_2011_0928, mgrast_version_3_2, mgrast_dev_12152011, mgrast_dev_10262011, mgrast_release_3_1_0, HEAD
Changes since 1.2: +15 -3 lines
Added CSV output that can be used to generate a spreadsheet of recommended functional assignments.

#!/usr/bin/perl -w

# Copyright (c) 2003-2006 University of Chicago and Fellowship
# for Interpretations of Genomes. All Rights Reserved.
# This file is part of the SEED Toolkit.
# The SEED Toolkit is free software. You can redistribute
# it and/or modify it under the terms of the SEED Toolkit
# Public License.
# You should have received a copy of the SEED Toolkit Public License
# along with this program; if not write to the University of Chicago
# at info@ci.uchicago.edu or the Fellowship for Interpretation of
# Genomes at veronika@thefig.info or download a copy from
# http://www.theseed.org/LICENSE.TXT.

=head1 Analyze Loose Sets

    analyze_loose_sets set_file

This script reads a file of loose gene sets with functional assignments and outputs
an analysis of the assignments found in each set.

The input should be a tab-delimited file with set numbers in the first column, feature
IDs in the second column, and functional assignments in the third column. All features
in the same set are believed to be members of the same protein family, so it is
expected that they would all have the same function.

There will be six output files, each with the same name as the input file plus a

Four output files will have the same format: a tab-delimited file
containing a set number in the first column, a functional assignment in the
second, a count in the third, and a comma-delimited list of feature IDs in the fourth.
The count indicates the number of features in the set having that functional
role. The first output file will have a suffix of C<.con> and will only contain
sets that have a single functional role. The second output file will have a
suffix of C<.pinc> and will contain sets that have multiple functional
roles of which one is believed to be correct. In this file the recommended role will be
output first. The third output file will have a suffix of C<.hinc> and will contain sets
that have multiple functional roles which would be consistent or could be fixed to consistent
if the hypothetical roles were removed from consideration. The fourth output
file will have a suffix of C<.inc> and will contain sets that have multiple
functional roles for which no recommendation can be made. In this file the roles
will be output from most common to least common.

The fifth output file will have a suffix of C<.chg> and will contain the recommended
functional role changes. It will have a feature ID in the first column, a constant
C<analysis> in the second column, and the recommended function ID in the third.

The sixth output file will be a comma-separated values file with a suffix of C<.csv>
suitable for loading into Excel. The first column in this file will be a set number, the
second column will be the best functional assignment for the set, and the third column
will be blank.

The single positional parameter is the name of the input file and is required.

The command-line parameters are as follows.

=over 4

=item user

Name suffix to be used for log files. If omitted, the PID is used.

=item trace

Numeric trace level. A higher trace level causes more messages to appear. The
default trace level is 2. Tracing will be directly to the standard output
as well as to a C<trace>I<User>C<.log> file in the FIG temporary directory,
where I<User> is the value of the B<user> option above.

=item background

Save the standard and error output to files. The files will be created
in the FIG temporary directory and will be named C<err>I<User>C<.log> and
C<out>I<User>C<.log>, respectively, where I<User> is the value of the
B<user> option above.

=item h

Display this command's parameters and options.

=item dbname

Name of the Sapling database to use. This option is generally only useful for debugging.

=item dbhost

SQL host for the Sapling database to use. This option is generally only useful for debugging.

=item dbport

Database port to use for the Sapling database. This option is generally only useful for debugging.



    use strict;
    use Tracer;
    use Stats;
    use SeedUtils;
    use Sapling;

    use RoleRuleSubstring;

my ($options, @parameters) = StandardSetup([], {
                                                dbname => [$FIG_Config::saplingDB, "name of the Sapling database to use"],
                                                dbhost => ["", "host containing the Sapling database"],
                                                dbport => ["", "port for connecting to the Sapling database"],
                                                trace => ["3-", "tracing level"],
                                               }, "<inputFileName>",

# Create a statistics object.
my $stats = Stats->new();
# Get the Sapling database.
my $sap = Sapling->new(dbName => $options->{dbname}, dbhost => $options->{dbhost},
                       port => $options->{dbport});
# Create the list of role rules.
my @RoleRules = (RoleRuleSubstring->new($sap, $stats));
eval {
    # We have valid options here, so we can proceed. This variable will contain the current
    # set number.
    my $currentSet = "";
    # This hash will list the IDs of the sets containing each non-hypothetical assignment found.
    # At the end of the run, information about the multiply-occurring assignments will be displayed.
    my %roleSets;
    # This hash counts the number of times each role was found in the current set.
    my %setRoles;
    # Get the input file name.
    my $inFileName = $parameters[0];
    if (! $inFileName || $inFileName eq "-") {
        Confess("No input file specified.");
    # Open the input and output files.
    Trace("Opening set file $inFileName.") if T(2);
    my $ih = Open(undef, "<$inFileName");
    Open(\*CONH, ">$inFileName.con");
    Open(\*INCH, ">$inFileName.inc");
    Open(\*PINCH, ">$inFileName.pinc");
    Open(\*HINCH, ">$inFileName.hinc");
    Open(\*CHGH, ">$inFileName.chg");
    Open(\*CSVH, ">$inFileName.csv");
    # Put column headers in the CSV file.
    print CSVH qq("Set #","Best Functional Assignment","Recommendation"\n);
    # Loop through the input.
    while (! eof $ih) {
        # Get the next input line.
        my ($set, $fid, $function) = Tracer::GetLine($ih);
        $stats->Add(lines => 1);
        # Is this a new set?
        if ($set ne $currentSet) {
            # Yes. Produce output for the old set.
            ProcessOldSet($currentSet, \%setRoles, $stats);
            # Set up for the new set.
            %setRoles = ();
            $currentSet = $set;
        # Record this role for this set.
        push @{$setRoles{$function}}, $fid;
    # Produce output for the residual set.
    ProcessOldSet($currentSet, \%setRoles, $stats);
    # Close the files.
    close $ih;
    close CONH;
    close INCH;
    close PINCH;
    close HINCH;
    close CHGH;
if ($@) {
    Trace("Error in script: $@") if T(0);
Trace("Statistics for run:\n" . $stats->Show()) if T(2);

# This function outputs the analysis of a single set. The parameters are
#   $currentSet     ID of the relevant set
#   $setRoles       reference to a hash mapping each function to the features for which it occurred in
#                   the set (used as input)
#   $stats          statistics object (updated)
sub ProcessOldSet {
    # Get the parameters.
    my ($currentSet, $setRoles, $stats) = @_;
    # Only proceed if this is a real set.
    if ($currentSet ne "") {
        $stats->Add(sets => 1);
        # The chosen output handle goes in here.
        my $oh;
        # The sorted list of roles to output goes here.
        my $sortedRoles;
        # The recommended role, if any, goes here.
        my $recommendedRole;
        # Is this a consistent set?
        if (scalar(keys %$setRoles) == 1) {
            # Yes. Output to the consistent file and denote we have another consistent set.
            $oh = \*CONH;
            $stats->Add(consistentSet => 1);
            $sortedRoles = [keys %$setRoles];
        } else {
            # Check for outnumbered hypotheticals. Non-hypotheticals will be considered
            # "regular" and hypotheticals will be considered "excess".
            my $regularRoles = {};
            my $excessRoles = {};
            my $regularCount = 0;
            my $excessCount = 0;
            for my $role (keys %$setRoles) {
                my $fids = $setRoles->{$role};
                if (SeedUtils::hypo($role)) {
                    $excessRoles->{$role} = $fids;
                    $excessCount += scalar @$fids;
                } else {
                    $regularRoles->{$role} = $fids;
                    $regularCount += scalar @$fids;
            my $excessList = [];
            # Do we have outnumbered hypotheticals?
            if ($excessCount && $regularCount > $excessCount * 2) {
                # Yes. Sort the hypotheticals into the excess list.
                $excessList = RoleRule::Sort($excessRoles);
            } else {
                # No. Denote we have no excess rules.
                $regularRoles = $setRoles;
                $excessRoles = [];
                $excessCount = 0;
            # Loop through the role rules, stopping on the first match.
            for my $roleRule (@RoleRules) { last if defined $sortedRoles;
                # Try to match this rule.
                $sortedRoles = $roleRule->Check($regularRoles);
            # If no rules matched, we have a purely inconsistent set.
            if (! defined $sortedRoles) {
                $stats->Add(inconsistentSet => 1);
                $oh = \*INCH;
                # Do a standard sort of the roles.
                $sortedRoles = RoleRule::Sort($setRoles);
                # Denote the excess roles no longer matter.
                $excessList = [];
                $excessCount = 0;
            } else {
                # Here we want to recommend a role change.
                $stats->Add(fixableSet => 1);
                # Output to the proper file depending on whether hypotheticals are a factor.
                $oh = ($excessCount ? \*HINCH : \*PINCH);
                # Get the recommended role.
                $recommendedRole = $sortedRoles->[0];
                # Output the change instructions.
                for my $role (@$sortedRoles) {
                    if ($role ne $recommendedRole) {
                        for my $fid (@{$setRoles->{$role}}) {
                            print CHGH "$fid\tanalysis\t$recommendedRole\n";
                            $stats->Add(changeRequest => 1);
                # Add the excess roles (if any) to the list for the final output.
                push @$sortedRoles, @$excessList;
        # Send the best role to the CSV file.
        print CSVH qq($currentSet,"$sortedRoles->[0]",\n);
        # Loop through the roles.
        for my $role (@$sortedRoles) {
            # Output this role.
            my @fids = @{$setRoles->{$role}};
            print $oh join("\t", $currentSet, $role, scalar(@fids), join(", ", @fids)) . "\n";
            $stats->Add(roleSetPairs => 1);

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3