[Bio] / FigKernelPackages / Observation.pm Repository:
ViewVC logotype

View of /FigKernelPackages/Observation.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.5 - (download) (as text) (annotate)
Wed Jun 13 17:56:35 2007 UTC (12 years, 6 months ago) by arodri7
Branch: MAIN
Changes since 1.4: +97 -4 lines
Added Identical Protein sub

package Observation;

require Exporter;
@EXPORT_OK = qw(get_objects); 

use strict;
use warnings;

1;

# $Id: Observation.pm,v 1.5 2007/06/13 17:56:35 arodri7 Exp $

=head1 NAME

Observation -- A presentation layer for observations in SEED.

=head1 DESCRIPTION

The SEED environment contains various sources of information for sequence features. The purpose of this library is to provide a 
single interface to this data.

The data can be used to display information for a given sequence feature (protein or other, but primarily information is computed for proteins). 

Example:

use FIG;
use Observation;

my $fig = new FIG;
my $fid = "fig|83333.1.peg.3";

my $observations = Observation::get_objects($fid);
foreach my $observation (@$observations) {
    print "ID: " . $fid . "\n";
    print "Start: " . $observation->start() . "\n";
    ...
}

B<return an array of objects>


print "$Observation->acc\n" prints the Accession number if present for the Observation

=cut

=head1 BACKGROUND

=head2 Data incorporated in the Observations 

As the goal of this library is to provide an integrated view, we combine diverse sources of evidence.

=head3 SEED core evidence

The core SEED data structures provided by FIG.pm. These are Similarities, BBHs and PCHs.

=head3 Attribute based Evidence

We use the SEED attribute infrastructure to store information computed by a variety of computational procedures.

These are e.g. InterPro hits via InterProScan (ipr), NCBI Conserved Domain Database Hits via PSSM(cdd), 
PFAM hits via HMM(pfam), SignalP results(signalp), and various others.

=head1 METHODS

The public methods this package provides are listed below:

=head3 acc()

A valid accession or remote ID (in the style of a db_xref) or a valid local ID (FID) in case this is supported.

=cut

sub acc {
  my ($self) = @_;

  return $self->{acc};
}

=head3 description()

The description of the hit. Taken from the data or from the our Ontology database for some cases e.g. IPR or PFAM.

B<Please note:>
Either remoteid or description is required.

=cut

sub description {
  my ($self) = @_;

  return $self->{description};
}

=head3 class()

The class of evidence (required). This is usually simply the name of the tool or the name of the SEED data structure.
B<Please note> the connection of class and display_method and URL.

Current valid classes are:

=over 9

=item SIM (seq)

=item BBH (seq)

=item PCH (fc)

=item FIGFAM (seq)

=item IPR (dom)

=item CDD (dom)

=item PFAM (dom)

=item SIGNALP (dom)

=item  CELLO(loc)

=item TMHMM (loc)

=item HMMTOP (loc)

=back

=cut

sub class {
  my ($self) = @_;

  return $self->{class};
}

=head3 type()

The type of evidence (required).

Where type is one of the following:

=over 8

=item seq=Sequence similarity

=item dom=domain based match

=item loc=Localization of the feature

=item fc=Functional coupling.

=back

=cut

sub type {
  my ($self) = @_;

  return $self->{acc};
}

=head3 start()

Start of hit in query sequence.

=cut

sub start {
  my ($self) = @_;

  return $self->{start};
}

=head3 end()

End of the hit in query sequence.

=cut

sub stop {
  my ($self) = @_;

  return $self->{stop};
}

=head3 evalue()

E-value or P-Value if present.

=cut

sub evalue {
  my ($self) = @_;

  return $self->{evalue};
}

=head3 score()

Score if present. 

B<Please note: >
Either score or eval are required.

=cut

sub score {
  my ($self) = @_;

  return $self->{score};
}


=head3 display_method()

If available use the function specified here to display the "raw" observation.
In the case of a BLAST alignment of fid1 and fid2 a cgi script 
will be called to display the results of running the command "bl2seq fid1 fid2". 

B<Please note> that URL linked to in display_method() is an external component and needs to added to the code for every class of evidence.

=cut 

sub display_method {
  my ($self) = @_;
  
  # add code here

  return $self->{display_method};
}

=head3 rank()

Returns an integer from 1 - 10 indicating the importance of this observations. 

Currently always returns 1.

=cut

sub rank {
  my ($self) = @_;

#  return $self->{rank};

  return 1;
}

=head3 supports_annotation()

Does a this observation support the annotation of its feature?

Returns

=over 3

=item 10, if feature annotation is identical to $self->description

=item 1, Feature annotation is similar to $self->annotation; this is computed using FIG::SameFunc() 

=item undef

=back 

=cut

sub supports_annotation {
  my ($self) = @_;

  # no code here so far

  return $self->{supports_annotation};
}

=head3 url()

URL describing the subject. In case of a BLAST hit against a sequence, this URL will lead to a page displaying the sequence record for the sequence. In case of an HMM hit, the URL will be to the URL description.

=cut

sub url {
  my ($self) = @_;

  my $url = get_url($self->type, $self->acc);

  return $url;
}

=head3 get_objects()

This is the B<REAL WORKHORSE> method of this Package.

It will probably have to:

- get all sims for the feature
- get all bbhs for the feature
- copy information from sim to bbh (bbh have no match location etc)
- get pchs (difficult)
- get attributes (there is code for this that in get_attribute_based_observations
- get_attributes_based_observations returns an array of arrays of hashes like this"    

  my $datasets =
     [
       [ { name => 'acc', value => '1234' },
 	{ name => 'from', value => '4' },
 	{ name => 'to', value => '400' },
 	....
       ],
       [ { name => 'acc', value => '456' },
 	{ name => 'from', value => '1' },
 	{ name => 'to', value => '100' },
 	....
       ],
       ...
     ];
   return $datasets;
 }

It will invoke the required calls to the SEED API to retrieve the information required.

=cut

sub get_objects {
    my ($self,$fid) = @_;

  my $objects = [];
  my @matched_datasets=();

  # call function that fetches attribute based observations
  # returns an array of arrays of hashes
  # 
  get_attribute_based_observations($fid,\@matched_datasets);

  # read sims
  get_sims_observations($fid,\@matched_datasets);

  # read identical proteins list of sequences
  get_identical_proteins($fid,\@matched_datasets);  

  # read sims + bbh (enrich BBHs with sims coordindates etc)
  # read pchs
  # read figfam match data from 48hr directory (BobO knows how do do this!)
  # what sources of evidence did I miss?

  foreach my $dataset (@matched_datasets) {
    my $object = $self->new();
    foreach my $attribute (@$dataset) {
      $object->{$attribute->{'name'}} = $attribute->{'value'};
    }
#    $object->{$attribute->{'feature_id'}} = $attribute->{$fid};
    push (@$objects, $object);
    }

  
  return $objects;
}

=head1 Internal Methods 

These methods are not meant to be used outside of this package. 

B<Please do not use them outside of this package!>

=cut


=head3 get_url (internal)

get_url() return a valid URL or undef for any observation.

URLs are constructed by looking at the Accession acc()  and  name()

Info from both attributes is combined with a table of base URLs stored in this function.

=cut

sub get_url {

 my ($self) = @_;
 my $url='';

# a hash with a URL for each observation; identified by name() 
#my $URL             => { 'PFAM' => "http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?" ,\
#                       'IPR'    => "http://www.ebi.ac.uk/interpro/DisplayIproEntry?ac=" ,\
#                          'CDD' => "http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=",\
#                       'PIR'    => "http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=",\
#                       'FIGFAM' => '',\
#	                   'sim'=> "http://www.theseed.org/linkin.cgi?id=",\
#			   'bbh'=> "http://www.theseed.org/linkin.cgi?id="
#};

# if (defined $URL{$self->name}) {
#     $url = $URL{$self->name}.$self->acc;
#     return $url;
# }
# else 
     return undef;
}

=head3 get_display_method (internal)

get_display_method() return a valid URL or undef for any observation.

URLs are constructed by looking at the Accession acc()  and  name() 
and Info from both attributes is combined with a table of base URLs stored in this function.

=cut

sub get_display_method {

 my ($self) = @_;

# a hash with a URL for each observation; identified by name() 
#my $URL             => { 'sim'=> "http://www.theseed.org/featalign.cgi?id1=",\
#	                 'bbh'=> "http://www.theseed.org/featalign.cgi?id1="
# };

#if (defined $URL{$self->name}) {
#     $url = $URL{$self->name}.$self->feature_id."&id2=".$self->acc;
#     return $url;
# }
# else 
     return undef;
}

=head3 get_attribute_based_evidence (internal)

This method retrieves evidence from the attribute server

=cut

sub get_attribute_based_observations{

    # we read a FIG ID and a reference to an array (of arrays of hashes, see above)
    my ($fid,$datasets_ref) = (@_);

    my $_myfig = new FIG;
    
    foreach my $attr_ref ($_myfig->get_attributes($fid)) {

        # convert the ref into a string for easier handling
        my ($string) = "@$attr_ref";

#	print "S:$string\n";
        my ($key,$val) = ( $string =~ /\S+\s(\S+)\s(\S+)/);

        # THIS SHOULD BE DONE ANOTHER WAY FM->TD
        # we need to do the right thing for each type, ie no evalue for CELLO and no coordinates, but a score, etc
        # as fas as possible this should be configured so that the type of observation and the regexp are
        # stored somewhere for easy expansion
        #

        if (($key =~ /PFAM::/) || ( $key =~ /IPR::/) || ( $key =~ /CDD::/) ) {

            # some keys are composite CDD::1233244 or PFAM:PF1233

            if ( $key =~ /::/ ) {
                my ($firstkey,$restkey) = ( $key =~ /([a-zA-Z0-9]+)::(.*)/);
                $val=$restkey.";".$val;
                $key=$firstkey;
            }

            my ($acc,$raw_evalue, $from,$to) = ($val =~ /(\S+);(\S+);(\d+)-(\d+)/ );

	    my $evalue= 255;
	    if (defined $raw_evalue) { # some of the tool do not give us an evalue

		my ($k,$expo) = ( $raw_evalue =~ /(\d+).(\d+)/);
		my ($new_k, $new_exp);
		
		#
		#  THIS DOES NOT WORK PROPERLY 
		# 
		if($raw_evalue =~/(\d+).(\d+)/){
		    
#		    $new_exp = (1000+$expo);
	#	    $new_k = $k / 100;
		    
		}
		$evalue = "0.01"#new_k."e-".$new_exp;
	    }

            # unroll it all into an array of hashes
            # this needs to be done differently for different types of observations
            my $dataset = [ { name => 'class', value => $key },
                            { name => 'acc' , value => $acc},
                            { name => 'type', value => "dom"} , # this clearly needs to be done properly FM->TD
			    { name => 'evalue', value => $evalue },
                            { name => 'start', value => $from},
                            { name => 'stop' , value => $to}
                            ];

            push (@{$datasets_ref} ,$dataset);
        }
    }
}

=head3 get_sims_observations() (internal)

This methods retrieves sims fills the internal data structures.

=cut

sub get_sims_observations{

    my ($fid,$datasets_ref) = (@_);
    my $fig = new FIG;
    my @sims= $fig->nsims($fid,100,1e-20,"fig");
    my ($dataset);
    foreach my $sim (@sims){
	my $hit = $sim->[1];
	my $evalue = $sim->[10];
	my $from = $sim->[8];
	my $to = $sim->[9];
	$dataset = [ { name => 'class', value => "SIM" },
			{ name => 'acc' , value => $hit},
			{ name => 'type', value => "seq"} ,
			{ name => 'evalue', value => $evalue },
			{ name => 'start', value => $from},
			{ name => 'stop' , value => $to}
			];
    push (@{$datasets_ref} ,$dataset);
    }
}

=head3 get_identical_proteins() (internal)

This methods retrieves sims fills the internal data structures.

=cut

sub get_identical_proteins{

    my ($fid,$datasets_ref) = (@_);
    my $fig = new FIG;
    my @funcs = ();

    my @maps_to = grep { $_ ne $fid and $_ !~ /^xxx/ } map { $_->[0] } $fig->mapped_prot_ids($fid);
    
    foreach my $id (@maps_to) {
        my ($tmp, $who);
        if (($id ne $fid) && ($tmp = $fig->function_of($fid))) {
            if ($id =~ /^fig\|/)           { $who = "FIG" }
            elsif ($id =~ /^gi\|/)            { $who = "NCBI" }
            elsif ($id =~ /^^[NXYZA]P_/)      { $who = "RefSeq" }
            elsif ($id =~ /^sp\|/)            { $who = "SwissProt" }
            elsif ($id =~ /^uni\|/)           { $who = "UniProt" }
            elsif ($id =~ /^tigr\|/)          { $who = "TIGR" }
            elsif ($id =~ /^pir\|/)           { $who = "PIR" }
            elsif ($id =~ /^kegg\|/)          { $who = "KEGG" }
            elsif ($id =~ /^tr\|/)            { $who = "TrEMBL" }
            elsif ($id =~ /^eric\|/)          { $who = "ASAP" }

            push(@funcs, [$id,$who,$tmp]);
        }
    }

    my ($dataset);
    foreach my $row (@funcs){
        my $id = $row->[0];
        my $organism = $fig->org_of($fid);
        my $who = $row->[1];
        my $assignment = $row->[2];
        $dataset = [ { name => 'class', value => "IDENTICAL" },
		     { name => 'id' , value => $id},
		     { name => 'organism', value => "$organism"} ,
		     { name => 'database', value => $who },
		     { name => 'description' , value => $assignment}
		     ];
        push (@{$datasets_ref} ,$dataset);
    }

}


=head3 get_sims_and_bbhs() (internal)

This methods retrieves sims and also BBHs and fills the internal data structures.

=cut

#     sub get_sims_and_bbhs{

# 	# blast m8 output format
# 	# id1, id2, %ident, align len, mismatches, gaps, q.start, q.stop, s. start, s.stop, eval, bit
	
# 	my $Sims=();
# 	@sims_src = $fig->sims($fid,80,500,"fig",0);
# 	print "found $#sims_src SIMs\n";
# 	foreach $sims (@sims_src) {
# 	    my ($sims_string) = "@$sims";
# #       print "$sims_string\n";
# 	    my ($rfid,$start,$stop,$eval) = ( $sims_string =~ /\S+\s+(\S+)\s+\S+\s\S+\s+(\S+)\s+(\S+)\s+
# 					      \S+\s+\S+\s+\S+\s+\S+\s+(\S+)+.*/);
# #       print "ID: $rfid, E:$eval, Start:$start stop:$stop\n";
# 	    $Sims{$rfid}{'eval'}=$eval;
# 	    $Sims{$rfid}{'start'}=$start;
# 	    $Sims{$rfid}{'stop'}=$stop;
# 	    print "$rfid $Sims{$rfid}{'eval'}\n";
# 	}
	
# 	# BBHs
# 	my $BBHs=();
	
# 	@bbhs_src = $fig->bbhs($fid,1.0e-10);
# 	print "found $#bbhs_src BBHs\n";
# 	foreach $bbh (@bbhs_src) {
# 	    #print "@$bbh\n";
# 	    my ($bbh_string) = "@$bbh";
# 	    my ($rfid,$eval,$score) = ( $bbh_string =~ /(\S+)\s(\S+)\s(\S+)/);
# 	    #print "ID: $rfid, E:$eval, S:$score\n";
# 	    $BBHs{$rfid}{'eval'}=$eval;
# 	    $BBHs{$rfid}{'score'}=$score;
# #print "$rfid $BBHs{$rfid}{'eval'}\n";
# 	}

#     }



=head3 new (internal)

Instantiate a new object.

=cut

sub new {
  my ($self) = @_;

  $self = { acc => '',
	    description => '',
	    class => '',
	    type => '',
	    start => '',
	    stop => '',
	    evalue => '',
	    score => '',
	    display_method => '',
	    feature_id => '',
	    rank => '',
	    supports_annotation => '',
	    id => '',
            organism => '',
            who => ''
	  };
  
  bless($self, 'Observation');
  
  return $self;
}

=head3 feature_id (internal)

Returns the ID  of the feature these Observations belong to.

=cut

sub feature_id {
  my ($self) = @_;

  return $self->{feature_id};
}

=head3 id (internal)

Returns the ID  of the identical sequence

=cut

sub id {
    my ($self) = @_;

    return $self->{id};
}

=head3 organism (internal)

Returns the organism  of the identical sequence

=cut

sub organism {
    my ($self) = @_;

    return $self->{organism};
}

=head3 database (internal)

Returns the database of the identical sequence

=cut

sub database {
    my ($self) = @_;

    return $self->{database};
}


MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3