Attributes

Updated July 11th, 2005. Rob Edwards

I have added attributes to the database in a more significant way. This page is to document those attributes and ways to access/modify them. The page has two sections, a non-technical section for general discussion and overview, and a technical section for behind-the-scenes type information.

Most people should read the first section and ignore the second section.

A comment on nomenclature: I use the term key/value pairs and attributes interchangeably. Something can have an attribute, and you have to say what that is and what its value is. We also have an idea that an attribute can be a URL, and if so, it should be presented as a URL. So we actually have key, value, value. You will see this in action. The third element is called URL, but we always check and make sure that it begins http before turning it into a URL, and so we reserve the option of renaming this and making it something else. That will be mentioned here.

Overview

We have extended the notion of key/value pairs beyond things associated with a peg and into the arena of anything. Any feature such as peg, prophage, rna, insertion element, and so on, can have a key/value pair associated with it. In addition, genomes have key/value pairs associated with them. In this sense, we can annotate the organisms from which the genomes were derived and begin to ask complex questions of the type "show me all organisms that are motile but don't have any flagellar genes". We are working on this interface.

The key/value pairs are designed to be "lightweight" objects ideal for data mining rather than the rich, complex objects associated with annotations. If you are curating individual proteins you should probably do that using the annotation links on protein.cgi since those allow tracking of who does what, and when. In contrast the key/value pairs will likely be loaded in batch from the command line without regard for overwriting other values!

Try the following exercises to see key/value pairs in action:

Definitions

These are the definition of attributes in the SEED and describes the locations and implementations of the files and directories used to store and retrieve attributes.
  1. Attributes have the following four fields

  2. File Locations
  3. Scripts for working with attributes
  4. Here are a few common scripts that you may want to use:
    1. load_attributes
    2. This will delete the current attributes database, look through all the potential places that attributes are stored and add those attributes into the database. Both genome-specific and global attributes will be added. Finally, each of the transaction_logs are processed and the data added back into the database. This is used to add new data to a database, and to rebuild an existing database.

    3. gather_attributes
    4. Atrributes are stored in disparate locations (global, genome, etc) and this will look through all the various locations and print out any attributes that are found. Gather attributes can take an optional -d on the command line, and will "delete" any attributes file that it finds. It doesn't actually delete the file, rather moves it to FIG_Config::temp/Attributes/deleted_attributes, and you can delete it from there.

    5. distribute_attributes
    6. This script will take any attributes on STDIN and write them to their appropriate locations.

      Recommended The recommended way to run these two commands is to first run gather attributes to collate the information and delete it:
      $gather_attributes -d > gathered_attributes.txt

      And then to run the distribute command:

      $sort -u gathered_attributes.txt | distribute_attributes

      This will recreate the attributes files, and overcome any potential problems of writing files that are being moved.

    7. dump_attributes
    8. Dumps the current value of each attribute from the database, so these have all the changes in transaction_log already enacted.

Methods for accessing attributes

The attributes methods have now been rewritten for handling all kinds of attributes. The key/value pairs can be associated with a feature like a peg, rna, or prophage, or a genome.

There are several base attribute methods:

 get_attributes
 add_attribute
 delete_attribute
 change_attribute

There are also methods for more complex things:

 get_keys
 get_values
 guess_value_format

By default all keys are case sensitive, and all keys have leading and trailing white space removed.

Attributes are not on a 1:1 correlation, so a single key can have several values.

get_attributes

Get attributes requires one of four keys: fid (which can be genome, peg, rna, or other id), key, value, url

It will find any attribute that has the characteristics that you request, and if any values match it will return a four-ple of: [fid, key, value, url]

You can request an E. coli key like this $fig->get_attributes('83333.1');

You can request any PIRSF key like this $fig->get_attributes('', 'PIRSF');

You can request any google url like this $fig->get_attributes('', '', '', 'http://www.google.com');

NOTE: If there are no attributes an empty array will be returned. You need to check for this and not assume that it will be undef.

add_attribute

Add a new key/value pair to something. Something can be a genome id, a peg, an rna, prophage, whatever.

Arguments:

        feature id, this can be a peg, genome, etc,
        key name. This is case sensitive and has the leading and trailing white space removed
        value
        optional URL to add
        optional file to store the attributes in.

A note on file names. At the moment the file assigned_attributes is used to store new attributes by default, and load_attributes loads that file last so any changes will overwrite existing keys. However this is not quite true since we can now have multiple key/values for a single peg. Using this method you can define a filename to store the attributes in. The directory structure will be figured out for you, so you can use something like ``pirsf'' as the file name.

delete_attribute

Remove a key from a feature.

 Arguments:
        feature id, this can be a peg, genome, etc,
        key name to delete
 Deleted attributes are stored in global/deleted_attributes

change_attribute

 Change the value of a key/value pair (and optionally its url).
 Arguments:
        feature id, this can be a peg, genome, etc,
        key name whose value to replace
        value to replace it with
        optional URL to add
        optional file to store the changes in.

See the note in add_attributes about files. Almost always you should not include this so that the default (assigned_attributes) is used as it is loaded last. However, this allows you to change the file if you wish.

Returns 0 on error and 1 on success.

erase_attribute_entirely

This method will remove any notion of the attribute that you give it. It is different from delete as that just removes a single attribute associated with a peg. This will remove the files and uninstall the attributes from the database so there is no memory of that type of attribute. All of the attribute files are moved to FIG_Tmp/Attributes/deleted_attributes, and so you can recover the data for a while. Still, you should probably use this carefully!

I use this to clean out old PIR superfamily attributes immediately before installing the new correspondence table.

e.g. my $status=$fig->erase_attribute_entirely(``pirsf'');

This will return the number of files that were moved to the new location

get_keys

Get all the keys that we know about.

Without any arguments:

Returns a reference to a hash, where the key is the type of feature (peg, genome, rna, prophage, etc), and the value is a reference to a hash where the key is the key name and the value is a reference to an array of all features with that id.

e.g.

print ``There are '' , scalar @{{$fig->get_keys}->{'peg'}->{'PIRSF'}}, `` PIRSF keys in the database\n'';

my $keys=$fig->get_keys; foreach my $type (keys %$keys) { foreach my $label (keys %{$keys->{$type}}) { foreach my $peg (@{$keys->{$type}->{$label}}) { .. do something to each peg and genome here } } }

With an argument (that should be a recognized type like peg, rna, genome, etc):

Returns a reference to a hash where the key is the key name and the value is the reference to the array. This should use less memory than above. The argument should be (currently) peg, rna, pp, genome, or any other recognized feature type (generally defined as the .peg. in the fid). The default is to return all keys, and this can also be specified with all

get_values

Get all the values that we know about

Without any arguments:

Returns a reference to a hash, where the key is the type of feature (peg, genome, rna, prophage, etc), and the value is a reference to a hash where the key is the value and the value is the number of occurences

e.g. print ``There are '' , {$fig->get_values}->{'peg'}->{'100'}, `` keys with the value 100 in the database\n'';

With a single argument:

The argument is assumed to be the type (rna, peg, genome, etc).

With two arguments:

The first argument is the type (rna, peg, genome, etc), and the second argument is the key.

In each case it will return a reference to a hash.

E.g.

        $fig->get_values(); # will get all values
        $fig->get_values('peg'); # will get all values for pegs
        $fig->get_values('peg', 'pirsf'); # will get all values for pegs with attribute pirsf
        $fig->get_values(undef, 'pirsf'); # will get all values for anything with that attribute

key_info

Access a hash of key information. The data that are returned are:

hash key namewhat is itdata type
singleWhether the attribute can handle only a single data point[boolean]
descriptionExplanation of key[free text]
readonlywhether to allow read/write[boolean]
is_cvattribute is a cv term[boolean]

Single is a boolean, if it is true only the last value returned should be used. Note that the other methods willl still return all the values, it is upto the implementer to ensure that only the last value is used.

Explanation is a user-derived explanation that can be free text

If a reference to a hash is provided, along with the key, those values will be set to the attribute_keys file

Returns an empty hash if the key is not provieded or doesn't exist

e.g.
$fig->key_info($key, \%data); # set the data
$data=$fig->key_info($key); # get the data

get_key_value

Given a key and a value will return anything that has both

E.g.

        my @nonmotile_genomes = $fig->get_key_value('motile', 'non-motile');
        my @bluepegs          = $fig->get_key_value('color', 'blue');

If either the key or the value is ommitted will return all the matching sets.

guess_value_format

There are occassions where I want to know what a value is for a key. I have three scenarios right now:

 1. strings
 2. numbers
 3. percentiles ( a type of number, I know)

In these cases, I may want to know something about them and do something interesting with them. This will try and guess what the values are for a given key so that you can try and limit what people add. At the moment this is pure guess work, although I suppose we could put some restrictions on t/v pairs I don't feel like.

This method will return a reference to an array. If the element is a string there will only be one element in that array, the word ``string''. If the value is a number, there will be three elements, the word ``float'' in position 0, and then the minimum and maximum values. You can figure out if it is a percent :-)

attribute_location

This is just an internal method to find the appropriate location of the attributes file depending on whether it is a peg, an rna, or a genome or whatever.