[Bio] / FigWebPages / Attributes.html Repository:
ViewVC logotype

View of /FigWebPages/Attributes.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.6 - (download) (as text) (annotate)
Tue Jul 19 04:54:22 2005 UTC (14 years, 8 months ago) by redwards
Branch: MAIN
Changes since 1.5: +17 -1 lines
Updating attributes code in seed

<h1 style="text-align: center">Attributes</h1>

<h2 style="text-align: center">Updated July 11th, 2005. Rob Edwards</h2>

		<h3 style="text-align: center">Contents</h3>
		        <li><a href="#overview">Overview</a></li>
			<li><a href="#definitions">Definitions</a></li>
			<li><a href="#filelocations">File Locations</a></li>
			<li><a href="#scripts">Scripts for working with attributes</a></li>
			<li><a href="#methods">Methods for accessing attributes</a></li>
				<li><a href="#get_attributes">get_attributes</a></li>
				<li><a href="#add_attribute">add_attribute</a></li>
				<li><a href="#delete_attribute">delete_attribute</a></li>
				<li><a href="#change_attribute">change_attribute</a></li>
				<li><a href="#erase_attribute_entirely">erase_attribute_entirely</a></li>
				<li><a href="#get_keys">get_keys</a></li>
				<li><a href="#get_values">get_values</a></li>
				<li><a href="#key_info">key_info</a></li>
				<li><a href="#get_key_value">get_key_value</a></li>
				<li><a href="#guess_value_format">guess_value_format</a></li>
				<li><a href="#attribute_location">attribute_location</a></li>

<p>I have added attributes to the database in a more significant way. This page is to document those attributes and ways to access/modify them. The page has two sections, a non-technical section for general discussion and overview, and a technical section for behind-the-scenes type information.</p>

<p>Most people should read the first section and ignore the second section.</p>

<p>A comment on nomenclature: I use the term key/value pairs and attributes interchangeably. Something can have an attribute, and you have to say what that is and what its value is. We also have an idea that an attribute can be a URL, and if so, it should be presented as a URL. So we actually have key, value, value. You will see this in action. The third element is called URL, but we always check and make sure that it begins http before turning it into a URL, and so we reserve the option of renaming this and making it something else. That will be mentioned here.</p>

<h3><a name="overview">Overview</a></h3>

<p>We have extended the notion of key/value pairs beyond things associated with a peg and into the arena of anything. Any feature such as peg, prophage, rna, insertion element, and so on, can have a key/value pair associated with it. In addition, <em>genomes</em> have key/value pairs associated with them. In this sense, we can annotate the organisms from which the genomes were derived and begin to ask complex questions of the type "show me all organisms that are motile but don't have any flagellar genes". We are working on this interface.<p>

<p>The key/value pairs are designed to be "lightweight" objects ideal for data mining rather than the rich, complex objects associated with annotations. If you are curating individual proteins you should probably do that using the annotation links on <a href="/FIG/protein.cgi">protein.cgi</a> since those allow tracking of who does what, and when. In contrast the key/value pairs will likely be loaded in batch from the command line without regard for overwriting other values!</p>

<p>Try the following exercises to see key/value pairs in action:</p>

<li>Choose an organism from the FIG search page and select statistics to see the list. There is an option at the bottom of the page to edit the key/value pairs, and this will pull up a table where you can enter the information for an organim.

<li>Open the <a href="http://localhost/FIG/subsys.cgi?user=&ssa_name=Flagellum&request=show_ssa&can_alter=">Flagellum subsytem</a>, and scroll to the checkboxes/buttons at the bottom. There are two pulldown lists, from the first one (labeled "color rows by each organism's attribute" choose MOTILE), and click show spreadsheet. The sheet is now highlighted with motile and non-motile organisms that have flagella. This view is also helped by decresing the text size from the view menu. There is a key at the bottom just above the "show spreadsheet" button so you know which color is which, and in this case there is only motile and non-motile. This key is also an active link that will limit the display of the spreadsheet to just those particular organisms that you have highlighted.</li>

<li>Now choose WIDTH from the same pull down menu, and click show spreadsheet. Because width is a numeric variable, I grouped these key/value pairs in 1/10ths of the maximum. If you look at the Color Descriptions box you will see ranges (this is not perfect at the moment, but it is on the way).</li>

<li>Now reset the WIDTH pull-down menu to empty (the first option in the list), and choose PIRSF from the menu labelled "color columns by each PEGs attribute" and click show spreadsheet. This is the same as before, but hopefully we can add more keys here and color other things.</li>

<li>From one of the PEGs that is colored as having a PIRSF link click on the link to get to the protein page. There is the attributes box (as before), and a new "Edit Attributes" button. When you click this, you will get three fields, key, value, and URL. If you go to a protein that does not have any attributes yet, you still get the edit box to let you add some attributes.</li>


<h3><a name="definitions">Definitions</a></h3>

These are the definition of attributes in the SEED and describes the locations and implementations of the files and directories used to store and retrieve attributes.

	<li style="font-weight: 700">Attributes have the following four fields</li>
		<li><em>ID</em>. This is usually a gene id or genome id but doesn't <i>have</i> to be.</li>
		<li><em>Key</em>. This is the key. The key should be unique (but doesn't have to be) and we will provide a method through the clearinghoouse to allow you to register a key and/or check whether someone else has assigned a key.
			<li>The key does not have to be unique, but this will assist in the exchange of data between machines.</li>
			<li>Keys are case sensitive</li>
			<li>An optional mapping is provided between a key and an explanation of what the key means (see below)</li>
			<li>By default, any key can have multiple values. If a key is to have only one value then a boolean can be set (see below) to limit this behavior</li>
			<li>keys cannot contain the following characters: space, tab or newline or any of @$!#%^&*()`~{}[]|\:;"'<>?,./

		<li><em>Value</em>. The value is free form and there are no limitations on what is contained in the value.
		<li><em>URL</em>. The URL is optional, and not required for any data set.
	<li style="font-weight: 700"><a name="filelocations">File Locations</a></li>
		<li><em>General Attributes</em> Attributes are stored in the following locations:</li>
			<li>$FIG_Config::organisms/xxxxx/Attributes contains the genome and organism attributes</li>
			<li>$FIG_Config::organisms/xxxxx/Features/peg/Attributes contains the attributes for pegs</li>
			<li>$FIG_Config::organisms/xxxxx/Features/rna/Attributes contains the attributes for rnas... etc</li>
			<li>Note that general attributes should not normally be stored in $FIG_Config::global (see below)</li>
		<li>All attributes files can hold comments as long as the line begins with a pound sign. Blank lines are also ignored.
		<li><em>Modified attributes</em></li>
			<li>Modified attributes are stored in the files transaction_log</li>
			<li>There are separate transaction_logs in each of the locations where attributes are stored (e.g. the Features/peg/Attributes, Organism/nnnn.nn/Attributes, and Global/Attributes directories<li>
			<li>The transaction_log file has the following format:
				<li>Method. This must be one of ADD/CHANGE/DELETE</li>
				<li>Feature ID (e.g. peg, genome, or RNA number)</li>
				<li>Old value</li>
				<li>Old URL</li>
				<li>New value</li>
				<li>New URL</li>
			<li>The old value, old, url, new value, and new url are optional depending on the method. For example, old value/url can be null if the method is add, and new value/new url can be null if the method is delete.</LI>
			<li>If the old value and old URL are ommitted and the method is delete all attributes that match key will be deleted from the feature</li>
			<li>Metadata associated with a key is stored in $FIG_Config::global/Attributes/attribute_keys</li>
			<li>This file has the following format, with the columns separated by tabs:</li>
			<li>single datum only. A boolean, if set will limit the data associated with the key to a single datum, otherwise the key will be assumed to allow multiple data sets. Note that this is for information only and we will store all the data associated with a key</li>
			<li>Other information about the key (e.g. name of experiment, experimental details, etc).</li>
	<li style="font-weight: 700"><a name="scripts">Scripts for working with attributes</a></li>
	<li>Here are a few common scripts that you may want to use:
		<p>This will delete the current attributes database, look through all the potential places that attributes are stored and add those attributes into the database. Both genome-specific and global attributes will be added. Finally, each of the transaction_logs are processed and the data added back into the database. This is used to add new data to a database, and to rebuild an existing database.</p>
		<p>Atrributes are stored in disparate locations (global, genome, etc) and this will look through all the various locations and print out any attributes that are found. Gather attributes can take an optional -d on the command line, and will "delete" any attributes file that it finds. It doesn't actually delete the file, rather moves it to FIG_Config::temp/Attributes/deleted_attributes, and you can delete it from there.</p>
		<p>This script will take any attributes on STDIN and write them to their appropriate locations.</p>
<p><b>Recommended</b> The recommended way to run these two commands is to first run gather attributes to collate the information and delete it:
$gather_attributes -d > gathered_attributes.txt
<br>And then to run the distribute command:</br>
$sort -u gathered_attributes.txt | distribute_attributes

<p>This will recreate the attributes files, and overcome any potential problems of writing files that are being moved.</p>
		<p>Dumps the current value of each attribute from the database, so these have all the changes in transaction_log already enacted.</p>

<h3><a name="methods">Methods for accessing attributes</a></h3>
<p>The attributes methods have now been rewritten for handling all kinds of attributes. The key/value pairs can be associated with a feature like a peg, rna, or prophage, or a genome.</p>
<p>There are several base attribute methods:</p>
<p>There are also methods for more complex things:</p>
<p>By default all keys are case sensitive, and all keys have leading and trailing white space removed.</p>
<p>Attributes are not on a 1:1 correlation, so a single key can have several values.</p>
<h3><a name="get_attributes">get_attributes</a></h3>
<p>Get attributes requires one of four keys:
fid (which can be genome, peg, rna, or other id),
<p>It will find any attribute that has the characteristics that you request, and if any values match it will return a four-ple of:
[fid, key, value, url]</p>
<p>You can request an E. coli key like this
<p>You can request any PIRSF key like this
$fig-&gt;get_attributes('', 'PIRSF');</p>
<p>You can request any google url like this
$fig-&gt;get_attributes('', '', '', 'http://www.google.com');</p>
<p>NOTE: If there are no attributes an empty array will be returned. You need to check for this and not assume that it will be undef.</p>
<h3><a name="add_attribute">add_attribute</a></h3>
<p>Add a new key/value pair to something. Something can be a genome id, a peg, an rna, prophage, whatever.</p>
        feature id, this can be a peg, genome, etc,
        key name. This is case sensitive and has the leading and trailing white space removed
        optional URL to add
        optional file to store the attributes in.</pre>
<p>A note on file names. At the moment the file assigned_attributes is used to store new attributes by default, and load_attributes loads that file last so any changes will overwrite existing keys. However this is not quite true since we can now have multiple key/values for a single peg. Using this method you can define a filename to store the attributes in. The directory structure will be figured out for you, so you can use something like ``pirsf'' as the file name.</p>
<h3><a name="delete_attribute">delete_attribute</a></h3>
<p>Remove a key from a feature.</p>
        feature id, this can be a peg, genome, etc,
        key name to delete</pre>
 Deleted attributes are stored in global/deleted_attributes</pre>
<h3><a name="change_attribute">change_attribute</a></h3>
 Change the value of a key/value pair (and optionally its url).</pre>
        feature id, this can be a peg, genome, etc,
        key name whose value to replace
        value to replace it with
        optional URL to add
        optional file to store the changes in.</pre>
<p>See the note in add_attributes about files. Almost always you should not include this so that the default (assigned_attributes) is used as it is loaded last. However, this allows you to change the file if you wish.</p>
<p>Returns 0 on error and 1 on success.</p>
<h3><a name="erase_attribute_entirely">erase_attribute_entirely</a></h3>
<p>This method will remove any notion of the attribute that you give it. It is different from delete as that just removes a single attribute associated with a peg. This will remove the files and uninstall the attributes from the database so there is no memory of that type of attribute. All of the attribute files are moved to FIG_Tmp/Attributes/deleted_attributes, and so you can recover the data for a while. Still, you should probably use this carefully!</p>
<p>I use this to clean out old PIR superfamily attributes immediately before installing the new correspondence table.</p>
<p>e.g. my $status=$fig-&gt;erase_attribute_entirely(``pirsf'');</p>
<p>This will return the number of files that were moved to the new location</p>
<h3><a name="get_keys">get_keys</a></h3>
<p>Get all the keys that we know about.</p>
<p>Without any arguments:</p>
<p>Returns a reference to a hash, where the key is the type of feature (peg, genome, rna, prophage, etc), and the value is a reference to a hash where the key is the key name and the value is a reference to an array of all features with that id.</p>
<p>print ``There are  '' , scalar @{{$fig-&gt;get_keys}-&gt;{'peg'}-&gt;{'PIRSF'}}, `` PIRSF keys in the database\n'';</p>
<p>my $keys=$fig-&gt;get_keys;
foreach my $type (keys %$keys)
 foreach my $label (keys %{$keys-&gt;{$type}})
  foreach my $peg (@{$keys-&gt;{$type}-&gt;{$label}})
    .. do something to each peg and genome here
<p>With an argument (that should be a recognized type like peg, rna, genome, etc):</p>
<p>Returns a reference to a hash where the key is the key name and the value is the reference to the array. This should use less memory than above.
The argument should be (currently) peg, rna, pp, genome, or any other recognized feature type (generally defined as the .peg. in the fid). The default is to return all keys, and this can also be specified with all</p>
<h3><a name="get_values">get_values</a></h3>
<p>Get all the values that we know about</p>
<p>Without any arguments:</p>
<p>Returns a reference to a hash, where the key is the type of feature (peg, genome, rna, prophage, etc), and the value is a reference to a hash where the key is the value and the value is the number of occurences</p>
<p>e.g. print ``There are  '' , {$fig-&gt;get_values}-&gt;{'peg'}-&gt;{'100'}, `` keys with the value 100 in  the database\n'';</p>
<p>With a single argument:</p>
<p>The argument is assumed to be the type (rna, peg, genome, etc).</p>
<p>With two arguments:</p>
<p>The first argument is the type (rna, peg, genome, etc), and the second argument is the key.</p>
<p>In each case it will return a reference to a hash.</p>
        $fig-&gt;get_values(); # will get all values</pre>
        $fig-&gt;get_values('peg'); # will get all values for pegs</pre>
        $fig-&gt;get_values('peg', 'pirsf'); # will get all values for pegs with attribute pirsf</pre>
        $fig-&gt;get_values(undef, 'pirsf'); # will get all values for anything with that attribute</pre>
<h3><a name="key_info">key_info</a></h3>
<p>Access a reference to an array of [single, explanation]</p>
<p>Single is a boolean, if it is true only the last value returned should be used. Note that the other methods willl still return all the values, it is upto the implementer to ensure that only the last value is used.</p>
<p>Explanation is a user-derived explanation that can be defined.</p>
<p>if a reference to an array is provided, along with the key, those values will be set.</p>
$fig-&gt;key_info($key, \@data); # set the data
$data=$fig-&gt;key_info($key); # get the data</p>
<h3><a name="get_key_value">get_key_value</a></h3>
<p>Given a key and a value will return anything that has both</p>
        my @nonmotile_genomes = $fig-&gt;get_key_value('motile', 'non-motile');
        my @bluepegs          = $fig-&gt;get_key_value('color', 'blue');</pre>
<p>If either the key or the value is ommitted will return all the matching sets.</p>
<h3><a name="guess_value_format">guess_value_format</a></h3>
<p>There are occassions where I want to know what a value is for a key. I have three scenarios right now:</p>
 1. strings
 2. numbers
 3. percentiles ( a type of number, I know)</pre>
<p>In these cases, I may want to know something about them and do something interesting with them. This will try and guess what the values are for a given key so that you can try and limit what people add. At the moment this is pure guess work, although I suppose we could put some restrictions on t/v pairs I don't feel like.</p>
<p>This method will return a reference to an array. If the element is a string there will only be one element in that array, the word ``string''. If the value is a number, there will be three elements, the word ``float'' in position 0, and then the minimum and maximum values. You can figure out if it is a percent :-)</p>
<h3><a name="attribute_location">attribute_location</a></h3>
<p>This is just an internal method to find the appropriate location of the attributes file depending on whether it is a peg, an rna, or a genome or whatever.</p>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3