[Bio] / FigWebPages / ProteinFamilies.html Repository:
ViewVC logotype

View of /FigWebPages/ProteinFamilies.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.6 - (download) (as text) (annotate)
Tue May 16 16:29:22 2006 UTC (13 years, 8 months ago) by overbeek
Branch: MAIN
CVS Tags: rast_rel_2014_0912, rast_rel_2008_06_18, rast_rel_2008_06_16, rast_rel_2008_07_21, rast_rel_2010_0928, rast_2008_0924, rast_rel_2008_09_30, rast_rel_2010_0526, rast_rel_2014_0729, rast_rel_2009_05_18, rast_rel_2009_0925, rast_rel_2010_1206, rast_rel_2010_0118, rast_rel_2009_02_05, rast_rel_2011_0119, rast_rel_2008_12_18, rast_rel_2008_10_09, rast_release_2008_09_29, rast_rel_2008_04_23, rast_rel_2008_08_07, rast_rel_2009_07_09, rast_rel_2010_0827, myrast_33, rast_rel_2011_0928, rast_rel_2008_09_29, rast_rel_2008_10_29, rast_rel_2009_03_26, rast_rel_2008_11_24, HEAD
Changes since 1.5: +1 -1 lines
RAE: correcting space in command

<h1 style="text-align: center">Protein Families</h1>
<h2 style="text-align: center">Updated May 6th, 2006. Rob Edwards</h2>

	                <h3 style="text-align: center">Contents</h3>
			<li><a href="#overview">Overview</a></li>
			<li><a href="#description">FIGFams</a></li>
			<li><a href="#building">Building protein families</a></li>
                        <li><a href="#converting">Converting Ross' families to FIGfims</a></li>
                        <li><a href="#example1">Example 1: extending exisiting protein families</a></li>

<h3><a name="overview">Overview</a></h3>

<p>FIG are introducing a new concept with protein families, a comprehensive family concept. We have generated correspondences between each of the proteins in the protein families available from a diverse array of sources, including <a href="ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/">COGs</a>, <a href="ftp://ftp.genome.ad.jp/pub/kegg/tarfiles">KEGG</a>, <A href="/FIG/pir.cgi">PIR</a>, <a href="ftp://us.expasy.org/databases/prosite/">Prosite</a> by SwissProt, <a href="ftp://ftp.tigr.org/pub/data/TIGRFAMs/">TIGR</a>, <a href="http://www.cbil.upenn.edu/gene-family/">OrthoMCL</a>, <a href="http://aclame.ulb.ac.be/">ACLAME</a>, <a href="ftp://ftp.genetics.wustl.edu/pub/Pfam/">PFams</a></p>

<p>In addition, we have also created <span style="font-weight: bolder">FIGFams</span>. A new type of protein family that is completely unique. The FIGFams have the following characteristics:</p>
	<li>FIGFams are united by a common functional role.
	<li>FIGFams are extracted as a single column from a <a href="/FIG/subsys.cgi">subsystem</a> spreadsheet.
	<li>FIGFams are not based on homology, although usually members of a family are homologous with each other.

<h3><a name="description">FIGFams</a></h3>
<p>FIGFams are created by extracting a single column from a spreadsheet and calling all proteins in that column a family. The proteins need not be homologous, although they should perform the same function. We are attempting to highlight those proteins that should not be in a FIGFam but are, or those proteins that are not in FIGFams but should be.</p>

<h3><a name="data">FIGFam data</a></h3>
<p>FIGFams data is stored in several places. The raw data is downloaded from the sites above, and parsed into the following files. Be default, these files are expected to be in <em>$FIG_Config::global/ProteinFamilies/Sources/</em>.</p>
<p>After parsing, we expect the following files in each of the subdirectories:</p>
	<p>This file contains an ID mapping that has two columns separated by tabs. The first column contains the family ID, and the second column contains the proteins that are in that family.</p>
	<p>This file contains a functional mapping. The file also has two columns separated by tabs. The first column is the family ID, and should be the same as the family IDs provided above. The second column should be the function of that family.</p>
	<p>This file is a fasta format file containing the protein sequences</p>

<p>A common ID (cid) is produced from this data that contains all the proteins. These are then merged such that any protein that is a part of a longer protein is considered part of that family. This results in a set of proteins associated with a common id. This data is then mapped to indicate which IDs are mapped to which families. This data can be retrieved from the database objects.</P>

<h3><a name="building">Building Protein Families</a></h3>

<p>We tried to take an inclusive approach while building the protein families. Data were downloaded from the following sites:</p>
<table border=1>
	<th>Last Download Date</th>
	<td><a href="ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/">ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/</a></td>
	<td>Dec 5th, 2005</td>
	<td><a href="ftp://ftp.genome.ad.jp/pub/kegg/tarfiles/">ftp://ftp.genome.ad.jp/pub/kegg/tarfiles/</a></td>
	<td>Dec 5th, 2005</td>
	<td><a href="http://orthomcl.cbil.upenn.edu/cgi-bin/OrthoMclWeb.cgi">OrthoMCL website</a> </td>
	<td>June 7th, 2005</td>
	<td><a href="<http://theseed.uchicago.edu/FIG/pir.cgi">http://theseed.uchicago.edu/FIG/pir.cgi</a></td>
	<td>Dec 5th, 2005</td>
	<td><a href="http://www.expasy.org/prosite/">http://www.expasy.org/prosite/</a></td>
	<td>Dec 5th, 2005</td>
	<td><a href="http://aclame.ulb.ac.be/">aclame.ulb.ac.be</a></td>
	<td>June 7th, 2005</td>
	<td><a href="ftp://ftp.genetics.wustl.edu/pub/Pfam/">ftp://ftp.genetics.wustl.edu/pub/Pfam/</a></td>
	<td>Dec 5th, 2005</td>
	<td><a href="ftp://ftp.tigr.org/pub/data/TIGRFAMs/">ftp://ftp.tigr.org/pub/data/TIGRFAMs/</a></td>
	<td>Dec 5th, 2005</td>

<p>Please email Rob Edwards  if you would like more data included in these comparisons.</p>

<h3><a name="converting">Converting FIGfams to protein families</a></h3>
<p>There are several steps that you need to take Ross' output and convert it into protein families loaded into a seed. These are they:</p>
<li>You will start with files families.2c, and family functions. The first thing we need to do is make a fasta file of the proteins in the families for accurate comparisons. Use this command:
<br /><tt>nice cat families.2c | cut -f2 | get_translations  &gt; fasta 2&gt; fasta.err</tt><br />
If you get an error it means you are most certainly using the wrong figdisk for the get_translations command.</li>
<li>Then copy the files fasta, families.2c, and family functions to the destination seed or a directory somewhere like the shared directories on the CI machines.</li>
<li>We need to convert those files like this:</li>
        <li>Fasta file is fine</li>
        <li>family.functions needs to be renamed and have fig| appended to the start of all the lines. This will take care of that:
            <br /><tt>perl -npe 's/^/fig\|/' family.functions > family.funcs</tt></li>
        <li>families.2c needs to be renamed, have fig| appended to the start of all the lines, and trimmed to two columsn - just the family ID and protein id. This will take care of that:
            <br /><tt>perl -npe 'chomp; s/^(.*\t.*)\t.*$/fig\|$1\n/' families.2c > id.map</tt></li>
<li>I do all this work in /home/seed/Rob/ProteinFamilies. In that directory there is the following structure:</li>
        <li>The directory ProteinFamilies contains the sources of the other protein families, and these should be updated periodically. If you look in those protein family source directories you will see the instructions for downloading and parsing those families. The FIG directory is a link to the latest version of those families in the upper level directory.</li>
        <li>The dated directories that contain the FIGfams builds and the families that were built.</li>
<li>In the directory /home/seed/Rob/ProteinFamilies/ProteinFamilies, run the command:
        <br /><tt>build_protein_families Sources</tt>
    <br />This will create the correspondence between protein families, put it in FIG/Data/Global/ProteinFamilies, and load the databases.</li>

<h3><a name="example1">Example 1</a></h3>

<p>In this example, we demonstrate how to identify the discrepancies between different protein families.</p>
<p>To start with, we need to find a family. Lets use the family <i>Low molecular weight protein tyrosine phosphatase (EC</i> as an example. You can find this family in several different ways:</p>
<li style="margin: 6px 6px">You can perform a <b>free text search</b> that will find this family and others with similar names.
<br />For example, entering "protein tyrosine phosphatase" in the free text search on the <a href="http://theseed.uchicago.edu/FIG/proteinfamilies.cgi">Protein Families</a> page will give you the results shown <a href="http://theseed.uchicago.edu/FIG/proteinfamilies.cgi?freetext=protein+tyrosine+phosphatase&freetextsearch=Text+Search">here</a></li>
<li style="margin: 6px 6px">You can <b>search</b> the protein families for an individual protein.
<br />For example, enteiring <i>fig|119072.1.peg.133</i> in the protein ID box, will lead you to <a href="http://theseed.uchicago.edu/FIG/proteinfamilies.cgi?prot=fig%7C119072.1.peg.133&equivalence=Show+Protein+Families">this page</a> that shows not only this family, but othe related families.</li>
<li  style="margin: 6px 6px">You can <b>browse</b> through all the families looking for the family you are interested in by clicking one of the family buttons.
<br />At the moment the list returned will be limited to 1,000 families just to allow browsers to load the menus easily. However you can limit this display to selected families by, for example, entering "tyrosine" in the limit box, and clicking rebuild table.</li>
<li  style="margin: 6px 6px">You can enter directly from a <b>SEED protein page</b>. 
<br />For example <a href="http://theseed.uchicago.edu/FIG/protein.cgi?user=&prot=fig|281309.1.peg.374">fig|281309.1.peg.374</a> and then clicking on the link entitled <i>Explore Protein Families for fig|281309.1.peg.374</i></li>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3