The SEED/PIR comparisons are controlled by the script pir.cgi. These are the functions in this script.
There are lots of ways of comparing the data between the SEED and the PIR superfamilies. You can enter the comparison directly using the list menu, can can ask for those proteins that are in many superfamilies or in many subsystems, and you can enter directly via the spreadsheets. This help should guide you on some of the entry points and what the data means. It is accessible from the links on the appropriate pages to remind you when necessary.
First we have a simple comparison of PIR and SEED functions. The list provides a summary of the PIR superfamilies, and the number of PEGS that map to that superfamily directly. There should be a many-to-one relationship here because each superfamily has many pegs in it.
Immediately below the list there are some options to control which superfamilies are displayed in the list:
You can control the number of pegs that the superfamily must contain. Using this option you can limit the list to those superfamilies that have multiple PEGs in the SEED database, and are therefore likely to be more consistent. Minimum number of pegs per PIR superfamily shown in list
You can also choose to show all superfamilies (note, this is the same as setting the minimum number to 1): or show all PIR superfamilies:
The next option is to show the subsystem counts in the menu. This shows the correspondence between PIR superfamilies and SEED subsystems. This doesn't have to be a one-to-one relationship because one superfamily can be represented by more than one subsystem.
For example, "Pyrophosphate--fructose 6-phosphate 1-phosphotransferase, alpha subunit (EC 184.108.40.206)" is in subsystems for "Embden-Meyerhof and Gluconeogenesis" and "Fructose_and_Mannose_metabolism". Checking the box will show the correspondence. On some machines this may take a minute or two to compute.
Show subsystem counts in list
By default the list only shows the Fully annotated PIR superfamilies, and not the preliminary superfamilies. However, you can reverse this, and display only the preliminray superfamilies if desired. The correspondences in preliminary superfamilies may be less well developed. Show only preliminary PIR superfamilies
You can limit the PIR superfamilies shown on the list by some text. The text is a case-insensitive match, and you can search for something like "glutamate" or a superfamily number like "729"
The choices are to update the view which will present the same page, but with the choices that you have selected here, to show the correspondence between PIR superfamily and the SEED, and to reset the list back to the original values. The correspondence is described below
Generating the tables takes about five minutes, and so you need to be patient and wait for the results. Resist the temptation to keep clicking the button.
The two options are to select what types of data are presented. See below for an example of the data that will be returned.
The first choice allows you to decide on the correspondence that you want to see. If you click the box (the default) you will only see those superfamilies that have proteins that are in subsystems. A superfamily has several proteins in it, and some of those may be in subsystems as well as superfamilies. However there are also superfamilies whose proteins are not in subsystems. These are not shown by default because we do not make any assertions about the annotations of those proteins.
The second choice allows you to sort the table that is returned. The data can either be sorted by the "Number of annotations in subsystems" or by the "Number of SEED annotations". See below for a description of these.
The table that is returned will look something like this:
|Number of annotations in subsystems||Number of SEED annotations||PIRSF
(Link goes to SEED/PIR comparison)
|Superfamily name||Subsystems in superfamily|
|31||83||PIRSF001370||(Full) thiamine diphosphate-dependent enzyme, acetolactate synthase type||Valine_Biosynthesis; Acetoin_metabolism; Valine_Synthesis; Xanthine_to_Glycine; Allantoin_degradation; Inositol_catabolism|
|30||33||PIRSF002891||(Preliminary) rod protein flgF||Flagellum|
|20||32||PIRSF005419||(Preliminary) Type III secretion system/flagellar apparatus protein, InvA/LcrD/FlhA type||Vibrio_Experimental_Type_III_secretion_system_; Flagellum; Type_III_secretion_system|
|17||17||PIRSF004862||(Preliminary) probable flagellar basal-body M ring protein||Flagellum|
|17||18||PIRSF006184||(Preliminary) flagellar basal body P-ring protein flgI||Flagellum|
Note that the header is repeated throughout the table to keep it clear which column is which.
The table contains the following information:
This is the nummber of different annotations that this superfamily has, only considering those annotations that are in subsystems. A large number indicates that the superfamily covers proteins with different roles in the subsystems, and there is probably a conflict between the superfamily and the SEED. These are the superfamilies or subsystems or annotations that need most attention.
This is the total number of different annotations that this protein family encompasses in the SEED, including those proteins that are not in subsystems. A lower number is also better, and the excess over the first column probably represents the proteins that have yet to be included in superfamilies yet.
The number is the number of the PIR superfamily, and the link takes you to the correspondence between this superfamily and the SEED database. See the correspondence help below.
The name of the superfamily
The different subsystems that proteins in this superfamily are members of.
An example of the correspondence table is shown below:
Link goes to PIR
|Genome||UniProt||PEG||FIG Function||FIG Subsystem|
|PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16||Buchnera aphidicola str. APS (Acyrthosiphon pisum)||uni|P57584||492||LSU ribosomal protein L16p (L10e)||Ribosome LSU bacterial|
|PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16||Acanthamoeba castellanii||uni|P46768||26||LSU ribosomal protein L16|
|PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16||Naegleria gruberi||uni|Q9G8Q3||26||LSU ribosomal protein L16|
|PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16||Aquifex aeolicus VF5||uni|O66438||11||LSU ribosomal protein L16p (L10e)||Ribosome LSU bacterial|
|PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16||Guillardia theta||uni|O46901||117||LSU ribosomal protein L16|
The table contains the following information:
The name of the superfamily and a link to the PIR website that describes the superfamily.
The name of the genome.
The UniProt ID and a link to the PIR website describing that protein.
Just the PEG number in the genome of interest. This is just a shorter link so we use 43 instead of writing out the whole fig id in the form fig|83333.1.peg.43. The link will take you to the SEED protein page.
The function that the protein has in the SEED database. Identical functions are colored with the same color so that you can easily identify which proteins have the same function and which do not.
All of the subsystems that protein is present in
At the top of the page there is a link to either Show All Matches or Show only matches with a subsystem. In the former case, every match between the PIR superfamily and the SEED database will be shown. In the latter case, only those proteins that are present in subsystems will be shown
You can download new data from the PIR ftp site and install the data directly. This is a two step process. First, we check to see whether the new file is more current than the old one (there is no point updating otherwise, unless, of course, you have added new genomes). Second, we actually get the data.
Click on the button to see whether there are new updates. You will then see a page that looks something along these lines (of course, the times will be different!):
This example is for files that are all current.
The local file is up to date and there is no need to update your source PIR superfamilies.
The remote file was modified on Fri Jan 14 13:26:47 2005
The local file was modified on Sat Jan 29 11:31:55 2005
If your files are not current, then the message will be something like this:
The remote file ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/data/pirsfinfo.dat is newer than your current file. You should proceed with the update.
The remote file was modified on Fri Jan 14 13:26:47 2005
The local file was modified on Wed Dec 29 11:31:55 2004
If there is a problem with the internet connection or the file can not be accessed for some reason (e.g. the name is wrong), you will not be given the option to proceed with the update like this:
Could not connect to PIR to check the status of the PIR file. Please check the location of ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/data/pirsfinfo.dat
Clicking on the "Update Data" or "Update Anyway" buttons will start the download and reinitiate the comparison of the SEED data with the PIR data. The downloading and installation of the data is run in the background using the script 'load_pirsf' because it takes a signficant amount of time and resources. You can monitor the progress in the SEED control panel. While the data is being installed you should really not use the PIR superfamilies. Although they will show up they are being edited, added, and deleted, and are therefore unstable. Installation of the data should take about 10-15 minutes.
Once the update is run, you will see the front page again, however there will be a message telling you that the update is complete.
The correspondence between PIR and SEED is highlighted in the spreadsheets. A sample of a few columns are shown below. Note this table is for demonstration purposes only and the correspondence will likely change.
|Genome ID||Organism||Variant Code||cysB||cysC||cysD||cysI||cysJ||cysN||cysQ||cysS|
|Yersinia pestis CO92 [B]||2277||3343 ||3345 ||3349||3350 ||3344 ||3504||3079 |
|Vibrio parahaemolyticus RIMD 2210633 [B]||1101||296 ||292 ||2721||2722 ||293 ||1150 |
|Shigella flexneri 2a str. 301 [B]||1222||2595 ||2597 ||2601 ||2602 ||2596 ||4023||444 |
|Bacillus halodurans C-125 [B]||1489 ||610 ||609 ||111, 112 [5, 3]|
The columns of the table are colored based on the superfamilies that the proteins are in, and in theory each column should be the same color and complete throughout.
Note that the small numbers that are slightly superscripted  are linked to the PIR correspondence table so you can click through and see proteins missing from either side as described above.
This example demonstrates this different aspects of the PIR/SEED interactions:
The download file is generated for those proteins that are in PIR superfamilies. The file has four columns separated by tabs:
This file is generated on the fly and stored in a temporary location so it may or may not exist. You can create it at any time by clicking the button.