Help On SEED/PIR Comparisons

Rob Edwards April 1, 2005

The SEED/PIR comparisons are controlled by the script pir.cgi. These are the functions in this script.

There are lots of ways of comparing the data between the SEED and the PIR superfamilies. You can enter the comparison directly using the list menu, can can ask for those proteins that are in many superfamilies or in many subsystems, and you can enter directly via the spreadsheets. This help should guide you on some of the entry points and what the data means. It is accessible from the links on the appropriate pages to remind you when necessary.

First we have a simple comparison of PIR and SEED functions. The list provides a summary of the PIR superfamilies, and the number of PEGS that map to that superfamily directly. There should be a many-to-one relationship here because each superfamily has many pegs in it.
Immediately below the list there are some options to control which superfamilies are displayed in the list:

Control menu contents

You can control the number of pegs that the superfamily must contain. Using this option you can limit the list to those superfamilies that have multiple PEGs in the SEED database, and are therefore likely to be more consistent. Minimum number of pegs per PIR superfamily shown in list  

You can also choose to show all superfamilies (note, this is the same as setting the minimum number to 1): or show all PIR superfamilies:

The next option is to show the subsystem counts in the menu. This shows the correspondence between PIR superfamilies and SEED subsystems. This doesn't have to be a one-to-one relationship because one superfamily can be represented by more than one subsystem.
For example, "Pyrophosphate--fructose 6-phosphate 1-phosphotransferase, alpha subunit (EC 2.7.1.90)" is in subsystems for "Embden-Meyerhof and Gluconeogenesis" and "Fructose_and_Mannose_metabolism". Checking the box will show the correspondence. On some machines this may take a minute or two to compute.
Show subsystem counts in list

By default the list only shows the Fully annotated PIR superfamilies, and not the preliminary superfamilies. However, you can reverse this, and display only the preliminray superfamilies if desired. The correspondences in preliminary superfamilies may be less well developed. Show only preliminary PIR superfamilies

You can limit the PIR superfamilies shown on the list by some text. The text is a case-insensitive match, and you can search for something like "glutamate" or a superfamily number like "729"

The choices are to update the view which will present the same page, but with the choices that you have selected here, to show the correspondence between PIR superfamily and the SEED, and to reset the list back to the original values. The correspondence is described below

Generate Data Tables

Generating the tables takes about five minutes, and so you need to be patient and wait for the results. Resist the temptation to keep clicking the button.

The two options are to select what types of data are presented. See below for an example of the data that will be returned.

The first choice allows you to decide on the correspondence that you want to see. If you click the box (the default) you will only see those superfamilies that have proteins that are in subsystems. A superfamily has several proteins in it, and some of those may be in subsystems as well as superfamilies. However there are also superfamilies whose proteins are not in subsystems. These are not shown by default because we do not make any assertions about the annotations of those proteins.

The second choice allows you to sort the table that is returned. The data can either be sorted by the "Number of annotations in subsystems" or by the "Number of SEED annotations". See below for a description of these.

The table that is returned will look something like this:

Number of annotations in subsystems Number of SEED annotations PIRSF
(Link goes to SEED/PIR comparison)
Superfamily name Subsystems in superfamily
31 83 PIRSF001370 (Full) thiamine diphosphate-dependent enzyme, acetolactate synthase type Valine_Biosynthesis; Acetoin_metabolism; Valine_Synthesis; Xanthine_to_Glycine; Allantoin_degradation; Inositol_catabolism
30 33 PIRSF002891 (Preliminary) rod protein flgF Flagellum
20 32 PIRSF005419 (Preliminary) Type III secretion system/flagellar apparatus protein, InvA/LcrD/FlhA type Vibrio_Experimental_Type_III_secretion_system_; Flagellum; Type_III_secretion_system
17 17 PIRSF004862 (Preliminary) probable flagellar basal-body M ring protein Flagellum
17 18 PIRSF006184 (Preliminary) flagellar basal body P-ring protein flgI Flagellum

Note that the header is repeated throughout the table to keep it clear which column is which.

The table contains the following information:

  1. Number of annotations in subsystems
  2. This is the nummber of different annotations that this superfamily has, only considering those annotations that are in subsystems. A large number indicates that the superfamily covers proteins with different roles in the subsystems, and there is probably a conflict between the superfamily and the SEED. These are the superfamilies or subsystems or annotations that need most attention.

  3. Number of SEED annotations
  4. This is the total number of different annotations that this protein family encompasses in the SEED, including those proteins that are not in subsystems. A lower number is also better, and the excess over the first column probably represents the proteins that have yet to be included in superfamilies yet.

  5. PIRSF
  6. The number is the number of the PIR superfamily, and the link takes you to the correspondence between this superfamily and the SEED database. See the correspondence help below.

  7. Superfamily name
  8. The name of the superfamily

  9. Subsystems in superfamily
  10. The different subsystems that proteins in this superfamily are members of.

Correspondence between SEED and PIR

An example of the correspondence table is shown below:

Correspondence between SEED and PIR
PIR Superfamily
Link goes to PIR
Genome UniProt PEG FIG Function FIG Subsystem
PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16 Buchnera aphidicola str. APS (Acyrthosiphon pisum) uni|P57584 492 LSU ribosomal protein L16p (L10e) Ribosome LSU bacterial
PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16 Acanthamoeba castellanii uni|P46768 26 LSU ribosomal protein L16
PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16 Naegleria gruberi uni|Q9G8Q3 26 LSU ribosomal protein L16
PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16 Aquifex aeolicus VF5 uni|O66438 11 LSU ribosomal protein L16p (L10e) Ribosome LSU bacterial
PIRSF002185(Preliminary) Escherichia coli ribosomal protein L16 Guillardia theta uni|O46901 117 LSU ribosomal protein L16

The table contains the following information:

  1. PIR Superfamily
  2. The name of the superfamily and a link to the PIR website that describes the superfamily.

  3. Genome
  4. The name of the genome.

  5. UniProt
  6. The UniProt ID and a link to the PIR website describing that protein.

  7. PEG
  8. Just the PEG number in the genome of interest. This is just a shorter link so we use 43 instead of writing out the whole fig id in the form fig|83333.1.peg.43. The link will take you to the SEED protein page.

  9. FIG Function
  10. The function that the protein has in the SEED database. Identical functions are colored with the same color so that you can easily identify which proteins have the same function and which do not.

  11. FIG Subsystem
  12. All of the subsystems that protein is present in

At the top of the page there is a link to either Show All Matches or Show only matches with a subsystem. In the former case, every match between the PIR superfamily and the SEED database will be shown. In the latter case, only those proteins that are present in subsystems will be shown

Update Data

You can download new data from the PIR ftp site and install the data directly. This is a two step process. First, we check to see whether the new file is more current than the old one (there is no point updating otherwise, unless, of course, you have added new genomes). Second, we actually get the data.

Click on the button to see whether there are new updates. You will then see a page that looks something along these lines (of course, the times will be different!):

This example is for files that are all current.

The local file is up to date and there is no need to update your source PIR superfamilies.

The remote file was modified on Fri Jan 14 13:26:47 2005

The local file was modified on Sat Jan 29 11:31:55 2005

If your files are not current, then the message will be something like this:

The remote file ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/data/pirsfinfo.dat is newer than your current file. You should proceed with the update.

The remote file was modified on Fri Jan 14 13:26:47 2005

The local file was modified on Wed Dec 29 11:31:55 2004

If there is a problem with the internet connection or the file can not be accessed for some reason (e.g. the name is wrong), you will not be given the option to proceed with the update like this:

Could not connect to PIR to check the status of the PIR file. Please check the location of ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/data/pirsfinfo.dat

Clicking on the "Update Data" or "Update Anyway" buttons will start the download and reinitiate the comparison of the SEED data with the PIR data. The downloading and installation of the data is run in the background using the script 'load_pirsf' because it takes a signficant amount of time and resources. You can monitor the progress in the SEED control panel. While the data is being installed you should really not use the PIR superfamilies. Although they will show up they are being edited, added, and deleted, and are therefore unstable. Installation of the data should take about 10-15 minutes.

Once the update is run, you will see the front page again, however there will be a message telling you that the update is complete.

Subsystem Spreadsheets

The correspondence between PIR and SEED is highlighted in the spreadsheets. A sample of a few columns are shown below. Note this table is for demonstration purposes only and the correspondence will likely change.

Genome ID Organism Variant Code cysB cysC cysD cysI cysJ cysN cysQ cysS
Yersinia pestis CO92 [B] 2277 3343   [10] 3345   [1] 3349 3350   [9] 3344   [2] 3504 3079   [3]
Vibrio parahaemolyticus RIMD 2210633 [B] 1101 296   [10] 292   [1] 2721 2722   [9] 293   [2] 1150   [3]
Shigella flexneri 2a str. 301 [B] 1222 2595   [10] 2597   [1] 2601   [8] 2602   [9] 2596   [2] 4023 444   [3]
Bacillus halodurans C-125 [B] 1489   [10] 610   [8] 609   [9] 111, 112   [5, 3]

The columns of the table are colored based on the superfamilies that the proteins are in, and in theory each column should be the same color and complete throughout.
Note that the small numbers that are slightly superscripted [5] are linked to the PIR correspondence table so you can click through and see proteins missing from either side as described above.

This example demonstrates this different aspects of the PIR/SEED interactions: