Sprout Genome and Subsystem Database
A [i]genome[/i] contains the sequence data for a particular individual organism.
Genus of the relevant organism.
RandParam('streptococcus', 'staphyloccocus', 'felis', 'homo', 'ficticio', 'strangera', 'escherischia', 'carborunda')
Species of the relevant organism.
StringGen('PKVKVKVKVKV')
The unique characterization identifies the particular organism instance from which the
genome is taken. It is possible to have in the database more than one genome for a
particular species, and every individual organism has variations in its DNA.
StringGen('PKVKVK999')
The access code determines which users can look at the data relating to this genome.
Each user is associated with a set of access codes. In order to view a genome, one of
the user's access codes must match this value.
RandParam('low','medium','high')
TRUE if the genome is complete, else FALSE
The taxonomy string contains the full taxonomy of the organism, while individual elements
separated by semi-colons (and optional white space), starting with the domain and ending with
the disambiguated genus and species (which is the organism's scientific name plus an
identifying string).
join('; ', (RandParam('bacteria', 'archaea', 'eukaryote', 'virus', 'environmental'),
ListGen('PKVKVKVK', 5), $this->{genus}, $this->{species}))
The group identifies a special grouping of organisms that would be displayed on a particular
page or of particular interest to a research group or web site. A single genome can belong to multiple
such groups or none at all.
This index allows the applications to find all genomes associated with
a specific access code, so that a complete list of the genomes users can view
may be generated.
This index allows the applications to find all genomes for a particular
species.
A [i]source[/i] describes a place from which genome data was taken. This can be an organization
or a paper citation.
URL the paper cited or of the organization's web site. This field optional.
"http://www.conservativecat.com/Ferdy/TestTarget.php?Source=" . $this->{id}
Description the source. The description can be a street address or a citation.
$this->{id} . ': ' . StringGen(IntGen(50,200))
A [i]contig[/i] is a contiguous run of residues. The contig's ID consists of the
genome ID followed by a name that identifies which contig this is for the parent genome. As
is the case with all keys in this database, the individual components are separated by a
period.
[p]A contig can contain over a million residues. For performance reasons, therefore,
the contig is split into multiple pieces called [i]sequences[/i]. The sequences
contain the characters that represent the residues as well as data on the quality of
the residue identification.
A [i]sequence[/i] is a continuous piece of a [i]contig[/i]. Contigs are split into
sequences so that we don't have to have the entire contig in memory when we are
manipulating it. The key of the sequence is the contig ID followed by the index of
the begin point.
String consisting of the residues. Each residue is described by a single
character in the string.
RandChars("ACGT", IntGen(100,400))
String describing the quality data for each base pair. Individual values will
be separated by periods. The value represents negative exponent of the probability
of error. Thus, for example, a quality of 30 indicates the probability of error is
10^-30. A higher quality number a better chance of a correct match. It is possible
that the quality data is not known for a sequence. If that is the case, the quality
vector will contain the [b]unknown[/b].
unknown
A [i]feature[/i] is a part of a genome that is of special interest. Features
may be spread across multiple contigs of a genome, but never across more than
one genome. Features can be assigned to roles via spreadsheet cells,
and are the targets of annotation.
Code indicating the type of this feature.
RandParam('peg','rna')
Alternative name for this feature. A feature can have many aliases.
StringGen('Pgi|99999', 'Puni|XXXXXX', 'PAAAAAA999')
[i](optional)[/i] A translation of this feature's residues into character
codes, formed by concatenating the pieces of the feature together. For a
protein encoding group, this is the protein characters. For other types
it is the DNA characters.
Upstream sequence the feature. This includes residues preceding the feature as well as some of
the feature's initial residues.
TRUE if this feature is still considered valid, FALSE if it has been logically deleted.
1
Web hyperlink for this feature. A feature have no hyperlinks or it can have many. The
links are to other websites that have useful about the gene that the feature represents, and
are coded as raw HTML, using [b]<a href="[i]link[/i]">[i]text[/i]</a>[/b] notation.
'http://www.conservativecat.com/Ferdy/TestTarget.php?Source=' . $this->{id} .
"&Number=" . IntGen(1,99)
This index allows the user to find the feature corresponding to
the specified alias name.
A [i]synonym group[/i] represents a group of features. Substantially identical features
are mapped to the same synonym group, and this information is used to expand similarities.
A [i]role[/i] describes a biological function that may be fulfilled by a feature.
One of the main goals of the database is to record the roles of the various features.
EC code for this role.
StringGen(IntGen(20,40)) . "(" . $this->{id} . ")"
Abbreviated name for the role, generally non-unique, but useful
in column headings for HTML tables.
This index allows the user to find the role corresponding to
an EC number.
An [i]annotation[/i] contains supplementary information about a feature. Annotations
are currently the only objects that may be inserted directly into the database. All other
information is loaded from data exported by the SEED.
Date and time of the annotation.
Text of the annotation.
This index allows the user to find recent annotations.
A [i]reaction[/i] is a chemical process catalyzed by a protein. The reaction ID
is generally a small number preceded by a letter.
HTML string containing a link to a web location that describes the
reaction. This field is optional.
TRUE if this reaction is reversible, else FALSE
A [i]compound[/i] is a chemical that participates in a reaction.
All compounds have a unique ID and may also have one or more names.
Priority of a compound name. The name with the loweset
priority is the main name of this compound.
Descriptive name for the compound. A compound may
have several names.
Chemical Abstract Service ID for this compound (optional).
Name used in reaction display strings.
It is the same as the name possessing a priority of 1, but it is placed
here to speed up the query used to create the display strings.
This index allows the user to find the compound corresponding to
the specified name.
This index allows the user to find the compound corresponding to
the specified CAS ID.
This index allows the user to access the compound names in
priority order.
A [i]subsystem[/i] is a collection of roles that work together in a cell. Identification of subsystems
is an important tool for recognizing parallel genetic features in different organisms.
Name of the person currently in charge of the subsystem.
Descriptive notes about the subsystem.
General classification data about the subsystem.
A [i]role subset[/i] is a named collection of roles in a particular subsystem. The
subset names are generally very short, non-unique strings. The ID of the parent
subsystem is prefixed to the subset ID in order to make it unique.
A [i]genome subset[/i] is a named collection of genomes that participate
in a particular subsystem. The subset names are generally very short, non-unique
strings. The ID of the parent subsystem is prefixed to the subset ID in order
to make it unique.
Part of the process of locating and assigning features is creating a spreadsheet of
genomes and roles to which features are assigned. A [i]spreadsheet cell[/i] represents one
of the positions on the spreadsheet.
A [i]user[/i] is a person who can make annotations and view data in the database. The
user object is keyed on the user's login name.
Full name or description of this user.
Access code possessed by this
user. A user can have many access codes; a genome is accessible to the user if its
access code matches any one of the user's access codes.
RandParam('low', 'medium', 'high')
A [i]property[/i] is a type of assertion that could be made about the properties of
a particular feature. Each property instance is a key/value pair and can be associated
with many different features. Conversely, a feature can be associated with many key/value
pairs, even some that notionally contradict each other. For example, there can be evidence
that a feature is essential to the organism's survival and evidence that it is superfluous.
Name of this property.
Value associated with this property. For each property
name, there must by a property record for all of its possible
values.
This index enables the application to find all values for a specified property
name, or any given name/value pair.
A functional diagram describes the chemical reactions, often comprising a single
subsystem. A diagram is identified by a short name and contains a longer descriptive name.
The actual diagram shows which functional roles guide the reactions along with the inputs
and outputs; the database, however, only indicate which roles belong to a particular
map.
Descriptive name of this diagram.
An external alias is a feature name for a functional assignment that is not a
FIG ID. Functional assignments for external aliases are kept in a separate section of
the database. This table contains a description of the relevant organism for an
external alias functional assignment.
Descriptive name of the target organism for this external alias.
An external alias is a feature name for a functional assignment that is not a
FIG ID. Functional assignments for external aliases are kept in a separate section of
the database. This table contains the functional role for the external alias functional
assignment.
Functional role for this external alias.
A coupling is a relationship between two features. The features are
physically close on the contig, and there is evidence that they generally
belong together. The key of this entity is formed by combining the coupled
feature IDs with a space.
A number based on the set of PCHs (pairs of close homologs). A PCH
indicates that two genes near each other on one genome are very similar to
genes near each other on another genome. The score only counts PCHs for which
the genomes are very different. (In other words, we have a pairing that persists
between different organisms.) A higher score implies a stronger meaning to the
clustering.
A PCH (physically close homolog) connects a clustering (which is a
pair of physically close features on a contig) to a second pair of physically
close features that are similar to the first. Essentially, the PCH is a
relationship between two clusterings in which the first clustering's features
are similar to the second clustering's features. The simplest model for
this would be to simply relate clusterings to each other; however, not all
physically close pairs qualify as clusterings, so we relate a clustering to
a pair of features. The key is the clustering key followed by the IDs
of the features in the second pair.
TRUE if this PCH is used in scoring the attached clustering,
else FALSE. If a clustering has a PCH for a particular genome and many
similar genomes are present, then a PCH will probably exist for the
similar genomes as well. When this happens, only one of the PCHs will
be scored: the others are considered duplicates of the same evidence.
This relationship connects a feature to all the functional couplings
in which it participates. A functional coupling is a recognition of the fact
that the features are close to each other on a chromosome, and similar
features in other genomes also tend to be close.
Ordinal position of the feature in the coupling. Currently,
this is either "1" or "2".
This index enables the application to view the features of
a coupling in the proper order. The order influences the way the
PCHs are examined.
This relation connects a synonym group to the features that make it
up.
This relationship connects a genome to all of its features. This
relationship is redundant in a sense, because the genome ID is part
of the feature ID; however, it makes the creation of certain queries more
convenient because you can drag in filtering information for a feature's
genome.
Feature type (eg. peg, rna)
This index enables the application to view the features of a
Genome sorted by type.
This relationship connects a functional coupling to the physically
close homologs (PCHs) which affirm that the coupling is meaningful.
This relationship connects a PCH to the features that represent its
evidence. Each PCH is connected to a parent coupling that relates two features
on a specific genome. The PCH's evidence that the parent coupling is functional
is the existence of two physically close features on a different genome that
correspond to the features in the coupling. Those features are found on the
far side of this relationship.
Ordinal position of the feature in the coupling that corresponds
to our target feature. There is a one-to-one correspondence between the
features connected to the PCH by this relationship and the features
connected to the PCH's parent coupling. The ordinal position is used
to decode that relationship. Currently, this field is either "1" or
"2".
This index enables the application to view the features of
a PCH in the proper order.
This relationship connects a genome to the contigs that contain the actual genetic
information.
This relationship connects a genome to the sources that mapped it. A genome can
come from a single source or from a cooperation among multiple sources.
A contig is stored in the database as an ordered set of sequences. By splitting the
contig into sequences, we get a performance boost from only needing to keep small portions
of a contig in memory at any one time. This relationship connects the contig to its
constituent sequences.
Length of the sequence.
Index (1-based) of the point in the contig where this
sequence starts.
This index enables the application to find all of the sequences in
a contig in order, and makes it easier to find a particular residue section.
This relationship connects a feature to its annotations.
This relationship connects an annotation to the user who made it.
This relationship connects subsystems to the genomes that use
it. If the subsystem has been curated for the genome, then the subsystem's roles will also be
connected to the genome features through the [b]SSCell[/b] object.
Code indicating the subsystem variant to which this
genome belongs. Each subsystem can have multiple variants. A variant
code of [b]-1[/b] indicates that the genome does not have a functional
variant of the subsystem. A variant code of [b]0[/b] indicates that
the genome's participation is considered iffy.
This index enables the application to find all of the genomes using
a subsystem in order by variant code, which is how we wish to display them
in the spreadsheets.
This relationship connects roles to the subsystems that implement them.
Column number for this role in the specified subsystem's
spreadsheet.
This index enables the application to see the subsystem roles
in column order. The ordering of the roles is usually significant,
so it is important to preserve it.
This relationship connects a subsystem's spreadsheet cell to the
genome for the spreadsheet column.
This relationship connects a subsystem's spreadsheet cell to the
role for the spreadsheet row.
This relationship connects a subsystem's spreadsheet cell to the
features assigned to it.
ID of this feature's cluster. Clusters represent families of
related proteins participating in a subsystem.
This relationship connects a reaction to the compounds that participate
in it.
TRUE if the compound is a product of the reaction, FALSE if
it is a substrate. When a reaction is written on paper in
chemical notation, the substrates are left of the arrow and the
products are to the right. Sorting on this field will cause
the substrates to appear first, followed by the products. If the
reaction is reversible, then the notion of substrates and products
is not at intuitive; however, a value here of FALSE still puts the
compound left of the arrow and a value of TRUE still puts it to the
right.
Number of molecules of the compound that participate in a
single instance of the reaction. For example, if a reaction
produces two water molecules, the stoichiometry of water for the
reaction would be two. When a reaction is written on paper in
chemical notation, the stoichiometry is the number next to the
chemical formula of the compound.
TRUE if this compound is one of the main participants in
the reaction, else FALSE. It is permissible for none of the
compounds in the reaction to be considered main, in which
case this value would be FALSE for all of the relevant
compounds.
An optional character string that indicates the relative
position of this compound in the reaction's chemical formula. The
location affects the way the compounds present as we cross the
relationship from the reaction side. The product/substrate flag
comes first, then the value of this field, then the main flag.
The default value is an empty string; however, the empty string
sorts first, so if this field is used, it should probably be
used for every compound in the reaction.
A unique ID for this record. The discriminator does not
provide any useful data, but it prevents identical records from
being collapsed by the SELECT DISTINCT command used by ERDB to
retrieve data.
This index presents the compounds in the reaction in the
order they should be displayed when writing it in chemical notation.
All the substrates appear before all the products, and within that
ordering, the main compounds appear first.
This relationship connects a feature to the contig segments that work together
to effect it. The segments are numbered sequentially starting from 1. The database is
required to place an upper limit on the length of each segment. If a segment is longer
than the maximum, it can be broken into smaller bits.
[p]The upper limit enables applications to locate all features that contain a specific
residue. For example, if the upper limit is 100 and we are looking for a feature that
contains residue 234 of contig [b]ABC[/b], we can look for features with a begin point
between 135 and 333. The results can then be filtered by direction and length of the
segment.
Sequence number of this segment.
Index (1-based) of the first residue in the contig that
belongs to the segment.
Number of residues in the segment. A length of 0 identifies
a specific point between residues. This is the point before the residue if the direction
is forward and the point after the residue if the direction is backward.
Direction of the segment: [b]+[/b] if it is forward and
[b]-[/b] if it is backward.
This index allows the application to find all the segments of a feature in
the proper order.
This index is the one used by applications to find all the feature
segments that contain a specific residue.
This relationship is one of two that relate features to each other. It
connects features that are very similar but on separate genomes. A
bidirectional best hit relationship exists between two features [b]A[/b]
and [b]B[/b] if [b]A[/b] is the best match for [b]B[/b] on [b]A[/b]'s genome
and [b]B[/b] is the best match for [b]A[/b] on [b]B[/b]'s genome.
ID of the genome containing the target (to) feature.
score for this relationship
This index allows the application to find a feature's best hit for
a specific target genome.
This relationship connects a feature to its known property values.
The relationship contains text data that indicates the paper or organization
that discovered evidence that the feature possesses the property. So, for
example, if two papers presented evidence that a feature is essential,
there would be an instance of this relationship for both.
URL or citation of the paper or
institution that reported evidence of the relevant feature possessing
the specified property value.
This relationship connects a role to the diagrams on which it
appears. A role frequently identifies an enzyme, and can appear in many
diagrams. A diagram generally contains many different roles.
This relationship connects a subsystem to the spreadsheet cells
used to analyze and display it. The cells themselves can be thought of
as a grid with Roles on one axis and Genomes on the other. The
various features of the subsystem are then assigned to the cells.
This relationship identifies the users trusted by each
particular user. When viewing functional assignments, the
assignment displayed is the most recent one by a user trusted
by the current user. The current user implicitly trusts himself.
If no trusted users are specified in the database, the user
also implicitly trusts the user [b]FIG[/b].
This relationship connects a role subset to the roles that it covers.
A subset is, essentially, a named group of roles belonging to a specific
subsystem, and this relationship effects that. Note that will a role
may belong to many subsystems, a subset belongs to only one subsystem,
and all roles in the subset must have that subsystem in common.
This relationship connects a subset to the genomes that it covers.
A subset is, essentially, a named group of genomes participating in a specific
subsystem, and this relationship effects that. Note that while a genome
may belong to many subsystems, a subset belongs to only one subsystem,
and all genomes in the subset must have that subsystem in common.
This relationship connects a subsystem to its constituent
role subsets. Note that some roles in a subsystem may not belong to a
subset, so the relationship between roles and subsystems cannot be
derived from the relationships going through the subset.
This relationship connects a subsystem to its constituent
genome subsets. Note that some genomes in a subsystem may not belong to a
subset, so the relationship between genomes and subsystems cannot be
derived from the relationships going through the subset.
This relationship connects a role to the reactions it catalyzes.
The purpose of a role is to create proteins that trigger certain
chemical reactions. A single reaction can be triggered by many roles,
and a role can trigger many reactions.