The Underlying Database Architecture

Basic Concepts

Metadata Structures

The metadata structures describe the entities and relationships implemented in the database. They are, in fact a database describing the database itself.

ENTITY

An entity is a real or abstract thing on which we wish to keep data. The terms entity and object are mostly interchangeable; however, for our purposes, object will only be used to describe an entity instance, rather than an entity type. In the relations that implement an entity, there must be an ID field that contains the entity key.

entity-id (key) displayable common name of the entity
relation-id (multiple) a relation used to implement the entity

RELATIONSHIP

A relationship is a connection between a pair of entities.

relationship-id (key) displayable common name of the relationship
relation-id relation used to implement the relationship
arity type of relationship: 1-to-many, many-to-many, many-to-1, or 1-to-1
source-entity-id name of the entity type from which the relationship starts
target-entity-id name of the entity type into which the relationship ends

RELATION

A relation is a physical table that implements a relationship or partly implements an entity.

name (key) name of the physical relation

FIELD

A field is a physical table column that ultimately contains the actual data.

relation-id (key.1) ID of the relation containing this field
name (key.2) name of the field
data-type type of data stored in the field

Methods

The following methods are provided to access data in the database. Methods that allow iteration will have GetFirst and GetNext versions. For example, the GetObjects operation will be implemented as two methods-- GetFirstObject and GetNextObject.

Surface Database Architecture

Entities

GENOME

	[genome-id,genus,species,unique-characterization,source-id]
	[genome-id,access-code]

SOURCE

	[source-id,label,URL,description]

CONTIG

	[contig-id]

The contig-id is the genome-id and the contig name. A CONTIG is a contiguous section of a genome that was produced by a sequencing project. The CONTIGs are named and generated externally and then loaded into the database.

SEQUENCE

    [sequence-id,sequence]
    [sequence-id,quality-vector]

The sequence id is the contig-id and the begin point. The sequence is an ordered collection of characters from an alphabet. For each character in the sequence, the quality vector is an integer exponent indicating the likelihood of an error. So, a quality value of 30 means the likelihood that the chqaracter is correct is (1 - 10^-30).

The character data for the CONTIG is broken into SEQUENCEs so that we do not have to manipulate the entire CONTIG as a string in memory. This is important, because some CONTIGs can be hundreds of megacharacters in length.

FEATURE

	[feature-id,type]
	[feature-id,alias]
	[feature-id,DNA-sequence]
	[feature-id,translation]
	[feature-id,upstream-sequence]
	[feature-id,virulence]
	[feature-id,essentiality]

ROLE

	[role-id,role]

ANNOTATION

	[annotation-id,time,annotation,confidence]

ASSIGNMENT

    [assignment-id,confidence]

SUBSYSTEM

	[subsystem-id,subsystem-name]

SSCELL

	[cell-id,subsystem-id]

USER

	[user-id,user-name,password]
	[user-id,access-code]

FUSION

    [feature-id-1, feature-id-2]

Relationships

GENOME HasContig CONTIG

A single GENOME is composed of multiple CONTIGs.

GENOME ComesFrom SOURCE

A single GENOME can come from a single SOURCE or from cooperation by multiple SOURCEs. Multiple GENOMEs may come from a single SOURCE.

CONTIG IsMadeUpOf SEQUENCE

A single CONTIG is made up of multiple SEQUENCEs.

start-position ordinal number of this sequence in the CONTIG (For example, a start-position of 100 means that this sequence starts at the 100th position of the CONTIG.

FEATURE IsDescribedBy ANNOTATION

Multiple ANNOTATIONs can be made on a single FEATURE.

USER Made ANNOTATION

Multiple ANNOTATIONs can be made by a single USER.

USER Assigned ASSIGNMENT

Multiple ASSIGNMENTs can be made by a single USER

FEATURE IsTargetOf ASSIGNMENT

Multiple ASSIGNMENTs can be made to a single FEATURE.

ASSIGNMENT Implements ROLE

Multiple ASSIGNMENTs can describe a single ROLE. Multiple ROLEs can be implemented by a single ASSIGNMENT.

GENOME ParticipatesIn SUBSYSTEM

Multiple GENOMEs can participate in multiple SUBSYSTEMs.

variant description of the subsystem variant

ROLE OccursIn SUBSYSTEM

Multiple ROLEs can be acheived by multiple SUBSYSTEMs.

SSCELL BelongsTo GENOME

Multiple SSCELLs belong to a single GENOME.

SSCELL RelatesTo ROLE

Multiple SSCELLs relate to a single ROLE.

FEATURE IsLocatedIn CONTIG

A single FEATURE is located in multiple CONTIGs; a CONTIG contains multiple FEATURE locations. This relationship enables us to find the gene sequences in the CONTIGs that make up the FEATURE.

In order to insure that we are able to find all genes relating to a particular location we imposed a maximum size on each span encoded by this relationship. So, for example, if the maximum span size is 100 and we want to find all features that include position 321 of CONTIG ABC, we would search for location data relating to positions 222 through 420, and only emit them if the length and direction cross the 321 location.

locN ordinal number of this location for the FEATURE
beg position of this location's first nucleotide in the CONTIG
len number of nucleotides used by this location in the CONTIG
dir direction of the location from the beginning point CONTIG

SSCELL Contains FEATURE

A single SSCELL contains multiple FEATUREs; a FEATURE may be contained in multiple SSCELLs.

FEATURE IsRelatedTo FEATURE

Multiple FEATUREs are related to multiple other FEATUREs. This relationship is commutative.

score measurement of the level of the relationship
type type of relationship (similarity, bidirectional best hit, or chromosome clustering)

FUSION Yields FEATURE

Multiple FUSIONs produce a single FEATURE.