Parent Directory
|
Revision Log
no message
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <title>Notes about the SPROUT Database</title> <meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1"> <meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5"> </head> <body BGCOLOR="#FFFF80"> <h1>The Underlying Database Architecture</h1> <h2>Basic Concepts</h2> <UL> <li> At its lowest level, Sprout is a configurable Entity-Relationship database that supports only inserts. <ul> <li> Only a small number of tables will support insertion. <li> The real data is kept in tab-delimited flat files that are used to load the data into the database. <ul> <li> The Sprout database will periodically be loaded from tab-delimited files generated by the current SEED.</li> </ul> <li> Inserted data is remembered so that it is not lost during the load. <li> Only certain entities will allow inserts. For example, the <b>GENOME</b> entity is marked so that new genomes can only come from the flat files, while the <b>ANNOTATION</b> entity can be inserted directly into the database.</li> </ul> <LI> Each entity consists of multiple relations, all with the same ID field. <UL> <li> The use of multiple relations allows multiple value occurrences. For example, the <b>USER</b> entity has an <b>access-code</b> attribute that occurs multiple times. This attribute would be implemented as a second relation. <LI> Every entity has a relation that contains an <STRONG>id</STRONG> field and a <STRONG> type</STRONG> field. The <STRONG>id</STRONG> may be a number or a string.The ID of an entity instance is unique to that instance for all entities of that type. </LI> </UL> <li> <ul> <li> The ID is generated externally so that it can be used to communicate between the database and the flat files. <li> This requires some special handling when dealing with items inserted directly into the database.</li> </ul> <li> Entities are connected by single relations called relationships. <ul> <li> A relationship is keyed on the IDs of the two related entities. The IDs are called <STRONG>from</STRONG> and <STRONG>to</STRONG>. In addition, it may contain additional fields that act as intersection data. <li> The relationship may also contain additional attributes to represent intersection data. For example, the <b>IsLocatedIn</b> relationship contains the ID of a <b>FEATURE</b> and a <b>CONTIG</b>, plus a a <b>dir</b> attribute that describes the direction of the gene. <li> Relationships are all binary. Non-binary relationships are implemented by adding new entities.</li> </ul> </li> </UL> <h2><a name="Structures">Metadata Structures</a></h2> <p>The metadata structures describe the entities and relationships implemented in the database. They are, in fact a database describing the database itself.</p> <h3>ENTITY</h3> <p>An <i>entity</i> is a real or abstract thing on which we wish to keep data. The terms <i>entity</i> and <i>object</i> are mostly interchangeable; however, for our purposes, <i>object</i> will only be used to describe an entity instance, rather than an entity type. In the relations that implement an entity, there must be an ID field that contains the entity key.</p> <table border="2"> <tr> <td><b>entity-id</b></td> <td><i>(key)</i> displayable common name of the entity</td> </tr> <tr> <td><b>relation-id</b></td> <td><i>(multiple)</i> a relation used to implement the entity</td> </tr> </table> <h3>RELATIONSHIP</h3> <p>A <i>relationship</i> is a connection between a pair of entities.</p> <table border="2"> <tr> <td><b>relationship-id</b></td> <td><i>(key)</i> displayable common name of the relationship</td> </tr> <tr> <td><b>relation-id</b></td> <td>relation used to implement the relationship</td> </tr> <tr> <td><b>arity</b></td> <td>type of relationship: 1-to-many, many-to-many, many-to-1, or 1-to-1</td> </tr> <tr> <td><b>source-entity-id</b></td> <td>name of the entity type from which the relationship starts</td> </tr> <tr> <td><b>target-entity-id</b></td> <td>name of the entity type into which the relationship ends</td> </tr> </table> <h3>RELATION</h3> <p>A <i>relation</i> is a physical table that implements a relationship or partly implements an entity.</p> <table border="2"> <tr> <td><b>name</b></td> <td><i>(key)</i> name of the physical relation</td> </tr> </table> <h3>FIELD</h3> <p>A <i>field</i> is a physical table column that ultimately contains the actual data.</p> <table border="2"> <tr> <td><b>relation-id</b></td> <td><i>(key.1)</i> ID of the relation containing this field</td> </tr> <tr> <td><b>name</b></td> <td><i>(key.2)</i> name of the field</td> </tr> <tr> <td><b>data-type</b></td> <td>type of data stored in the field</td> </tr> </table> <h2>Methods</h2> <p>The following methods are provided to access data in the database. Methods that allow iteration will have <b>GetFirst</b> and <b>GetNext</b> versions. For example, the <b>GetObjects</b> operation will be implemented as two methods-- <b>GetFirstObject</b> and <b>GetNextObject</b>.</p> <ul> <li> <b>Load</b>: Load the data from a flat file into the database. <ul> <li> If the database already exists, special handling is required to maintain inserted rows. <li> If the database does not exist, the tables will be created from the metadata.</li> </ul> <li> <b>GetEntityTypes</b>: Return a list of the entity types. <li> <b>GetObjects</b>: Iterate through the instances of a specified entity type. <ul> <li> A more or less arbitrary filtering mechanism will be needed. <li> The results will be returned in an indeterminate order. <li> Only one type of object will be returned. <li> This method only uses relationships for filtering purposes.</li> </ul> <li> <b>AccessObject</b>: Get a handle for extracting data from a specific entity instance. <li> <b>GetAttributes</b>: Iterate through the fields for a specified object. <ul> <li> Some fields will occur multiple times. For example, one particular feature instance may be spread between six contigs, while another appears only once. In this case, the first feature will have six occurrences of the <b>locN</b> field while the second feature will have only one.</li> </ul> <li> <b>GetValue</b>: Return the value of an attribute. <ul> <li> Because an attribute may occur multiple times, an ordinal number is required to identify the desired occurrence.</li> </ul> </li> </ul> <h1>Surface Database Architecture</h1> <h2>Entities</h2> <h3>GENOME</h3> <pre> [genome-id,genus,species,unique-characterization,source-id] [genome-id,access-code] </pre> <h3>SOURCE</h3> <pre> [source-id,label,URL,description] </pre> <h3>CONTIG</h3> <pre> [contig-id] </pre> <p>The contig-id is the genome-id and the contig name. A <b>CONTIG</b> is a contiguous section of a genome that was produced by a sequencing project. The <b>CONTIG</b>s are named and generated externally and then loaded into the database.</p> <h3>SEQUENCE</h3> <pre> [sequence-id,sequence] [sequence-id,quality-vector] </pre> <p>The sequence id is the contig-id and the begin point. The sequence is an ordered collection of characters from an alphabet. For each character in the sequence, the quality vector is an integer exponent indicating the likelihood of an error. So, a quality value of 30 means the likelihood that the chqaracter is correct is (1 - 10^-30).</p> <p>The character data for the <b>CONTIG</b> is broken into <b>SEQUENCE</b>s so that we do not have to manipulate the entire <b>CONTIG</b> as a string in memory. This is important, because some <b>CONTIG</b>s can be hundreds of megacharacters in length.</p> <h3>FEATURE</h3> <pre> [feature-id,type] [feature-id,alias] [feature-id,DNA-sequence] [feature-id,translation] [feature-id,upstream-sequence] [feature-id,virulence] [feature-id,essentiality] </pre> <h3>ROLE</h3> <pre> [role-id,role] </pre> <h3>ANNOTATION</h3> <pre> [annotation-id,time,annotation,confidence] </pre> <h3>ASSIGNMENT</h3> <pre> [assignment-id,confidence] </pre> <h3>SUBSYSTEM</h3> <pre> [subsystem-id,subsystem-name] </pre> <h3>SSCELL</h3> <pre> [cell-id,subsystem-id] </pre> <h3>USER</h3> <pre> [user-id,user-name,password] [user-id,access-code] </pre> <h3>FUSION</h3> <pre> [feature-id-1, feature-id-2] </pre> <h2>Relationships</h2> <h3>GENOME HasContig CONTIG</h3> <p>A single <b>GENOME</b> is composed of multiple <b>CONTIG</b>s.</p> <h3>GENOME ComesFrom SOURCE</h3> <p>A single <b>GENOME</b> can come from a single <STRONG>SOURCE </STRONG>or from cooperation by multiple <b>SOURCE</b>s. Multiple <STRONG>GENOME</STRONG>s may come from a single <STRONG>SOURCE</STRONG>.</p> <h3>CONTIG IsMadeUpOf SEQUENCE</h3> <p>A single <b>CONTIG</b> is made up of multiple <b>SEQUENCE</b>s.</p> <table border="2"> <tr> <td><b>start-position</b></td> <td>ordinal number of this sequence in the <b>CONTIG</b> (For example, a <b>start-position</b> of 100 means that this sequence starts at the 100th position of the <b>CONTIG</b>.</td> </tr> </table> <h3>FEATURE IsDescribedBy ANNOTATION</h3> <p>Multiple <b>ANNOTATION</b>s can be made on a single <STRONG>FEATURE</STRONG>.</p> <h3>USER Made ANNOTATION</h3> <p>Multiple <b>ANNOTATION</b>s can be made by a single <b>USER</b>.</p> <h3>USER Assigned ASSIGNMENT</h3> <p>Multiple <b>ASSIGNMENT</b>s can be made by a single <b>USER</b></p> <h3>FEATURE IsTargetOf ASSIGNMENT</h3> <p>Multiple <b>ASSIGNMENT</b>s can be made to a single <b>FEATURE</b>.</p> <h3>ASSIGNMENT Implements ROLE</h3> <p>Multiple <b>ASSIGNMENT</b>s can describe a single <STRONG>ROLE</STRONG>. Multiple <b>ROLE</b>s can be implemented by a single <STRONG>ASSIGNMENT</STRONG>.</p> <h3>GENOME ParticipatesIn SUBSYSTEM</h3> <p>Multiple <b>GENOME</b>s can participate in multiple <b>SUBSYSTEM</b>s.</p> <table border="2"> <tr> <td><b>variant</b></td> <td>description of the subsystem variant</td> </tr> </table> <h3>ROLE OccursIn SUBSYSTEM</h3> <p>Multiple <b>ROLE</b>s can be acheived by multiple <b>SUBSYSTEM</b>s.</p> <h3>SSCELL BelongsTo GENOME</h3> <p>Multiple <b>SSCELL</b>s belong to a single <b>GENOME</b>.</p> <h3>SSCELL RelatesTo ROLE</h3> <p>Multiple <b>SSCELL</b>s relate to a single <b>ROLE</b>.</p> <h3>FEATURE IsLocatedIn CONTIG</h3> <p>A single <b>FEATURE</b> is located in multiple <b>CONTIG</b>s; a <b>CONTIG</b> contains multiple <b>FEATURE</b> locations. This relationship enables us to find the gene sequences in the <b>CONTIG</b>s that make up the <b>FEATURE</b>.</p> <p>In order to insure that we are able to find all genes relating to a particular location we imposed a maximum size on each span encoded by this relationship. So, for example, if the maximum span size is 100 and we want to find all features that include position 321 of <b>CONTIG</b> ABC, we would search for location data relating to positions 222 through 420, and only emit them if the length and direction cross the 321 location.</p> <table border="2"> <tr> <td><b>locN</b></td> <td>ordinal number of this location for the <b>FEATURE</b></td> </tr> <tr> <td><b>beg</b></td> <td>position of this location's first nucleotide in the <b>CONTIG</b></td> </tr> <tr> <td><b>len</b></td> <td>number of nucleotides used by this location in the <b>CONTIG</b></td> </tr> <tr> <td><b>dir</b></td> <td>direction of the location from the beginning point <b>CONTIG</b></td> </tr> </table> <h3>SSCELL Contains FEATURE</h3> <p>A single <b>SSCELL</b> contains multiple <b>FEATURE</b>s; a <b>FEATURE</b> may be contained in multiple <b>SSCELL</b>s.</p> <h3>FEATURE IsRelatedTo FEATURE</h3> <p>Multiple <b>FEATURE</b>s are related to multiple other <b>FEATURE</b>s. This relationship is commutative.</p> <table border="2"> <tr> <td><b>score</b></td> <td>measurement of the level of the relationship</td> </tr> <tr> <td><b>type</b></td> <td>type of relationship (similarity, bidirectional best hit, or chromosome clustering)</td> </tr> </table> <h3>FUSION Yields FEATURE</h3> <p>Multiple <b>FUSION</b>s produce a single <b>FEATURE</b>.</p> </body> </html>
MCS Webmaster | ViewVC Help |
Powered by ViewVC 1.0.3 |