[Bio] / Sprout / SproutNotes.htm Repository:
ViewVC logotype

View of /Sprout/SproutNotes.htm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (download) (as text) (annotate)
Wed Oct 6 18:17:41 2004 UTC (15 years, 2 months ago) by parrello
Branch point for: MAIN, avendor
Initial revision

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
    <head>
        <title>Notes about the SPROUT Database</title>
        <meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">
        <meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5">
    </head>
    <body BGCOLOR="#FFFF80">
        <h1>The Underlying Database Architecture</h1>
        <h2>Basic Concepts</h2>
        <UL>
            <li>
                At its lowest level, Sprout is a configurable Entity-Relationship database that
                supports only inserts.
                <ul>
                    <li>
                    Only a small number of tables will support insertion.
                    <li>
                        The real data is kept in tab-delimited flat files that are used to load the
                        data into the database.
                        <ul>
                            <li>
                                The Sprout database will periodically be loaded from tab-delimited files
                                generated by the current SEED.</li>
                        </ul>
                    <li>
                    Inserted data is remembered so that it is not lost during the load.
                    <li>
                        Only certain entities will allow inserts. For example, the <b>GENOME</b> entity
                        is marked so that new genomes can only come from the flat files, while the <b>ANNOTATION</b>
                        entity can be inserted directly into the database.</li>
                </ul>
            <LI>
                Each entity consists of multiple relations, all with the same ID field.
                <UL>
                    <li>
                        The use of multiple relations allows multiple value occurrences. For example,
                        the <b>USER</b> entity has an <b>access-code</b>
                    attribute that occurs multiple times. This attribute would be implemented as a
                    second relation.
                    <LI>
                        Every entity has a relation that contains an <STRONG>id</STRONG> field and a <STRONG>
                            type</STRONG> field. The <STRONG>id</STRONG> may be a number or a
                        string.The ID of an entity instance is unique to that instance for all entities
                        of that type.
                    </LI>
                </UL>
            <li>
                <ul>
                    <li>
                    The ID is generated externally so that it can be used to communicate between
                    the database and the flat files.
                    <li>
                        This requires some special handling when dealing with items inserted directly
                        into the database.</li>
                </ul>
            <li>
                Entities are connected by single relations called relationships.
                <ul>
                    <li>
                        A relationship is keyed on the IDs of the two related entities. The IDs are
                        called <STRONG>from</STRONG> and <STRONG>to</STRONG>. In addition, it may
                    contain additional fields that act as intersection data.
                    <li>
                        The relationship may also contain additional attributes to represent
                        intersection data. For example, the <b>IsLocatedIn</b> relationship contains
                        the ID of a <b>FEATURE</b> and a <b>CONTIG</b>, plus a a <b>dir</b>
                    attribute that describes the direction of the gene.
                    <li>
                        Relationships are all binary. Non-binary relationships are implemented by
                        adding new entities.</li>
                </ul>
            </li>
        </UL>
        <h2><a name="Structures">Metadata Structures</a></h2>
        <p>The metadata structures describe the entities and relationships implemented in
            the database. They are, in fact a database describing the database itself.</p>
        <h3>ENTITY</h3>
        <p>An <i>entity</i> is a real or abstract thing on which we wish to keep data. The
            terms <i>entity</i> and <i>object</i> are mostly interchangeable; however, for
            our purposes, <i>object</i> will only be used to describe an entity instance,
            rather than an entity type. In the relations that implement an entity, there
            must be an ID field that contains the entity key.</p>
        <table border="2">
            <tr>
                <td><b>entity-id</b></td>
                <td><i>(key)</i> displayable common name of the entity</td>
            </tr>
            <tr>
                <td><b>relation-id</b></td>
                <td><i>(multiple)</i> a relation used to implement the entity</td>
            </tr>
        </table>
        <h3>RELATIONSHIP</h3>
        <p>A <i>relationship</i> is a connection between a pair of entities.</p>
        <table border="2">
            <tr>
                <td><b>relationship-id</b></td>
                <td><i>(key)</i> displayable common name of the relationship</td>
            </tr>
            <tr>
                <td><b>relation-id</b></td>
                <td>relation used to implement the relationship</td>
            </tr>
            <tr>
                <td><b>arity</b></td>
                <td>type of relationship: 1-to-many, many-to-many, many-to-1, or 1-to-1</td>
            </tr>
            <tr>
                <td><b>source-entity-id</b></td>
                <td>name of the entity type from which the relationship starts</td>
            </tr>
            <tr>
                <td><b>target-entity-id</b></td>
                <td>name of the entity type into which the relationship ends</td>
            </tr>
        </table>
        <h3>RELATION</h3>
        <p>A <i>relation</i> is a physical table that implements a relationship or partly
            implements an entity.</p>
        <table border="2">
            <tr>
                <td><b>name</b></td>
                <td><i>(key)</i> name of the physical relation</td>
            </tr>
        </table>
        <h3>FIELD</h3>
        <p>A <i>field</i> is a physical table column that ultimately contains the actual
            data.</p>
        <table border="2">
            <tr>
                <td><b>relation-id</b></td>
                <td><i>(key.1)</i> ID of the relation containing this field</td>
            </tr>
            <tr>
                <td><b>name</b></td>
                <td><i>(key.2)</i> name of the field</td>
            </tr>
            <tr>
                <td><b>data-type</b></td>
                <td>type of data stored in the field</td>
            </tr>
        </table>
        <h2>Methods</h2>
        <p>The following methods are provided to access data in the database. Methods that
            allow iteration will have <b>GetFirst</b> and <b>GetNext</b> versions. For
            example, the <b>GetObjects</b> operation will be implemented as two methods-- <b>GetFirstObject</b>
            and <b>GetNextObject</b>.</p>
        <ul>
            <li>
                <b>Load</b>: Load the data from a flat file into the database.
                <ul>
                    <li>
                    If the database already exists, special handling is required to maintain
                    inserted rows.
                    <li>
                        If the database does not exist, the tables will be created from the metadata.</li>
                </ul>
            <li>
                <b>GetEntityTypes</b>: Return a list of the entity types.
            <li>
                <b>GetObjects</b>: Iterate through the instances of a specified entity type.
                <ul>
                    <li>
                    A more or less arbitrary filtering mechanism will be needed.
                    <li>
                    The results will be returned in an indeterminate order.
                    <li>
                    Only one type of object will be returned.
                    <li>
                        This method only uses relationships for filtering purposes.</li>
                </ul>
            <li>
                <b>AccessObject</b>: Get a handle for extracting data from a specific
            entity instance.
            <li>
                <b>GetAttributes</b>: Iterate through the fields for a specified object.
                <ul>
                    <li>
                        Some fields will occur multiple times. For example, one particular feature
                        instance may be spread between six contigs, while another appears only once. In
                        this case, the first feature will have six occurrences of the <b>locN</b> field
                        while the second feature will have only one.</li>
                </ul>
            <li>
                <b>GetValue</b>: Return the value of an attribute.
                <ul>
                    <li>
                        Because an attribute may occur multiple times, an ordinal number is required to
                        identify the desired occurrence.</li>
                </ul>
            </li>
        </ul>
        <h1>Surface Database Architecture</h1>
        <h2>Entities</h2>
        <h3>GENOME</h3>
        <pre>
	[genome-id,genus,species,unique-characterization,source-id]
	[genome-id,access-code]
</pre>
        <h3>SOURCE</h3>
        <pre>
	[source-id,label,URL,description]
</pre>
        <h3>CONTIG</h3>
        <pre>
	[contig-id]
</pre>
        <p>The contig-id is the genome-id and the contig name. A <b>CONTIG</b> is a
            contiguous section of a genome that was produced by a sequencing project. The <b>CONTIG</b>s
            are named and generated externally and then loaded into the database.</p>
        <h3>SEQUENCE</h3>
        <pre>
    [sequence-id,sequence]
    [sequence-id,quality-vector]
</pre>
        <p>The sequence id is the contig-id and the begin point. The sequence is an ordered
            collection of characters from an alphabet. For each character in the sequence,
            the quality vector is an integer exponent indicating the likelihood of an
            error. So, a quality value of 30 means the likelihood that the chqaracter is
            correct is (1 - 10^-30).</p>
        <p>The character data for the <b>CONTIG</b> is broken into <b>SEQUENCE</b>s so that
            we do not have to manipulate the entire <b>CONTIG</b> as a string in memory.
            This is important, because some <b>CONTIG</b>s can be hundreds of
            megacharacters in length.</p>
        <h3>FEATURE</h3>
        <pre>
	[feature-id,type]
	[feature-id,alias]
	[feature-id,DNA-sequence]
	[feature-id,translation]
	[feature-id,upstream-sequence]
	[feature-id,virulence]
	[feature-id,essentiality]
</pre>
        <h3>ROLE</h3>
        <pre>
	[role-id,role]
</pre>
        <h3>ANNOTATION</h3>
        <pre>
	[annotation-id,time,annotation,confidence]
</pre>
        <h3>ASSIGNMENT</h3>
        <pre>
    [assignment-id,confidence]
</pre>
        <h3>SUBSYSTEM</h3>
        <pre>
	[subsystem-id,subsystem-name]
</pre>
        <h3>SSCELL</h3>
        <pre>
	[cell-id,subsystem-id]
</pre>
        <h3>USER</h3>
        <pre>
	[user-id,user-name,password]
	[user-id,access-code]
</pre>
        <h3>FUSION</h3>
        <pre>
    [feature-id-1, feature-id-2]
</pre>
        <h2>Relationships</h2>
        <h3>GENOME HasContig CONTIG</h3>
        <p>A single <b>GENOME</b> is composed of multiple <b>CONTIG</b>s.</p>
        <h3>GENOME ComesFrom SOURCE</h3>
        <p>A single <b>GENOME</b> can come from a single <STRONG>SOURCE </STRONG>or from
            cooperation by multiple <b>SOURCE</b>s. Multiple <STRONG>GENOME</STRONG>s may
            come from a single <STRONG>SOURCE</STRONG>.</p>
        <h3>CONTIG IsMadeUpOf SEQUENCE</h3>
        <p>A single <b>CONTIG</b> is made up of multiple <b>SEQUENCE</b>s.</p>
        <table border="2">
            <tr>
                <td><b>start-position</b></td>
                <td>ordinal number of this sequence in the <b>CONTIG</b> (For example, a <b>start-position</b>
                    of 100 means that this sequence starts at the 100th position of the <b>CONTIG</b>.</td>
            </tr>
        </table>
        <h3>FEATURE IsDescribedBy ANNOTATION</h3>
        <p>Multiple <b>ANNOTATION</b>s can be made on a single <STRONG>FEATURE</STRONG>.</p>
        <h3>USER Made ANNOTATION</h3>
        <p>Multiple <b>ANNOTATION</b>s can be made by a single <b>USER</b>.</p>
        <h3>USER Assigned ASSIGNMENT</h3>
        <p>Multiple <b>ASSIGNMENT</b>s can be made by a single <b>USER</b></p>
        <h3>FEATURE IsTargetOf ASSIGNMENT</h3>
        <p>Multiple <b>ASSIGNMENT</b>s can be made to a single <b>FEATURE</b>.</p>
        <h3>ASSIGNMENT Implements ROLE</h3>
        <p>Multiple <b>ASSIGNMENT</b>s can describe a single <STRONG>ROLE</STRONG>.
            Multiple <b>ROLE</b>s can be implemented by a single <STRONG>ASSIGNMENT</STRONG>.</p>
        <h3>GENOME ParticipatesIn SUBSYSTEM</h3>
        <p>Multiple <b>GENOME</b>s can participate in multiple <b>SUBSYSTEM</b>s.</p>
        <table border="2">
            <tr>
                <td><b>variant</b></td>
                <td>description of the subsystem variant</td>
            </tr>
        </table>
        <h3>ROLE OccursIn SUBSYSTEM</h3>
        <p>Multiple <b>ROLE</b>s can be acheived by multiple <b>SUBSYSTEM</b>s.</p>
        <h3>SSCELL BelongsTo GENOME</h3>
        <p>Multiple <b>SSCELL</b>s belong to a single <b>GENOME</b>.</p>
        <h3>SSCELL RelatesTo ROLE</h3>
        <p>Multiple <b>SSCELL</b>s relate to a single <b>ROLE</b>.</p>
        <h3>FEATURE IsLocatedIn CONTIG</h3>
        <p>A single <b>FEATURE</b> is located in multiple <b>CONTIG</b>s; a <b>CONTIG</b> contains
            multiple <b>FEATURE</b> locations. This relationship enables us to find the
            gene sequences in the <b>CONTIG</b>s that make up the <b>FEATURE</b>.</p>
        <p>In order to insure that we are able to find all genes relating to a particular
            location we imposed a maximum size on each span encoded by this relationship.
            So, for example, if the maximum span size is 100 and we want to find all
            features that include position 321 of <b>CONTIG</b> ABC, we would search for
            location data relating to positions 222 through 420, and only emit them if the
            length and direction cross the 321 location.</p>
        <table border="2">
            <tr>
                <td><b>locN</b></td>
                <td>ordinal number of this location for the <b>FEATURE</b></td>
            </tr>
            <tr>
                <td><b>beg</b></td>
                <td>position of this location's first nucleotide in the <b>CONTIG</b></td>
            </tr>
            <tr>
                <td><b>len</b></td>
                <td>number of nucleotides used by this location in the <b>CONTIG</b></td>
            </tr>
            <tr>
                <td><b>dir</b></td>
                <td>direction of the location from the beginning point <b>CONTIG</b></td>
            </tr>
        </table>
        <h3>SSCELL Contains FEATURE</h3>
        <p>A single <b>SSCELL</b> contains multiple <b>FEATURE</b>s; a <b>FEATURE</b> may
            be contained in multiple <b>SSCELL</b>s.</p>
        <h3>FEATURE IsRelatedTo FEATURE</h3>
        <p>Multiple <b>FEATURE</b>s are related to multiple other <b>FEATURE</b>s. This
            relationship is commutative.</p>
        <table border="2">
            <tr>
                <td><b>score</b></td>
                <td>measurement of the level of the relationship</td>
            </tr>
            <tr>
                <td><b>type</b></td>
                <td>type of relationship (similarity, bidirectional best hit, or chromosome
                    clustering)</td>
            </tr>
        </table>
        <h3>FUSION Yields FEATURE</h3>
        <p>Multiple <b>FUSION</b>s produce a single <b>FEATURE</b>.</p>
    </body>
</html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3