[Bio] / Sprout / SproutNotes.htm Repository:
ViewVC logotype

Annotation of /Sprout/SproutNotes.htm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1.1.1 - (view) (download) (as text)

1 : parrello 1.1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2 :     <html>
3 :     <head>
4 :     <title>Notes about the SPROUT Database</title>
5 :     <meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">
6 :     <meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5">
7 :     </head>
8 :     <body BGCOLOR="#FFFF80">
9 :     <h1>The Underlying Database Architecture</h1>
10 :     <h2>Basic Concepts</h2>
11 :     <UL>
12 :     <li>
13 :     At its lowest level, Sprout is a configurable Entity-Relationship database that
14 :     supports only inserts.
15 :     <ul>
16 :     <li>
17 :     Only a small number of tables will support insertion.
18 :     <li>
19 :     The real data is kept in tab-delimited flat files that are used to load the
20 :     data into the database.
21 :     <ul>
22 :     <li>
23 :     The Sprout database will periodically be loaded from tab-delimited files
24 :     generated by the current SEED.</li>
25 :     </ul>
26 :     <li>
27 :     Inserted data is remembered so that it is not lost during the load.
28 :     <li>
29 :     Only certain entities will allow inserts. For example, the <b>GENOME</b> entity
30 :     is marked so that new genomes can only come from the flat files, while the <b>ANNOTATION</b>
31 :     entity can be inserted directly into the database.</li>
32 :     </ul>
33 :     <LI>
34 :     Each entity consists of multiple relations, all with the same ID field.
35 :     <UL>
36 :     <li>
37 :     The use of multiple relations allows multiple value occurrences. For example,
38 :     the <b>USER</b> entity has an <b>access-code</b>
39 :     attribute that occurs multiple times. This attribute would be implemented as a
40 :     second relation.
41 :     <LI>
42 :     Every entity has a relation that contains an <STRONG>id</STRONG> field and a <STRONG>
43 :     type</STRONG> field. The <STRONG>id</STRONG> may be a number or a
44 :     string.The ID of an entity instance is unique to that instance for all entities
45 :     of that type.
46 :     </LI>
47 :     </UL>
48 :     <li>
49 :     <ul>
50 :     <li>
51 :     The ID is generated externally so that it can be used to communicate between
52 :     the database and the flat files.
53 :     <li>
54 :     This requires some special handling when dealing with items inserted directly
55 :     into the database.</li>
56 :     </ul>
57 :     <li>
58 :     Entities are connected by single relations called relationships.
59 :     <ul>
60 :     <li>
61 :     A relationship is keyed on the IDs of the two related entities. The IDs are
62 :     called <STRONG>from</STRONG> and <STRONG>to</STRONG>. In addition, it may
63 :     contain additional fields that act as intersection data.
64 :     <li>
65 :     The relationship may also contain additional attributes to represent
66 :     intersection data. For example, the <b>IsLocatedIn</b> relationship contains
67 :     the ID of a <b>FEATURE</b> and a <b>CONTIG</b>, plus a a <b>dir</b>
68 :     attribute that describes the direction of the gene.
69 :     <li>
70 :     Relationships are all binary. Non-binary relationships are implemented by
71 :     adding new entities.</li>
72 :     </ul>
73 :     </li>
74 :     </UL>
75 :     <h2><a name="Structures">Metadata Structures</a></h2>
76 :     <p>The metadata structures describe the entities and relationships implemented in
77 :     the database. They are, in fact a database describing the database itself.</p>
78 :     <h3>ENTITY</h3>
79 :     <p>An <i>entity</i> is a real or abstract thing on which we wish to keep data. The
80 :     terms <i>entity</i> and <i>object</i> are mostly interchangeable; however, for
81 :     our purposes, <i>object</i> will only be used to describe an entity instance,
82 :     rather than an entity type. In the relations that implement an entity, there
83 :     must be an ID field that contains the entity key.</p>
84 :     <table border="2">
85 :     <tr>
86 :     <td><b>entity-id</b></td>
87 :     <td><i>(key)</i> displayable common name of the entity</td>
88 :     </tr>
89 :     <tr>
90 :     <td><b>relation-id</b></td>
91 :     <td><i>(multiple)</i> a relation used to implement the entity</td>
92 :     </tr>
93 :     </table>
94 :     <h3>RELATIONSHIP</h3>
95 :     <p>A <i>relationship</i> is a connection between a pair of entities.</p>
96 :     <table border="2">
97 :     <tr>
98 :     <td><b>relationship-id</b></td>
99 :     <td><i>(key)</i> displayable common name of the relationship</td>
100 :     </tr>
101 :     <tr>
102 :     <td><b>relation-id</b></td>
103 :     <td>relation used to implement the relationship</td>
104 :     </tr>
105 :     <tr>
106 :     <td><b>arity</b></td>
107 :     <td>type of relationship: 1-to-many, many-to-many, many-to-1, or 1-to-1</td>
108 :     </tr>
109 :     <tr>
110 :     <td><b>source-entity-id</b></td>
111 :     <td>name of the entity type from which the relationship starts</td>
112 :     </tr>
113 :     <tr>
114 :     <td><b>target-entity-id</b></td>
115 :     <td>name of the entity type into which the relationship ends</td>
116 :     </tr>
117 :     </table>
118 :     <h3>RELATION</h3>
119 :     <p>A <i>relation</i> is a physical table that implements a relationship or partly
120 :     implements an entity.</p>
121 :     <table border="2">
122 :     <tr>
123 :     <td><b>name</b></td>
124 :     <td><i>(key)</i> name of the physical relation</td>
125 :     </tr>
126 :     </table>
127 :     <h3>FIELD</h3>
128 :     <p>A <i>field</i> is a physical table column that ultimately contains the actual
129 :     data.</p>
130 :     <table border="2">
131 :     <tr>
132 :     <td><b>relation-id</b></td>
133 :     <td><i>(key.1)</i> ID of the relation containing this field</td>
134 :     </tr>
135 :     <tr>
136 :     <td><b>name</b></td>
137 :     <td><i>(key.2)</i> name of the field</td>
138 :     </tr>
139 :     <tr>
140 :     <td><b>data-type</b></td>
141 :     <td>type of data stored in the field</td>
142 :     </tr>
143 :     </table>
144 :     <h2>Methods</h2>
145 :     <p>The following methods are provided to access data in the database. Methods that
146 :     allow iteration will have <b>GetFirst</b> and <b>GetNext</b> versions. For
147 :     example, the <b>GetObjects</b> operation will be implemented as two methods-- <b>GetFirstObject</b>
148 :     and <b>GetNextObject</b>.</p>
149 :     <ul>
150 :     <li>
151 :     <b>Load</b>: Load the data from a flat file into the database.
152 :     <ul>
153 :     <li>
154 :     If the database already exists, special handling is required to maintain
155 :     inserted rows.
156 :     <li>
157 :     If the database does not exist, the tables will be created from the metadata.</li>
158 :     </ul>
159 :     <li>
160 :     <b>GetEntityTypes</b>: Return a list of the entity types.
161 :     <li>
162 :     <b>GetObjects</b>: Iterate through the instances of a specified entity type.
163 :     <ul>
164 :     <li>
165 :     A more or less arbitrary filtering mechanism will be needed.
166 :     <li>
167 :     The results will be returned in an indeterminate order.
168 :     <li>
169 :     Only one type of object will be returned.
170 :     <li>
171 :     This method only uses relationships for filtering purposes.</li>
172 :     </ul>
173 :     <li>
174 :     <b>AccessObject</b>: Get a handle for extracting data from a specific
175 :     entity instance.
176 :     <li>
177 :     <b>GetAttributes</b>: Iterate through the fields for a specified object.
178 :     <ul>
179 :     <li>
180 :     Some fields will occur multiple times. For example, one particular feature
181 :     instance may be spread between six contigs, while another appears only once. In
182 :     this case, the first feature will have six occurrences of the <b>locN</b> field
183 :     while the second feature will have only one.</li>
184 :     </ul>
185 :     <li>
186 :     <b>GetValue</b>: Return the value of an attribute.
187 :     <ul>
188 :     <li>
189 :     Because an attribute may occur multiple times, an ordinal number is required to
190 :     identify the desired occurrence.</li>
191 :     </ul>
192 :     </li>
193 :     </ul>
194 :     <h1>Surface Database Architecture</h1>
195 :     <h2>Entities</h2>
196 :     <h3>GENOME</h3>
197 :     <pre>
198 :     [genome-id,genus,species,unique-characterization,source-id]
199 :     [genome-id,access-code]
200 :     </pre>
201 :     <h3>SOURCE</h3>
202 :     <pre>
203 :     [source-id,label,URL,description]
204 :     </pre>
205 :     <h3>CONTIG</h3>
206 :     <pre>
207 :     [contig-id]
208 :     </pre>
209 :     <p>The contig-id is the genome-id and the contig name. A <b>CONTIG</b> is a
210 :     contiguous section of a genome that was produced by a sequencing project. The <b>CONTIG</b>s
211 :     are named and generated externally and then loaded into the database.</p>
212 :     <h3>SEQUENCE</h3>
213 :     <pre>
214 :     [sequence-id,sequence]
215 :     [sequence-id,quality-vector]
216 :     </pre>
217 :     <p>The sequence id is the contig-id and the begin point. The sequence is an ordered
218 :     collection of characters from an alphabet. For each character in the sequence,
219 :     the quality vector is an integer exponent indicating the likelihood of an
220 :     error. So, a quality value of 30 means the likelihood that the chqaracter is
221 :     correct is (1 - 10^-30).</p>
222 :     <p>The character data for the <b>CONTIG</b> is broken into <b>SEQUENCE</b>s so that
223 :     we do not have to manipulate the entire <b>CONTIG</b> as a string in memory.
224 :     This is important, because some <b>CONTIG</b>s can be hundreds of
225 :     megacharacters in length.</p>
226 :     <h3>FEATURE</h3>
227 :     <pre>
228 :     [feature-id,type]
229 :     [feature-id,alias]
230 :     [feature-id,DNA-sequence]
231 :     [feature-id,translation]
232 :     [feature-id,upstream-sequence]
233 :     [feature-id,virulence]
234 :     [feature-id,essentiality]
235 :     </pre>
236 :     <h3>ROLE</h3>
237 :     <pre>
238 :     [role-id,role]
239 :     </pre>
240 :     <h3>ANNOTATION</h3>
241 :     <pre>
242 :     [annotation-id,time,annotation,confidence]
243 :     </pre>
244 :     <h3>ASSIGNMENT</h3>
245 :     <pre>
246 :     [assignment-id,confidence]
247 :     </pre>
248 :     <h3>SUBSYSTEM</h3>
249 :     <pre>
250 :     [subsystem-id,subsystem-name]
251 :     </pre>
252 :     <h3>SSCELL</h3>
253 :     <pre>
254 :     [cell-id,subsystem-id]
255 :     </pre>
256 :     <h3>USER</h3>
257 :     <pre>
258 :     [user-id,user-name,password]
259 :     [user-id,access-code]
260 :     </pre>
261 :     <h3>FUSION</h3>
262 :     <pre>
263 :     [feature-id-1, feature-id-2]
264 :     </pre>
265 :     <h2>Relationships</h2>
266 :     <h3>GENOME HasContig CONTIG</h3>
267 :     <p>A single <b>GENOME</b> is composed of multiple <b>CONTIG</b>s.</p>
268 :     <h3>GENOME ComesFrom SOURCE</h3>
269 :     <p>A single <b>GENOME</b> can come from a single <STRONG>SOURCE </STRONG>or from
270 :     cooperation by multiple <b>SOURCE</b>s. Multiple <STRONG>GENOME</STRONG>s may
271 :     come from a single <STRONG>SOURCE</STRONG>.</p>
272 :     <h3>CONTIG IsMadeUpOf SEQUENCE</h3>
273 :     <p>A single <b>CONTIG</b> is made up of multiple <b>SEQUENCE</b>s.</p>
274 :     <table border="2">
275 :     <tr>
276 :     <td><b>start-position</b></td>
277 :     <td>ordinal number of this sequence in the <b>CONTIG</b> (For example, a <b>start-position</b>
278 :     of 100 means that this sequence starts at the 100th position of the <b>CONTIG</b>.</td>
279 :     </tr>
280 :     </table>
281 :     <h3>FEATURE IsDescribedBy ANNOTATION</h3>
282 :     <p>Multiple <b>ANNOTATION</b>s can be made on a single <STRONG>FEATURE</STRONG>.</p>
283 :     <h3>USER Made ANNOTATION</h3>
284 :     <p>Multiple <b>ANNOTATION</b>s can be made by a single <b>USER</b>.</p>
285 :     <h3>USER Assigned ASSIGNMENT</h3>
286 :     <p>Multiple <b>ASSIGNMENT</b>s can be made by a single <b>USER</b></p>
287 :     <h3>FEATURE IsTargetOf ASSIGNMENT</h3>
288 :     <p>Multiple <b>ASSIGNMENT</b>s can be made to a single <b>FEATURE</b>.</p>
289 :     <h3>ASSIGNMENT Implements ROLE</h3>
290 :     <p>Multiple <b>ASSIGNMENT</b>s can describe a single <STRONG>ROLE</STRONG>.
291 :     Multiple <b>ROLE</b>s can be implemented by a single <STRONG>ASSIGNMENT</STRONG>.</p>
292 :     <h3>GENOME ParticipatesIn SUBSYSTEM</h3>
293 :     <p>Multiple <b>GENOME</b>s can participate in multiple <b>SUBSYSTEM</b>s.</p>
294 :     <table border="2">
295 :     <tr>
296 :     <td><b>variant</b></td>
297 :     <td>description of the subsystem variant</td>
298 :     </tr>
299 :     </table>
300 :     <h3>ROLE OccursIn SUBSYSTEM</h3>
301 :     <p>Multiple <b>ROLE</b>s can be acheived by multiple <b>SUBSYSTEM</b>s.</p>
302 :     <h3>SSCELL BelongsTo GENOME</h3>
303 :     <p>Multiple <b>SSCELL</b>s belong to a single <b>GENOME</b>.</p>
304 :     <h3>SSCELL RelatesTo ROLE</h3>
305 :     <p>Multiple <b>SSCELL</b>s relate to a single <b>ROLE</b>.</p>
306 :     <h3>FEATURE IsLocatedIn CONTIG</h3>
307 :     <p>A single <b>FEATURE</b> is located in multiple <b>CONTIG</b>s; a <b>CONTIG</b> contains
308 :     multiple <b>FEATURE</b> locations. This relationship enables us to find the
309 :     gene sequences in the <b>CONTIG</b>s that make up the <b>FEATURE</b>.</p>
310 :     <p>In order to insure that we are able to find all genes relating to a particular
311 :     location we imposed a maximum size on each span encoded by this relationship.
312 :     So, for example, if the maximum span size is 100 and we want to find all
313 :     features that include position 321 of <b>CONTIG</b> ABC, we would search for
314 :     location data relating to positions 222 through 420, and only emit them if the
315 :     length and direction cross the 321 location.</p>
316 :     <table border="2">
317 :     <tr>
318 :     <td><b>locN</b></td>
319 :     <td>ordinal number of this location for the <b>FEATURE</b></td>
320 :     </tr>
321 :     <tr>
322 :     <td><b>beg</b></td>
323 :     <td>position of this location's first nucleotide in the <b>CONTIG</b></td>
324 :     </tr>
325 :     <tr>
326 :     <td><b>len</b></td>
327 :     <td>number of nucleotides used by this location in the <b>CONTIG</b></td>
328 :     </tr>
329 :     <tr>
330 :     <td><b>dir</b></td>
331 :     <td>direction of the location from the beginning point <b>CONTIG</b></td>
332 :     </tr>
333 :     </table>
334 :     <h3>SSCELL Contains FEATURE</h3>
335 :     <p>A single <b>SSCELL</b> contains multiple <b>FEATURE</b>s; a <b>FEATURE</b> may
336 :     be contained in multiple <b>SSCELL</b>s.</p>
337 :     <h3>FEATURE IsRelatedTo FEATURE</h3>
338 :     <p>Multiple <b>FEATURE</b>s are related to multiple other <b>FEATURE</b>s. This
339 :     relationship is commutative.</p>
340 :     <table border="2">
341 :     <tr>
342 :     <td><b>score</b></td>
343 :     <td>measurement of the level of the relationship</td>
344 :     </tr>
345 :     <tr>
346 :     <td><b>type</b></td>
347 :     <td>type of relationship (similarity, bidirectional best hit, or chromosome
348 :     clustering)</td>
349 :     </tr>
350 :     </table>
351 :     <h3>FUSION Yields FEATURE</h3>
352 :     <p>Multiple <b>FUSION</b>s produce a single <b>FEATURE</b>.</p>
353 :     </body>
354 :     </html>

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3