[Bio] / FigTutorial / exchange_of_assignments_and_annotations_Sept17_2004.html Repository:
ViewVC logotype

Annotation of /FigTutorial/exchange_of_assignments_and_annotations_Sept17_2004.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.2 - (view) (download) (as text)

1 : overbeek 1.1 <h1>Exchange of Assignments and Annotations: the SEED Perspective</h1>
2 :    
3 :     In this document I describe the basic concepts relating to assignments
4 :     and annotations as they are implemented in the SEED. I will discuss
5 :     notions relating to exchange of annotations and maintenance of
6 :     relevant events. I will not discuss issues relating to ontologies
7 :     and the use of constrtained vocabularies to formulate functions.
8 :     These topics are important, but they are largely independent of the
9 :     issues I will be covering in this document.
10 :     <p>
11 :     The topic of annotation is certainly not limited to the specific class
12 :     of features we call coding sequences (this type of feature is often
13 :     abbreviated as a <b>CDS</b> or within the SEED as a protein-encoding
14 :     gene called a <b>PEG</b>). However, most of the central issues do
15 :     relate to CDSs. Hence, I will focus on annotation of CDSs; the
16 :     generalization of the concept to arbitrary features is straightforward.
17 :    
18 :     <h2>Annotation</h2>
19 :     An <b>annotation</b> is a time-stamped piece of
20 :     text written by an <i>author</i> that is attached to a feature. It
21 :     may be viewed as a 4-tuple: {Feature,TimeStamp,Author,Text}. The text
22 :     can be either <i>structured</i> or <i>unstructured</i>. One
23 :     particular type of structured annotation is used to record a judgement
24 :     relating to the function of a protein encoded by a specific gene. The
25 :     basic syntax of this form of structured annotation is <b>The function
26 :     of <i>Gene</i> is <i>Function</i></b>.
27 :    
28 :     <h2>Assignment</h2>
29 :     An <b>assignment</b> is a 3-tuple {Feature,Author,Function}. When a
30 :     user of the SEED generates an assignment, an annotation is also
31 :     generated to record the event.
32 :    
33 :     <h2>Function of a Protein Encoded by a Gene</h2>
34 :     A <b>Function</b> also has some minimal structure. It can be an
35 :     arbitrary string of text that does not contain any occurrences of
36 :     either "; " or " / ", which is called a <b>Basic Function</b>. It can
37 :     also be a sequence of basic functions separated by "; ", which may be
38 :     taken to mean "the function is asserted to be one or more of the basic
39 : overbeek 1.2 functions". It can also be a sequence of basic functions separated by
40 : overbeek 1.1 " / ", which may be taken to mean "the function is asserted to be all
41 :     of the basic functions".
42 :    
43 :     <h2>The Issue of IDs</h2>
44 :    
45 :     To understand the issues relating to IDs of coding sequences, consider
46 :     a situation in which we have a system (say, an instance of the SEED)
47 :     which contains 100 closely related genomes all sharing the same genus
48 :     and species. Within just 2-3 years this will be commonplace.
49 :     Already, we have versions with 5-10 distinct strains of
50 :     <i>Stapholococcus aureaus</i>. Now suppose that a specific gene
51 :     appears with exactly identical sequence in each of the 100 genomes.
52 :     Further, in one of the genomes it has been duplicated. Hence, we have
53 :     101 distinct coding sequences that all translate to a single protein
54 :     sequence.
55 :     Finally, assume that in the genome with two copies, the upstream
56 :     regions contain regulatory sites that cause one copy to be expressed
57 :     only at high temperatures and the other copy at low temperatures; that
58 :     is; the two copies have what is arguably different functions.
59 :     <p>
60 :     When comparing data from two systems, it may or may not be trivial to
61 :     determine when two genomes are identical. If users of each system are
62 :     making occasional corrections to the actual sequences of the genomes,
63 :     it becomes somewhat problematic. When genomes are not precisely
64 :     identical, it can be quite difficult to determine whether genes from
65 :     the two genomes should be thought of as identical or not.
66 :     <p>
67 :     When a version of the SEED receives an assignment from an external
68 :     source, it is normally received as a 3-tuple
69 :     {ExternalID,Sequence,Function}, where the <i>Sequence</i> is a protein
70 :     sequence (i.e., the translation of the CDS). If the <i>ExternalID</i>
71 :     can reliably be mapped to a specific coding sequence in the SEED, then
72 :     the assignment is unambiguous. On the other hand, if the mapping
73 :     cannot be done unambiguously, <b>the assignment is taken as a set of
74 :     assertions -- one for each of the internal CDSs that have matching
75 :     translations</b>.
76 :     Two translations are considered matching if after discarding the
77 :     initial amino acids, one of the sequences is a suffix of the other and
78 :     the shorter sequence has a length that is at least 70% of the length
79 :     of the longer sequence.
80 :     <p>
81 :     This naturally raises the issue of how unambiguous mappings can be
82 :     determined.
83 :     Within the SEED, the following steps are used:
84 :     <ol>
85 :     <li>
86 :     If the exchange is with a version of the SEED, then FIG ids can be
87 :     matched. If some genomes exist in only one version, or if ids have
88 :     been added to one or both of the versions, ids may fail to match.
89 :     <li>
90 :     Otherwise,
91 :     if CDS ids (e.g., gi or RefSeq ids) can be matched, then an
92 :     unambiguous correspondence can be established.
93 :     <li>
94 :     Otherwise, if identical versions of a genome are in use (ensuring that
95 :     checksums of the contigs in the genome give identical results), then
96 :     locations on contigs can be used as ids to determine unambiguous
97 :     matches.
98 :     </ol>
99 :    
100 :     <h2>The Notion of Cooperative Maintenance of a Master Set of Annotations</h2>
101 :    
102 :     The SEED is designed to support a group of annotators who wish to
103 :     cooperatively annotate a set of genomes. By "cooperatively annotate",
104 :     I mean that they wish to overwrite each other's assignments -- they
105 :     are <i>trusted</i> annotators. The SEED supports any number of
106 :     independent annotators -- individuals who do not overwrite each
107 :     other's assignments. Corresponding to each SEED, there is a single
108 :     set of cooperating users, and they specify user ids of the form
109 :     <b>master:</b><i>user</i>.
110 :     Note that the annotations recording assignments
111 :     should never get overwritten in any event.
112 :    
113 :     <h2>The Use and Synchronization of Cooperating Annotation Systems</h2>
114 :    
115 : overbeek 1.2 Multiple copies of the SEED can be used to maintain synchronized assignments.
116 :     Usually, the systems would be those of
117 : overbeek 1.1 a cooperating group of annotators. The SEED provides the capabilities
118 :     for daily automatic synchronizations. This is achieved by designating
119 :     one of the systems as a "clearinghouse". On a daily basis, the
120 :     clearinghouse will acquire all newly-generated annotations and
121 :     assignments from each of the other participating systems. Then, it
122 :     will merge and dispense updates to the other systems. Setting up and
123 :     administering this behaviour is described in a separate document.
124 :    
125 :     <h2>Introduction of Externally-Generated Assignments and Annotations</h2>
126 :    
127 :     Any SEED system can initiate a transfer of annotations and assignments
128 :     from other SEED systems. We expect to move towards common protocols
129 :     that allow such transfers with a growing number of non-SEED annotation
130 :     systems.
131 :     <p>
132 :    
133 :     When annotations are transferred they are simply merged. When
134 :     assignments are transferred and the author is not a cooperating user,
135 :     the SEED user is offered the option of accepting them (or not
136 :     accepting them). If they are accepted and the author was a
137 :     cooperating annotator, the assignment will be made (but no annotation
138 :     will be generated). If they are accepted and the author was not a
139 :     cooperating annotator, the assignment will be made and an annotation
140 : overbeek 1.2 recording the event will be made. A "short-cut" to acceptance can be
141 : overbeek 1.1 utilized for a cooperating annotator -- an assignment that is
142 :     accompanied by an annotation that is time-stamped as more recent than
143 :     any existing annotations is automatically accepted.
144 :    
145 :     <h2>The Introduction of a New Release of the SEED</h2>
146 :    
147 :     Introduction of a new release of the SEED is conceptually the same as
148 :     <ol>
149 :     <li>
150 :     considering the new release as the current system (with minimal
151 :     assignments and annotations), and
152 :     <li>
153 :     treating the old version of the SEED as the source of all of its
154 :     existing assignments and annotations that were made since the
155 :     completion of the installation of the last release, and
156 :     <li>
157 :     all assignments that would introduce changes (i.e., that did not match
158 :     the assignments supplied with the release) are accepted (i.e., you
159 :     will not be asked whether or not you wish to accept them).
160 :     </ol>
161 :    

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3