[Bio] / Sprout / variation.html Repository:
ViewVC logotype

Annotation of /Sprout/variation.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (view) (download) (as text)

1 : parrello 1.1 <h1>Analysis of Variation Within a Set of Closely Related Genomes</h1>
2 :    
3 :     Both the SEED and the SPROUT will include features designed to support
4 :     analysis of variation between closely related genomes. For the
5 :     record, we expect to see 20-100 closely related genomes for a number
6 :     of pathogens and production strains by early 2007. It is possible
7 :     that we will see such collections even earlier. Within a very
8 :     few more years it will be commonplace to see hundreds of closely
9 :     related strains. The driving force will be the plummeting costs of
10 :     sequencing and resequencing. The sole issue is how much can be
11 :     learned from such collections. If such efforts actually produce
12 :     better production strains, new anibiotics, or new vaccines, there is
13 :     no question that the sequencing costs can easily be justified.
14 :    
15 :     <h2>Major and Minor Variation: the Basic Concepts</h2>
16 :    
17 :     Consider a set of very closely related genomes. There will be regions
18 :     in which all of the genomes contain very nearly identical sequence for
19 :     long stretches, although each genome will contain insertions,
20 :     deletions and rearrangements. A <b>block</b> is a corresponding section from two or
21 :     more genomes that can easily be accurately aligned. Often the
22 :     sequence similarity is near 100%. An <b>unaligned sequence</b> is a
23 :     sequence from one of the genomes.
24 :     >From an initial collection of similar genomes, we will
25 :     form a collection of blocks and unaligned sequences such that:
26 :     <ul>
27 :     <li>
28 :     each genome can be viewed as a sequence of entries from blocks and
29 :     unaligned sequences,
30 :     <li>
31 :     if part of a gene occurs within a block, the whole gene occurs within
32 :     the block,
33 :     <li>
34 :     each block contains at most one entry from any single genome, and
35 :     <li>
36 :     whenever a block can be constructed (rather than leaving a collection
37 :     of unaligned sequences), it will be formed.
38 :     </ul>
39 :     <br>
40 :     Note that these three conditions are somewhat imprecise and do not, as
41 :     they stand, force a unique collection of blocks and unaligned
42 :     sequences. However, for our purposes, this is not critical. One of
43 :     the most serious ambiguities involves the notion of "accurately
44 :     aligned". On the one extreme, we might force blocks to be free of
45 :     indels. On the other, we might allow fairly substantial stretches
46 :     (say, up to 20) indels. The exact parameters that we use will be
47 :     important, but they represent a detail to be determined as we proceed.
48 :     <p>
49 :     Blocks in which one or more genomes are not represented and all
50 :     unaligned sequences are thought of as <b>major variations</b>.
51 :     Columns within a block that contain differing characters are thought
52 :     of as <b>minor variations</b> or <b>SNP</b>s.
53 :     <p>
54 :     Finally, we can dispense with the concept of "unaligned sequence", if
55 :     we allow blocks containing a single entry. I think that it is useful
56 :     to leave the diswcussion above in terms of blocks and unaligned
57 :     sequences and to treat this merger as an implementation detail.
58 :    
59 :     <h2>The SPROUT Entities and Relationships Needed to Support Variation</h2>
60 :    
61 :     We will be using the SEED and the SPROUT to connect variation to
62 :     phenotype. The discussion so far supports the addition of the
63 :     following SPROUT entities and relationships:
64 :     <ul>
65 :     <li><b>Block</b>,
66 :     <ul>
67 :     <li><b>Block ContainsSubsequenceFrom Genome</b> with three fields of
68 :     intersection data (<i>Contig,Beg,End</i>)
69 :     </ul>
70 :    
71 :     <li><b>MajorVariation</b>,
72 :     <ul>
73 :     <li><b>MajorVariation IsExposedBy Block</b>
74 :     </ul>
75 :     <li><b>SNP</b>
76 :     <ul>
77 :     <li><b>SNP IsMinorVariationIn Block</b> with one field of intersection
78 :     data (<i>Offset</i> into the Block), and
79 :     <li><b>SNP HasCharacterIn Genome</b> with one field of intersection
80 :     data (<i>Value</i> which is the character in the genome)
81 :     </ul>
82 :     </ul>
83 :     <br>
84 :     A key component of the analysis required to do this will
85 :     be to correlate variations with effected genes. This would argue for
86 :     the addition of the following relationships:
87 :     <ul>
88 :     <li><b>Feature ParticipatesInMajorVariationWith MajorVariation</b>,
89 :     <li><b>Feature ContainsSNP SNP</b>,
90 :     <li><b>Feature HasUpStreamImpactedByMajorVariation MajorVariation</b>
91 :     <li><b>Feature HasDownStreamImpactedByMajorVariation MajorVariation</b>
92 :     <li><b>Feature HasUpStreamImpactedBySNP SNP</b>
93 :     <li><b>Feature HasDownStreamImpactedBySNP SNP</b>
94 :     </ul>
95 :    
96 :     <h2>Relating Phenotype to Variation</h2>
97 :    
98 :     The central goal of encoding variation in genomes will be to relate
99 :     these differences to differences in phenotype. To understand what is
100 :     needed requires that we first consider some typical biological
101 :     experiements. Then, when we consider how to encode the relationships
102 :     between phenotype and variation, whatever choices we select should at
103 :     least adequately handle these prototypical examples:
104 :     <ul>
105 :     <li>Biologists frequently perform knockouts and then measure changes
106 :     between those organisms with and without the knockout. Sometimes this
107 :     is done for a single gene, and the result is summarized in a journal
108 :     article. Sometimes it is methodically done for large numbers of genes
109 :     and the changes being examined are things like "essentiality" or
110 :     "virulence". Sometimes the measurements record profiles of
111 :     expression under differing conditions for the two organisms.
112 :     <li>A closely related type of data is acquired when one or more genes
113 :     are "added to a genome" (either via actual integration or by the
114 :     addition of a plasmid), and measurements are taken comparing the
115 :     derived genome with the original genome.
116 :     <li>A third type of data amounts to taking a large number of
117 :     unsequenced organisms,recording the presence or absence of a specific
118 :     known variation, and attaching some measurement of phenotype.
119 :     <li>A final type of data amounts to taking a small set of strains from
120 :     the set of those that have been sequenced and acquiring numerous
121 :     measurements, each reflecting an aspect of phentotype.
122 :     </ul>
123 :    
124 :     <h2>The Issue of What Is Meant by a Genome?</h2>
125 :    
126 :     The SEED and SPROUT model of a genome amounts to a collection of
127 :     contigs that cover at least 90-95% of the actual genome (at least this
128 :     is the SPROUT model -- this amounts to the SEED model of a <i>complete
129 :     genome</i>. which clearly includes those that are only nearly
130 :     complete). When people discuss knockouts or measurements against
131 :     hundreds of unsequenced clinical isolates, clearly some new issues are
132 :     appearing. How should these cases be handled?
133 :     <p>
134 :     Perhaps a good place to start, as in many things bacterial, is with
135 :     <i>E.coli</i>. We now have, thanks to the efforts of Mori's group in
136 :     Japan, a set of knockouts for a majority of the <i>E.coli</i> genes.
137 :     This amounts to a collection of thousands of distinct strains each
138 :     carefully constructed to include all but one gene from a sequenced
139 :     strain (i.e., each of the distinct strains is believed to be
140 :     essentially identical to a known ancestor, except for the disruption
141 :     of a single gene). Suppose that we have measured some property of
142 :     each of these thousands of strains. How should we represent the
143 :     results?
144 :     <p>
145 :     To begin with, we need to accept that we are dealing with thousands of
146 :     distinct genomes. They need to have distinct IDs and be treated as
147 :     different. On the other hand, we do not have the precise sequence of
148 :     these thousands of genomes. Rather, we think of them each as "the
149 :     common genome with gene X disrupted". That is, in most aspects they
150 :     are viewed as completely identical to the original strain.
151 :     We do have the option of basically generating the conceptual genome
152 :     (at least a pretty good approximation of it) and adding it to a
153 :     running version of the SEED or the SPROUT. However, there is another approach.
154 :     I propose adding a single entity (the recipe of how a
155 :     genome was constructed) and a number of
156 :     relationships to the SPROUT model:
157 :     <ul>
158 :     <li><b>DerivationChanges</b>
159 :     <ul>
160 :     <li><b>Genome WasConstructedBy DerivationChanges</b>
161 :     <li><b>DerivationChanges IncludeAdditionOf Feature</b>
162 :     <li><b>DerivationChanges IncludeDeletionOf Feature</b>
163 :     <li><b>DerivationChanges WereAppliedTo Genome</b>
164 :     </ul>
165 :     </ul>
166 :    
167 :     <p>
168 :     In this case, the newly-generated <i>Genome</i> is not related to any
169 :     <i>Features</i> or <i>Contigs</i>. Rather, it is related to a recipe,
170 :     which has the necessary relationships to the original genome.
171 :    
172 :     <h2>Back to the Issue of Connecting Phenotype to Variation</h2>
173 :    
174 :     In my view, there is no single "right" way to encode phenotype;
175 :     there are many ways and none that I know of is clearly superior. I am
176 :     going to propose an approach which has the redeeming characteristic of
177 :     being both simple and general.
178 :     <p>
179 :     I will begin by thinking of a phenotype as no more than a 3-tuple that
180 :     can be attached as a tag to one or more genomes. The fields in the
181 :     tuple are
182 :     <ol>
183 :     <li>an arbitrary <i>label</i>,
184 :     <li>a single floating-point number, and
185 :     <li>a single text field.
186 :     </ol>
187 :     For boolean attributes (such as <i>essential</i> or
188 :     <i>gram-positive</i>), the floating-point number will simply be 0 or
189 :     1. In other cases it will reflect a numeric value to be associated
190 :     with one or more genomes. That is, I have essentially reduced the
191 :     notion of phenotype to the quite different notion of a
192 :     <i>measurement</i>, and assume that any single measurement can apply
193 :     to an arbitrary set of genomes. In this case, we assume that
194 :     variations connected to the distinct genomes will ultimately explain
195 :     the differences in measurement. Sorting out the relevant causal
196 :     relationships becomes one of the central issues of this century.
197 :     <p>
198 :     However, we may wish to connect measurements with variation, as well.
199 :     That is, once we have catalogued the variations within a family of
200 :     closely-related genomes, we may wish to attach measurements to the
201 :     variations themselves. Essentially, we wish to be able to encode
202 :     statements of the form "of the set of X genomes that have variation Y,
203 :     P had the following measurement and N did not". This makes sense for
204 :     boolean measurements. Non-boolean properties would be handled (less
205 :     than elegantly) by dividing a range of potential values into discreet
206 :     "buckets" and then considering presence or absence of a value in a
207 :     bucket as a boolean property.
208 :     <p>
209 :     These considerations cause me to propose the following additions to
210 :     the existing SPROUT model:
211 :     <ul>
212 :     <li><b>Measurement</b>, which will include fields encoding the
213 :     <i>label</i>, numeric value, and arbitrary text.
214 :     <ul>
215 :     <li><b>Genome HasMeasurement Measurement</b>
216 :     <li><b>Measurement CorrelatesWithMajorVariation MajorVariation</b>
217 :     <li><b>Measurement CorrelatesWithSNP SNP</b>
218 :     </ul>
219 :     </ul>
220 :    
221 :     This appears to me to be a reasonable approach. I solicit comments.

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3