Parent Directory
|
Revision Log
Revision 1.1.1.1 - (view) (download) (as text)
1 : | parrello | 1.1 | <h1>Analysis of Variation Within a Set of Closely Related Genomes</h1> |
2 : | |||
3 : | Both the SEED and the SPROUT will include features designed to support | ||
4 : | analysis of variation between closely related genomes. For the | ||
5 : | record, we expect to see 20-100 closely related genomes for a number | ||
6 : | of pathogens and production strains by early 2007. It is possible | ||
7 : | that we will see such collections even earlier. Within a very | ||
8 : | few more years it will be commonplace to see hundreds of closely | ||
9 : | related strains. The driving force will be the plummeting costs of | ||
10 : | sequencing and resequencing. The sole issue is how much can be | ||
11 : | learned from such collections. If such efforts actually produce | ||
12 : | better production strains, new anibiotics, or new vaccines, there is | ||
13 : | no question that the sequencing costs can easily be justified. | ||
14 : | |||
15 : | <h2>Major and Minor Variation: the Basic Concepts</h2> | ||
16 : | |||
17 : | Consider a set of very closely related genomes. There will be regions | ||
18 : | in which all of the genomes contain very nearly identical sequence for | ||
19 : | long stretches, although each genome will contain insertions, | ||
20 : | deletions and rearrangements. A <b>block</b> is a corresponding section from two or | ||
21 : | more genomes that can easily be accurately aligned. Often the | ||
22 : | sequence similarity is near 100%. An <b>unaligned sequence</b> is a | ||
23 : | sequence from one of the genomes. | ||
24 : | >From an initial collection of similar genomes, we will | ||
25 : | form a collection of blocks and unaligned sequences such that: | ||
26 : | <ul> | ||
27 : | <li> | ||
28 : | each genome can be viewed as a sequence of entries from blocks and | ||
29 : | unaligned sequences, | ||
30 : | <li> | ||
31 : | if part of a gene occurs within a block, the whole gene occurs within | ||
32 : | the block, | ||
33 : | <li> | ||
34 : | each block contains at most one entry from any single genome, and | ||
35 : | <li> | ||
36 : | whenever a block can be constructed (rather than leaving a collection | ||
37 : | of unaligned sequences), it will be formed. | ||
38 : | </ul> | ||
39 : | <br> | ||
40 : | Note that these three conditions are somewhat imprecise and do not, as | ||
41 : | they stand, force a unique collection of blocks and unaligned | ||
42 : | sequences. However, for our purposes, this is not critical. One of | ||
43 : | the most serious ambiguities involves the notion of "accurately | ||
44 : | aligned". On the one extreme, we might force blocks to be free of | ||
45 : | indels. On the other, we might allow fairly substantial stretches | ||
46 : | (say, up to 20) indels. The exact parameters that we use will be | ||
47 : | important, but they represent a detail to be determined as we proceed. | ||
48 : | <p> | ||
49 : | Blocks in which one or more genomes are not represented and all | ||
50 : | unaligned sequences are thought of as <b>major variations</b>. | ||
51 : | Columns within a block that contain differing characters are thought | ||
52 : | of as <b>minor variations</b> or <b>SNP</b>s. | ||
53 : | <p> | ||
54 : | Finally, we can dispense with the concept of "unaligned sequence", if | ||
55 : | we allow blocks containing a single entry. I think that it is useful | ||
56 : | to leave the diswcussion above in terms of blocks and unaligned | ||
57 : | sequences and to treat this merger as an implementation detail. | ||
58 : | |||
59 : | <h2>The SPROUT Entities and Relationships Needed to Support Variation</h2> | ||
60 : | |||
61 : | We will be using the SEED and the SPROUT to connect variation to | ||
62 : | phenotype. The discussion so far supports the addition of the | ||
63 : | following SPROUT entities and relationships: | ||
64 : | <ul> | ||
65 : | <li><b>Block</b>, | ||
66 : | <ul> | ||
67 : | <li><b>Block ContainsSubsequenceFrom Genome</b> with three fields of | ||
68 : | intersection data (<i>Contig,Beg,End</i>) | ||
69 : | </ul> | ||
70 : | |||
71 : | <li><b>MajorVariation</b>, | ||
72 : | <ul> | ||
73 : | <li><b>MajorVariation IsExposedBy Block</b> | ||
74 : | </ul> | ||
75 : | <li><b>SNP</b> | ||
76 : | <ul> | ||
77 : | <li><b>SNP IsMinorVariationIn Block</b> with one field of intersection | ||
78 : | data (<i>Offset</i> into the Block), and | ||
79 : | <li><b>SNP HasCharacterIn Genome</b> with one field of intersection | ||
80 : | data (<i>Value</i> which is the character in the genome) | ||
81 : | </ul> | ||
82 : | </ul> | ||
83 : | <br> | ||
84 : | A key component of the analysis required to do this will | ||
85 : | be to correlate variations with effected genes. This would argue for | ||
86 : | the addition of the following relationships: | ||
87 : | <ul> | ||
88 : | <li><b>Feature ParticipatesInMajorVariationWith MajorVariation</b>, | ||
89 : | <li><b>Feature ContainsSNP SNP</b>, | ||
90 : | <li><b>Feature HasUpStreamImpactedByMajorVariation MajorVariation</b> | ||
91 : | <li><b>Feature HasDownStreamImpactedByMajorVariation MajorVariation</b> | ||
92 : | <li><b>Feature HasUpStreamImpactedBySNP SNP</b> | ||
93 : | <li><b>Feature HasDownStreamImpactedBySNP SNP</b> | ||
94 : | </ul> | ||
95 : | |||
96 : | <h2>Relating Phenotype to Variation</h2> | ||
97 : | |||
98 : | The central goal of encoding variation in genomes will be to relate | ||
99 : | these differences to differences in phenotype. To understand what is | ||
100 : | needed requires that we first consider some typical biological | ||
101 : | experiements. Then, when we consider how to encode the relationships | ||
102 : | between phenotype and variation, whatever choices we select should at | ||
103 : | least adequately handle these prototypical examples: | ||
104 : | <ul> | ||
105 : | <li>Biologists frequently perform knockouts and then measure changes | ||
106 : | between those organisms with and without the knockout. Sometimes this | ||
107 : | is done for a single gene, and the result is summarized in a journal | ||
108 : | article. Sometimes it is methodically done for large numbers of genes | ||
109 : | and the changes being examined are things like "essentiality" or | ||
110 : | "virulence". Sometimes the measurements record profiles of | ||
111 : | expression under differing conditions for the two organisms. | ||
112 : | <li>A closely related type of data is acquired when one or more genes | ||
113 : | are "added to a genome" (either via actual integration or by the | ||
114 : | addition of a plasmid), and measurements are taken comparing the | ||
115 : | derived genome with the original genome. | ||
116 : | <li>A third type of data amounts to taking a large number of | ||
117 : | unsequenced organisms,recording the presence or absence of a specific | ||
118 : | known variation, and attaching some measurement of phenotype. | ||
119 : | <li>A final type of data amounts to taking a small set of strains from | ||
120 : | the set of those that have been sequenced and acquiring numerous | ||
121 : | measurements, each reflecting an aspect of phentotype. | ||
122 : | </ul> | ||
123 : | |||
124 : | <h2>The Issue of What Is Meant by a Genome?</h2> | ||
125 : | |||
126 : | The SEED and SPROUT model of a genome amounts to a collection of | ||
127 : | contigs that cover at least 90-95% of the actual genome (at least this | ||
128 : | is the SPROUT model -- this amounts to the SEED model of a <i>complete | ||
129 : | genome</i>. which clearly includes those that are only nearly | ||
130 : | complete). When people discuss knockouts or measurements against | ||
131 : | hundreds of unsequenced clinical isolates, clearly some new issues are | ||
132 : | appearing. How should these cases be handled? | ||
133 : | <p> | ||
134 : | Perhaps a good place to start, as in many things bacterial, is with | ||
135 : | <i>E.coli</i>. We now have, thanks to the efforts of Mori's group in | ||
136 : | Japan, a set of knockouts for a majority of the <i>E.coli</i> genes. | ||
137 : | This amounts to a collection of thousands of distinct strains each | ||
138 : | carefully constructed to include all but one gene from a sequenced | ||
139 : | strain (i.e., each of the distinct strains is believed to be | ||
140 : | essentially identical to a known ancestor, except for the disruption | ||
141 : | of a single gene). Suppose that we have measured some property of | ||
142 : | each of these thousands of strains. How should we represent the | ||
143 : | results? | ||
144 : | <p> | ||
145 : | To begin with, we need to accept that we are dealing with thousands of | ||
146 : | distinct genomes. They need to have distinct IDs and be treated as | ||
147 : | different. On the other hand, we do not have the precise sequence of | ||
148 : | these thousands of genomes. Rather, we think of them each as "the | ||
149 : | common genome with gene X disrupted". That is, in most aspects they | ||
150 : | are viewed as completely identical to the original strain. | ||
151 : | We do have the option of basically generating the conceptual genome | ||
152 : | (at least a pretty good approximation of it) and adding it to a | ||
153 : | running version of the SEED or the SPROUT. However, there is another approach. | ||
154 : | I propose adding a single entity (the recipe of how a | ||
155 : | genome was constructed) and a number of | ||
156 : | relationships to the SPROUT model: | ||
157 : | <ul> | ||
158 : | <li><b>DerivationChanges</b> | ||
159 : | <ul> | ||
160 : | <li><b>Genome WasConstructedBy DerivationChanges</b> | ||
161 : | <li><b>DerivationChanges IncludeAdditionOf Feature</b> | ||
162 : | <li><b>DerivationChanges IncludeDeletionOf Feature</b> | ||
163 : | <li><b>DerivationChanges WereAppliedTo Genome</b> | ||
164 : | </ul> | ||
165 : | </ul> | ||
166 : | |||
167 : | <p> | ||
168 : | In this case, the newly-generated <i>Genome</i> is not related to any | ||
169 : | <i>Features</i> or <i>Contigs</i>. Rather, it is related to a recipe, | ||
170 : | which has the necessary relationships to the original genome. | ||
171 : | |||
172 : | <h2>Back to the Issue of Connecting Phenotype to Variation</h2> | ||
173 : | |||
174 : | In my view, there is no single "right" way to encode phenotype; | ||
175 : | there are many ways and none that I know of is clearly superior. I am | ||
176 : | going to propose an approach which has the redeeming characteristic of | ||
177 : | being both simple and general. | ||
178 : | <p> | ||
179 : | I will begin by thinking of a phenotype as no more than a 3-tuple that | ||
180 : | can be attached as a tag to one or more genomes. The fields in the | ||
181 : | tuple are | ||
182 : | <ol> | ||
183 : | <li>an arbitrary <i>label</i>, | ||
184 : | <li>a single floating-point number, and | ||
185 : | <li>a single text field. | ||
186 : | </ol> | ||
187 : | For boolean attributes (such as <i>essential</i> or | ||
188 : | <i>gram-positive</i>), the floating-point number will simply be 0 or | ||
189 : | 1. In other cases it will reflect a numeric value to be associated | ||
190 : | with one or more genomes. That is, I have essentially reduced the | ||
191 : | notion of phenotype to the quite different notion of a | ||
192 : | <i>measurement</i>, and assume that any single measurement can apply | ||
193 : | to an arbitrary set of genomes. In this case, we assume that | ||
194 : | variations connected to the distinct genomes will ultimately explain | ||
195 : | the differences in measurement. Sorting out the relevant causal | ||
196 : | relationships becomes one of the central issues of this century. | ||
197 : | <p> | ||
198 : | However, we may wish to connect measurements with variation, as well. | ||
199 : | That is, once we have catalogued the variations within a family of | ||
200 : | closely-related genomes, we may wish to attach measurements to the | ||
201 : | variations themselves. Essentially, we wish to be able to encode | ||
202 : | statements of the form "of the set of X genomes that have variation Y, | ||
203 : | P had the following measurement and N did not". This makes sense for | ||
204 : | boolean measurements. Non-boolean properties would be handled (less | ||
205 : | than elegantly) by dividing a range of potential values into discreet | ||
206 : | "buckets" and then considering presence or absence of a value in a | ||
207 : | bucket as a boolean property. | ||
208 : | <p> | ||
209 : | These considerations cause me to propose the following additions to | ||
210 : | the existing SPROUT model: | ||
211 : | <ul> | ||
212 : | <li><b>Measurement</b>, which will include fields encoding the | ||
213 : | <i>label</i>, numeric value, and arbitrary text. | ||
214 : | <ul> | ||
215 : | <li><b>Genome HasMeasurement Measurement</b> | ||
216 : | <li><b>Measurement CorrelatesWithMajorVariation MajorVariation</b> | ||
217 : | <li><b>Measurement CorrelatesWithSNP SNP</b> | ||
218 : | </ul> | ||
219 : | </ul> | ||
220 : | |||
221 : | This appears to me to be a reasonable approach. I solicit comments. |
MCS Webmaster | ViewVC Help |
Powered by ViewVC 1.0.3 |