[Bio] / Sprout / BioWords.pm Repository:
ViewVC logotype

Annotation of /Sprout/BioWords.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package BioWords;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :    
25 :     =head1 BioWords Package
26 :    
27 :     Microbiological Word Conflation Helper
28 :    
29 :     =head2 Introduction
30 :    
31 :     This object is in charge of managing keywords used to search the database. Its
32 :     purpose is to insure that if a user types something close to the correct word, a
33 :     usable result will be returned.
34 :    
35 :     A keyword string consists of words separated by delimiters. A I<word> is an
36 :     uninterrupted sequence of letters, semidelimiters (currently only C<'>) and digits.
37 :     A word that begins with a letter is called a I<real word>. For each real word we
38 :     produce two alternate forms. The I<stem> represents the root form of the word
39 :     (e.g. C<skies> to C<ski>, C<following> to C<follow>). The I<phonex> is computed
40 :     from the stem by removing the vowels and equating consonants that produce similar
41 :     sounds. It is likely a mispelled word will have the same phonex as its real form.
42 :    
43 :     In addition to computing stems and phonexes, this object also I<cleans> a
44 :     keyword. I<Cleaning> consists of converting upper-case letters to lower case and
45 :     converting certain delimiters. In particular, bar (C<|>), colon (C<:>), and
46 :     semi-colon (C<;>) are converted to a single quote (C<'>) and period (C<.>) and
47 :     hyphen (C<->) are converted to underscore (C<_>). The importance of this is that
48 :     the single quote and underscore are considered word characters by the search
49 :     software. The cleaning causes the names of chemical compounds and the IDs of
50 :     features and genomes to behave as words when searching.
51 :    
52 :     Search words must be at least three characters long, so the stem of a real word with
53 :     only three letters is the word itself, and any real word with only two letters
54 :     is discarded. In addition, there is a list of I<stop words> that are discarded
55 :     by the keyword search. These will have an empty string for the stem and phonex.
56 :    
57 :     Note that the stemming algorithm differs from the standard for English because
58 :     of the use of Greek and Latin words in chemical compound names and genome
59 :     taxonomies. The algorithm has been evolving in response to numerous experiments
60 :     and is almost certainly not in its last iteration.
61 :    
62 :     The fields in this object are as follows.
63 :    
64 :     =over 4
65 :    
66 :     =item stems
67 :    
68 :     Hash of the stems found so far. This is cleared by L</AnalyzeSearchExpression>,
69 :     so it can be used by clients to determine the number of search expressions
70 :     containing a particular stem.
71 :    
72 :     =item cache
73 :    
74 :     Reference to a hash that maps a pure word to a hash containing its stem, a count of
75 :     the number of times it has occurred, and its phonex. The hash is also used to keep
76 :     exceptions (which map to their predetermined stem) and stop words (which map to an
77 :     empty string). The cache should only be used when the number of words being
78 :     processed is small. If multiple millions of words are put into the cache, it
79 :     causes the application to hang.
80 :    
81 :     =item stopFile
82 :    
83 :     The name of a file containing the stop word list, one word per line. The stop
84 :     word file is read into the cache the first time we try to stem a pure word.
85 :     Once the file is read, this field is cleared so that we know it's handled.
86 :    
87 :     =item exceptionFile
88 :    
89 :     The name of a file containing exception rules, one rule per line. Each rule
90 :     consists of a space-delimited list of words followed by a single stem. The
91 :     exception file is read into the cache the first time we try to stem a pure word.
92 :     Once the file is read, this field is cleared so that we know it's handled.
93 :    
94 :     =item cacheFlag
95 :    
96 :     TRUE if incoming words should be cached, else FALSE.
97 :    
98 :     =item VOWEL
99 :    
100 :     The list of vowel characters (lower-case). This defaults to the value of the
101 :     compile-time constant VOWELS, but may be overridden by the constructor.
102 :    
103 :     =item LETTER
104 :    
105 :     The list of letter characters (lower-case). This defaults to the value of the
106 :     compile-time constant LETTERS, but may be overridden by the constructor. All
107 :     of the vowels should be included in the list of letters.
108 :    
109 :     =item DIGIT
110 :    
111 :     The list of digit characters (lower-case). This defaults to the value of the
112 :     compile-time constant DIGITS, but may be overridden by the constructor.
113 :    
114 :     =item WORD
115 :    
116 :     The list of all word-like characters. This is the union of the letters
117 :     and digits.
118 :    
119 :     =back
120 :    
121 :     We allow configuration of letters, digits, and vowels; but in general the
122 :     stemming and phonex algorithms are aware of the English language and what the
123 :     various letters mean. The main use of the configuration strings is to allow
124 :     flexibility in the treatment of special characters, such as underscore (C<_>) and
125 :     the single quote (C<'>). The defaults have all been chosen fairly carefully based
126 :     on empirical testing, but of course everything is subject to evolution.
127 :    
128 :     =head2 Special Declarations
129 :    
130 :     =head3 EMPTY
131 :    
132 :     The EMPTY constant simply evaluates to the empty string. It makes the stemming
133 :     rules more readable.
134 :    
135 :     =cut
136 :    
137 :     use constant EMPTY => '';
138 :    
139 :     =head3 SHORT
140 :    
141 :     The SHORT constant specifies the minimum length for a word. A word shorter than
142 :     the minimum length is treated as a stop word.
143 :    
144 :     =cut
145 :    
146 :     use constant SHORT => 3;
147 :    
148 :     =head3 VOWELS
149 :    
150 :     String containing the characters that are considered vowels (lower case only).
151 :    
152 :     =cut
153 :    
154 :     use constant VOWELS => q(aeiou_);
155 :    
156 :     =head3 LETTERS
157 :    
158 :     String containing the characters that are considered letters (lower case only).
159 :    
160 :     =cut
161 :    
162 :     use constant LETTERS => q(abcdefghijklmnopqrstuvwxyz_);
163 :    
164 :     =head3 DIGITS
165 :    
166 :     String containing the characters that are considered digits (lower case only).
167 :    
168 :     =cut
169 :    
170 :     use constant DIGITS => q(0123456789');
171 :    
172 :     =head3 new
173 :    
174 :     my $bw = BioWords->new(%options);
175 :    
176 :     Construct a new BioWords object. The following options are supported.
177 :    
178 :     =over 4
179 :    
180 :     =item exceptions
181 :    
182 :     Name of the exception file, or a reference to a hash containing the exception
183 :     rules. The default is to have no exceptions.
184 :    
185 :     =item stops
186 :    
187 :     Name of the stop word file, or a reference to a list containing the stop words.
188 :     The default is to have no stop words.
189 :    
190 :     =item vowels
191 :    
192 :     List of characters to be treated as vowels (lower-case only). The default is
193 :     a compile-time constant.
194 :    
195 :     =item letters
196 :    
197 :     List of characters to be treated as letters (lower-case only). The default is a
198 :     compile-time constant.
199 :    
200 :     =item digits
201 :    
202 :     List of characters to be treated as digits (lower-case only). The default is a
203 :     compile-time constant.
204 :    
205 :     =item cache
206 :    
207 :     If TRUE, then words will be cached when they are processed. If FALSE, the cache
208 :     will only be used for stopwords and exceptions. The default is TRUE.
209 :    
210 :     =back
211 :    
212 :     =cut
213 :    
214 :     sub new {
215 :     # Get the parameters.
216 :     my ($class, %options) = @_;
217 :     # Get the options.
218 : parrello 1.3 my $exceptionOption = $options{exceptions} || "$FIG_Config::sproutData/Exceptions.txt";
219 :     my $stopOption = $options{stops} || "$FIG_Config::sproutData/StopWords.txt";
220 : parrello 1.1 my $vowels = $options{vowels} || VOWELS;
221 :     my $letters = $options{letters} || LETTERS;
222 :     my $digits = $options{digits} || DIGITS;
223 :     my $cacheFlag = (defined $options{cache} ? $options{cache} : 1);
224 :     my $cache = {};
225 :     # Create the BioWords object.
226 :     my $retVal = {
227 :     cache => $cache,
228 :     cacheFlag => $cacheFlag,
229 :     stopFile => undef,
230 :     exceptionFile => undef,
231 :     stems => {},
232 :     VOWEL => $vowels,
233 :     LETTER => $letters,
234 :     DIGIT => $digits,
235 :     WORD => "$letters$digits"
236 :     };
237 :     # Now we need to deal with the craziness surrounding the exception hash and the stop word
238 :     # list, both of which are loaded into the cache before we start processing anything
239 :     # serious. The exceptions and stops could be passed in as hash references, in which case
240 :     # we load them into the cache. Alternatively, they could be file names, which we save
241 :     # to be read in when we need them. So, first, we check for an exception file name.
242 :     if (! ref $exceptionOption) {
243 :     # Here we have a file name. We store it in the object.
244 :     $retVal->{exceptionFile} = $exceptionOption;
245 :     } else {
246 :     # Here we have a hash. Slurp it into the cache.
247 :     for my $exceptionWord (keys %{$exceptionOption}) {
248 :     Store($retVal, $exceptionWord, $exceptionOption->{$exceptionWord}, 0);
249 :     }
250 :     }
251 :     # Now we check for a stopword file name.
252 :     if (! ref $stopOption) {
253 :     # Store it in the object.
254 :     $retVal->{stopFile} = $stopOption;
255 :     } else {
256 :     # No file name, so slurp in the list of words.
257 :     for my $stopWord (@{$stopOption}) {
258 :     Stop($retVal, $stopWord);
259 :     }
260 :     }
261 :     # Bless and return the object.
262 :     bless $retVal, $class;
263 :     return $retVal;
264 :     }
265 :    
266 :     =head2 Public Methods
267 :    
268 :     =head3 Stop
269 :    
270 :     $bio->Stop($word);
271 :    
272 :     Denote that a word is a stop word.
273 :    
274 :     =over 4
275 :    
276 :     =item word
277 :    
278 :     Word to be declared as a stop word.
279 :    
280 :     =back
281 :    
282 :     =cut
283 :    
284 :     sub Stop {
285 :     # Get the parameters.
286 :     my ($self, $word) = @_;
287 : parrello 1.3 Trace("$word is a stop word.") if T(4);
288 : parrello 1.1 # Store the stop word.
289 :     $self->{cache}->{$word} = {stem => EMPTY, phonex => EMPTY, count => 0 };
290 :     }
291 :    
292 :     =head3 Store
293 :    
294 :     $bio->Store($word, $stem, $count);
295 :    
296 :     Store a word in the cache. The word will be mapped to the
297 :     specified stem and its count will be set to the specified value. The phonex
298 :     will be computed automatically from the stem. This method can also be used to
299 :     store exceptions. In that case, the count should be C<0>.
300 :    
301 :     =over 4
302 :    
303 :     =item word
304 :    
305 :     Word to be stored.
306 :    
307 :     =item stem
308 :    
309 :     Proposed stem.
310 :    
311 :     =item count
312 :    
313 :     Proposed count. This should be C<0> for exceptions and C<1> for normal
314 :     words. The default is C<1>.
315 :    
316 :     =back
317 :    
318 :     =cut
319 :    
320 :     sub Store {
321 :     # Get the parameters.
322 :     my ($self, $word, $stem, $count) = @_;
323 :     # Default the count.
324 :     my $realCount = (defined $count ? $count : 1);
325 :     # Get the phonex for the specified stem.
326 :     my $phonex = $self->_phonex($stem);
327 :     # Store the word in the cache.
328 :     $self->{cache}->{$word} = { stem => $stem, phonex => $phonex, count => $realCount };
329 :     }
330 :    
331 :     =head3 Split
332 :    
333 :     my @words = $bio->Split($string);
334 :    
335 :     Split a string into keywords. A keyword is this context is either a
336 :     delimiter sequence or a combination of letters, digits, underscores
337 :     (C<_>), and isolated single quotes (C<'>). All letters are converted to
338 :     lower case, and any white space sequence inside the string is converted
339 :     to a single space. Prior to splitting the string, certain strings that
340 :     have special biological meaning are modified, and certain delimiters are
341 :     converted. This helps to resolve some ambiguities (e.g. which alias names
342 :     use colons and which use vertical bars) and makes strings such as EC
343 :     numbers appear to be singleton keywords. The list of keywords we output
344 :     can be rejoined and then passed unmodified to a keyword search; however,
345 :     before doing that the individual pure words should be stemmed and checked
346 :     for spelling.
347 :    
348 :     =over 4
349 :    
350 :     =item string
351 :    
352 :     Input string to process.
353 :    
354 :     =item RETURN
355 :    
356 :     Returns a List of normalized keywords and delimiters.
357 :    
358 :     =back
359 :    
360 :     =cut
361 :    
362 :     sub Split {
363 :     # Get the parameters.
364 :     my ($self, $string) = @_;
365 :     # Convert letters to lower case and collapse the white space. Note that we use the "s" modifier on
366 :     # the substitution so that new-lines are treated as white space, and we take precautions so that
367 :     # an undefined input is treated as a null string (which saves us from compiler warnings).
368 :     my $lowered = (defined($string) ? lc $string : "");
369 :     $lowered =~ s/\s+/ /sg;
370 : parrello 1.3 # Connect the TC prefix to TC numbers.
371 :     $lowered =~ s/TC ((?:\d+|-)(?:\.(?:\d+|-)){3})/TC_$1/g;
372 : parrello 1.1 # Trim the leading space (if any).
373 :     $lowered =~ s/^ //;
374 :     # Fix the periods in EC and TC numbers. Note here we are insisting on real
375 :     # digits rather than the things we treat as digits. We are parsing for real EC
376 :     # and TC numbers, not generalized strings, and the format is specific.
377 :     $lowered =~ s/(\d+|\-)\.(\d+|-)\.(\d+|-)\.(\d+|-)/$1_$2_$3_$4/g;
378 :     # Fix non-trailing periods.
379 :     $lowered =~ s/\.([$self->{WORD}])/_$1/g;
380 :     # Fix non-leading minus signs.
381 :     $lowered =~ s/([$self->{WORD}])[\-]/$1_/g;
382 :     # Fix interior vertical bars and colons
383 :     $lowered =~ s/([$self->{WORD}])[|:]([$self->{WORD}])/$1'$2/g;
384 :     # Now split up the list so that each keyword is in its own string. The delimiters between
385 :     # are kept, so when we're done everything can be joined back together again.
386 :     Trace("Normalized string is -->$lowered<--") if T(4);
387 :     my @pieces = map { split(/([^$self->{WORD}]+)/, $_) } $lowered;
388 :     # The last step is to separate spaces from the other delimiters.
389 :     my @retVal;
390 :     for my $piece (@pieces) {
391 :     while (substr($piece,0,1) eq " ") {
392 :     $piece = substr($piece, 1);
393 :     push @retVal, " ";
394 :     }
395 :     while ($piece =~ /(.+?) (.*)/) {
396 :     push @retVal, $1, " ";
397 :     $piece = $2;
398 :     }
399 :     if ($piece ne "") {
400 :     push @retVal, $piece;
401 :     }
402 :     }
403 :     # Return the result.
404 :     return @retVal;
405 :     }
406 :    
407 :     =head3 Region1
408 :    
409 :     my $root = $bio->Region1($word);
410 :    
411 :     Return the suffix region for a word. This is referred to as I<region 1>
412 :     in the literature on word stemming, and it consists of everything after
413 :     the first non-vowel that follows a vowel.
414 :    
415 :     =over 4
416 :    
417 :     =item word
418 :    
419 :     Lower-case word whose suffix region is desired.
420 :    
421 :     =item RETURN
422 :    
423 :     Returns the suffix region, or the empty string if there is no suffix region.
424 :    
425 :     =back
426 :    
427 :     =cut
428 :    
429 :     sub Region1 {
430 :     # Get the parameters.
431 :     my ($self, $word) = @_;
432 :     # Declare the return variable.
433 :     my $retVal = "";
434 :     # Look for the R1.
435 :     if ($word =~ /[$self->{VOWEL}][^$self->{VOWEL}](.+)/i) {
436 :     $retVal = $1;
437 :     }
438 :     # Return the result.
439 :     return $retVal;
440 :     }
441 :    
442 :     =head3 FindRule
443 :    
444 :     my ($prefix, $suffix, $replacement) = BioWords::FindRule($word, @rules);
445 :    
446 :     Find the appropriate suffix rule for a word. Suffix rules are specified
447 :     as pairs in a list. Syntactically, the rule list may look like a hash,
448 :     but the order of the rules is important, so in fact it is a list. The
449 :     first rule whose key matches the suffix is applied. The part of the word
450 :     before the suffix, the suffix itself, and the value of the rule are all
451 :     passed back to the caller. If no rule matches, the prefix will be the
452 :     entire input word, and the suffix and replacement will be an empty string.
453 :    
454 :     =over 4
455 :    
456 :     =item word
457 :    
458 :     Word to parse. It should already be normalized to lower case.
459 :    
460 :     =item rules
461 :    
462 :     A list of rules. Each rule is represented by two entries in the list-- a suffix
463 :     to match and a value to return.
464 :    
465 :     =item RETURN
466 :    
467 :     Returns a three-element list. The first element will be the portion of the word
468 :     before the matched suffix, the second element will be the suffix itself, and the
469 :     third will be the replacement recommended by the matched rule. If no rule
470 :     matches, the first element will be the whole word and the other two will be
471 :     empty strings.
472 :    
473 :     =back
474 :    
475 :     =cut
476 :    
477 :     sub FindRule {
478 :     # Get the parameters.
479 :     my ($word, @rules) = @_;
480 :     # Declare the return variables.
481 :     my ($prefix, $suffix, $replacement) = ($word, EMPTY, EMPTY);
482 :     # Search for a match. We'll stop on the first one.
483 :     for (my $i = 0; ! $suffix && $i < $#rules; $i += 2) {
484 :     my $len = length($rules[$i]);
485 :     if ($rules[$i] eq substr($word, -$len)) {
486 :     $prefix = substr($word, 0, length($word) - $len);
487 :     $suffix = $rules[$i];
488 :     $replacement = $rules[$i+1];
489 :     }
490 :     }
491 :     # Return the results.
492 :     return ($prefix, $suffix, $replacement);
493 :     }
494 :    
495 :     =head3 Process
496 :    
497 :     my $stem = $biowords->Process($word);
498 :    
499 :     Compute the stem of the specified word and record it in the cache.
500 :    
501 :     =over 4
502 :    
503 :     =item word
504 :    
505 :     Word to be processed.
506 :    
507 :     =item RETURN
508 :    
509 :     Returns the stem of the word (which could be the original word itself. If the word
510 :     is a stop word, returns a null string.
511 :    
512 :     =back
513 :    
514 :     =cut
515 :    
516 :     sub Process {
517 :     # Get the parameters.
518 :     my ($self, $word) = @_;
519 :     # Verify that the cache is initialized.
520 :     my $cache = $self->_initCache();
521 :     # Declare the return variable.
522 :     my $retVal;
523 :     # Get the word in lower case and compute its length.
524 :     my $lowered = lc $word;
525 :     my $len = length $lowered;
526 : parrello 1.3 Trace("Processing \"$lowered\".") if T(4);
527 : parrello 1.1 # Check to see what type of word it is.
528 :     if ($lowered =~ /[^$self->{WORD}]/) {
529 :     # It's delimiters. Return it unchanged and don't record it.
530 :     $retVal = $lowered;
531 :     } elsif ($len < $self->{SHORT}) {
532 :     # It's too short. Treat it as a stop word.
533 :     $retVal = EMPTY;
534 : parrello 1.3 } elsif (exists $cache->{$lowered}) {
535 : parrello 1.1 # It's already in the cache. Get the cache entry.
536 :     my $entry = $cache->{$lowered};
537 :     $retVal = $entry->{stem};
538 :     # If it is NOT a stop word, count it.
539 :     if ($retVal ne EMPTY) {
540 :     $entry->{count}++;
541 :     }
542 :     } elsif ($len == $self->{SHORT}) {
543 :     # It's already the minimum length. The stem is the word itself.
544 :     $retVal = $lowered;
545 :     # Store it if we're using the cache.
546 :     if ($self->{cacheFlag}) {
547 :     $self->Store($lowered, $retVal, 1);
548 :     }
549 :     } else {
550 :     # Here we have a new word. We compute the stem and store it.
551 :     $retVal = $self->_stem($lowered);
552 :     # Store the word if we're using the cache.
553 :     if ($self->{cacheFlag}) {
554 :     $self->Store($lowered, $retVal, 1);
555 :     }
556 :     }
557 :     # We're done. If the stem is non-empty, add it to the stem list.
558 :     if ($retVal ne EMPTY) {
559 :     $self->{stems}->{$retVal} = 1;
560 : parrello 1.2 Trace("\"$word\" stems to \"$retVal\".") if T(3);
561 :     } else {
562 :     Trace("\"$word\" discarded by stemmer.") if T(3);
563 : parrello 1.1 }
564 :     # Return the stem.
565 :     return $retVal;
566 :     }
567 :    
568 :     =head3 IsWord
569 :    
570 :     my $flag = $biowords->IsWord($word);
571 :    
572 :     Return TRUE if the specified string is a word and FALSE if it is a
573 :     delimiter.
574 :    
575 :     =over 4
576 :    
577 :     =item word
578 :    
579 :     String to examine.
580 :    
581 :     =item RETURN
582 :    
583 :     Returns TRUE if the string contains no delimiters, else FALSE.
584 :    
585 :     =back
586 :    
587 :     =cut
588 :    
589 :     sub IsWord {
590 :     # Get the parameters.
591 :     my ($self, $word) = @_;
592 :     # Test the word.
593 :     my $retVal = ($word =~ /^[$self->{WORD}]+$/);
594 :     # Return the result.
595 :     return $retVal;
596 :     }
597 :    
598 :     =head3 StemList
599 :    
600 :     my @stems = $biowords->StemList();
601 :    
602 :     Return the list of stems found in the last search expression.
603 :    
604 :     =cut
605 :    
606 :     sub StemList {
607 :     # Get the parameters.
608 :     my ($self) = @_;
609 :     # Return the keys of the stem hash.
610 :     my @retVal = keys %{$self->{stems}};
611 :     return @retVal;
612 :     }
613 :    
614 :     =head3 StemLookup
615 :    
616 :     my ($stem, $phonex) = $biowords->StemLookup($word);
617 :    
618 :     Return the stem and phonex for the specified word.
619 :    
620 :     =over 4
621 :    
622 :     =item word
623 :    
624 :     Word whose stem and phonex are desired.
625 :    
626 :     =item RETURN
627 :    
628 :     Returns a two-element list. If the word is found in the cache, the
629 :     list will consist of the stem followed by the phonex. If the word
630 :     is a stop word, the list will consist of two empty strings.
631 :    
632 :     =back
633 :    
634 :     =cut
635 :    
636 :     sub StemLookup {
637 :     # Get the parameters.
638 :     my ($self, $word) = @_;
639 :     # Declare the return variables.
640 :     my ($stem, $phonex);
641 :     # Get the cache.
642 :     my $cache = $self->{cache};
643 :     # Check the cache for the word.
644 :     if (exists $cache->{$word}) {
645 :     # It's found. Return its data.
646 :     ($stem, $phonex) = map { $_->{stem}, $_->{phonex} } $cache->{$word};
647 :     } else {
648 :     # It's not found. Compute the stem and phonex.
649 :     my $lowered = lc $word;
650 :     $stem = $self->Process($lowered);
651 :     $phonex = $self->_phonex($stem);
652 :     }
653 :     # Return the results.
654 :     return ($stem, $phonex);
655 :     }
656 :    
657 :     =head3 WordList
658 :    
659 :     my $words = $biowords->WordList($keep);
660 :    
661 :     Return a list of all of the words that were found by
662 :     L</AnalyzeSearchExpression>. Stop words will not be included.
663 :     Because the list could potentially contain millions of words, it is returned
664 :     as a list reference.
665 :    
666 :     =cut
667 :    
668 :     sub WordList {
669 :     # Get the parameters.
670 :     my ($self) = @_;
671 :     # Get the cache.
672 :     my $cache = $self->{cache};
673 :     # Declare the return variable.
674 :     my $retVal;
675 :     # Extract the desired words from the cache.
676 :     $retVal = [ grep { $cache->{$_}->{count} } keys %{$cache} ];
677 :     # Return the result.
678 :     return $retVal;
679 :     }
680 :    
681 :    
682 :     =head3 PrepareSearchExpression
683 :    
684 :     my $searchExpression = $bio->PrepareSearchExpression($string);
685 :    
686 :     Convert an incoming string to a search expression. The string is split
687 :     into pieces, the pieces are stemmed and processed into the cache, and
688 :     then they are rejoined after certain adjustments are made. In particular,
689 :     words without an operator preceding them are prefixed with a plus (C<+>)
690 :     so that they are treated as required words.
691 :    
692 :     =over 4
693 :    
694 :     =item string
695 :    
696 :     Search expression to prepare.
697 :    
698 :     =item RETURN
699 :    
700 :     Returns a modified version of the search expression with words converted to
701 :     stems, stop words eliminated, and plus signs placed before unmodified words.
702 :    
703 :     =back
704 :    
705 :     =cut
706 :    
707 :     sub PrepareSearchExpression {
708 :     # Get the parameters.
709 :     my ($self, $string) = @_;
710 :     # Declare the return variable.
711 :     my $retVal = "";
712 :     # Analyze the search expression.
713 :     my @parts = $self->AnalyzeSearchExpression($string);
714 :     # Now we have to put the pieces back together. At any point, we need
715 :     # to know if we are inside quotes or in the scope of an operator.
716 :     my ($inQuotes, $activeOp) = (0, 0);
717 :     for my $part (@parts) {
718 :     # Is this a word?
719 :     if ($part =~ /[a-z0-9]$/) {
720 :     # Yes. If no operator is present, add a plus.
721 :     if (! $activeOp && ! $inQuotes) {
722 :     $retVal .= "+";
723 :     $activeOp = 0;
724 :     }
725 :     } else {
726 :     # Here we have one or more operators. We process them
727 :     # individually.
728 :     for my $op (split //, $part) {
729 :     if ($op eq '"') {
730 :     # Here we have a quote.
731 :     if ($inQuotes) {
732 :     # A close quote turns off operator scope.
733 :     $inQuotes = 0;
734 :     $activeOp = 0;
735 :     } else {
736 :     # An open quote puts us in quote mode. Words inside
737 : parrello 1.3 # quotes do not need the plus added, but the
738 :     # quote does.
739 : parrello 1.1 $inQuotes = 1;
740 : parrello 1.3 $retVal .= "+";
741 : parrello 1.1 }
742 :     } elsif ($op eq ' ') {
743 :     # Spaces detach us from the preceding operator.
744 :     $activeOp = 0;
745 :     } else {
746 :     # Everything else puts us in operator scope.
747 :     $activeOp = 1;
748 :     }
749 :     }
750 :     }
751 :     # Add this part to the output string.
752 :     $retVal .= $part;
753 :     }
754 :     # Return the result.
755 :     return $retVal;
756 :     }
757 :    
758 :     =head3 AnalyzeSearchExpression
759 :    
760 :     my @list = $bio->AnalyzeSearchExpression($string);
761 :    
762 :     Analyze the components of a search expression and return them to the
763 :     caller. Statistical information about the words in the expression will
764 :     have been stored in the cache, and the return value will be a list of
765 :     stems and delimiters.
766 :    
767 :     =over 4
768 :    
769 :     =item string
770 :    
771 :     Search expression to analyze.
772 :    
773 :     =item RETURN
774 :    
775 :     Returns a list of words and delimiters, in an order corresponding to the
776 :     original expression. Real words will have been converted to stems and
777 :     stop words will have been converted to empty strings.
778 :    
779 :     =back
780 :    
781 :     =cut
782 :    
783 :     sub AnalyzeSearchExpression {
784 :     # Get the parameters.
785 :     my ($self, $string) = @_;
786 :     # Clear the stem list.
787 :     $self->{stems} = {};
788 :     # Normalize and split the search expression.
789 :     my @parts = $self->Split($string);
790 :     # Declare the return variable.
791 :     my @retVal;
792 :     # Now we loop through the parts, processing them.
793 :     for my $part (@parts) {
794 :     my $stem = $self->Process($part);
795 :     push @retVal, $stem;
796 :     Trace("Stem of \"$part\" is \"$stem\".") if T(4);
797 :     }
798 :     # Return the result.
799 :     return @retVal;
800 :     }
801 :    
802 : parrello 1.3 =head3 WildsOfEC
803 :    
804 :     my @ecWilds = BioWords::WildsOfEC($number);
805 :    
806 :     Return a list of all of the possible wild-carded EC numbers that would
807 :     match the specified EC number.
808 :    
809 :     =over 4
810 :    
811 :     =item number
812 :    
813 :     EC number to process.
814 :    
815 :     =item RETURN
816 :    
817 :     Returns a list consisting of the original EC number and all other
818 :     EC numbers that subsume it.
819 :    
820 :     =back
821 :    
822 :     =cut
823 :    
824 :     sub WildsOfEC {
825 :     # Get the parameters.
826 :     my ($number) = @_;
827 :     # Declare the return variable. It contains at the start the original
828 :     # EC number.
829 :     my @retVal = $number;
830 :     # Bust the EC number into pieces.
831 :     my @pieces = split '.', $number;
832 :     # Put it back together with hyphens.
833 :     for (my $i = 1; $i <= $#pieces; $i++) {
834 :     if ($pieces[$i] ne '-') {
835 :     my @wildPieces;
836 :     for (my $j = 0; $j <= $#pieces; $j++) {
837 :     push @wildPieces, ($j < $i ? $pieces[$i] : '-');
838 :     }
839 :     push @retVal, join(".", @wildPieces);
840 :     }
841 :     }
842 :     # Return the result.
843 :     return @retVal;
844 :     }
845 :    
846 :     =head3 ExtractECs
847 :    
848 :     my @ecThings = BioWords::ExtractECs($string);
849 :    
850 :     Return any individual EC numbers found in the specified string.
851 :    
852 :     =over 4
853 :    
854 :     =item string
855 :    
856 :     String containing potential EC numbers.
857 :    
858 :     =item RETURN
859 :    
860 :     Returns a list of all the EC numbers and subsuming EC numbers found in the string.
861 :    
862 :     =back
863 :    
864 :     =cut
865 :    
866 :     sub ExtractECs {
867 :     # Get the parameters.
868 :     my ($string) = @_;
869 :     # Find all the EC numbers in the string.
870 :     my @ecs = ($string =~ /ec\s+(\d+(?:\.\d+|\.-){3})/gi);
871 :     # Get the wild versions.
872 :     my @retVal = map { WildsOfEc($_) } @ecs;
873 :     # Return the result.
874 :     return @retVal;
875 :     }
876 :    
877 : parrello 1.1 =head2 Internal Methods
878 :    
879 :     =head3 _initCache
880 :    
881 :     my $cache = $biowords->_initCache();
882 :    
883 :     Insure the cache is initialized. If exception and stop word files exist,
884 :     they will be read into memory and used to populate the cache. A reference to
885 :     the cache will be returned to the caller.
886 :    
887 :     =cut
888 :    
889 :     sub _initCache {
890 :     # Get the parameters.
891 :     my ($self) = @_;
892 :     # Check for a stopword file.
893 :     if ($self->{stopFile}) {
894 :     # Read the file.
895 :     my @lines = Tracer::GetFile($self->{stopFile});
896 : parrello 1.3 Trace(scalar(@lines) . " lines found in stop file.") if T(3);
897 : parrello 1.1 # Insert it into the cache.
898 :     for my $line (@lines) {
899 :     $self->Stop(lc $line);
900 :     }
901 :     # Denote that the stopword file has been processed.
902 :     $self->{stopFile} = EMPTY;
903 :     }
904 :     # Check for an exception list.
905 :     if ($self->{exceptionFile}) {
906 :     # Read the file.
907 :     my @lines = Tracer::GetFile($self->{exceptionFile});
908 : parrello 1.3 Trace(scalar(@lines) . " lines found in exception file.") if T(3);
909 : parrello 1.1 # Loop through the lines.
910 :     for my $line (@lines) {
911 :     # Extract the words.
912 :     my @words = split /\s+/, $line;
913 :     # Map all of the starting words to the last word.
914 :     my $stem = pop @words;
915 :     for my $word (@words) {
916 :     $self->Store($word, $stem, 0);
917 :     }
918 :     }
919 :     # Denote that the exception file has been procesed.
920 :     $self->{exceptionFile} = EMPTY;
921 :     }
922 :     # Return the cache.
923 :     return $self->{cache};
924 :     }
925 :    
926 :     =head3 _stem
927 :    
928 :     my $stem = $biowords->_stem($word);
929 :    
930 :     Compute the stem of an incoming word. This is an internal method that
931 :     does not check the cache or do any length checking.
932 :    
933 :     =over 4
934 :    
935 :     =item word
936 :    
937 :     The word to stem. It must already have been converted to lower case.
938 :    
939 :     =item RETURN
940 :    
941 :     Returns the stem of the incoming word, which could possibly be the word itself.
942 :    
943 :     =back
944 :    
945 :     =cut
946 :    
947 :     sub _stem {
948 :     # Get the parameters.
949 :     my ($self, $word) = @_;
950 :     # Copy the word so we can mangle it.
951 :     my $retVal = $word;
952 :     # Convert consonant "y" to "j".
953 :     $retVal =~ s/^y/j/;
954 :     $retVal =~ s/([aeiou])y/$1j/g;
955 :     # Convert vowel "y" to "i".
956 :     $retVal =~ tr/y/i/;
957 :     # Compute the R1 and R2 regions. R1 is everything after the first syllable,
958 :     # and R2 is everything after the second syllable.
959 :     my $r1 = $self->Region1($retVal);
960 :     my $r2 = $self->Region1($r1);
961 :     # Compute the physical locations of the regions.
962 :     my $len = length $retVal;
963 :     my $p1 = $len - length $r1;
964 :     my $p2 = $len - length $r2;
965 :     # These variables will be used by FindRule.
966 :     my ($prefix, $suffix, $ruleValue);
967 : parrello 1.3 # Remove the genitive apostrophe.
968 : parrello 1.1 ($retVal, $suffix, $ruleValue) = FindRule($retVal, q('s') => EMPTY, q('s) => EMPTY, q(') => EMPTY);
969 :     # Process latin endings.
970 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, us => 'i', um => 'a', ae => 'a');
971 :     # Latin endings only apply if they follow a consonant.
972 :     if ($prefix =~ /[^aeiou]$/) {
973 :     $retVal = "$prefix$ruleValue";
974 :     }
975 :     # Convert plurals to singular.
976 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, sses => 'ss', ied => 'i', ies => 'i', s => 's');
977 :     if ($ruleValue eq 'i') {
978 :     # If the prefix length is one, we append an "e".
979 :     if (length $prefix <= 1) {
980 :     $ruleValue .= "e"
981 :     }
982 :     } elsif ($ruleValue eq 's') {
983 :     # Here we have a naked "s" at the end. We null it out if the prefix ends in a
984 :     # consonant or an 'e'. Nulling it will cause the "s" to be removed.
985 :     if ($prefix =~ /[^aiou]$/) {
986 :     $ruleValue = EMPTY;
987 :     }
988 :     }
989 :     # Finish the singularization. The possibly-modified rule value is applied to the prefix.
990 :     # If no rule applied, this has no effect, since the prefix is the whole word and the
991 :     # rule value is the empty string.
992 :     $retVal = "$prefix$ruleValue";
993 :     # Catch the special "izing" construct.
994 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, izing => 'is');
995 :     $retVal = "$prefix$ruleValue";
996 :     # Convert adverbs to adjectives.
997 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, eedli => 'ee', eed => 'ee',
998 :     ingli => EMPTY, ing => EMPTY, edli => EMPTY,
999 :     ed => EMPTY);
1000 :     # These rules only apply in limited circumstances.
1001 :     if ($ruleValue eq 'ee') {
1002 :     # The "ee" replacement only applies if it occurs in region 1. If it does not
1003 :     # occur there, then we put the suffix back.
1004 :     if (length($prefix) < $p1) {
1005 :     $ruleValue = $suffix;
1006 :     }
1007 :     } elsif ($suffix) {
1008 :     # Here the rule value is the empty string. It only applies if there is a
1009 :     # vowel in the prefix.
1010 :     if ($prefix !~ /[aeiou]/) {
1011 :     # No vowel, so put the suffix back.
1012 :     $ruleValue = $suffix;
1013 :     } else {
1014 :     # The prefix is now the whole word, because the rule value is the empty
1015 :     # string. Check for ending mutations. We may need to add an "e" or
1016 :     # remove a doubled letter.
1017 :     ($prefix, $suffix, $ruleValue) = FindRule($prefix, at => 'ate', bl => 'ble', iz => 'ize',
1018 :     bb => 'b', dd => 'd', ff => 'f', gg => 'g',
1019 :     mm => 'n', nn => 'n', pp => 'p', rr => 'r',
1020 :     tt => 't');
1021 :     }
1022 :     }
1023 :     # Apply the modifications.
1024 :     $retVal = "$prefix$ruleValue";
1025 :     # Now we get serious. Here we're looking for special suffixes.
1026 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ational => 'ate', tional => 'tion',
1027 :     enci => 'ence', anci => 'ance', abli => 'able',
1028 :     entli => 'ent', ization => 'ize', izer => 'ize',
1029 :     ation => 'ate', ator => 'ate', alism => 'al',
1030 :     aliti => 'al', alli => 'al', fulness => 'ful',
1031 :     ousness => 'ous', ousli => 'ous', ivness => 'ive',
1032 :     iviti => 'ive', biliti => 'ble', bli => 'ble',
1033 :     logi => 'log', fulli => 'ful', lessli => 'less',
1034 :     cli => 'c', dli => 'd', eli => 'e', gli => 'g',
1035 :     hli => 'h', kli => 'k', mli => 'm', nli => 'n',
1036 :     rli => 'r', tli => 't', alize => 'al', icate => 'ic',
1037 :     iciti => 'ic', ical => 'ic');
1038 :     # These only apply if they are in R1.
1039 :     if ($ruleValue && length($prefix) >= $p1) {
1040 :     $retVal = "$prefix$ruleValue";
1041 :     }
1042 :     # Conflate "ence" to "ent" if it's in R2.
1043 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ence => 'ent');
1044 :     if ($ruleValue && length($prefix) >= $p2) {
1045 :     $retVal = "$prefix$ruleValue";
1046 :     }
1047 :     # Now zap "ful", "ness", "ative", and "ize", but only if they're in R1.
1048 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ful => EMPTY, ness => EMPTY, ize => EMPTY);
1049 :     if (length($prefix) >= $p1) {
1050 :     $retVal = $prefix;
1051 :     }
1052 :     # Now we have some suffixes that get deleted if they're in R2.
1053 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ement => EMPTY, ment => EMPTY, able => EMPTY,
1054 :     ible => EMPTY, ance => EMPTY, ence => EMPTY,
1055 :     ant => EMPTY, ent => EMPTY, ism => EMPTY, ate => EMPTY,
1056 :     iti => EMPTY, ous => EMPTY, ive => EMPTY, ize => EMPTY,
1057 :     al => EMPTY, er => EMPTY, ic => EMPTY, sion => 's',
1058 :     tion => 't', alli => 'al');
1059 :     if (length($prefix) >= $p2) {
1060 :     $retVal = $prefix;
1061 :     }
1062 :     # Process the doubled L.
1063 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ll => 'l');
1064 :     $retVal = "$prefix$ruleValue";
1065 : parrello 1.3 # Check for an ending 'e'.
1066 :     $retVal =~ s/([$self->{VOWEL}][^$self->{VOWEL}]+)e$/$1/;
1067 : parrello 1.1 # Return the result.
1068 :     return $retVal;
1069 :     }
1070 :    
1071 :     =head3 _phonex
1072 :    
1073 :     my $phonex = $biowords->_phonex($word);
1074 :    
1075 :     Compute the phonetic version of a word. Vowels are ignored, doubled
1076 :     letters are trimmed to singletons, and certain letters or letter
1077 :     combinations are conflated. The resulting word is likely to match a
1078 :     misspelling of the original.
1079 :    
1080 :     This is an internal method. It does not check the cache and it assumes
1081 :     the word has already been converted to lower case.
1082 :    
1083 :     =over 4
1084 :    
1085 :     =item word
1086 :    
1087 :     Word whose phonetic translation is desired.
1088 :    
1089 :     =item RETURN
1090 :    
1091 :     Returns a more-or-less phonetic translation of the word.
1092 :    
1093 :     =back
1094 :    
1095 :     =cut
1096 :    
1097 :     sub _phonex {
1098 :     # Get the parameters.
1099 :     my ($self, $word) = @_;
1100 :     # Declare the return variable.
1101 :     my $retVal = $word;
1102 :     # Handle some special cases. For typed IDs, we remove the type. For
1103 :     # horrible multi-part chemical names, remove everything in front of
1104 :     # the last underscore.
1105 :     if ($word =~ /_([$self->{LETTER}]+)$/ && length($1) > $self->{SHORT}) {
1106 :     $word = $1;
1107 :     } elsif ($word =~ /^[$self->{LETTER}]+'(.+)$/ && length($1) > $self->{SHORT}) {
1108 :     $word = $1;
1109 :     }
1110 :     # Convert the pesky sibilant combinatorials to their own private symbol.
1111 :     $retVal =~ s/sch|ch|sh/S/g;
1112 :     # Convert PH to F.
1113 :     $retVal =~ s/ph/f/g;
1114 :     # Remove silent constructs.
1115 :     $retVal =~ s/gh//g;
1116 :     $retVal =~ s/^ps/s/;
1117 :     # Convert soft G to J and soft C to S.
1118 :     $retVal =~ s/g(e|i)/j$1/g;
1119 :     $retVal =~ s/c(e|i)/s$1/g;
1120 :     # Convert C to K, S to Z, M to N.
1121 :     $retVal =~ tr/csm/kzn/;
1122 :     # Singlify doubled letters.
1123 :     $retVal =~ tr/a-z//s;
1124 :     # Split off the first letter.
1125 :     my $first = substr($retVal, 0, 1, "");
1126 :     # Delete the vowels.
1127 :     $retVal =~ s/[$self->{VOWEL}]//g;
1128 :     # Put the first letter back.
1129 :     $retVal = $first . $retVal;
1130 :     # Return the result.
1131 :     return $retVal;
1132 :     }
1133 :    
1134 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3