[Bio] / Sprout / BioWords.pm Repository:
ViewVC logotype

Annotation of /Sprout/BioWords.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.2 - (view) (download) (as text)

1 : parrello 1.1 #!/usr/bin/perl -w
2 :    
3 :     #
4 :     # Copyright (c) 2003-2006 University of Chicago and Fellowship
5 :     # for Interpretations of Genomes. All Rights Reserved.
6 :     #
7 :     # This file is part of the SEED Toolkit.
8 :     #
9 :     # The SEED Toolkit is free software. You can redistribute
10 :     # it and/or modify it under the terms of the SEED Toolkit
11 :     # Public License.
12 :     #
13 :     # You should have received a copy of the SEED Toolkit Public License
14 :     # along with this program; if not write to the University of Chicago
15 :     # at info@ci.uchicago.edu or the Fellowship for Interpretation of
16 :     # Genomes at veronika@thefig.info or download a copy from
17 :     # http://www.theseed.org/LICENSE.TXT.
18 :     #
19 :    
20 :     package BioWords;
21 :    
22 :     use strict;
23 :     use Tracer;
24 :    
25 :     =head1 BioWords Package
26 :    
27 :     Microbiological Word Conflation Helper
28 :    
29 :     =head2 Introduction
30 :    
31 :     This object is in charge of managing keywords used to search the database. Its
32 :     purpose is to insure that if a user types something close to the correct word, a
33 :     usable result will be returned.
34 :    
35 :     A keyword string consists of words separated by delimiters. A I<word> is an
36 :     uninterrupted sequence of letters, semidelimiters (currently only C<'>) and digits.
37 :     A word that begins with a letter is called a I<real word>. For each real word we
38 :     produce two alternate forms. The I<stem> represents the root form of the word
39 :     (e.g. C<skies> to C<ski>, C<following> to C<follow>). The I<phonex> is computed
40 :     from the stem by removing the vowels and equating consonants that produce similar
41 :     sounds. It is likely a mispelled word will have the same phonex as its real form.
42 :    
43 :     In addition to computing stems and phonexes, this object also I<cleans> a
44 :     keyword. I<Cleaning> consists of converting upper-case letters to lower case and
45 :     converting certain delimiters. In particular, bar (C<|>), colon (C<:>), and
46 :     semi-colon (C<;>) are converted to a single quote (C<'>) and period (C<.>) and
47 :     hyphen (C<->) are converted to underscore (C<_>). The importance of this is that
48 :     the single quote and underscore are considered word characters by the search
49 :     software. The cleaning causes the names of chemical compounds and the IDs of
50 :     features and genomes to behave as words when searching.
51 :    
52 :     Search words must be at least three characters long, so the stem of a real word with
53 :     only three letters is the word itself, and any real word with only two letters
54 :     is discarded. In addition, there is a list of I<stop words> that are discarded
55 :     by the keyword search. These will have an empty string for the stem and phonex.
56 :    
57 :     Note that the stemming algorithm differs from the standard for English because
58 :     of the use of Greek and Latin words in chemical compound names and genome
59 :     taxonomies. The algorithm has been evolving in response to numerous experiments
60 :     and is almost certainly not in its last iteration.
61 :    
62 :     The fields in this object are as follows.
63 :    
64 :     =over 4
65 :    
66 :     =item stems
67 :    
68 :     Hash of the stems found so far. This is cleared by L</AnalyzeSearchExpression>,
69 :     so it can be used by clients to determine the number of search expressions
70 :     containing a particular stem.
71 :    
72 :     =item cache
73 :    
74 :     Reference to a hash that maps a pure word to a hash containing its stem, a count of
75 :     the number of times it has occurred, and its phonex. The hash is also used to keep
76 :     exceptions (which map to their predetermined stem) and stop words (which map to an
77 :     empty string). The cache should only be used when the number of words being
78 :     processed is small. If multiple millions of words are put into the cache, it
79 :     causes the application to hang.
80 :    
81 :     =item stopFile
82 :    
83 :     The name of a file containing the stop word list, one word per line. The stop
84 :     word file is read into the cache the first time we try to stem a pure word.
85 :     Once the file is read, this field is cleared so that we know it's handled.
86 :    
87 :     =item exceptionFile
88 :    
89 :     The name of a file containing exception rules, one rule per line. Each rule
90 :     consists of a space-delimited list of words followed by a single stem. The
91 :     exception file is read into the cache the first time we try to stem a pure word.
92 :     Once the file is read, this field is cleared so that we know it's handled.
93 :    
94 :     =item cacheFlag
95 :    
96 :     TRUE if incoming words should be cached, else FALSE.
97 :    
98 :     =item VOWEL
99 :    
100 :     The list of vowel characters (lower-case). This defaults to the value of the
101 :     compile-time constant VOWELS, but may be overridden by the constructor.
102 :    
103 :     =item LETTER
104 :    
105 :     The list of letter characters (lower-case). This defaults to the value of the
106 :     compile-time constant LETTERS, but may be overridden by the constructor. All
107 :     of the vowels should be included in the list of letters.
108 :    
109 :     =item DIGIT
110 :    
111 :     The list of digit characters (lower-case). This defaults to the value of the
112 :     compile-time constant DIGITS, but may be overridden by the constructor.
113 :    
114 :     =item WORD
115 :    
116 :     The list of all word-like characters. This is the union of the letters
117 :     and digits.
118 :    
119 :     =back
120 :    
121 :     We allow configuration of letters, digits, and vowels; but in general the
122 :     stemming and phonex algorithms are aware of the English language and what the
123 :     various letters mean. The main use of the configuration strings is to allow
124 :     flexibility in the treatment of special characters, such as underscore (C<_>) and
125 :     the single quote (C<'>). The defaults have all been chosen fairly carefully based
126 :     on empirical testing, but of course everything is subject to evolution.
127 :    
128 :     =head2 Special Declarations
129 :    
130 :     =head3 EMPTY
131 :    
132 :     The EMPTY constant simply evaluates to the empty string. It makes the stemming
133 :     rules more readable.
134 :    
135 :     =cut
136 :    
137 :     use constant EMPTY => '';
138 :    
139 :     =head3 SHORT
140 :    
141 :     The SHORT constant specifies the minimum length for a word. A word shorter than
142 :     the minimum length is treated as a stop word.
143 :    
144 :     =cut
145 :    
146 :     use constant SHORT => 3;
147 :    
148 :     =head3 VOWELS
149 :    
150 :     String containing the characters that are considered vowels (lower case only).
151 :    
152 :     =cut
153 :    
154 :     use constant VOWELS => q(aeiou_);
155 :    
156 :     =head3 LETTERS
157 :    
158 :     String containing the characters that are considered letters (lower case only).
159 :    
160 :     =cut
161 :    
162 :     use constant LETTERS => q(abcdefghijklmnopqrstuvwxyz_);
163 :    
164 :     =head3 DIGITS
165 :    
166 :     String containing the characters that are considered digits (lower case only).
167 :    
168 :     =cut
169 :    
170 :     use constant DIGITS => q(0123456789');
171 :    
172 :     =head3 new
173 :    
174 :     my $bw = BioWords->new(%options);
175 :    
176 :     Construct a new BioWords object. The following options are supported.
177 :    
178 :     =over 4
179 :    
180 :     =item exceptions
181 :    
182 :     Name of the exception file, or a reference to a hash containing the exception
183 :     rules. The default is to have no exceptions.
184 :    
185 :     =item stops
186 :    
187 :     Name of the stop word file, or a reference to a list containing the stop words.
188 :     The default is to have no stop words.
189 :    
190 :     =item vowels
191 :    
192 :     List of characters to be treated as vowels (lower-case only). The default is
193 :     a compile-time constant.
194 :    
195 :     =item letters
196 :    
197 :     List of characters to be treated as letters (lower-case only). The default is a
198 :     compile-time constant.
199 :    
200 :     =item digits
201 :    
202 :     List of characters to be treated as digits (lower-case only). The default is a
203 :     compile-time constant.
204 :    
205 :     =item cache
206 :    
207 :     If TRUE, then words will be cached when they are processed. If FALSE, the cache
208 :     will only be used for stopwords and exceptions. The default is TRUE.
209 :    
210 :     =back
211 :    
212 :     =cut
213 :    
214 :     sub new {
215 :     # Get the parameters.
216 :     my ($class, %options) = @_;
217 :     # Get the options.
218 :     my $exceptionOption = $options{exceptions} || "$FIG_Config::sproutData/exceptions.txt";
219 :     my $stopOption = $options{stops} || "$FIG_Config::sproutData/stopwords.txt";
220 :     my $vowels = $options{vowels} || VOWELS;
221 :     my $letters = $options{letters} || LETTERS;
222 :     my $digits = $options{digits} || DIGITS;
223 :     my $cacheFlag = (defined $options{cache} ? $options{cache} : 1);
224 :     my $cache = {};
225 :     # Create the BioWords object.
226 :     my $retVal = {
227 :     cache => $cache,
228 :     cacheFlag => $cacheFlag,
229 :     stopFile => undef,
230 :     exceptionFile => undef,
231 :     stems => {},
232 :     VOWEL => $vowels,
233 :     LETTER => $letters,
234 :     DIGIT => $digits,
235 :     WORD => "$letters$digits"
236 :     };
237 :     # Now we need to deal with the craziness surrounding the exception hash and the stop word
238 :     # list, both of which are loaded into the cache before we start processing anything
239 :     # serious. The exceptions and stops could be passed in as hash references, in which case
240 :     # we load them into the cache. Alternatively, they could be file names, which we save
241 :     # to be read in when we need them. So, first, we check for an exception file name.
242 :     if (! ref $exceptionOption) {
243 :     # Here we have a file name. We store it in the object.
244 :     $retVal->{exceptionFile} = $exceptionOption;
245 :     } else {
246 :     # Here we have a hash. Slurp it into the cache.
247 :     for my $exceptionWord (keys %{$exceptionOption}) {
248 :     Store($retVal, $exceptionWord, $exceptionOption->{$exceptionWord}, 0);
249 :     }
250 :     }
251 :     # Now we check for a stopword file name.
252 :     if (! ref $stopOption) {
253 :     # Store it in the object.
254 :     $retVal->{stopFile} = $stopOption;
255 :     } else {
256 :     # No file name, so slurp in the list of words.
257 :     for my $stopWord (@{$stopOption}) {
258 :     Stop($retVal, $stopWord);
259 :     }
260 :     }
261 :     # Bless and return the object.
262 :     bless $retVal, $class;
263 :     return $retVal;
264 :     }
265 :    
266 :     =head2 Public Methods
267 :    
268 :     =head3 Stop
269 :    
270 :     $bio->Stop($word);
271 :    
272 :     Denote that a word is a stop word.
273 :    
274 :     =over 4
275 :    
276 :     =item word
277 :    
278 :     Word to be declared as a stop word.
279 :    
280 :     =back
281 :    
282 :     =cut
283 :    
284 :     sub Stop {
285 :     # Get the parameters.
286 :     my ($self, $word) = @_;
287 :     # Store the stop word.
288 :     $self->{cache}->{$word} = {stem => EMPTY, phonex => EMPTY, count => 0 };
289 :     }
290 :    
291 :     =head3 Store
292 :    
293 :     $bio->Store($word, $stem, $count);
294 :    
295 :     Store a word in the cache. The word will be mapped to the
296 :     specified stem and its count will be set to the specified value. The phonex
297 :     will be computed automatically from the stem. This method can also be used to
298 :     store exceptions. In that case, the count should be C<0>.
299 :    
300 :     =over 4
301 :    
302 :     =item word
303 :    
304 :     Word to be stored.
305 :    
306 :     =item stem
307 :    
308 :     Proposed stem.
309 :    
310 :     =item count
311 :    
312 :     Proposed count. This should be C<0> for exceptions and C<1> for normal
313 :     words. The default is C<1>.
314 :    
315 :     =back
316 :    
317 :     =cut
318 :    
319 :     sub Store {
320 :     # Get the parameters.
321 :     my ($self, $word, $stem, $count) = @_;
322 :     # Default the count.
323 :     my $realCount = (defined $count ? $count : 1);
324 :     # Get the phonex for the specified stem.
325 :     my $phonex = $self->_phonex($stem);
326 :     # Store the word in the cache.
327 :     $self->{cache}->{$word} = { stem => $stem, phonex => $phonex, count => $realCount };
328 :     }
329 :    
330 :    
331 :    
332 :     =head3 Split
333 :    
334 :     my @words = $bio->Split($string);
335 :    
336 :     Split a string into keywords. A keyword is this context is either a
337 :     delimiter sequence or a combination of letters, digits, underscores
338 :     (C<_>), and isolated single quotes (C<'>). All letters are converted to
339 :     lower case, and any white space sequence inside the string is converted
340 :     to a single space. Prior to splitting the string, certain strings that
341 :     have special biological meaning are modified, and certain delimiters are
342 :     converted. This helps to resolve some ambiguities (e.g. which alias names
343 :     use colons and which use vertical bars) and makes strings such as EC
344 :     numbers appear to be singleton keywords. The list of keywords we output
345 :     can be rejoined and then passed unmodified to a keyword search; however,
346 :     before doing that the individual pure words should be stemmed and checked
347 :     for spelling.
348 :    
349 :     =over 4
350 :    
351 :     =item string
352 :    
353 :     Input string to process.
354 :    
355 :     =item RETURN
356 :    
357 :     Returns a List of normalized keywords and delimiters.
358 :    
359 :     =back
360 :    
361 :     =cut
362 :    
363 :     sub Split {
364 :     # Get the parameters.
365 :     my ($self, $string) = @_;
366 :     # Convert letters to lower case and collapse the white space. Note that we use the "s" modifier on
367 :     # the substitution so that new-lines are treated as white space, and we take precautions so that
368 :     # an undefined input is treated as a null string (which saves us from compiler warnings).
369 :     my $lowered = (defined($string) ? lc $string : "");
370 :     $lowered =~ s/\s+/ /sg;
371 :     # Trim the leading space (if any).
372 :     $lowered =~ s/^ //;
373 :     # Fix the periods in EC and TC numbers. Note here we are insisting on real
374 :     # digits rather than the things we treat as digits. We are parsing for real EC
375 :     # and TC numbers, not generalized strings, and the format is specific.
376 :     $lowered =~ s/(\d+|\-)\.(\d+|-)\.(\d+|-)\.(\d+|-)/$1_$2_$3_$4/g;
377 :     # Fix non-trailing periods.
378 :     $lowered =~ s/\.([$self->{WORD}])/_$1/g;
379 :     # Fix non-leading minus signs.
380 :     $lowered =~ s/([$self->{WORD}])[\-]/$1_/g;
381 :     # Fix interior vertical bars and colons
382 :     $lowered =~ s/([$self->{WORD}])[|:]([$self->{WORD}])/$1'$2/g;
383 :     # Now split up the list so that each keyword is in its own string. The delimiters between
384 :     # are kept, so when we're done everything can be joined back together again.
385 :     Trace("Normalized string is -->$lowered<--") if T(4);
386 :     my @pieces = map { split(/([^$self->{WORD}]+)/, $_) } $lowered;
387 :     # The last step is to separate spaces from the other delimiters.
388 :     my @retVal;
389 :     for my $piece (@pieces) {
390 :     while (substr($piece,0,1) eq " ") {
391 :     $piece = substr($piece, 1);
392 :     push @retVal, " ";
393 :     }
394 :     while ($piece =~ /(.+?) (.*)/) {
395 :     push @retVal, $1, " ";
396 :     $piece = $2;
397 :     }
398 :     if ($piece ne "") {
399 :     push @retVal, $piece;
400 :     }
401 :     }
402 :     # Return the result.
403 :     return @retVal;
404 :     }
405 :    
406 :     =head3 Region1
407 :    
408 :     my $root = $bio->Region1($word);
409 :    
410 :     Return the suffix region for a word. This is referred to as I<region 1>
411 :     in the literature on word stemming, and it consists of everything after
412 :     the first non-vowel that follows a vowel.
413 :    
414 :     =over 4
415 :    
416 :     =item word
417 :    
418 :     Lower-case word whose suffix region is desired.
419 :    
420 :     =item RETURN
421 :    
422 :     Returns the suffix region, or the empty string if there is no suffix region.
423 :    
424 :     =back
425 :    
426 :     =cut
427 :    
428 :     sub Region1 {
429 :     # Get the parameters.
430 :     my ($self, $word) = @_;
431 :     # Declare the return variable.
432 :     my $retVal = "";
433 :     # Look for the R1.
434 :     if ($word =~ /[$self->{VOWEL}][^$self->{VOWEL}](.+)/i) {
435 :     $retVal = $1;
436 :     }
437 :     # Return the result.
438 :     return $retVal;
439 :     }
440 :    
441 :     =head3 FindRule
442 :    
443 :     my ($prefix, $suffix, $replacement) = BioWords::FindRule($word, @rules);
444 :    
445 :     Find the appropriate suffix rule for a word. Suffix rules are specified
446 :     as pairs in a list. Syntactically, the rule list may look like a hash,
447 :     but the order of the rules is important, so in fact it is a list. The
448 :     first rule whose key matches the suffix is applied. The part of the word
449 :     before the suffix, the suffix itself, and the value of the rule are all
450 :     passed back to the caller. If no rule matches, the prefix will be the
451 :     entire input word, and the suffix and replacement will be an empty string.
452 :    
453 :     =over 4
454 :    
455 :     =item word
456 :    
457 :     Word to parse. It should already be normalized to lower case.
458 :    
459 :     =item rules
460 :    
461 :     A list of rules. Each rule is represented by two entries in the list-- a suffix
462 :     to match and a value to return.
463 :    
464 :     =item RETURN
465 :    
466 :     Returns a three-element list. The first element will be the portion of the word
467 :     before the matched suffix, the second element will be the suffix itself, and the
468 :     third will be the replacement recommended by the matched rule. If no rule
469 :     matches, the first element will be the whole word and the other two will be
470 :     empty strings.
471 :    
472 :     =back
473 :    
474 :     =cut
475 :    
476 :     sub FindRule {
477 :     # Get the parameters.
478 :     my ($word, @rules) = @_;
479 :     # Declare the return variables.
480 :     my ($prefix, $suffix, $replacement) = ($word, EMPTY, EMPTY);
481 :     # Search for a match. We'll stop on the first one.
482 :     for (my $i = 0; ! $suffix && $i < $#rules; $i += 2) {
483 :     my $len = length($rules[$i]);
484 :     if ($rules[$i] eq substr($word, -$len)) {
485 :     $prefix = substr($word, 0, length($word) - $len);
486 :     $suffix = $rules[$i];
487 :     $replacement = $rules[$i+1];
488 :     }
489 :     }
490 :     # Return the results.
491 :     return ($prefix, $suffix, $replacement);
492 :     }
493 :    
494 :     =head3 Process
495 :    
496 :     my $stem = $biowords->Process($word);
497 :    
498 :     Compute the stem of the specified word and record it in the cache.
499 :    
500 :     =over 4
501 :    
502 :     =item word
503 :    
504 :     Word to be processed.
505 :    
506 :     =item RETURN
507 :    
508 :     Returns the stem of the word (which could be the original word itself. If the word
509 :     is a stop word, returns a null string.
510 :    
511 :     =back
512 :    
513 :     =cut
514 :    
515 :     sub Process {
516 :     # Get the parameters.
517 :     my ($self, $word) = @_;
518 :     # Verify that the cache is initialized.
519 :     my $cache = $self->_initCache();
520 :     # Declare the return variable.
521 :     my $retVal;
522 :     # Get the word in lower case and compute its length.
523 :     my $lowered = lc $word;
524 :     my $len = length $lowered;
525 :     # Check to see what type of word it is.
526 :     if ($lowered =~ /[^$self->{WORD}]/) {
527 :     # It's delimiters. Return it unchanged and don't record it.
528 :     $retVal = $lowered;
529 :     } elsif ($len < $self->{SHORT}) {
530 :     # It's too short. Treat it as a stop word.
531 :     $retVal = EMPTY;
532 :     } elsif (exists $cache->{$retVal}) {
533 :     # It's already in the cache. Get the cache entry.
534 :     my $entry = $cache->{$lowered};
535 :     $retVal = $entry->{stem};
536 :     # If it is NOT a stop word, count it.
537 :     if ($retVal ne EMPTY) {
538 :     $entry->{count}++;
539 :     }
540 :     } elsif ($len == $self->{SHORT}) {
541 :     # It's already the minimum length. The stem is the word itself.
542 :     $retVal = $lowered;
543 :     # Store it if we're using the cache.
544 :     if ($self->{cacheFlag}) {
545 :     $self->Store($lowered, $retVal, 1);
546 :     }
547 :     } else {
548 :     # Here we have a new word. We compute the stem and store it.
549 :     $retVal = $self->_stem($lowered);
550 :     # Store the word if we're using the cache.
551 :     if ($self->{cacheFlag}) {
552 :     $self->Store($lowered, $retVal, 1);
553 :     }
554 :     }
555 :     # We're done. If the stem is non-empty, add it to the stem list.
556 :     if ($retVal ne EMPTY) {
557 :     $self->{stems}->{$retVal} = 1;
558 : parrello 1.2 Trace("\"$word\" stems to \"$retVal\".") if T(3);
559 :     } else {
560 :     Trace("\"$word\" discarded by stemmer.") if T(3);
561 : parrello 1.1 }
562 :     # Return the stem.
563 :     return $retVal;
564 :     }
565 :    
566 :     =head3 IsWord
567 :    
568 :     my $flag = $biowords->IsWord($word);
569 :    
570 :     Return TRUE if the specified string is a word and FALSE if it is a
571 :     delimiter.
572 :    
573 :     =over 4
574 :    
575 :     =item word
576 :    
577 :     String to examine.
578 :    
579 :     =item RETURN
580 :    
581 :     Returns TRUE if the string contains no delimiters, else FALSE.
582 :    
583 :     =back
584 :    
585 :     =cut
586 :    
587 :     sub IsWord {
588 :     # Get the parameters.
589 :     my ($self, $word) = @_;
590 :     # Test the word.
591 :     my $retVal = ($word =~ /^[$self->{WORD}]+$/);
592 :     # Return the result.
593 :     return $retVal;
594 :     }
595 :    
596 :     =head3 StemList
597 :    
598 :     my @stems = $biowords->StemList();
599 :    
600 :     Return the list of stems found in the last search expression.
601 :    
602 :     =cut
603 :    
604 :     sub StemList {
605 :     # Get the parameters.
606 :     my ($self) = @_;
607 :     # Return the keys of the stem hash.
608 :     my @retVal = keys %{$self->{stems}};
609 :     return @retVal;
610 :     }
611 :    
612 :     =head3 StemLookup
613 :    
614 :     my ($stem, $phonex) = $biowords->StemLookup($word);
615 :    
616 :     Return the stem and phonex for the specified word.
617 :    
618 :     =over 4
619 :    
620 :     =item word
621 :    
622 :     Word whose stem and phonex are desired.
623 :    
624 :     =item RETURN
625 :    
626 :     Returns a two-element list. If the word is found in the cache, the
627 :     list will consist of the stem followed by the phonex. If the word
628 :     is a stop word, the list will consist of two empty strings.
629 :    
630 :     =back
631 :    
632 :     =cut
633 :    
634 :     sub StemLookup {
635 :     # Get the parameters.
636 :     my ($self, $word) = @_;
637 :     # Declare the return variables.
638 :     my ($stem, $phonex);
639 :     # Get the cache.
640 :     my $cache = $self->{cache};
641 :     # Check the cache for the word.
642 :     if (exists $cache->{$word}) {
643 :     # It's found. Return its data.
644 :     ($stem, $phonex) = map { $_->{stem}, $_->{phonex} } $cache->{$word};
645 :     } else {
646 :     # It's not found. Compute the stem and phonex.
647 :     my $lowered = lc $word;
648 :     $stem = $self->Process($lowered);
649 :     $phonex = $self->_phonex($stem);
650 :     }
651 :     # Return the results.
652 :     return ($stem, $phonex);
653 :     }
654 :    
655 :     =head3 WordList
656 :    
657 :     my $words = $biowords->WordList($keep);
658 :    
659 :     Return a list of all of the words that were found by
660 :     L</AnalyzeSearchExpression>. Stop words will not be included.
661 :     Because the list could potentially contain millions of words, it is returned
662 :     as a list reference.
663 :    
664 :     =cut
665 :    
666 :     sub WordList {
667 :     # Get the parameters.
668 :     my ($self) = @_;
669 :     # Get the cache.
670 :     my $cache = $self->{cache};
671 :     # Declare the return variable.
672 :     my $retVal;
673 :     # Extract the desired words from the cache.
674 :     $retVal = [ grep { $cache->{$_}->{count} } keys %{$cache} ];
675 :     # Return the result.
676 :     return $retVal;
677 :     }
678 :    
679 :    
680 :     =head3 PrepareSearchExpression
681 :    
682 :     my $searchExpression = $bio->PrepareSearchExpression($string);
683 :    
684 :     Convert an incoming string to a search expression. The string is split
685 :     into pieces, the pieces are stemmed and processed into the cache, and
686 :     then they are rejoined after certain adjustments are made. In particular,
687 :     words without an operator preceding them are prefixed with a plus (C<+>)
688 :     so that they are treated as required words.
689 :    
690 :     =over 4
691 :    
692 :     =item string
693 :    
694 :     Search expression to prepare.
695 :    
696 :     =item RETURN
697 :    
698 :     Returns a modified version of the search expression with words converted to
699 :     stems, stop words eliminated, and plus signs placed before unmodified words.
700 :    
701 :     =back
702 :    
703 :     =cut
704 :    
705 :     sub PrepareSearchExpression {
706 :     # Get the parameters.
707 :     my ($self, $string) = @_;
708 :     # Declare the return variable.
709 :     my $retVal = "";
710 :     # Analyze the search expression.
711 :     my @parts = $self->AnalyzeSearchExpression($string);
712 :     # Now we have to put the pieces back together. At any point, we need
713 :     # to know if we are inside quotes or in the scope of an operator.
714 :     my ($inQuotes, $activeOp) = (0, 0);
715 :     for my $part (@parts) {
716 :     # Is this a word?
717 :     if ($part =~ /[a-z0-9]$/) {
718 :     # Yes. If no operator is present, add a plus.
719 :     if (! $activeOp && ! $inQuotes) {
720 :     $retVal .= "+";
721 :     $activeOp = 0;
722 :     }
723 :     } else {
724 :     # Here we have one or more operators. We process them
725 :     # individually.
726 :     for my $op (split //, $part) {
727 :     if ($op eq '"') {
728 :     # Here we have a quote.
729 :     if ($inQuotes) {
730 :     # A close quote turns off operator scope.
731 :     $inQuotes = 0;
732 :     $activeOp = 0;
733 :     } else {
734 :     # An open quote puts us in quote mode. Words inside
735 :     # quotes do not need the plus added.
736 :     $inQuotes = 1;
737 :     }
738 :     } elsif ($op eq ' ') {
739 :     # Spaces detach us from the preceding operator.
740 :     $activeOp = 0;
741 :     } else {
742 :     # Everything else puts us in operator scope.
743 :     $activeOp = 1;
744 :     }
745 :     }
746 :     }
747 :     # Add this part to the output string.
748 :     $retVal .= $part;
749 :     }
750 :     # Return the result.
751 :     return $retVal;
752 :     }
753 :    
754 :     =head3 AnalyzeSearchExpression
755 :    
756 :     my @list = $bio->AnalyzeSearchExpression($string);
757 :    
758 :     Analyze the components of a search expression and return them to the
759 :     caller. Statistical information about the words in the expression will
760 :     have been stored in the cache, and the return value will be a list of
761 :     stems and delimiters.
762 :    
763 :     =over 4
764 :    
765 :     =item string
766 :    
767 :     Search expression to analyze.
768 :    
769 :     =item RETURN
770 :    
771 :     Returns a list of words and delimiters, in an order corresponding to the
772 :     original expression. Real words will have been converted to stems and
773 :     stop words will have been converted to empty strings.
774 :    
775 :     =back
776 :    
777 :     =cut
778 :    
779 :     sub AnalyzeSearchExpression {
780 :     # Get the parameters.
781 :     my ($self, $string) = @_;
782 :     # Clear the stem list.
783 :     $self->{stems} = {};
784 :     # Normalize and split the search expression.
785 :     my @parts = $self->Split($string);
786 :     # Declare the return variable.
787 :     my @retVal;
788 :     # Now we loop through the parts, processing them.
789 :     for my $part (@parts) {
790 :     my $stem = $self->Process($part);
791 :     push @retVal, $stem;
792 :     Trace("Stem of \"$part\" is \"$stem\".") if T(4);
793 :     }
794 :     # Return the result.
795 :     return @retVal;
796 :     }
797 :    
798 :     =head2 Internal Methods
799 :    
800 :     =head3 _initCache
801 :    
802 :     my $cache = $biowords->_initCache();
803 :    
804 :     Insure the cache is initialized. If exception and stop word files exist,
805 :     they will be read into memory and used to populate the cache. A reference to
806 :     the cache will be returned to the caller.
807 :    
808 :     =cut
809 :    
810 :     sub _initCache {
811 :     # Get the parameters.
812 :     my ($self) = @_;
813 :     # Check for a stopword file.
814 :     if ($self->{stopFile}) {
815 :     # Read the file.
816 :     my @lines = Tracer::GetFile($self->{stopFile});
817 :     # Insert it into the cache.
818 :     for my $line (@lines) {
819 :     $self->Stop(lc $line);
820 :     }
821 :     # Denote that the stopword file has been processed.
822 :     $self->{stopFile} = EMPTY;
823 :     }
824 :     # Check for an exception list.
825 :     if ($self->{exceptionFile}) {
826 :     # Read the file.
827 :     my @lines = Tracer::GetFile($self->{exceptionFile});
828 :     # Loop through the lines.
829 :     for my $line (@lines) {
830 :     # Extract the words.
831 :     my @words = split /\s+/, $line;
832 :     # Map all of the starting words to the last word.
833 :     my $stem = pop @words;
834 :     for my $word (@words) {
835 :     $self->Store($word, $stem, 0);
836 :     }
837 :     }
838 :     # Denote that the exception file has been procesed.
839 :     $self->{exceptionFile} = EMPTY;
840 :     }
841 :     # Return the cache.
842 :     return $self->{cache};
843 :     }
844 :    
845 :     =head3 _stem
846 :    
847 :     my $stem = $biowords->_stem($word);
848 :    
849 :     Compute the stem of an incoming word. This is an internal method that
850 :     does not check the cache or do any length checking.
851 :    
852 :     =over 4
853 :    
854 :     =item word
855 :    
856 :     The word to stem. It must already have been converted to lower case.
857 :    
858 :     =item RETURN
859 :    
860 :     Returns the stem of the incoming word, which could possibly be the word itself.
861 :    
862 :     =back
863 :    
864 :     =cut
865 :    
866 :     sub _stem {
867 :     # Get the parameters.
868 :     my ($self, $word) = @_;
869 :     # Copy the word so we can mangle it.
870 :     my $retVal = $word;
871 :     # Convert consonant "y" to "j".
872 :     $retVal =~ s/^y/j/;
873 :     $retVal =~ s/([aeiou])y/$1j/g;
874 :     # Convert vowel "y" to "i".
875 :     $retVal =~ tr/y/i/;
876 :     # Compute the R1 and R2 regions. R1 is everything after the first syllable,
877 :     # and R2 is everything after the second syllable.
878 :     my $r1 = $self->Region1($retVal);
879 :     my $r2 = $self->Region1($r1);
880 :     # Compute the physical locations of the regions.
881 :     my $len = length $retVal;
882 :     my $p1 = $len - length $r1;
883 :     my $p2 = $len - length $r2;
884 :     # These variables will be used by FindRule.
885 :     my ($prefix, $suffix, $ruleValue);
886 :     # Remove genitives.
887 :     ($retVal, $suffix, $ruleValue) = FindRule($retVal, q('s') => EMPTY, q('s) => EMPTY, q(') => EMPTY);
888 :     # Process latin endings.
889 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, us => 'i', um => 'a', ae => 'a');
890 :     # Latin endings only apply if they follow a consonant.
891 :     if ($prefix =~ /[^aeiou]$/) {
892 :     $retVal = "$prefix$ruleValue";
893 :     }
894 :     # Convert plurals to singular.
895 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, sses => 'ss', ied => 'i', ies => 'i', s => 's');
896 :     if ($ruleValue eq 'i') {
897 :     # If the prefix length is one, we append an "e".
898 :     if (length $prefix <= 1) {
899 :     $ruleValue .= "e"
900 :     }
901 :     } elsif ($ruleValue eq 's') {
902 :     # Here we have a naked "s" at the end. We null it out if the prefix ends in a
903 :     # consonant or an 'e'. Nulling it will cause the "s" to be removed.
904 :     if ($prefix =~ /[^aiou]$/) {
905 :     $ruleValue = EMPTY;
906 :     }
907 :     }
908 :     # Finish the singularization. The possibly-modified rule value is applied to the prefix.
909 :     # If no rule applied, this has no effect, since the prefix is the whole word and the
910 :     # rule value is the empty string.
911 :     $retVal = "$prefix$ruleValue";
912 :     # Catch the special "izing" construct.
913 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, izing => 'is');
914 :     $retVal = "$prefix$ruleValue";
915 :     # Convert adverbs to adjectives.
916 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, eedli => 'ee', eed => 'ee',
917 :     ingli => EMPTY, ing => EMPTY, edli => EMPTY,
918 :     ed => EMPTY);
919 :     # These rules only apply in limited circumstances.
920 :     if ($ruleValue eq 'ee') {
921 :     # The "ee" replacement only applies if it occurs in region 1. If it does not
922 :     # occur there, then we put the suffix back.
923 :     if (length($prefix) < $p1) {
924 :     $ruleValue = $suffix;
925 :     }
926 :     } elsif ($suffix) {
927 :     # Here the rule value is the empty string. It only applies if there is a
928 :     # vowel in the prefix.
929 :     if ($prefix !~ /[aeiou]/) {
930 :     # No vowel, so put the suffix back.
931 :     $ruleValue = $suffix;
932 :     } else {
933 :     # The prefix is now the whole word, because the rule value is the empty
934 :     # string. Check for ending mutations. We may need to add an "e" or
935 :     # remove a doubled letter.
936 :     ($prefix, $suffix, $ruleValue) = FindRule($prefix, at => 'ate', bl => 'ble', iz => 'ize',
937 :     bb => 'b', dd => 'd', ff => 'f', gg => 'g',
938 :     mm => 'n', nn => 'n', pp => 'p', rr => 'r',
939 :     tt => 't');
940 :     }
941 :     }
942 :     # Apply the modifications.
943 :     $retVal = "$prefix$ruleValue";
944 :     # Now we get serious. Here we're looking for special suffixes.
945 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ational => 'ate', tional => 'tion',
946 :     enci => 'ence', anci => 'ance', abli => 'able',
947 :     entli => 'ent', ization => 'ize', izer => 'ize',
948 :     ation => 'ate', ator => 'ate', alism => 'al',
949 :     aliti => 'al', alli => 'al', fulness => 'ful',
950 :     ousness => 'ous', ousli => 'ous', ivness => 'ive',
951 :     iviti => 'ive', biliti => 'ble', bli => 'ble',
952 :     logi => 'log', fulli => 'ful', lessli => 'less',
953 :     cli => 'c', dli => 'd', eli => 'e', gli => 'g',
954 :     hli => 'h', kli => 'k', mli => 'm', nli => 'n',
955 :     rli => 'r', tli => 't', alize => 'al', icate => 'ic',
956 :     iciti => 'ic', ical => 'ic');
957 :     # These only apply if they are in R1.
958 :     if ($ruleValue && length($prefix) >= $p1) {
959 :     $retVal = "$prefix$ruleValue";
960 :     }
961 :     # Conflate "ence" to "ent" if it's in R2.
962 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ence => 'ent');
963 :     if ($ruleValue && length($prefix) >= $p2) {
964 :     $retVal = "$prefix$ruleValue";
965 :     }
966 :     # Now zap "ful", "ness", "ative", and "ize", but only if they're in R1.
967 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ful => EMPTY, ness => EMPTY, ize => EMPTY);
968 :     if (length($prefix) >= $p1) {
969 :     $retVal = $prefix;
970 :     }
971 :     # Now we have some suffixes that get deleted if they're in R2.
972 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ement => EMPTY, ment => EMPTY, able => EMPTY,
973 :     ible => EMPTY, ance => EMPTY, ence => EMPTY,
974 :     ant => EMPTY, ent => EMPTY, ism => EMPTY, ate => EMPTY,
975 :     iti => EMPTY, ous => EMPTY, ive => EMPTY, ize => EMPTY,
976 :     al => EMPTY, er => EMPTY, ic => EMPTY, sion => 's',
977 :     tion => 't', alli => 'al');
978 :     if (length($prefix) >= $p2) {
979 :     $retVal = $prefix;
980 :     }
981 :     # Process the doubled L.
982 :     ($prefix, $suffix, $ruleValue) = FindRule($retVal, ll => 'l');
983 :     $retVal = "$prefix$ruleValue";
984 :     # Return the result.
985 :     return $retVal;
986 :     }
987 :    
988 :     =head3 _phonex
989 :    
990 :     my $phonex = $biowords->_phonex($word);
991 :    
992 :     Compute the phonetic version of a word. Vowels are ignored, doubled
993 :     letters are trimmed to singletons, and certain letters or letter
994 :     combinations are conflated. The resulting word is likely to match a
995 :     misspelling of the original.
996 :    
997 :     This is an internal method. It does not check the cache and it assumes
998 :     the word has already been converted to lower case.
999 :    
1000 :     =over 4
1001 :    
1002 :     =item word
1003 :    
1004 :     Word whose phonetic translation is desired.
1005 :    
1006 :     =item RETURN
1007 :    
1008 :     Returns a more-or-less phonetic translation of the word.
1009 :    
1010 :     =back
1011 :    
1012 :     =cut
1013 :    
1014 :     sub _phonex {
1015 :     # Get the parameters.
1016 :     my ($self, $word) = @_;
1017 :     # Declare the return variable.
1018 :     my $retVal = $word;
1019 :     # Handle some special cases. For typed IDs, we remove the type. For
1020 :     # horrible multi-part chemical names, remove everything in front of
1021 :     # the last underscore.
1022 :     if ($word =~ /_([$self->{LETTER}]+)$/ && length($1) > $self->{SHORT}) {
1023 :     $word = $1;
1024 :     } elsif ($word =~ /^[$self->{LETTER}]+'(.+)$/ && length($1) > $self->{SHORT}) {
1025 :     $word = $1;
1026 :     }
1027 :     # Convert the pesky sibilant combinatorials to their own private symbol.
1028 :     $retVal =~ s/sch|ch|sh/S/g;
1029 :     # Convert PH to F.
1030 :     $retVal =~ s/ph/f/g;
1031 :     # Remove silent constructs.
1032 :     $retVal =~ s/gh//g;
1033 :     $retVal =~ s/^ps/s/;
1034 :     # Convert soft G to J and soft C to S.
1035 :     $retVal =~ s/g(e|i)/j$1/g;
1036 :     $retVal =~ s/c(e|i)/s$1/g;
1037 :     # Convert C to K, S to Z, M to N.
1038 :     $retVal =~ tr/csm/kzn/;
1039 :     # Singlify doubled letters.
1040 :     $retVal =~ tr/a-z//s;
1041 :     # Split off the first letter.
1042 :     my $first = substr($retVal, 0, 1, "");
1043 :     # Delete the vowels.
1044 :     $retVal =~ s/[$self->{VOWEL}]//g;
1045 :     # Put the first letter back.
1046 :     $retVal = $first . $retVal;
1047 :     # Return the result.
1048 :     return $retVal;
1049 :     }
1050 :    
1051 :     1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3