[Bio] / Sprout / BioWords.pm Repository:
ViewVC logotype

View of /Sprout/BioWords.pm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (download) (as text) (annotate)
Tue Sep 9 21:02:10 2008 UTC (11 years ago) by parrello
Branch: MAIN
Changes for v24 database.

#!/usr/bin/perl -w

#
# Copyright (c) 2003-2006 University of Chicago and Fellowship
# for Interpretations of Genomes. All Rights Reserved.
#
# This file is part of the SEED Toolkit.
#
# The SEED Toolkit is free software. You can redistribute
# it and/or modify it under the terms of the SEED Toolkit
# Public License.
#
# You should have received a copy of the SEED Toolkit Public License
# along with this program; if not write to the University of Chicago
# at info@ci.uchicago.edu or the Fellowship for Interpretation of
# Genomes at veronika@thefig.info or download a copy from
# http://www.theseed.org/LICENSE.TXT.
#

package BioWords;

    use strict;
    use Tracer;

=head1 BioWords Package

Microbiological Word Conflation Helper

=head2 Introduction

This object is in charge of managing keywords used to search the database. Its
purpose is to insure that if a user types something close to the correct word, a
usable result will be returned.

A keyword string consists of words separated by delimiters. A I<word> is an
uninterrupted sequence of letters, semidelimiters (currently only C<'>) and digits.
A word that begins with a letter is called a I<real word>. For each real word we
produce two alternate forms. The I<stem> represents the root form of the word
(e.g. C<skies> to C<ski>, C<following> to C<follow>). The I<phonex> is computed
from the stem by removing the vowels and equating consonants that produce similar
sounds. It is likely a mispelled word will have the same phonex as its real form.

In addition to computing stems and phonexes, this object also I<cleans> a
keyword. I<Cleaning> consists of converting upper-case letters to lower case and
converting certain delimiters. In particular, bar (C<|>), colon (C<:>), and
semi-colon (C<;>) are converted to a single quote (C<'>) and period (C<.>) and
hyphen (C<->) are converted to underscore (C<_>). The importance of this is that
the single quote and underscore are considered word characters by the search
software. The cleaning causes the names of chemical compounds and the IDs of
features and genomes to behave as words when searching.

Search words must be at least three characters long, so the stem of a real word with
only three letters is the word itself, and any real word with only two letters
is discarded. In addition, there is a list of I<stop words> that are discarded
by the keyword search. These will have an empty string for the stem and phonex.

Note that the stemming algorithm differs from the standard for English because
of the use of Greek and Latin words in chemical compound names and genome
taxonomies. The algorithm has been evolving in response to numerous experiments
and is almost certainly not in its last iteration.

The fields in this object are as follows.

=over 4

=item stems

Hash of the stems found so far. This is cleared by L</AnalyzeSearchExpression>,
so it can be used by clients to determine the number of search expressions
containing a particular stem.

=item cache

Reference to a hash that maps a pure word to a hash containing its stem, a count of
the number of times it has occurred, and its phonex. The hash is also used to keep
exceptions (which map to their predetermined stem) and stop words (which map to an
empty string). The cache should only be used when the number of words being
processed is small. If multiple millions of words are put into the cache, it
causes the application to hang.

=item stopFile

The name of a file containing the stop word list, one word per line. The stop
word file is read into the cache the first time we try to stem a pure word.
Once the file is read, this field is cleared so that we know it's handled.

=item exceptionFile

The name of a file containing exception rules, one rule per line. Each rule
consists of a space-delimited list of words followed by a single stem. The
exception file is read into the cache the first time we try to stem a pure word.
Once the file is read, this field is cleared so that we know it's handled.

=item cacheFlag

TRUE if incoming words should be cached, else FALSE.

=item VOWEL

The list of vowel characters (lower-case). This defaults to the value of the
compile-time constant VOWELS, but may be overridden by the constructor.

=item LETTER

The list of letter characters (lower-case). This defaults to the value of the
compile-time constant LETTERS, but may be overridden by the constructor. All
of the vowels should be included in the list of letters.

=item DIGIT

The list of digit characters (lower-case). This defaults to the value of the
compile-time constant DIGITS, but may be overridden by the constructor.

=item WORD

The list of all word-like characters. This is the union of the letters
and digits.

=back

We allow configuration of letters, digits, and vowels; but in general the
stemming and phonex algorithms are aware of the English language and what the
various letters mean. The main use of the configuration strings is to allow
flexibility in the treatment of special characters, such as underscore (C<_>) and
the single quote (C<'>). The defaults have all been chosen fairly carefully based
on empirical testing, but of course everything is subject to evolution.

=head2 Special Declarations

=head3 EMPTY

The EMPTY constant simply evaluates to the empty string. It makes the stemming
rules more readable.

=cut

use constant EMPTY => '';

=head3 SHORT

The SHORT constant specifies the minimum length for a word. A word shorter than
the minimum length is treated as a stop word.

=cut

use constant SHORT => 3;

=head3 VOWELS

String containing the characters that are considered vowels (lower case only).

=cut

use constant VOWELS => q(aeiou_);

=head3 LETTERS

String containing the characters that are considered letters (lower case only).

=cut

use constant LETTERS => q(abcdefghijklmnopqrstuvwxyz_);

=head3 DIGITS

String containing the characters that are considered digits (lower case only).

=cut

use constant DIGITS => q(0123456789');

=head3 new

    my $bw = BioWords->new(%options);

Construct a new BioWords object. The following options are supported.

=over 4

=item exceptions

Name of the exception file, or a reference to a hash containing the exception
rules. The default is to have no exceptions.

=item stops

Name of the stop word file, or a reference to a list containing the stop words.
The default is to have no stop words.

=item vowels

List of characters to be treated as vowels (lower-case only). The default is
a compile-time constant.

=item letters

List of characters to be treated as letters (lower-case only). The default is a
compile-time constant.

=item digits

List of characters to be treated as digits (lower-case only). The default is a
compile-time constant.

=item cache

If TRUE, then words will be cached when they are processed. If FALSE, the cache
will only be used for stopwords and exceptions. The default is TRUE.

=back

=cut

sub new {
    # Get the parameters.
    my ($class, %options) = @_;
    # Get the options.
    my $exceptionOption = $options{exceptions} || "$FIG_Config::sproutData/exceptions.txt";
    my $stopOption = $options{stops} || "$FIG_Config::sproutData/stopwords.txt";
    my $vowels = $options{vowels} || VOWELS;
    my $letters = $options{letters} || LETTERS;
    my $digits = $options{digits} || DIGITS;
    my $cacheFlag = (defined $options{cache} ? $options{cache} : 1);
    my $cache = {};
    # Create the BioWords object.
    my $retVal = { 
                    cache => $cache,
                    cacheFlag => $cacheFlag,
                    stopFile => undef,
                    exceptionFile => undef,
                    stems => {},
                    VOWEL => $vowels,
                    LETTER => $letters,
                    DIGIT => $digits,
                    WORD => "$letters$digits"
                 };
    # Now we need to deal with the craziness surrounding the exception hash and the stop word
    # list, both of which are loaded into the cache before we start processing anything
    # serious. The exceptions and stops could be passed in as hash references, in which case
    # we load them into the cache. Alternatively, they could be file names, which we save
    # to be read in when we need them. So, first, we check for an exception file name.
    if (! ref $exceptionOption) {
        # Here we have a file name. We store it in the object.
        $retVal->{exceptionFile} = $exceptionOption;
    } else {
        # Here we have a hash. Slurp it into the cache.
        for my $exceptionWord (keys %{$exceptionOption}) {
            Store($retVal, $exceptionWord, $exceptionOption->{$exceptionWord}, 0);
        }
    }
    # Now we check for a stopword file name.
    if (! ref $stopOption) {
        # Store it in the object.
        $retVal->{stopFile} = $stopOption;
    } else {
        # No file name, so slurp in the list of words.
        for my $stopWord (@{$stopOption}) {
            Stop($retVal, $stopWord);
        }
    }
    # Bless and return the object.
    bless $retVal, $class;
    return $retVal;
}

=head2 Public Methods

=head3 Stop

    $bio->Stop($word);

Denote that a word is a stop word.

=over 4

=item word

Word to be declared as a stop word.

=back

=cut

sub Stop {
    # Get the parameters.
    my ($self, $word) = @_;
    # Store the stop word.
    $self->{cache}->{$word} = {stem => EMPTY, phonex => EMPTY, count => 0 };
}

=head3 Store

    $bio->Store($word, $stem, $count);

Store a word in the cache. The word will be mapped to the
specified stem and its count will be set to the specified value. The phonex
will be computed automatically from the stem. This method can also be used to
store exceptions. In that case, the count should be C<0>.

=over 4

=item word

Word to be stored.

=item stem

Proposed stem.

=item count

Proposed count. This should be C<0> for exceptions and C<1> for normal
words. The default is C<1>.

=back

=cut

sub Store {
    # Get the parameters.
    my ($self, $word, $stem, $count) = @_;
    # Default the count.
    my $realCount = (defined $count ? $count : 1);
    # Get the phonex for the specified stem.
    my $phonex = $self->_phonex($stem);
    # Store the word in the cache.
    $self->{cache}->{$word} = { stem => $stem, phonex => $phonex, count => $realCount };
}



=head3 Split

    my @words = $bio->Split($string);

Split a string into keywords. A keyword is this context is either a
delimiter sequence or a combination of letters, digits, underscores
(C<_>), and isolated single quotes (C<'>). All letters are converted to
lower case, and any white space sequence inside the string is converted
to a single space. Prior to splitting the string, certain strings that
have special biological meaning are modified, and certain delimiters are
converted. This helps to resolve some ambiguities (e.g. which alias names
use colons and which use vertical bars) and makes strings such as EC
numbers appear to be singleton keywords. The list of keywords we output
can be rejoined and then passed unmodified to a keyword search; however,
before doing that the individual pure words should be stemmed and checked
for spelling.

=over 4

=item string

Input string to process.

=item RETURN

Returns a List of normalized keywords and delimiters.

=back

=cut

sub Split {
    # Get the parameters.
    my ($self, $string) = @_;
    # Convert letters to lower case and collapse the white space. Note that we use the "s" modifier on
    # the substitution so that new-lines are treated as white space, and we take precautions so that
    # an undefined input is treated as a null string (which saves us from compiler warnings).
    my $lowered = (defined($string) ? lc $string : "");
    $lowered =~ s/\s+/ /sg;
    # Trim the leading space (if any).
    $lowered =~ s/^ //;
    # Fix the periods in EC and TC numbers. Note here we are insisting on real
    # digits rather than the things we treat as digits. We are parsing for real EC
    # and TC numbers, not generalized strings, and the format is specific.
    $lowered =~ s/(\d+|\-)\.(\d+|-)\.(\d+|-)\.(\d+|-)/$1_$2_$3_$4/g;
    # Fix non-trailing periods.
    $lowered =~ s/\.([$self->{WORD}])/_$1/g;
    # Fix non-leading minus signs.
    $lowered =~ s/([$self->{WORD}])[\-]/$1_/g;
    # Fix interior vertical bars and colons
    $lowered =~ s/([$self->{WORD}])[|:]([$self->{WORD}])/$1'$2/g;
    # Now split up the list so that each keyword is in its own string. The delimiters between
    # are kept, so when we're done everything can be joined back together again.
    Trace("Normalized string is -->$lowered<--") if T(4);
    my @pieces = map { split(/([^$self->{WORD}]+)/, $_) } $lowered;
    # The last step is to separate spaces from the other delimiters.
    my @retVal;
    for my $piece (@pieces) {
        while (substr($piece,0,1) eq " ") {
            $piece = substr($piece, 1);
            push @retVal, " ";
        }
        while ($piece =~ /(.+?) (.*)/) {
            push @retVal, $1, " ";
            $piece = $2;
        }
        if ($piece ne "") {
            push @retVal, $piece;
        }
    }
    # Return the result.
    return @retVal;
}

=head3 Region1

    my $root = $bio->Region1($word);

Return the suffix region for a word. This is referred to as I<region 1>
in the literature on word stemming, and it consists of everything after
the first non-vowel that follows a vowel.

=over 4

=item word

Lower-case word whose suffix region is desired.

=item RETURN

Returns the suffix region, or the empty string if there is no suffix region.

=back

=cut

sub Region1 {
    # Get the parameters.
    my ($self, $word) = @_;
    # Declare the return variable.
    my $retVal = "";
    # Look for the R1.
    if ($word =~ /[$self->{VOWEL}][^$self->{VOWEL}](.+)/i) {
        $retVal = $1;
    }
    # Return the result.
    return $retVal;
}

=head3 FindRule

    my ($prefix, $suffix, $replacement) = BioWords::FindRule($word, @rules);

Find the appropriate suffix rule for a word. Suffix rules are specified
as pairs in a list. Syntactically, the rule list may look like a hash,
but the order of the rules is important, so in fact it is a list. The
first rule whose key matches the suffix is applied. The part of the word
before the suffix, the suffix itself, and the value of the rule are all
passed back to the caller. If no rule matches, the prefix will be the
entire input word, and the suffix and replacement will be an empty string.

=over 4

=item word

Word to parse. It should already be normalized to lower case.

=item rules

A list of rules. Each rule is represented by two entries in the list-- a suffix
to match and a value to return.

=item RETURN

Returns a three-element list. The first element will be the portion of the word
before the matched suffix, the second element will be the suffix itself, and the
third will be the replacement recommended by the matched rule. If no rule
matches, the first element will be the whole word and the other two will be
empty strings.

=back

=cut

sub FindRule {
    # Get the parameters.
    my ($word, @rules) = @_;
    # Declare the return variables.
    my ($prefix, $suffix, $replacement) = ($word, EMPTY, EMPTY);
    # Search for a match. We'll stop on the first one.
    for (my $i = 0; ! $suffix && $i < $#rules; $i += 2) {
        my $len = length($rules[$i]);
        if ($rules[$i] eq substr($word, -$len)) {
            $prefix = substr($word, 0, length($word) - $len);
            $suffix = $rules[$i];
            $replacement = $rules[$i+1];
        }
    }
    # Return the results.
    return ($prefix, $suffix, $replacement);
}

=head3 Process

    my $stem = $biowords->Process($word);

Compute the stem of the specified word and record it in the cache.

=over 4

=item word

Word to be processed.

=item RETURN

Returns the stem of the word (which could be the original word itself. If the word
is a stop word, returns a null string.

=back

=cut

sub Process {
    # Get the parameters.
    my ($self, $word) = @_;
    # Verify that the cache is initialized.
    my $cache = $self->_initCache();
    # Declare the return variable.
    my $retVal;
    # Get the word in lower case and compute its length.
    my $lowered = lc $word;
    my $len = length $lowered;
    # Check to see what type of word it is.
    if ($lowered =~ /[^$self->{WORD}]/) {
        # It's delimiters. Return it unchanged and don't record it.
        $retVal = $lowered;
    } elsif ($len < $self->{SHORT}) {
        # It's too short. Treat it as a stop word.
        $retVal = EMPTY;
    } elsif (exists $cache->{$retVal}) {
        # It's already in the cache. Get the cache entry.
        my $entry = $cache->{$lowered};
        $retVal = $entry->{stem};
        # If it is NOT a stop word, count it.
        if ($retVal ne EMPTY) {
            $entry->{count}++;
        }
    } elsif ($len == $self->{SHORT}) {
        # It's already the minimum length. The stem is the word itself.
        $retVal = $lowered;
        # Store it if we're using the cache.
        if ($self->{cacheFlag}) {
            $self->Store($lowered, $retVal, 1);
        }
    } else {
        # Here we have a new word. We compute the stem and store it.
        $retVal = $self->_stem($lowered);
        # Store the word if we're using the cache.
        if ($self->{cacheFlag}) {
            $self->Store($lowered, $retVal, 1);
        }
    }
    # We're done. If the stem is non-empty, add it to the stem list.
    if ($retVal ne EMPTY) {
        $self->{stems}->{$retVal} = 1;
    }
    # Return the stem.
    return $retVal;
}

=head3 IsWord

    my $flag = $biowords->IsWord($word);

Return TRUE if the specified string is a word and FALSE if it is a
delimiter.

=over 4

=item word

String to examine.

=item RETURN

Returns TRUE if the string contains no delimiters, else FALSE.

=back

=cut

sub IsWord {
    # Get the parameters.
    my ($self, $word) = @_;
    # Test the word.
    my $retVal = ($word =~ /^[$self->{WORD}]+$/);
    # Return the result.
    return $retVal;
}

=head3 StemList

    my @stems = $biowords->StemList();

Return the list of stems found in the last search expression.

=cut

sub StemList {
    # Get the parameters.
    my ($self) = @_;
    # Return the keys of the stem hash.
    my @retVal = keys %{$self->{stems}};
    return @retVal;
}

=head3 StemLookup

    my ($stem, $phonex) = $biowords->StemLookup($word);

Return the stem and phonex for the specified word.

=over 4

=item word

Word whose stem and phonex are desired.

=item RETURN

Returns a two-element list. If the word is found in the cache, the
list will consist of the stem followed by the phonex. If the word
is a stop word, the list will consist of two empty strings.

=back

=cut

sub StemLookup {
    # Get the parameters.
    my ($self, $word) = @_;
    # Declare the return variables.
    my ($stem, $phonex);
    # Get the cache.
    my $cache = $self->{cache};
    # Check the cache for the word.
    if (exists $cache->{$word}) {
        # It's found. Return its data.
        ($stem, $phonex) = map { $_->{stem}, $_->{phonex} } $cache->{$word};
    } else {
        # It's not found. Compute the stem and phonex.
        my $lowered = lc $word;
        $stem = $self->Process($lowered);
        $phonex = $self->_phonex($stem);
    }
    # Return the results.
    return ($stem, $phonex);
}

=head3 WordList

    my $words = $biowords->WordList($keep);

Return a list of all of the words that were found by
L</AnalyzeSearchExpression>. Stop words will not be included.
Because the list could potentially contain millions of words, it is returned
as a list reference.

=cut

sub WordList {
    # Get the parameters.
    my ($self) = @_;
    # Get the cache.
    my $cache = $self->{cache};
    # Declare the return variable.
    my $retVal;
    # Extract the desired words from the cache.
    $retVal = [ grep { $cache->{$_}->{count} } keys %{$cache} ];
    # Return the result.
    return $retVal;
}


=head3 PrepareSearchExpression

    my $searchExpression = $bio->PrepareSearchExpression($string);

Convert an incoming string to a search expression. The string is split
into pieces, the pieces are stemmed and processed into the cache, and
then they are rejoined after certain adjustments are made. In particular,
words without an operator preceding them are prefixed with a plus (C<+>)
so that they are treated as required words.

=over 4

=item string

Search expression to prepare.

=item RETURN

Returns a modified version of the search expression with words converted to
stems, stop words eliminated, and plus signs placed before unmodified words.

=back

=cut

sub PrepareSearchExpression {
    # Get the parameters.
    my ($self, $string) = @_;
    # Declare the return variable.
    my $retVal = "";
    # Analyze the search expression.
    my @parts = $self->AnalyzeSearchExpression($string);
    # Now we have to put the pieces back together. At any point, we need
    # to know if we are inside quotes or in the scope of an operator.
    my ($inQuotes, $activeOp) = (0, 0);
    for my $part (@parts) {
        # Is this a word?
        if ($part =~ /[a-z0-9]$/) {
            # Yes. If no operator is present, add a plus.
            if (! $activeOp && ! $inQuotes) {
                $retVal .= "+";
                $activeOp = 0;
            }
        } else {
            # Here we have one or more operators. We process them
            # individually.
            for my $op (split //, $part) {
                if ($op eq '"') {
                    # Here we have a quote.
                    if ($inQuotes) {
                        # A close quote turns off operator scope.
                        $inQuotes = 0;
                        $activeOp = 0;
                    } else {
                        # An open quote puts us in quote mode. Words inside
                        # quotes do not need the plus added.
                        $inQuotes = 1;
                    }
                } elsif ($op eq ' ') {
                    # Spaces detach us from the preceding operator.
                    $activeOp = 0;
                } else {
                    # Everything else puts us in operator scope.
                    $activeOp = 1;
                }
            }
        }
        # Add this part to the output string.
        $retVal .= $part;
    }
    # Return the result.
    return $retVal;
}

=head3 AnalyzeSearchExpression

    my @list = $bio->AnalyzeSearchExpression($string);

Analyze the components of a search expression and return them to the
caller. Statistical information about the words in the expression will
have been stored in the cache, and the return value will be a list of
stems and delimiters.

=over 4

=item string

Search expression to analyze.

=item RETURN

Returns a list of words and delimiters, in an order corresponding to the
original expression. Real words will have been converted to stems and
stop words will have been converted to empty strings.

=back

=cut

sub AnalyzeSearchExpression {
    # Get the parameters.
    my ($self, $string) = @_;
    # Clear the stem list.
    $self->{stems} = {};
    # Normalize and split the search expression.
    my @parts = $self->Split($string);
    # Declare the return variable.
    my @retVal;
    # Now we loop through the parts, processing them.
    for my $part (@parts) {
        my $stem = $self->Process($part);
        push @retVal, $stem;
        Trace("Stem of \"$part\" is \"$stem\".") if T(4);
    }
    # Return the result.
    return @retVal;
}

=head2 Internal Methods

=head3 _initCache

    my $cache = $biowords->_initCache();

Insure the cache is initialized. If exception and stop word files exist,
they will be read into memory and used to populate the cache. A reference to
the cache will be returned to the caller.

=cut

sub _initCache {
    # Get the parameters.
    my ($self) = @_;
    # Check for a stopword file.
    if ($self->{stopFile}) {
        # Read the file.
        my @lines = Tracer::GetFile($self->{stopFile});
        # Insert it into the cache.
        for my $line (@lines) {
            $self->Stop(lc $line);
        }
        # Denote that the stopword file has been processed.
        $self->{stopFile} = EMPTY;
    }
    # Check for an exception list.
    if ($self->{exceptionFile}) {
        # Read the file.
        my @lines = Tracer::GetFile($self->{exceptionFile});
        # Loop through the lines.
        for my $line (@lines) {
            # Extract the words.
            my @words = split /\s+/, $line;
            # Map all of the starting words to the last word.
            my $stem = pop @words;
            for my $word (@words) {
                $self->Store($word, $stem, 0);
            }
        }
        # Denote that the exception file has been procesed.
        $self->{exceptionFile} = EMPTY;
    }
    # Return the cache.
    return $self->{cache};
}

=head3 _stem

    my $stem = $biowords->_stem($word);

Compute the stem of an incoming word. This is an internal method that
does not check the cache or do any length checking.

=over 4

=item word

The word to stem. It must already have been converted to lower case.

=item RETURN

Returns the stem of the incoming word, which could possibly be the word itself.

=back

=cut

sub _stem {
    # Get the parameters.
    my ($self, $word) = @_;
    # Copy the word so we can mangle it.
    my $retVal = $word;
    # Convert consonant "y" to "j".
    $retVal =~ s/^y/j/;
    $retVal =~ s/([aeiou])y/$1j/g;
    # Convert vowel "y" to "i".
    $retVal =~ tr/y/i/;
    # Compute the R1 and R2 regions. R1 is everything after the first syllable,
    # and R2 is everything after the second syllable.
    my $r1 = $self->Region1($retVal);
    my $r2 = $self->Region1($r1);
    # Compute the physical locations of the regions.
    my $len = length $retVal;
    my $p1 = $len - length $r1;
    my $p2 = $len - length $r2;
    # These variables will be used by FindRule.
    my ($prefix, $suffix, $ruleValue);
    # Remove genitives.
    ($retVal, $suffix, $ruleValue) = FindRule($retVal, q('s') => EMPTY, q('s) => EMPTY, q(') => EMPTY);
    # Process latin endings.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, us => 'i', um => 'a', ae => 'a');
    # Latin endings only apply if they follow a consonant.
    if ($prefix =~ /[^aeiou]$/) {
        $retVal = "$prefix$ruleValue";
    }
    # Convert plurals to singular.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, sses => 'ss', ied => 'i', ies => 'i', s => 's');
    if ($ruleValue eq 'i') {
        # If the prefix length is one, we append an "e".
        if (length $prefix <= 1) {
            $ruleValue .= "e"
        }
    } elsif ($ruleValue eq 's') {
        # Here we have a naked "s" at the end. We null it out if the prefix ends in a
        # consonant or an 'e'. Nulling it will cause the "s" to be removed.
        if ($prefix =~ /[^aiou]$/) {
            $ruleValue = EMPTY;
        }
    }
    # Finish the singularization. The possibly-modified rule value is applied to the prefix.
    # If no rule applied, this has no effect, since the prefix is the whole word and the
    # rule value is the empty string.
    $retVal = "$prefix$ruleValue";
    # Catch the special "izing" construct.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, izing => 'is');
    $retVal = "$prefix$ruleValue";
    # Convert adverbs to adjectives.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, eedli => 'ee', eed => 'ee',
                                              ingli => EMPTY, ing => EMPTY, edli => EMPTY,
                                              ed => EMPTY);
    # These rules only apply in limited circumstances.
    if ($ruleValue eq 'ee') {
        # The "ee" replacement only applies if it occurs in region 1. If it does not
        # occur there, then we put the suffix back.
        if (length($prefix) < $p1) {
            $ruleValue = $suffix;
        }
    } elsif ($suffix) {
        # Here the rule value is the empty string. It only applies if there is a
        # vowel in the prefix.
        if ($prefix !~ /[aeiou]/) {
            # No vowel, so put the suffix back.
            $ruleValue = $suffix;
        } else {
            # The prefix is now the whole word, because the rule value is the empty
            # string. Check for ending mutations. We may need to add an "e" or
            # remove a doubled letter.
            ($prefix, $suffix, $ruleValue) = FindRule($prefix, at => 'ate', bl => 'ble', iz => 'ize',
                                                      bb => 'b', dd => 'd', ff => 'f', gg => 'g',
                                                      mm => 'n', nn => 'n', pp => 'p', rr => 'r',
                                                      tt => 't');
        }
    }
    # Apply the modifications.
    $retVal = "$prefix$ruleValue";
    # Now we get serious. Here we're looking for special suffixes.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, ational => 'ate', tional => 'tion',
                                              enci => 'ence', anci => 'ance', abli => 'able',
                                              entli => 'ent', ization => 'ize', izer => 'ize',
                                              ation => 'ate', ator => 'ate', alism => 'al',
                                              aliti => 'al', alli => 'al', fulness => 'ful',
                                              ousness => 'ous', ousli => 'ous', ivness => 'ive',
                                              iviti => 'ive', biliti => 'ble', bli => 'ble',
                                              logi => 'log', fulli => 'ful', lessli => 'less',
                                              cli => 'c', dli => 'd', eli => 'e', gli => 'g',
                                              hli => 'h', kli => 'k', mli => 'm', nli => 'n',
                                              rli => 'r', tli => 't', alize => 'al', icate => 'ic',
                                              iciti => 'ic', ical => 'ic');
    # These only apply if they are in R1.
    if ($ruleValue && length($prefix) >= $p1) {
        $retVal = "$prefix$ruleValue";
    }
    # Conflate "ence" to "ent" if it's in R2.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, ence => 'ent');
    if ($ruleValue && length($prefix) >= $p2) {
        $retVal = "$prefix$ruleValue";
    }
    # Now zap "ful", "ness", "ative", and "ize", but only if they're in R1.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, ful => EMPTY, ness => EMPTY, ize => EMPTY);
    if (length($prefix) >= $p1) {
        $retVal = $prefix;
    }
    # Now we have some suffixes that get deleted if they're in R2.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, ement => EMPTY, ment => EMPTY, able => EMPTY,
                                              ible => EMPTY, ance => EMPTY, ence => EMPTY,
                                              ant => EMPTY, ent => EMPTY, ism => EMPTY, ate => EMPTY,
                                              iti => EMPTY, ous => EMPTY, ive => EMPTY, ize => EMPTY,
                                              al => EMPTY, er => EMPTY, ic => EMPTY, sion => 's',
                                              tion => 't', alli => 'al');
    if (length($prefix) >= $p2) {
        $retVal = $prefix;
    }
    # Process the doubled L.
    ($prefix, $suffix, $ruleValue) = FindRule($retVal, ll => 'l');
    $retVal = "$prefix$ruleValue";
    # Return the result.
    return $retVal;
}

=head3 _phonex

    my $phonex = $biowords->_phonex($word);

Compute the phonetic version of a word. Vowels are ignored, doubled
letters are trimmed to singletons, and certain letters or letter
combinations are conflated. The resulting word is likely to match a
misspelling of the original.

This is an internal method. It does not check the cache and it assumes
the word has already been converted to lower case.

=over 4

=item word

Word whose phonetic translation is desired.

=item RETURN

Returns a more-or-less phonetic translation of the word.

=back

=cut

sub _phonex {
    # Get the parameters.
    my ($self, $word) = @_;
    # Declare the return variable.
    my $retVal = $word;
    # Handle some special cases. For typed IDs, we remove the type. For
    # horrible multi-part chemical names, remove everything in front of
    # the last underscore.
    if ($word =~ /_([$self->{LETTER}]+)$/ && length($1) > $self->{SHORT}) {
        $word = $1;
    } elsif ($word =~ /^[$self->{LETTER}]+'(.+)$/ && length($1) > $self->{SHORT}) {
        $word = $1;
    }
    # Convert the pesky sibilant combinatorials to their own private symbol.
    $retVal =~ s/sch|ch|sh/S/g;
    # Convert PH to F.
    $retVal =~ s/ph/f/g;
    # Remove silent constructs.
    $retVal =~ s/gh//g;
    $retVal =~ s/^ps/s/;
    # Convert soft G to J and soft C to S.
    $retVal =~ s/g(e|i)/j$1/g;
    $retVal =~ s/c(e|i)/s$1/g;
    # Convert C to K, S to Z, M to N.
    $retVal =~ tr/csm/kzn/;
    # Singlify doubled letters.
    $retVal =~ tr/a-z//s;
    # Split off the first letter.
    my $first = substr($retVal, 0, 1, "");
    # Delete the vowels.
    $retVal =~ s/[$self->{VOWEL}]//g;
    # Put the first letter back.
    $retVal = $first . $retVal;
    # Return the result.
    return $retVal;
}

1;

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3