Packages Used: |
Perl 5.005.....................................CPAN BritishNational Corpus......http:info.ox.ac.uk/bnc/ |
Introduction
Natural Language Processing (NLP) is a branch of Artificial Intelligence involving the development of computer programs to understand or generate documents written in "natural language" — that is, any human language, like English, Hebrew, Swahili, or Xhosa. Creating programs that exhibit full understanding of natural language has long been a goal of AI.
Here are some typical NLP applications:
Word assistance to users. For example, a human might ask: "What is the adverb form of 'accident'?, and the computer might reply: "'accidentally' is probably the word you want, although 3% of users spell it 'accidently'".
A smarter web search engine, or a knowbot (knowledge robot), that lets you search for a keyword such as compute to retrieve all documents that contain that keyword, or conceptually related keywords like computationally.
A smart document categorizer that reads a series of documents and sorts them into different categories based upon disproportionate use of particular keywords ("this document appears to be about nuclear physics, because it mentions keywords like atom and nucleus far more than average documents").
A smart document summarizer that summarizes one or more documents into a more compact form and presents a digest (see CS-Web: A Lightweight Summarizer for HTML, in TPJ #16).
All of these programs require some understanding of natural language. For perfect summarization or perfect translation, we'd need total understanding, and that's far too difficult because of the intricacies of human communication.
Fortunately, we don't need complete language understanding to obtain some utility. Even a little knowledge of language is useful for some applications, and in this spirit, let's start at the lower levels and pick one small part of natural language which we can process relatively easily — the structure of single words.
Morphology: Word Form And Structure
Natural languages like English often express shades of meaning by changing the form of a single word, such as by adding a prefix or a suffix to an existing word. Adding s to computer changes the meaning from "a single computer" to "multiple computers": a noun-to-noun change. We can also move between nouns, verbs, and adjectives with similarly simple changes (move becomes movement). The study of the form of words is called morphology.
This article shows how to implement a program that uses morphological rules to generate many related words from one given word. This will be useful in search tools, because when the user searches for computer, the search engine can now find documents with related words like compute, computing, and computationally.
So let's look at four common morphological processes and how we might model them in Perl. The first is called inflection, in which a word is transformed to indicate the tense (for example, look becomes looks, looking, and looked) or the number (for example, husky becomes huskies).
As soon as we start thinking about such transformations from a Perl perspective, we can't help but think of regular expressions: if you have the verb look and you want to generate the present participle (looking), this regular expression springs to mind:
s/$/ing/ # add "ing" to $_
In this spirit, Table 1a shows some typical inflections that pluralizes nouns, and Table 1b shows some tense inflections on verbs.
Unfortunately, English has many irregularities, ranging from whether or not to suppress a trailing e, up to the impressive past tense of go: went, with no letters in common!
Sometimes two different rules can apply to a word, but only one of the inflections is actually correct. Is the plural form of omnibus really omnibi according to the s/us$/i/ rule? Or is it actually omnibuses because of s/us$/uses/? In some cases (geniuses and genii, syllabuses and syllabi) both inflections are permitted. How can a program know what to do?
We'll downgrade our rules to rules of thumb — or, as AI researchers call them, heuristics. We'll accept that they will sometimes be wrong, and later we'll develop a way to filter out incorrect applications.
Table 1a: Some rules for pluralizing English nouns. |
|||
regular | dog, cat, computer | dogs, cats, computers | s/$/s/ |
ending in s, x, z, sh, ch | gas, tax, wish | gases, taxes, wishes | s/$/es/ |
ending in f | calf, elf, dwarf | calves, elves, dwarves | s/f$/ves/ |
ending in o | bamboo, folio domino, torpedo |
bamboos, folios dominoes, torpedos |
s/o$/os/ s/o$/oes/ |
ending in y | sky, husky decoy, donkey |
skies, huskies decoys, donkeys |
s/y$/ies/ s/y$/ys/ |
ending in sis | crisis, analysis | crises, analyses | s/sis$/ses/ |
ending in us | radius, genius omnibus, genius |
radii, genii omnibuses, geniuses |
s/us$/i/ s/us$/uses |
irregular | man | men | s/an$/en/ |
Table 1b: Some rules for inflecting English verbs |
|||||||
regular | look walk decide watch |
looks walks decides watches |
looking walking deciding watching |
looked walked decided watched |
s/$/s/ s/$/s/ s/$/s/ s/$/es/ |
s/$/ing/ s/$/ing/ s/e$/ing/ s/$/ing/ |
s/$/ed/ s/$/ed/ s/e$/ed/ s/$/ed/ |
irregular | see flee come go take |
sees flees comes goes takes |
seeing fleeing coming going taking |
saw fled came went took |
s/$/s/ s/$/s/ s/$/s/ s/$/es/ s/$/s/ |
s/$/ing/ s/$/ing/ s/$/ing/ s/$/ing/ s/e$/ing/ |
s/ee$/aw/ s/ee$/ed/ s/ome$/ame/ s/go$/went/ s/take$/took/ |
The second morphological process is called derivation, and involves taking a base word (the stem) and transforming it from one part of speech into another (for instance, from the verb ignite to the noun ignition). Table 2 shows some typical derivations.
Table 2: Examples of derivational morphology
NOUN | DERIVATION | CLASS |
nation | national | adjective |
nation | nationalize | verb |
VERB | DERIVATION | CLASS |
slow | slowly | adverb |
observe | observation | noun |
observe | observable | adjective |
nationalize | nationalization | noun |
develop | development | noun |
ADJECTIVE | DERIVATION | CLASS |
national | nationally | adverb |
thick | thickness | noun |
Nominalization is an especially important type of derivation — deriving a noun from a verb or adjective. Let's look at one example of derivation in detail — the verb observe. As well as the usual inflectional tenses (observing, etc.), we derive the adjective observable and, by nominalization, the noun observation, the plural inflection observations, another adjective observational, and an adverb observationally.
There are two other major morphological processes in which words are formed in English: compounding (ear + ache = earache, work + load = workload) and affixation (dis + arrange = disarrange). Compounding combines two words; affixation combines a word with a prefix or suffix. These actions can also be expressed in regular expressions (s/^/ear/), but for the purposes of this article we'll ignore compounding and affixation.
To summarize, we can transform words by adding certain prefixes or suffixes to a stem, or by joining two stems together. Salton (1989) estimated that there are about 75 prefixes and around 250 suffixes in English. Most importantly for us, Perl regular expressions are perfectly suited to performing all these transformations.
Now let's build a set of derivation and inflection rules and then use these rules to analyze a corpus, a collection of carefully selected documents. Several large corpuses, representative of different types of documents, have been built — a typical representative corpus of everyday written language is the British National Corpus. Details about the BNC research project can be found on the web at http://info.ox.ac.uk/bnc/. It is a collection of over 4,000 different documents with around 100 million words in over 6 million sentences. The number of distinct words — the vocabulary — is a little under 400,000. We'll use it for the examples below.
Morphological Analysis And Perl
We now present a series of Perl programs performing simple morphological inflections and derivations on single words. Our starting point is a series of documents, forming our corpus. Since we're working only with single words, we don't need to retain the original word order in the documents. All we need is a list of the words that appear in the documents, and how many times each word appears.
Our first step, therefore, is to transform the documents into a word frequency list which tells us how many distinct word forms there are, and how many times each form occurs. The rest of the programs can then operate on the word frequency lists, which are much smaller and more easily managed than the documents themselves.
Constructing A Word Frequency List
Our first program, mkfreq, creates a word frequency list from a set of documents. The only hard part is defining what constitutes a word. One definition might be a series of alphanumeric characters: that is, anything not matching /[\W_]+/.
This regular expression isn't perfect: it splits abbreviations like don't into two words. Let's live with that, and instead of calling the things identified by our regular expression words, we'll call them tokens. We'll routinely lowercase all tokens, since "work" and "Work" shouldn't be treated as different words.
Listing 1 shows mkfreq. We use a hash to store the word frequencies, and display two simple but useful measures of the linguistic diversity in the documents — the total number of tokens and the vocabulary. The frequency list is displayed in descending order of frequency using a custom sort order. For cryptographic applications, a knowledge of the most frequent words in the English language is useful. Looking at the most frequent ten words in the British National Corpus, we see:
Word | Frequency |
the | 6,198,660 |
of | 3,112,676 |
and | 2,695,784 |
to | 2,672,790 |
a | 2,241,128 |
in | 1,996,243 |
that | 1,150,942 |
it | 1,091,584 |
is | 1,004,574 |
Total | 22,164,381 |
Some observations: These ten words are responsible for 21.6% of the corpus. They're all common determiners, conjunctions, pronouns, and prepositions, carrying no information about the subject of a document. Most search engines exclude these uninformative words from their indexes. Excluded words are called stopwords.
The frequency list can also help us out by identifying misspelled words; we'd expect to find uncommon spelling mistakes at the bottom of our frequency list. We might decide to exclude all tokens with a frequency below some threshold (say, one or two occurrences) in order to exclude many spelling mistakes (and, inevitably, some legitimate words) from our processing. In the British National Corpus discussed above, 154883 tokens occur exactly once and 49465 tokens occur exactly twice. Together, these comprise 204348 tokens out of the total vocabulary of 381807 tokens — 53% percent of the vocabulary occurs less than three times!
Excluding low-frequency tokens is a simple addition to mkfreq. We omit that here, but it's in the version of mkfreq on the TPJ website. We also omit tokens with digits.
Morphological Inflections And Derivations
Some derivation rules apply only to particular word categories — you wouldn't try to find the plural of a verb. However, there's no algorithmic way of identifying whether an English word is a verb or noun. No rules or even heuristics are possible.
Ideally, we'd have some way to take a word and find out which category it belonged to. However, this would require a knowledge base containing every base word from the corpus. Constructing and maintaining such a knowledge base would be a major undertaking.
So, let's turn the problem on its head: how far can we get with no knowledge base whatever? Could we, in fact, make reasonably accurate derivations based only upon our corpus? If we have no knowledge of word categories, then we must start by applying all derivation rules to all tokens in the corpus, and then see if there's any way of eliminating the duds.
Representing A Single Rule
A rule is basically an s/// regular expression. However, we need to store a collection of rules in an array and apply them all to a token. Having decided to ignore affixation and compounding, if we look back to Tables 1a, 1b, and 2, we notice that all our rules for inflection and derivation are of the form s/ending$/suffix/.
This suggests that we can represent each rule as an ending-suffix pair with the meaning:
Given this, we could write a function that attempts to apply a rule pair (ending, suffix) to a given token, returning either the derived token or the empty string (if the token does not end in ending). This subroutine, derive(), is shown below.
sub derive { # Make a single derivation my ( $t, $r, $a ) = @_; if ( $r ) { return "" unless $t =~ /.$r$/; $t =~ s/$r$//; } return "$t$a"; }
We embedded this into a test harness that we named applyruletotoken, which, at its core, is simply this:
my $d = derive( @ARGV ); print "Derivation is '$d'\n" if $d; print "No derivation\n" unless $d;
We can now perform a few simple tests:
Command | Response |
applyruletotoken look '' ing | Derivation is looking |
applyruletotoken look 's' sing | No Derivation |
applyruletotoken take 'e' ing | Derivation is taking |
Representing Many Rules
The next step is to figure out how to represent all the derivation rules so that we can apply all of them to each token and determine which succeed. For convenience we'll collapse our rule pair (ending, suffix) into a single string: -ending+suffix, or just +suffix if ending was empty. Figure 3 shows a function setuprules which creates an array of rules @rule and then processes it into an array of removal-strings @remove and an array of addition-strings @add. Only a few rules are shown, although we routinely use many more.
# Set up the derivation rules and processed forms. sub setupRules { my ( $r, $a ); @rule = # A rule is a disguised pair of the form -R+A or +A ( "+s", "+es", "-us+i", "-is+es", # singular -> plural noun "+ion", "+tion", "+sion", "-te+tion", # verb -> noun "-ify+ification", "+ment", "+ness", "+fulness", "+ful", "+ic", "+al", # noun -> adjective "-e+ing", "-s+ssing", "+ing", "+ed", # verb -> tense forms "+ly", # adjective -> adverb ); @add = @remove = (); foreach (@rule) { ( $r, $a ) = ( $1, $2 ) if /^-(\w+)\+(\w+)$/; ( $r, $a ) = ( "", $1 ) if /^\+(\w+)$/; push( @remove, $r ); push( @add, $a ); } }
We had to make a few small changes to derive(), and our new version is shown below. Whenever it finds a valid derivation, it calls handlederivation(), providing the token, the rule, and the derived token.
# Our updated derive, which now calls handlederivation() sub derive { my ( $t ) = @_; my ( $i, $a, $r, $d ); for ( $i = 0; $i < @add; $i++ ) { $a = $add[$i]; $r = $remove[$i]; $d = $t; if ( $r ) { next unless $t =~ /.$r$/; $d =~ s/$r$//; } $d = "$d$a"; handlederivation( $t, $rule[$i], $d ); } } sub handlederivation { my ( $token, $rule, $derivation ) = @_; print "$token\t$rule\t$derivation\n"; }
We embedded these subroutines into a test harness called applyallrules, whose heart is simply this:
setuprules(); derive( @ARGV );
A test such as applyallrules look will generate the following output:
look +s looks look +es lookes look +ion lookion look +tion looktion look +sion looksion look +ment lookment look +ness lookness look +fulness lookfulness look +ful lookful look +ic lookic look +al lookal look +ing looking look +ed looked look +ly lookly
As we would expect, this generates some good derivations (looks, looking, looked) but the majority of the derivations are not real English words.
Telling Good From Bad
We need some way to distinguish the real English words from the bogus, and this is point at which an English word knowledge base could be used. We could use an external wordlist such as the Unix dictionary /usr/dict/words, rejecting any derivation that doesn't appear, but many uncommon words (and even some common ones) aren't included. Augmenting a wordlist with a large collection of specialized words is another major endeavor.
The Key Insight!
Fortunately, we have another wordlist available that is a source of exactly the specialized words that we need — the frequency list derived from the corpus! Using this insight, we propose the following method:
Take an existing token in the frequency list.
Apply all derivation rules to that token as above.
Reject any derivation that is not itself present in the frequency list.
Implementing It
Our new system, filterallrules, is just applyallrules with two changes: the readfreqfile() subroutine shown below reads a frequency list into %freq, and derive() calls handlederivation() only if the word is in the frequency list: handlederivation( $t, $rule[$i], $d ) if $freq{$d};.
sub readfreqfile { # Read the frequency list my ( $filename ) = @_; my ( $line, $word, $count ); open ( IN, $filename ) || die "can't open $filename\n"; %freq = (); while ( $line = <IN> ) { last unless $line =~ m/^=======/; chomp $line; $totaltokens = $1 if $line =~ m/= Total tokens:\s+(\d+)/; $vocab = $1 if $line =~ m/= Vocabulary:\s+(\d+)/; } while ( $line = <IN> ) { chomp $line; next if $line =~ /^\d/; ( $word, $count ) = split( /\s+/, $line ); $freq{$word} = $count; } close IN; }
Applying this to look with a frequency file derived from the British National Corpus now correctly sifts out the true English words from the duds:
look +s looks look +ing looking look +ed looked
These inflections are rather straightforward and obvious. Consider a more complex example — the derivations of approximate:
approximate +s approximates approximate -te+tion approximation approximate -e+ing approximating approximate +ly approximately
This technique is not perfect. There are two main types of false derivations produced: first, when two completely unrelated English words happen to match a derivation rule (for example, if the +s rule is applied to asses, we get assess). There are surprisingly few cases of this type, however, because English has strong morphological constraints.
The second and more common type of false derivation arises from specialized terms, proper names, foreign words, and spelling mistakes in our corpus. For example, the electrical term AC is lowercased to ac and then is used to generate seemingly related words such as aced and action. This is inevitable without deeper knowledge about the words we're processing.
Applying The Derivation Process To All Tokens
The final step is to apply this derivation process not just to one token, but to all tokens within the corpus, and to display the results in a pleasing format. To apply the derivation rules to all tokens, we define a new subroutine called deriveall() that iterates over all tokens:
# Call derive() for every token in the frequency table sub deriveAll { my ( $t ); %b2d = (); foreach $t (sort(keys(%freq))) { derive( $t ); } }
To tabulate the results, we redefine handlederivation() as shown below. The successive calls to handlederivation() now build a hash called %b2d (base-to-derivations) that maps each base token to an array of all of its derivations. We use references so that our hash can have many values for each key.
sub handlederivation { my ( $token, $rule, $derivation ) = @_; my ( $ref ); $ref = $b2d{$token}; $ref = $b2d{$token} = [ ] unless defined $ref; push( @$ref, $derivation ); }
Once deriveall() completes, the entire %b2d hash has been constructed. To display all the derivations, we define a subroutine named tabulate, which displays each word and its frequency:
# tabulate the results in the %b2d array. sub tabulate { my ( @b, $base, $list, $t ); # @b contains all the base tokens (the keys of %b2d) @b = sort(keys(%b2d)); foreach $base (@b) { # $list is a reference to an array of derivations $list = $b2d{$base}; $f = $freq{$base}; print "$base/$f\t"; foreach $t (@$list ) { $f = $freq{$t}; print "$t/$f\t"; } print "\n"; } }We assembled a final test harness called tabulate which did little more than this:
readfreqfile( $ARGV[0] ); setuprules(); deriveall(); tabulate();
Now the user can receive a summary of the derivations obtained, where each token is displayed followed by its frequency. Here is an excerpt of the output:
approximate/39 approximation/258 approximating/2 approximately/35 argument/3 arguments/3 arise/34 arises/19 arising/41 arrange/4 arrangement/17 arranging/1 artificial/10 artificially/1 aspect/2 aspects/4 assess/23 assessment/10 assessing/10 assessed/7 assistant/3 assistants/1 assume/68 assumes/10 assuming/69 assumption/49 assumptions/19 asymptotic/17 asymptotics/8
Summary
We have managed to obtain remarkably accurate heuristic derivations by taking a set of real texts, turning them into a frequency distribution, applying a series of very simple addition and substitution rules to every token, and rejecting derivations that do not appear in the frequency distribution. Some derivations are wrong, but a high percentage are accurate. The results are certainly good enough to enable Web searches for morphologically related words — a significant improvement over a dumb keyword search.
Future Work
To enhance our system, more rules could be added (we routinely use a much bigger set, and are augmenting them all the time). More categories of replacement rules (such as affixation and compounding) should be added too.
We could make mkfreq retrieve an arbitrary web page, discard its HTML tags, and generate its frequency list using the LWP::Simple module, exactly as shown by Jon Orwant and Dan Gruhl in TPJ #13. A version that does this (webmkfreq) is available from the TPJ web site.
Reading in the entire token frequency file cost a lot of time and memory, so we implemented it as a DBM file instead. This significantly speeds up programs that access only a few frequency elements (such as plural and adverb), but slows down programs like tabulate that access a very large number of frequency elements many times. We concluded that in our era of large high-speed computers and bountiful memory, reading a token frequency file into memory is realistic.
One could easily add the ability to include known word lists such as Unix's /usr/dict/words, so that we could include entries for derivations of a base token not present in the corpus.
We have also implemented two small applications called plural and adverb, which use expanded single-purpose sets of rules to report on the probable plural and adverb forms (if any) of a given word. It answers the very first question we posed in this paper:
Adverb form of accident: accidently (3.5%) accidentally (96.5%)
_ _END_ _