This article is about how I used regular expressions to turn common English text into Braille, correctly using all the possible shorthand-like contractions and abbreviations that Braille provides. This is basically a case study in how a messy problem got solved straightaway with Perl - but along the way I'll mention some of the new features to be found in Perl 5.005's regular expressions, which I used along the way; and in the end, I find a surprising commonality between regexes and natural language writing systems.
When I was a little kid, I had a children's book about Helen Keller. I don't remember reading it; I just remember the back cover, which had the Braille alphabet printed on it - well, embossed, actually. They had the Roman letter "a" in ink, and then below that the embossed dot pattern for Braille "a", and so on up to "z". So I got the idea that Braille printing is just like a letter-for-letter substitution cipher for the Roman alphabet.
Then I started noticing on men's room doors that below "MEN" in big Roman letters, there'd be the same word in Braille - but sometimes the word would have three Braille characters, and sometimes just two. And that I found perplexing. I couldn't imagine how the word "men" could end up only two characters long.
So I asked my friend Sheri, who's been reading and writing Braille since she was a kid, how "men" could be just two characters long. She explained that Braille has contractions: the first character in "men" is "m", and the second character is a contraction for "en". (For the rest of this article, I'll use wedges to denote contractions, so "<en>" will mean the single Braille character that is a contraction for the two Roman letters "en".)(Braille readers typically use the term "sign" to mean a Braille character.)
Moreover, some contractions are context-specific - there's a contraction for "com", but it can apply only at the start of words, as in "computer". These shorthand-like contractions make reading Braille faster, and they make books in Braille less expensive and less bulky.
There are contractions for "en", "st", "ea", "er", "the" (the letter sequence, not just the word "the"), and lots more. It occurred to me that there were cases where I could imagine more than one way to apply contractions. Consider the two ways to encode the word "leather":
l e a t h e r . \_/ \_/ \_/ - contract "ther" as "<th><er>"
or
l e a t h e r . \_/ \___/ . - contract "ther" as "<the>r"
In each case, you end up with a word four letters long, so brevity can't be a tie-breaker. I asked Sheri whether both were acceptable spellings and she said no; the second spelling, with "<the>r", is the only correct way to encode "leather". This reminded me of English hyphenation. For example, you could hyphenate "boggling" as "bogg-ling" (sort of like "gos-ling", maybe), but you could also hyphenate it as "bog-gling", and that seems the best way to do it, so that gets to be the correct way.
So I was curious how this would be implemented as an encoding algorithm. Sheri explained that programs do exist for taking conventional English text, and applying all the contractions possible, in the right places - what I'll just call "Braille encoding." Braille encoding programs can be embedded in tactile displays for computers, so that when I send Sheri email (unencoded), it shows on her display in encoded Braille. Or a Braille encoder can be a program that encodes a whole file at a time, typically for preparing an unencoded text file for sending to a Braille embosser, a kind of impact printer.
One encoder program is from the National Federation of the Blind, and is called NFBTRANS. NFBTRANS is by no means the only such program around. Another widely used program is DBT, Duxbury Braille Translator. DBT, however, is a commercial product, with no source available. NFBTRANS is free, and comes with source code.
I hunted down a copy of NFBTRANS, and looked at the source code that came with it, and discovered that it was not just a Braille encoder, but more a typesetting system like troff. It was about seven thousand lines of C, and in the guts of it, I did manage to find the encoding routines - and I couldn't make heads or tails of them. NFBTRANS reads the list of contractions from an external data file that come with the program, but I couldn't figure out exactly how the rules were applied - namely, in what order or priority.
In other words, the file had rules like:
"the" contracts in any context "er" contracts in any context "th" contracts in any context
...but how or why would NFBTRANS manage to always contract the end of the word "leather" to "<the>r" and not "<th><er>"? Were there other "-ther-" words that could get "<th><er>" instead of "<the>r"?
To fill in the blanks in my understanding of Braille, and to try my hand at modeling a linguistic system in Perl, I decided to try to develop an algorithm in Perl to correctly apply these rules, and to check them against NFBTRANS's output.
I'd had some experience with formal linguistic models of the subsystems that make up natural languages. Every time I'd seen a linguist create a formal model of a subsystem of a language (whether English phonology, or Hindi syntax, or something less abstract, like hyphenation), the rules that model the system have been in one of the two kinds of frameworks I'll explain below - either the kind I call a "generativity system," or the kind I call an "optimality system". So I expected Braille encoding to be one of these two sorts of rule systems, and the first task was to figure out which it was.
Linguists often use this kind of rule system to model phonology, the study of the way the sounds of language are represented in the brain. Phonological models assume that words are stored in a somewhat simplified, abstracted form, and that to work out the details of how these are to be pronounced, we apply a series of rules.
For example, a phonological model of my dialect of American English would assume that the "long i" sound is always stored as simply that. But I, like most Americans, pronounce it differently in "kite" than in "fly". To account for this difference, phonological models say that this sound starts out as just itself (here written "/aj/"), but that rules may apply to change it in some contexts. This isn't just about vowels - the two different-sounding t's in "write" and in "writing" supposedly both start out as just /t/, but a rule happens to change the latter one to sound like a /d/.
The rules describing these two phonological changes could look like:
These rules, when applied in that order, are supposed to take words in a representation that doesn't have these finer distinctions (like the difference between /aj/ and /Uj/), and change those representations to flesh them out with those distinctions.
Now, translating rules like these into regular expressions used to be hard, but now, with Perl 5.005 regexes, it's a snap, with "lookbehind" and "lookahead":
s/aj # target : the vowel /aj/ (?=[ptk]) # following context : /[ptk]/ /Uj/gx && print "R1 applies. Output: $_\n"; s/(?<=[aeiouIUj]) # preceding context : any vowel t # target : /t/ (?=[aeiouIUj]) # following context : any vowel /d/gx && print "Output of system: $_\n";
What this gets you over /aj[ptk]/ or /[aeiouIUj]t[aeiouIUJ]/ is that with lookbehind (as in (?<=[aeiouIUj])) or lookahead (as in (?=[aeiouIUj])), the text matching the lookahead and lookbehind parts of the regex aren't part of what gets matched by the regex as a whole.
That is, text that matches the lookbehind expression is free to precede the "cursor" (the part of the regex mechanism that pos reports the position of), and text matching the lookahead expression doesn't move the cursor forward, as normal matching does. (Also, the lookbehind text doesn't end up in $&, which you probably know as the "what matched?" regex variable, but which you can better think of as "now that we've matched something, what was between where the point started, and where it ended up?")
Now, the above phonetic rules are greatly simplified for the sake of discussion, but you can see their application to "write + ing", starting from its simplified abstract form /rajt + IN/:
$_ = 'rajtIN'; print "Input to system: $_\n"; s/aj (?=[ptk]) /Uj/gx && print "R1 applies. Output: $_\n"; s/(?<=[aeiouIUj]) t (?=[aeiouIUj]) /d/gx && print "R2 applies. Output: $_\n"; print "Output of system: $_\n";
This prints:
Input to system: rajtIN R1 applies. Output: rUjtIN R2 applies. Output: rUjdIN Output of system: rUjdIN
And this gives the correct pronunciation for "writing" in my dialect, /rUjdIN/. Change the first line to "rajdIN" ("riding"), and neither rule applies, and you get "rajdIN" out. (So, yes, when I speak, "writing" and "riding" sound different.) And, importantly, if you swap the rules so that R2 applies before R1, you get:
Input to system: rajtIN R2 applies. Output: rajdIN Output of system: rajdIN
(So, if your dialect has "writing" and "riding" sounding the same, it might be that rule-ordering is the only difference between your dialect and mine.)
The ordering of the rules in generativity systems is crucial; if you have the right rules in the wrong order, you get the wrong answer. If Braille encoding were a generativity system, I'd need to figure out how to order the rules from the NFBTRANS data table.
The algorithms I've seen that implement English hyphenation are basically optimality systems. I encourage interested readers to look at the hyphenation algorithm in TeX (which you can see reiterated in Perl in CPAN's TeX::Hyphen), but for sake of discussion, suppose you can model English hyphenation with these rules for ranking candidate forms:
In Perl, you could implement this as:
use strict; my $in = 'boggling'; my $best = $in; my $best_score = 0; my $Cons = 'bcdfghjklmnpqrstvwxz'; my $Vowel = 'aeiouy'; foreach my $i (1 .. (length($in) - 1)) { $_ = $in; my $score = 0; substr($_, $i, 0) = '-'; ++$score if /[$Cons]-[$Cons]/oi; ++$score if /([$Cons])-\1/oi; --$score if /[$Cons]-[$Vowel]/oi; $score -= 10 if /^[$Cons]+-/oi || /-[$Cons]+$/oi; print " \"$_\" : score: $score\n"; if($score > $best_score) { $best_score = $score; $best = $_; } } print "Best: \"$best\" with score of $best_score\n";
The output of this is:
"b-oggling" : score: -11 "bo-ggling" : score: 0 "bog-gling" : score: 2 "bogg-ling" : score: 1 "boggl-ing" : score: -1 "boggli-ng" : score: -10 "bogglin-g" : score: -9 Best: "bog-gling" with score of 2
...so these rules seem to work right - the highest-ranked form ("bog-gling") is the best; and incidentally, the second best ("bogg-ling") is not too bad, and from there on out it's all quite bad ("bo-ggling", "boggli-ng", etc.).
Note that it doesn't matter here what order you apply the rules in; it just matters what weight gets attached to each rule. In the above example, I've kept it simple, but suppose we now add a rule that means "the closer to the middle of the word you hyphenate, the better", such as:
$score += .5 * (length($in) - abs($i - length($in) / 2));
If we leave that weighting constant at 0.5, you still get "bog-gling" coming out on top. Change it to a 1, and we have a tie between "bog-gling" and "bogg-ling", since the point that "bog-gling" gets for hyphenating between a double consonant is offset by the point that it loses out on for the hyphen not being at the exact middle of the word. And if we change the constant to 1.5, we get "bogg-ling" coming out on top.
If NFBTRANS's rules somehow interacted with an optimality system with ranking-rules like "give K points for every letter this contraction saves", or possibly "give (or subtract?) L points if this word ends up with contractions next to each other" or "subtract a half-point if the word ends in a contraction", then I'd need to first puzzle out what these ranking-rules were, and then figure out what their values for K and L were.
So I decided to take the rules from NFBTRANS's rules file, cook them up into a regular expression, and use that to perform all the substitutions, in a one-pass regex like s/($re)/&lookup($1)/eg, an approach I borrowed from Kevin Lenzo's article on converting text to speech in TPJ #12. I figured that this approach would by no means behave correctly, but that the cases where it didn't behave correctly would give me some kind of hint as to whether a generativity system or an optimality system was called for, and from there I could start worrying about the ordering or weighting of rules.
Granted, next to either generativity systems or optimality systems, a big one-pass regular expression replacement seems pretty strange, but Perl makes regexes so easy that it seemed the path of least resistance for a first hack.
"the" contracts to "<the>" in any context "er" contracts to "<er>" in any context "th" contracts to "<th>" in any context "ea" contracts to "<ea>" only in the middle of words "ar" contracts to "<ar>" in any context "ear" contracts to "e<ar>" in any context
Now, a regex for "'the' in any context" would be a simple <the>. But how to express "<ea> only in the middle of words"? You could do it with something like /(?<=\w)ea(?=\w)/ to assert that there be a word character before and after the 'ea', but Perl already provides a metacharacter for matching a word boundary (\b), or the absence of one (\B). In other words, what's needed here is the absence of word boundaries before and after the 'ea', and you can do that with simply /\Bea\B/. Translating the above mini-list of Braille contraction rules into a regex gives:
%contraction_for = ( 'the' => '<the>', # Remember: I'm using "<the>" here # in place of the Braille ASCII, or Unicode, # code for the single Braille character # for that contracts "the" 'er' => '<er>', 'th' => '<th>', 'ea' => '<ea>', 'ar' => '<ar>', 'ear' => 'e<ar>', ); s/(the|ear|ar|\Bea|er|th)/$contraction_for($1)/eg;
Now, notice that I moved the longer strings to the start of the regex. This is a crucial point in understanding Perl's implementation of regular expressions: given alternates, Perl will match the first one it can match, regardless of whether there may be later alternates that are longer. So if "th" came before "the" in that regex, the "th" would always match, never letting "the" match. Moreover, if there were a "\Bea\B" and an "ea" in the regex, if the "ea" came first, it would block the "\Bea\B" from ever matching. So, in producing this regular expression, I had to make sure that the alternates started out with the longest strings, in the most specific contexts (like "\Bfoo\B"), and worked their way down to the shortest match strings, in the most general contexts (like just plain "foo"). This was a simple matter of a Schwartzian transform, where the sorter function looked like:
... sort { length($b->[0]) <=> length($a->[0]) # sort first by length of the literal or $b->[1] <=> $a->[1] # then by context "precedence" # where "\B_" gets a higher "precedence" # score than "_" } ...
By this time, I had 180 lines of code for reading the contraction rules file and transforming it into a regex. Only about 20 lines were necessary to perform the contractions, and they were basically a wrapper around this:
$word =~ s/($my_re)/$contraction_for{$1}/oeg;
"foo" contracts to "<X>" at word-start "foo" contracts to "<Y>" elsewhere
Now, you could model this pair of (fictitious!) rules with a regex that contains (...|\bfoo|foo|...), but then whether "\bfoo" or "foo" matches, you end with simply with $1 holding "foo". In other words, while you can use "\b" or real lookbehind/lookahead to select for certain contexts, you don't end up knowing which alternative actually matched.
However, a new feature in Perl 5.005 is exactly what I needed here. The feature is still somewhat experimental, and a full discussion of it would require an article of its own; but to make a long story short, having (?{ CODE }) in a regular expression will cause CODE (any Perl code) to be evaluated whenever the regular expression engine hits that point in trying to match the regular expression. So, to differentiate between "\bfoo" and "foo" matching, I add to the regex a bit of code to be executed at the end of each:
$word =~ s/ ( | \bfoo (?{ $x=1 }) | foo (?{ $x=2 }) ) /$contract[$x]{$1}/eg;
Here, if \bfoo is the bit of the regex that matches, the last thing that the match does is to execute $x=1, so then the string that got matched (say, the "foo" in "foolish") gets replaced with the value of $contract[1]{'foo'}. But if it's the second bit of the regex ("foo") that matches (as in the "foo" in "buffoon"), $x=2 will get executed, and so "foo" will get replaced with the value of $contract[2]{'foo'}. This array-of-hashes @contract (which is accessed like $contract[context_flag_numeric]{string_to_be_replaced}) will have been filled by the same part of the program that read the rules in and created the regular expression.
"the" contracts to "<the>" in any context "er" contracts to "<er>" in any context "th" contracts to "<th>" in any context "ea" contracts to "<ea>" in the middle of words "ar" contracts to "<ar>" in any context "ear" contracts to "e<ar>" in any context
the rule that replaces "the" with the single character "<the>" blocks the more general rule of replacing "th" with "<th>" from applying. Likewise, the rule that targets "ear" blocks the rule that targets "ea" from applying - which is why "heart" contracts to "he<ar>t", not "h<ea>rt".
Now, for some reason, it's incorrect to contract the "ear" in "tearoom" at all. This could be modelled by saying that "tearoom" is on a list of exception words that are exempt from contractions, but then you have two kinds of data in the system - the rules table, and the list of exceptions. The way NFBTRANS models this is to simply add another rule:
"tearoom" contracts to "tearoom" in any context
A rule that replaces a word with itself seems entirely pointless, but the point of it is to block the more general rules. (Actually, the rule says "replace 'tearo' with 'tearo' in any context", not specifying the whole word - but I think the people who thought up that rule thought of "tearoom" as the only word it could apply to, and were not aware of the rarer "-tearo-" words: "Aotearoa" or "stearolactone".)
The NFBTRANS rules file basically consists of:
The above "tearoom" rule is an example that happens to block all contractions for that word, but the average rule is meant to stop only some of the contractions from applying. For example, applying the normal contraction rules to "pineapple" would contract it as "p<in><ea>pple". However, this spelling is considered incorrect, so there's a rule specifically to fix this with the correct spelling:
"pineapple" contracts to "p<in>eapple" in any context
And so on for "phoenix" (to block the "en" from contracting), "coworker" (to block the "ow" from contracting, lest they become cow orkers), and all the way up to the longest rule, which replaces "psychedelic" with "psy<ch>edelic" (blocking the "<ed>").
Incidentally, as with exceptions in hyphenation, most of these exceptions to the normal Braille contraction rules are motivated by wanting to not contract (or wanting to hyphenate between) letters that belong to different syllables or to different morphemes (roots, prefixes, or suffixes, like "tea-room", "co-worker", etc.). But the "ea" in "create" does contract to the single "<ea>" character, even though it's clearly across two syllables. Since the conventions for Braille contractions evolved naturally - just as general English spelling did - it's full of patterns, and messy exceptions, and messy exceptions to the messy exceptions.
I took the "Unix wordlist" (a motley wordlist of 25,000 English words, one per line), and encoded it with a copy of NFBTRANS, so that I could compare it to my algorithm's output. I wrote a small program to use my algorithm to encode the wordlist, print all cases where my algorithm's output disagreed with the output of the inscrutable NFBTRANS algorithm, and then report my algorithm's accuracy.
Guessing wildly, I expected no better than 50% similarity between the output of my algorithm, and the output of NFBTRANS.
Most of the cases of disagreement were words like "R&D" or "O'Leary" - since this was a first hack, I had not bothered to properly deal with words with internal punctuation or complex capitalization, which require somewhat special treatment in Braille encoding. The remaining dozen words where my algorithm encoded differently than NFBTRANS were because I'd misimplemented the regex context for one of the more obscure rules. (And incidentally, I've only implemented my encoder for words in isolation - in encoding of actual sentences, there's punctuation to deal with; and a few Braille contractions in words are sensitive to the context of the word, like whether there's punctuation after the word, which makes things a bit messier.)
Listing 1: Sample Text in Braille |
This one-pass regex approach was just a stab in the dark, and I expected to use the list of its failures to refine the algorithm - say, developing a generativity system that would make one pass to handle word-initial contractions, then another to handle word-final contractions, then a final pass for everything leftover in between. Or maybe the answer would lie in an optimality system to consider all possible ways to contract a word, and select the one that had the fewest characters, or maybe the fewest dots total, or maybe the most contractions closest to the beginning or end of the word.
But it was nothing so complicated - this big regex, in one pass, did the whole job. Not only did this seem much too easy (although this is a not unfamiliar sensation in programming Perl), but it was quite unprecedented - I'd never heard of a phenomenon in natural language that could be treated with a regular expression.
I'd always considered regular expressions to be essentially abstract and mathematical. After all, the formalism of regular expressions was devised just this century, coming out of the work of mathematician Stephen Kleene; and the workings of the regex engine (as in Mark-Jason Dominus's article in TPJ #9) are explained in terms of automata. And whatever an automaton is (a big scary robot?), it sure doesn't sound like anything to do with something that people do unconsciously when they write Braille. (Incidentally, people have been been writing English in Braille this way for decades longer than regexes have been around.)
But in spite of the very formalist origins of regular expressions, I think that the basic mechanism underlying regex replacement - scanning left to right, matching as much as you can, context permitting - is not so unnatural and abstract. That's what we do when we use systems of contraction or abbreviation, whether in Braille, Gregg shorthand, or our own idiosyncratic abbreviations: we start at the beginning of the word (as opposed to inking in the end, and working backwards from there), and we abbreviate as much, and as early, as possible.
Granted, Perl regexes have plenty of features that have no parallel in this process of Braille contractions (like backtracking, notably). But there's the commonality in that what's called "greedy matching" for regexes is effectively the same as what we do when we abbreviate/contract, because of a mix of economy (wanting to save space) and laziness (not having to operate on the whole word at a time, or to consider lots of alternatives, but just contracting in one pass, as you write, starting at the beginning of the word). Perl programmers know from experience that regexes are immensely useful, and my experience with Braille encoding is just another case of that. But the fact that regexes have a basic commonality with a linguistic phenomenon like Braille encoding suggests that not only are regexes a useful tool, but that they're also linguistically natural and, hopefully, more learnable and valuable because of it.
__END__
Sheri Wells-Jensen (sheri@sg.inter.edu) is a linguist who used to be a programmer, way back when everyone thought FORTH was a pretty neat idea.
(Sean adds: thanks to Tack-Tiles (tack-tiles.com) for lending me a set of their Braille blocks, and thanks to David Ondrik for taking the picture of them. And I'd also like to thank the conferees at this year's YAPC for their encouraging response to my talk on Braille encoding with regular expressions.)
The final Braille-encoding code is available, along with some other Braille links, at http://www.netadventure.net/~sburke/bounce.cgi/braille.