Packages Used |
Festival: http://www.cstr.ed.ac.uk/projects/festival.html
|
Talking computers are ubiquitous in science fiction; machines speak so often in the movies that we think nothing of it. From an alien robot poised to destroy the Earth unless given a speech key ("Klaatu barada nikto" in The Day the Earth Stood Still and echoed in Army of Darkness) to the terrifyingly calm HAL 9000 in 2001: A Space Odyssey, machines have communicated with people through speech. The computer in Star Trek has spoken and understood speech since the earliest episodes. Speech sounds easy, because it's natural to us - but to a machine?
Let's ignore the problem of getting computers to think of things worth saying, and look only at turning word sequences into speech. Let's ignore prosody, too. Prosody - the intonation, rhythm, and timing of speech - is important to how we interpret what's said, as well as we feel about it, but it can't be given proper care in the span of this article; partly because it's almost completely unsolved. The input to our system is stark and minimal, as is the output - there are no lingering pauses or dynamics, no irony or sarcasm. (Not on purpose, anyway.)
If we are given plain text as input, how should it be spoken? How do we make a transducer that accepts text and output audio? In this article, we'll walk through a series of Perl speech synthesizers. Many of the ideas here apply to both natural and artificial languages, and are recurrent themes in the synthesis work at Carnegie Mellon University, where I spend my days (and nights). The code in this article is available on the TPJ web site (http://www.tpj.com/programs.html) and at http://www.cs.cmu.edu/~lenzo/tpj/synthesis_examples.html.
Audio samples of each method are included.
As a first attempt, we could use pre-recorded speech for everything we might want our computer to say. It wouldn't be very flexible, but it would sound great. For every possible input, we'd need a recording. Your phone company probably does this - "What city, please?", "What listing?"
Spokesvoices like James Earl Jones are recorded as expertly as possible to show off the sound quality, and greet you with phrases like "Welcome to Bell Atlantic." This sort of thing is pretty easy to do - just play the right sound file at the right time.
while (<>) { say $speech{$_}; }
%speech is a hash of speech samples, perhaps tied to a database using tie or dbmopen. say() is a subroutine that knows about sending audio to something that can play it, and $_ is, of course, the Perl default variable.
As limited as this might seem, it is the perfect solution for a applications that require only a small vocabulary - in games like WarCraft, for instance, where the characters have a few randomized responses they utter as you order them about - "Yes, Master?" "Of course, Master." If you need any more flexibility, though, you'll be stuck - and if you ever need to record new material, you have to be careful to emulate the same recording conditions.
Let's look some variations of this articles title:
while (<>) { s/($text)/$speech{$1}/g; say $_; }
The s///g replaces all of the words in $text with hash entries from %speech; this tokenizes. the input, and turns each token into an audio waveform. More formally, $text is an expression that matches words in the language defined by %speech. Whole words are replaced with their audio entries. Words not in the language have no entry in %speech, and so produce no output. So, for our purposes, the expression in $text can be quite simple:
$text = '\S+|\s+'
\s+ matches one or more whitespace characters, and \S+ matches one or more non-whitespace characters. Clearly,punctuation is a part of the word in this tokenization. For example, if the previous sentence were the input text, 'Clearly,' and 'tokenization.' would become tokens, as would all the other words; if the sentence you're reading now were the input, we'd even have quoted words as tokens. If we remove punctuation, and fold the uppercase letters to lower case so that "Clearly" and "clearly" are the same word, we can generalize our tokenization a bit. With a few changes we can eliminate some punctuation, in much the same way whitespace was separated from the words. For instance, we could do this:
$text = '[.!?"]|\S+|\s+'
which turns some of the punctuation into separate tokens.
We can convert to lower case with the lc operator:
while (<>) { s/($text)/say $speech{lc($1)}/eg; }
The sample is spoken immediately, rather than explicitly passed to say() after the entire line is processed. The /e on the end of the s/// tells Perl to evaluate say $speech{lc($1)} as a fullfledged Perl expression.
This can work nicely for a small, constrained vocabulary, but you need a sample for every word ever to be spoken. The result sounds unnatural, because the words are spoken without regard to their context in the sentence. For instance, words at the end of an utterance are often spoken more slowly, and with a final drop in pitch. This technique doesn't handle that.
How do we handle "out-of-vocabulary" words? If we wanted our system to speak a word for which there is no presampled audio, we could break it down into smaller parts: phonemes. Just like our written words are composed of an alphabet of letters, or spoken words are composed of an alphabet of phonemes. So we'd need to make a mapping from a sequence of text words to phonemes, and ultimately to audio waveforms.
Now, the regex $text remains the same, but we change the substitution:
while (<>) { s/($text)/ map { say $speech{$_} } text2phones lc($1)/eg }
The %speech hash now contains phonemes. More precisely, it contains phones, which are concrete, context-dependent realizations of abstract phonemes. For instance, the phoneme /T/ can be spoken as the phone [D] in words like 'butter', or as [T] if you pronounce it in isolation (such as at a spelling bee). The audio depends on more than just the phoneme; it also depends on the setting and context.
A couple of points: Notice that the language is still a "regular" grammar - the sort that can be recognized by a vanilla regular expression. (Perl's so-called "regular expressions" are more powerful than we need, because they can handle irregular grammars as well). (A regular grammar deals with patterns of symbols composed with only concatenation ('and'), disjunction ('or'), and closure (zero or more iterations of a pattern). Perl regular expressions allow things like look-a-head assertions, which makes them able to parse context-sensitive languages, and means that Perl's regular expressions aren't regular at all.)
What magic lies hidden in text2phones()? English spelling encodes a great deal of history, but the pronunciations of words are not always obvious - for instance, words such as 'night', 'knight', 'ought', and 'cough' don't sound like they have a 'gh' in them; well, at least not the 'g' as in 'great' or the 'h' as in "history". Languages such as Spanish and Greek have closer relationships between their written and spoken forms; Chinese, on the other hand, has no relationship whatsoever.
There are a number of possible strategies for tackling this problem. Many speech synthesizers use a two-tiered method - first, check if it's in a small dictionary of pronunciations, and, if not, resort to a text-to-phoneme conversion strategy. Any word pronounced properly by the rules is removed from the dictionary, so as to keep the total conversion package reasonably small. Festival, a free, source-available synthesizer from the University of Edinburgh, uses this strategy, as does my own Perl-based synthesis, phonebox.
Looking things up in a dictionary is good work when you can get it, but you'll always run into out-of-vocabulary words for which there's no predefined pronunciation. Even ignoring the fact that human languages change all the time by adding words and changing usages, the number of entries will get quite large if every word is included. Building generic text-to-phoneme transducers is an interesting problem that requires generalization to unforeseen data; decision trees, borrowed from the discipline of machine learning, work reasonably well. A decision tree is an organization of nodes in which each node represents a decision; you can download Perl code to build dictionary decision trees from http://www.cs.cmu.edu/~lenzo/t2p and the TPJ web site. The technique is described there, and it includes the letter-to-phoneme converter as well as some sample data.
At CMU we train decision trees for text-to-phoneme conversion using pronouncing dictionaries such as the CMU dictionary, the Oxford Advanced Learner's dictionary, the Moby lexicon, or any number of others. Given a set of words and their pronunciations, a set of alignments between the letters and phones is produced, generating a mapping from letters to phonemes. For example, here are some possible alignments of the letters in the word 'algebraically' with a string of phonemes:
a l g e b r a i c a l l y AE L JH AH B R EY IH K _ L _ IY
There are some tricky issues in generating good alignments; for this example alone there are 78 possible alignments, (This number comes from C[13,11]) [pronounced "thirteen choose eleven"], which is the number of possible ways you can choose eleven items from a set of thirteen. In this case, there are thirteen letters and eleven phonemes, so there are eleven slots that get phonemes, and two that don't. The ones that don't get a phoneme end up aligning with a null. C(n,k) is equal to n!/(k!(n-k)!), where n! is n factorial, or n * n-1 * n-2 * ... * 1.) and good alignments are critical for getting acceptable results from any learning method. NetTalk, a neural network for speech output that is successful despite relatively poor performance of its text-to-phoneme rules, uses this sort of encoding of letters in a window of neighboring letters. Methods for generating these alignments are discussed in [Black, Lenzo, and Pagel] and [Pagel, Lenzo, and Black].
Once the alignments have been generated, they are exploded into feature vectors, one per letter, that include some context: three letters on each side. Each feature (neighboring letter, in this case) is given a name; 'L' for the letter itself, 'L1' for one to the left, 'R1' for one to the right, and so on. The word 'absurdities', for instance, ends up producing eleven feature vectors:
L3 L2 L1 L R1 R2 R3 Phoneme - - - a b s u AH - - a b s u r B - a b s u r d S a b s u r d i ER b s u r d i t _ s u r d i t i D u r d i t i e AH r d i t i e s T d i t i e s - IY i t i e s - - _ t i e s - - - Z
These vectors are taken together to build a decision tree that tests the features and produces an output phoneme (or no phoneme at all, denoted by the underscore in the fifth and tenth vectors). The result is an example of a finite state transducer, the core of several speech synthesis systems, such as the Lucent synthesizer. Here's one subtree trained from the CMU dictionary:
if ($feature{'L'} eq 'C') { if ($feature{'R1'} eq 'A') { if ($feature{'L1'} eq 'A') { if ($feature{'L2'} eq 'F') { return 'S'; } if ($feature{'L2'} eq 'R') { if ($feature{'L3'} eq 'U') { return 'S'; } return 'K'; } return 'K'; } ...
This says that the letter C, in this context:
Feature L3 L2 L1 L R1 R2 R3 Phoneme Letter ? F A C A ? ? S Testing order - 4 3 1 2 - -
(tested in the order L, R1, L1, L2), should become the S phoneme (as in the Americanized "façade"), or if the second letter to the left is an R instead of an F, then if there's an R to the left of it ("Curaçao"), output the S phoneme, otherwise output K. You can see that these trees actually represent a great deal of the dictionary directly.
Once the text has been converted into a phoneme sequence using these letter contexts, each phoneme can be immediately replaced with its audio waveform, but only if we ignore the phoneme context.
You'll quickly find that simply stitching together phoneme sequences sounds pretty bad; examples of this are on both the TPJ web site and my own. Part of the reason is that the actual sounds of a language depend upon the neighboring sounds - they're context-dependent. Just as the letter transducer uses context to decide which phoneme to speak, the phone depends upon the neighboring phonemes. We need the phonemes to know who their neighbors are, but as long as we have just one s/// that does the whole job, we can't do that.
For an example of why context is necessary, consider the words 'pat' and 'spat'. In 'pat', the p is aspirated - it's spoken with a puff of air. In 'spat', there is no aspiration. Also, when an R comes after a vowel, it colors the vowel in a way that's different from an R that's before a vowel; L does the same thing, revealing the two kinds of [L] in the word 'lateral'.
Capturing the context and using it to pick the sample can be accomplished easily with Perl's irregular expressions:
s/($text)/phones lc($1)/eg;s/(?=(^|$phoneme))\s*($phoneme)\s*(?=($phoneme|$))/ say $speech{"$2($1,$3)"}/eg;
Here, the (?=) items are assertions that must be satisfied but aren't considered part of the matching text.
The first s/// turns a word sequence into a phoneme sequence, and the second speaks a phoneme in the context of its left and right neighbors; it captures some of the coarticulation effects that smear sounds into one another. This would be fine if you had every possible entry for every possible phoneme in context listed in %speech. That number can be quite large - for instance, if the size of the phonemic inventory is, say, 51 sounds, that gives a total of 132651 (513) entries in the table. It's difficult to get complete coverage of these units from a single talker, let alone with consistent quality, so we turn that last hash table lookup into a subroutine. A phoneme in context like this is often called a triphone, meaning one phoneme with its left and right neighbors fixed.
s/($text)/phonemes lc($1)/eg;s/(?=(^|$phoneme))\s*($phoneme)\s*(?=($phoneme|$))/ say best_triphone $2, $1, $3/eg; sub best_triphone { my ($phoneme, $left, $right) = @_; return $speech{"$phoneme($left,$right)"} || $speech{"$phoneme(,$right)"} || $speech{"$phoneme($left,)"} || $speech{"$phoneme(,)"}; }
So the hash of samples, %speech, needs to contain some entries for the "backed-off " triphones; for instance, P(,R), the phoneme P with an R on the right and anything on the left. It will also need P alone, in case neither context matches.
This is effectively an implementation of optimality theory - it finds the best non-empty subset given a partially ordered set of constraints. The "unwrapping" of the single phoneme into itself plus context is another example of the two-level conversion we used in the letter-to-phoneme rules: a finite context-sensitive language is expressed with an equivalent regular grammar, which lets us parse in a particularly easy way - with finite state machines. With these context-sensitive rewrite rules, we have a triphone synthesizer. It doesn't consider prosody, it performs no analysis beyond one level of neighboring context, but it works.
Even with everything described here, there's much still missing intonation, stress, voice quality, everything that really makes a voice sound like an individual. A complete speech synthesis system is more complex than what you've seen here, but the basic techniques here can be applied to any system.
Now that speech-interactive systems are actually out there, on the phone and on the desktop, we're seeing just how unsolved many of these synthesis problems really are. They don't sound very good; there's not a lot of individuality in the voices; they certainly don't sound interested in what they're doing; but the work goes on. I like to think we're making progress - the Earth does not stand still. Gort! Klaatu barada nikto.
__END__