A TEXT-TO-SPEECH PRIMER.

Purpose of text-to-speech.

Among the many definitions that could be given of text-to speech, the one that describes it as a way of having a computer audibly communicate information to the user is probably the most relevant within the context of this statement. In situations where visual feedback is inadequate or even impossible, audible feedback may be an essential feature; in other situations it may just add extra value to a product. Generally, text-to-speech provides a very valuable and flexible alternative for digital-audio recordings in cases where:

As to the types of applications that can benefit from text-to speech functionality, they cover a wide range of products in the markets of PC multimedia, telephony, automotive, consumer electronics etc. In the telecommunications business, the technology can be utilised in such applications as home banking, remote e-mail and fax access, database driven inquiry systems and PC-based phone management systems; to name but a few. In the computer industry, text-to-speech provides a considerable added value for both business and home applications. Language learning, PC-based video games, proof reading and data verification, message notification, answering machines and television viewing have become leading features in the home PC market. Integrated in new features for automotive electronics such as phone, navigation and information systems, text-to-speech offers next to the obvious advantages, also potential safety hazards. It provides a common hands-free interface, thereby keeping the driver’s attention on the road and increasing the driving comfort and safety. Consumer electronics can highly benefit from the inclusion of text-to speech functionality. The toy and appliance markets present one specific range of needs while pocket or hand-held translators, digital answering machines, portable and cellular phones, organisers and PCs cover the other end of the spectrum. In the industry, text-to-speech can help to produce speaking measurement or alarm systems, announcement systems, or it can be used for task list dictation facilitating hands- and eyes-free operation. In the medical field, finally, text-to-speech constitutes an excellent tool for the production of aids for handicapped people: reading machines for blind people, communication aids for vocally impaired persons etc.

Definition of concepts:

Different implementations of text-to-speech systems exist. This section discusses some of the concepts on which these systems are built. Generally, a text-to-speech system can be broken down into three main parts: a linguistic, a phonetic and an acoustic part. First, an ordinary text is input to the system. A linguistic module converts this text into a phonetic representation. From this representation, the phonetic processing module calculates the speech parameters. Finally, an acoustic module uses these parameters to generate a synthetic speech signal.

Linguistic processing:

The linguistic module of a text-to-speech system performs several tasks: text normalisation, spelling-to-phonetics conversion (i.e. grapheme-to-phoneme conversion and stress assignment), lexical and morphological analysis, syntactic analysis, and, to a lesser extent, semantic analysis.

Text normalisation:

The text-to-speech system should be able to read aloud any written text, even if it contains a miscellany of abbreviations, dates, currency indications, time indications, addresses, telephone numbers, bank account numbers and various other symbols such as quotation marks, parentheses, apostrophes and other punctuation marks. For example, to solve the abbreviation problem, an abbreviation dictionary can be used. Abbreviations that do not occur in the dictionary are then pronounced as single words or are spelled out depending on the graphotactic structure of the abbreviation. Another example of text normalisation is the processing of digits. Digits are handled according to the syntactic and semantic context in which they appear. In English (as in Dutch and German) digit strings such as 1991 are pronounced differently according to the context (number or year). This is not the case in Spanish or French. In Spanish for example, the conversion of digit strings also needs lexical information because the pronunciation of the digit string sometimes changes depending on the gender of the noun or on the following abbreviation. To handle text normalisation, text-to-speech systems use a lot of orthographic knowledge, frequently phrased by linguistic context-dependent rules, in combination with dictionary lookup.

Pre-processing modules:

Although state of the art text-to-speech systems offer sophisticated text normalisation features by themselves, for specific applications they also rely on additional text-pre-processing modules to optimise the text-to-speech performance.

Orthographics-to-phonetics:

This conversion is one of the main tasks of the linguistic processing part. A text-to-speech system needs a lot of pronunciation knowledge to perform this task, which includes grapheme-tophoneme conversion, syllabification and stress assignment. Different ways of orthographic-to-phonetic conversion are possible:

  • consulting dictionaries containing full word forms or morphemes.
  • using a set of pronunciation rules.
  • using techniques such as neural nets or classification trees.
Most (commercial) text-to-speech systems use a hybrid strategy combining word dictionaries, morpheme dictionaries and pronunciation rules. Although the same strategy can be used for the development of all language versions, it is obvious that each language has its own particularities.

Lexical, morphological and syntactic analysis:

Lexical, morphological and syntactic analysis is needed to solve pronunciation ambiguities. The English word re’cord for example, can also be pronounced as ’record. In French, the character string président is pronounced differently depending on its part of speech (noun or verb). Lexical, morphological and syntactic information is also very important to create a correct prosodic pattern for each sentence. For instance, important syntactic boundaries entail intonational changes and vowel lengthening. A frequently used method for tagging isolated words with their parts of speech is a combination of morphological rules and dictionary look-up. For example, particular word endings help predict the part of speech of words. The syntactic analysis can be performed with different parsing techniques. Some of these techniques are developed within the field of Natural Language Processing (NLP) and adapted to the special needs of text-to-speech synthesis. For example, parsing techniques for text-to-speech, much more than for NLP applications such as text translation, should meet the real-time requirement. Most of the current commercially available text-to-speech systems do not perform a full syntactic analysis, i.e. they do not construct a full syntax tree, but rather perform a phrase level parsing. For instance, context-dependent rules can be used to solve part-of-speech ambiguities and divide a sentence in word groups and prosodic phrases.

Phonetic processing:

The phonetic module performs two main tasks:

  • segmental synthesis.
  • creation of good prosodic patterns.

Segmental synthesis:

This part of the text-to-speech system is responsible for the synthesis of the spectral characteristics of synthetic speech. In most systems, the segmental synthesis module also handles amplitude (loudness). There are two different approaches to segmental synthesis:

  • phoneme synthesis (synthesis by rule).
  • synthesis based on segment concatenation.
In the concept of synthesis by rule, some target speech parameters are stored for each phoneme. During the synthesis, the system starts from these target values and then uses rules to create correct spectral transitions. The resulting speech parameter vectors are then used to drive a speech synthesiser, which is frequently a formant synthesiser. An alternative to using a formant synthesiser can be working with articulatory models. Although these might ultimately offer a more interesting and flexible approach, the speech quality of current articulatory-based systems is still inferior to the speech quality of state-of-the-art formant synthesis systems. Systems with segment concatenation use small speech segments taken from human speech to create synthetic speech. With a finite set of well-chosen speech segments, it is possible to synthesis any text by concatenating segments. The selection of the elementary building blocks is a key factor, determining to a great extent both the complexity and the quality of the system. Text-to-speech systems using the segment concatenation technique typically use diphones as elementary speech units. A diphone is a small speech segment starting somewhere ’in the middle’ of one phoneme and ending ’in the middle’ of the next phoneme. Consequently, the transition as well as most of the coarticulation effects between the phonemes are preserved inside the unit. The use of diphones as elementary building blocks for speechsynthesis is based on the assumption that coarticulation effects are local effects. Yet, this is not always the case, which is why some systems use segments of different lengths, e.g. diphones in combination with larger units such as triphones or tetraphones. An equally important issue in that respect is the development of methods for segment database creation and segment selection. Since the advantages and disadvantages of synthesis by rule and concatenative synthesis are to a certain extent complementary, a hybrid text-to-speech strategy is receiving increased attention.

Prosody:

To synthesize intelligible and natural sounding speech, it is essential to create good prosodic characteristics. The synthesis of prosody involves two steps: • the production of a good intonation contour • the assignment of a correct duration to each phoneme As already mentioned, the creation of a correct amplitude (loudness) contour is frequently handled as a part of the segmental synthesis module. With respect to the intonation, some important principles have to be taken into account. Each sentence contains at least one or more important or dominant words. In a lot of languages, an important word is marked by means of an intonation accent realized as a pitch movement on the lexically accented syllable of the important word. Intonation is not only used to emphasize words but also to mark the sentence type (e.g. declarative versus interrogative, WH-questions versus yes/no-questions) and to mark important syntactic boundaries (e.g. with phrase final continuation rises). In tone languages such as Chinese, word meanings and/or grammatical contrasts can be conveyed by variations in pitch. In pitch-accent languages such as Swedish and Japanese, a particular syllable in a word is pronounced with a certain tone. This is in contrast to languages such as English where each word has a fixed lexical stress position, though there is less restriction on the use of pitch. Apart from all the intonation effects just described, some segmental effects (such as the influence of the post-vocalic consonant on the pitch of the preceding vowel) can also be observed in natural intonation contours. A text-to-speech system should include a language-specific intonation module that models the perceptually relevant intonation effects of the target language. Such an intonation model should at least take into account the number, location and stress level of the important words, the location of the major syntactic boundaries and the sentence type. Among the different approaches possible, an approach applicable to a lot of languages (such as English and Dutch) is to describe pitch contours by means of standardised pitch movements (rises and falls). Rules specify how these elementary pitch movements can be combined to create intonation contours for entire messages. Assigning a correct duration to each phoneme is essential. Measurements on speech data as well as perceptual experiments prove the relevance and the importance of good duration models. Phoneme durations are influenced by a lot of factors. Without being exhaustive, the list below shows some of the factors a duration model should take into account, as they influence the intrinsic duration of the phonemes:

  • the phonetic context.
  • the stress level.
  • the position within the word.
  • the syntactic structure of the sentence.
  • the opposition between content and function words.
Phoneme models can be developed and implemented in different ways resulting, for example, in rule models, neural net models or decision tree models. Some of the models are phoneme-oriented while others predict the duration of syllables before assigning durations to phonemes. Although the prosody models in text-to-speech systems have become increasingly sophisticated, synthetic prosody is still one of the main causes of the quality difference between synthetic and human speech. In applications where at least parts of the messages to be synthesised are fixed, off-line prosody optimisation can be performed. This is for example the case for interactive voice response systems, dialog systems, automatic traffic messaging and navigation systems. One way to do off-line prosody optimisation is using the technique of prosody transplantation. This technique is based on the idea of transplanting (copying) intonation and duration values from a recorded donor message (human speech) to the phonetic transcription of the same message. The enriched phonetic transcription thus obtained serves as input for the text-to-speech system, bypassing the linguistic and prosodic modules. The output of this Phonetics-To- Speech (PTS) synthesis is high-quality synthetic speech. Transplanted prosody and PTS can also be used in applications where fixed messages are combined with variable information: full text-to-speech is used for the variable parts, while PTS and transplanted prosody are used for the fixed parts.

Acoustic processing:

The last part of a text-to-speech system performs the acoustic processing. At this stage, the speech data created in the previous stage of the processing are converted into a speech signal. The synthesis model used should allow the independent manipulation of spectral characteristics, phoneme duration and intonation. As mentioned before, a text-to-speech system based on phoneme synthesis uses a speech synthesiser (usually a formant synthesiser) to create the speech output. However, a system that uses segment concatenation does not necessarily need such a synthesiser. Concatenation of speech segments and synthesis of prosody can be done in the time domain using pitch synchronous methods such as the TDPSOLA (Time Domain Pitch Synchronous Overlap Add) technique. Another possibility is to use pitch synchronous synthesis techniques in combination with a residual excited LP (Linear Prediction) speech production model (or another derivative of this model). In this case, prosody manipulations can be done in the residual domain. Another technique is based on an LP-model in combination with a parametric excitation model. This approach offers extra flexibility to control the voice quality.

The forgoing text is attribruited to Scansoft inc.

Home Page