Use Embedded Speech Commands to Fine-Tune Spoken Output
As described in Control Speech Quality Using Embedded Speech Commands, you use embedded commands to fine-tune the pronunciation of individual words in the text your application passes to a synthesizer. Even if you use only a few of the embedded speech commands described in this section, you may significantly increase the understandability of your application’s spoken output. This section provides an overview of embedded speech command syntax, lists the available commands, and illustrates how to use them to achieve different effects.
Note that some embedded speech commands have functional equivalents provided by the Carbon selector mechanism (for a complete list of available selectors, see Speech Synthesis Manager Reference.) This means that to achieve some effects, you can either insert the embedded command in the text, or you can pass the equivalent selector to the Carbon SetSpeechInfo
function. If you use the SetSpeechInfo
function (described in Adjust Speech Channel Settings Using the Carbon Speech Synthesis API), the effect applies to all speech passing through the current speech channel, subject to synthesizer capabilities. If you use the embedded command to achieve the same effect, however, it applies only to the word immediately preceded by the embedded command.
Embedded Speech Command Delimiters
When processing an input string or buffer, speech synthesizers look for special strings of characters called command delimiters. These character strings are usually defined to be pairings of printable characters that do not typically appear in the text. One character string is defined as the begin command delimiter and another character string is defined as the end command delimiter. When the synthesizer encounters the begin command delimiter string, it interprets the characters following it as one or more embedded commands until it reaches the end command delimiter string.
The default begin and end command delimiter strings recognized by the MacinTalk synthesizer are “[[“ and “]],“ respectively. You can change these strings if necessary, but you should take care to use printable characters that you do not expect to see in the text your application processes. Also, if you change the default delimiters, be sure to change them back to the default characters when you have finished with the text, because the change is persistent for the current speech channel. For example, if you expect square brackets to appear in the text you’ll be sending to the synthesizer, you can change the default command delimiters to strings containing other printable characters that do not naturally occur in your text.
You can disable the processing of all embedded commands by setting both the begin and end command delimiters to two NUL bytes. You might want to do this if your application speaks text over which you have no control and you’re absolutely sure the text contains no embedded commands. To disable processing of embedded commands programmatically, use the soCommandDelimiter
selector with the SetSpeechInfo
function, as shown below:
// Create a structure to hold the new delimiter values
DelimiterInfo MyNewDelimiters;
MyNewDelimiters.startDelimiter[0] = 0;
MyNewDelimiters.startDelimiter[1] = 0;
MyNewDelimiters.endDelimiter[0] = 0;
MyNewDelimiters.endDelimiter[1] = 0;
SetSpeechInfo(CurrentSpeechChannel, soCommandDelimiter, &MyNewDelimiters);
Overview of Embedded Speech Command Syntax
Note: This section describes enough of the embedded command syntax for you to be able to understand the examples in this document. For a formal description of the syntax of embedded speech commands and their parameters, see Syntax of Embedded Speech Commands.
All embedded commands consist of a 4-character command code and a parameter, enclosed by the begin and end command delimiter strings. For example, the emph
command requires a parameter that tells the synthesizer to increase or decrease the emphasis with which to speak the next word, as shown below:
[[emph +]]
The + parameter tells the synthesizer to increase emphasis for the following word.
More than one command may occur within a single pair of delimiter strings if they are separated by semicolons, as shown below:
[[emph +; rate 165]]
Together, these commands tell the synthesizer to speak the following word or phrase with increased emphasis and at a rate of 165 words per minute.
A parameter may consist of a string, a numeric type, or an operating-system type, and may be accompanied by the + or - characters (the exact format of a parameter depends on the command with which it’s associated). Some commands allow you to use the parameter to specify either an absolute value or a relative value. For example, the volm
command allows you to specify a particular volume or an amount by which to increase or decrease the current volume, as shown below:
[[volm 0.3]]
This command sets the volume with which the following word is spoken to 0.3.
[[volm +0.1]]
This command increases the volume with which the following word is spoken by 0.1.
The speech synthesizer ignores all whitespace within an embedded command, so you may insert as many spaces as you need to make your command text more readable.
In addition, this document uses the following characters to express the syntax of embedded speech commands (these characters do not appear in actual embedded speech commands):
- The < and > characters enclose items that represent logical units, such as string, character, integer, or real value. When you insert an embedded command in your text, you replace the logical unit with an actual value. For example, you might replace "
<RealValue>
“ with 3.0
. For precise definitions of each logical unit, see the formal description of the syntax in Syntax of Embedded Speech Commands.
- The | character means “or" and appears between members in a list of possible items, any single one of which may be used. For example, the
emph
command accepts either the + character or the - character for its parameter. Therefore, the syntax of the emph
command is expressed as emph + | -
.
- The [ and ] characters enclose an optional item or list of items. For example, the
rate
command accepts the optional addition of the + or - character to its numerical parameter to indicate a change relative to the current value. Therefore, the syntax of the rate
command is expressed as rate [+ | -] <RealValue>
.
- Items followed by an ellipsis character (…) may be repeated one or more times.
The OS X Embedded Speech Commands
Table 3-1 describes the embedded speech commands, their parameters, equivalent speech information selectors (if they exist), and in which versions of OS X the commands are available. The syntax of each command in Table 3-1 is expressed using the conventions described in Overview of Embedded Speech Command Syntax.
Note: All embedded speech commands, except for ctxt
, are available in OS X v10.0 and later. The ctxt
command is available in OS X v10.4 and later.
Table 3-1 Embedded speech commands|Command|Syntax and description|Selector|
| — | — | — |
|char
|char NORM | LTRL
The character mode command sets the word-speaking mode of the speech channel. When the NORM
parameter is used, the synthesizer attempts to automatically convert words into speech. This is the most basic function of the synthesizer. When the LTRL
parameter is used, the synthesizer speaks the individual characters of every word, number, and symbol following the command (all other embedded commands are processed normally). For example, to cause the synthesizer to speak the word “cat” as “C-A-T,” you would include the following in a text buffer or string:
[[char LTRL]] cat [[char NORM]]
|SoCharacterMode
|
|cmnt
|cmnt [<Character>...]
The comment command is ignored by speech synthesizers. It enables you to add arbitrary content to the text buffer that will never be included in the spoken output. Note that the comment text itself must be included within the begin and end command delimiters of the cmnt
command.
[[cmnt This is a comment that will be ignored by the synthesizer.]]
|None|
|ctxt
|ctxt [WSKP | WORD | NORM | TSKP | TEXT]
The context command allows you to identify the context of a word to help the synthesizer generate the correct pronunciation of that word, even if no other words in the surrounding phrase or sentence are spoken. Because the pronunciation of words can be different depending on the context in which they appear, you can use the context command to specify the pronunciation used in a particular context.
The context command recognizes two modes: word-by-word and text fragment. In both modes, you use the appropriate “skip” parameter (WSKP
or TSKP
) to identify the text that provides context and the WORD
or TEXT
parameter to identify the word or phrase whose pronunciation is affected by the context. The synthesizer parses the entire phrase or sentence to determine the correct pronunciation of the word or phrase, but does not speak the portions of the text marked as “skipped.“ Use the [[ctxt NORM]]
command to signal a return to the default input-processing mode.
In word-by-word mode, the synthesizer parses the complete text selection to determine the part of speech (such as noun or verb) of the specified word. The synthesizer pronounces the word according to its part of speech, but it does not make any intonation or duration adjustments to the pronunciation. For example, the word “coordinates” is pronounced differently depending on whether it is used as a noun or a verb. The two sentences below illustrate how to use the context command to tell the synthesizer which pronunciation of the word to use:
[[ctxt WSKP]] GPS provides [[ctxt WORD]] coordinates. [[ctxt NORM]]
[[ctxt WSKP]] The post office [[ctxt WORD]] coordinates [[ctxt WSKP]] its deliveries. [[ctxt NORM]]
In text fragment mode, the synthesizer parses the complete text selection to determine the part of speech and the intonation and duration of the specified word or phrase. For example, the different pronunciations of the phrase “first step” are informed by the context provided by the surrounding words in the following two sentences:
[[ctxt TSKP]] Your [[ctxt TEXT]] first step [[ctxt TSKP]] should be to relax. [[ctxt NORM]]
[[ctxt TSKP]] To relax should be your [[ctxt TEXT]] first step. [[ctxt NORM]]
|None|
|dlim
|dlim <BeginDelimiter> <EndDelimiter>
The delimiter command changes the character sequences that indicate the beginning and end of all subsequent embedded speech commands. The new delimiters take effect after the command list containing the dlim
command has been completely processed. If the delimiter strings are empty, an error is generated. If you want to disable embedded command processing for the remainder of the text buffer, you can pass two NUL bytes in the BeginDelimiter
and EndDelimiter
parameters.
[[dlim $ $]
|soCommandDelimiter
|
|emph
|emph + | -
The emphasis command causes the synthesizer to speak the next word with greater or less emphasis than it is currently using. The + parameter increases emphasis and the - parameter decreases emphasis.
For example, to emphasize the word “not” in the following phrase, use the emph
command as follows:
Do [[emph +]] not [[emph -]] over tighten the screw.
|None|
|inpt
|inpt TEXT | PHON | TUNE
The input mode command switches the input-processing mode to textual mode, phoneme mode, or TUNE format mode. Note that some synthesizers may define additional speech input modes you can use. The default input-processing mode is textual, and you should always use the [[inpt TEXT]]
command to revert to textual mode after you’re finished providing content in one of the other modes. In phoneme mode, the synthesizer interprets characters as representing phonemes (listed in Phonemes). In the TUNE format mode, the synthesizer recognizes the same set of phonemes but also interprets additional information that specifies a precise spoken contour, or tune, for the words. For more information about the TUNE format, see Use the TUNE Format to Supply Complex Pitch Contours.
For example, to supply the phonemic representation of a name that synthesizers frequently mispronounce, you can use the inpt
command as follows:
My name is [[inpt PHON]] AY1yIY2SAX [[inpt TEXT]].
|soInputMode
|
|nmbr
|nmbr NORM | LTRL
The number mode command sets the number-speaking mode of the synthesizer. The NORM
parameter causes the synthesizer to speak the number 46 as “forty-six,” whereas the LTRL
parameter causes the synthesizer to speak the same number as “four six.“
For example, to make it clear that the following 7-digit number is a phone number, you can use the nmbr
command to tell the synthesizer to say each digit separately, as follows:
Please call me at [[nmbr LTRL]] 5551990 [[nmbr NORM]].
|soNumberMode
|
|pbas
|pbas [+ | -] <RealValue>
The baseline pitch command changes the current speech pitch for the speech channel to the specified real value. If the pitch value is preceded by the + or - character, the speech pitch is adjusted relative to its current value. Baseline pitch values are always positive numbers in the range of 1.000 to 127.000.|soPitchBase
|
|pmod
|pmod [+ | -] <RealValue>
The pitch modulation command changes the modulation range for the speech channel, based on the specified modulation-depth real value.|soPitchMode
|
|rate
|rate [+ | -] <RealValue>
The speech rate command sets the speech rate on the speech channel to the specified real value. Speech rates fall in the range 0.000 to 65535.999, which translates into a range of 50 to 500 words per minute. If the rate is preceded by a + or - character, the speech rate is increased or decreased relative to its current value.|soRate
|
|rset
|rset <32BitValue>
The reset command resets the speech channel’s voice and attributes to default values. The parameter has no effect; it should be set to 0
.|soReset
|
|slnc
|slnc <32BitValue>
The silence command causes the synthesizer to generate silence for the specified number of milliseconds. You might want to insert extra silence between two sentences to allow listeners to fully absorb the meaning of the first one. Note that the precise timing of the silence will vary among synthesizers.|none|
|sync
|sync <32BitValue>
The synchronization command causes an application’s synchronization callback procedure to be executed. The callback is made as the audio corresponding to the next word begins to sound. The 32-bit value is set by the application and is passed to the callback procedure.
You can use the sync
command to trigger a callback at times other than those defined by the built-in callbacks (such as the phoneme and speech-done callbacks). For example, you might want to perform some custom processing each time a date is spoken to highlight its place on a graphical timeline. To do this, you would define a synchronization callback procedure and refcon values, and insert a sync
command after each date in the text, as follows:
In 1066 [[sync 0x000000A1]], William the Conqueror invaded England and by 1072 [[sync 0x000000A2]], the whole of England was conquered and united.
|soSyncCallback
|
|vers
|vers <32BitValue>
The format version command tells the speech synthesizer which embedded command format version will be used by all subsequent embedded speech commands.|none|
|volm
|volm [+ | -] <RealValue>
The speech volume command sets the speech volume on the current speech channel to the specified real value. If the volume value is preceded by a + or - character, the speech volume is increased or decreased relative to its current value.|soVolume
|
|xtnd
|xtnd <OSType> [<Parameter> ...]
The synthesizer-specific xtnd
command enables other synthesizer-specific commands to be embedded in the text. The first parameter (OSType
) must be the creator ID of the synthesizer. The remaining optional parameters are synthesizer-specific.|soSynthExtension
|
and I just tried it with “Do [[emph +]] not [[emph -]] over tighten the screw.” to speak. Seems to still work after all the years.