Phoneme Tool/data format
The phoneme editor embeds the following ASCII text block at the end of a .wav file:
VERSION 1.0
PLAINTEXT
{
example sentence
}
WORDS
{
WORD example <start time> <end time>
{
<phoneme id> <phoneme name> <start time> <end time> <volume. Unused, always 1)>
}
WORD sentence <as above>
{
<as above>
}
}
EMPHASIS
{
<time> <normalised value>
}
CLOSECAPTION
{
english
{
PHRASE unicode <size of text in *bytes*. Text has no nul-termination.> <Text, formatted in what seems to be either UCS-2 or UTF-16> <start time> <end time>
}
}
OPTIONS
{
voice_duck <1/0>
}
All sections are required, even if they are empty (as emphasis
often is).
Todo: As relates to the closed-captioning section...what are other valid language identifiers? Are there encodings other than "unicode" that are valid? Can we have multiple "phrases"? Is this method of closed-captioning deprecated or does it supersede the method described here, or do they exist alongside each other?
VDAT chunk
As WAV is a chunk-based file format derived from RIFF, WAV files containing phoneme data use a custom chunk with the type VDAT
to store phoneme data. The VDAT chunk consists of the four ASCII characters VDAT
(56 44 41 54
in hexadecimal), followed by four bytes describing the length of the chunk, excluding the eight identifier and length bytes. All data is encoded in little endian. After that, the above plaintext block is appended.
Todo: Does the Phoneme Editor use the VDAT chunk or does that exist only in audio shipped with the games by Valve? Does Source still recognise phoneme data that doesn't use the VDAT chunk?
Phoneme IDs
- 95 <sil>
- 97 aa2
- 98 b
- 100 d
- 101 ey
- 102 f
- 103 g
- 104 hh
- 105 iy
- 106 y
- 107 c
- 108 l
- 109 m
- 110 n
- 111 ow
- 112 p
- 114 r2
- 115 s
- 116 t
- 117 uw
- 118 v
- 119 w
- 122 z
- 230 ae
- 240 dh
- 331 nx
- 593 aa
- 596 ao
- 601 ax
- 602 er
- 603 eh
- 604 ax2
- 605 er2
- 609 g2
- 614 hh2
- 616 ih2
- 618 ih
- 619 l2
- 633 r
- 635 r3
- 638 d2
- 643 sh
- 650 uh
- 652 ah
- 658 zh
- 676 jh
- 679 ch
- 952 th