In order to perform phoneme extraction you must have the Microsoft Speech API 5.1 (SAPI 5.1) installed. It can be downloaded from Microsoft's web site.
The FacePoser application contains a tool for editing phoneme/word tags for the .wav files that actors can use with the "SPEAK" event. You can either load a scene that contains a spoken .wav file and the select any of the SPEAK events in the Choreography View, or you can directly load a .wav file by clicking the "Load" button along the bottom of the Phoneme Editor view.
Once you've loaded a .wav file, the display will show the general wave form of the sound file. In addition, along the top, the display shows the previously recognized words of the sentence, while along the bottom the display shows the previously tagged phonemes of the spoken .wav. Useful information about the .wav file is displayed in the bottom section of the view. The full text of the sentence, and information about the currently selected phoneme/word is displayed along the right side of the workspace. There is a scroll bar at the top to allow sliding the view of the wave view left/right. In addition, the mouse wheel can be used to zoom in/out. The zoom factor is shown at the bottom left of the tool window. Finally, there is a tab control that allows changing from manipulation of phonemes to editing of phoneme emphasis or of close captioning/localization information.
Phoneme Editor Tools
- Redo Extraction
- Resubmits the sound file to the speech recognizer. If this is successful, a new list of words/phonemes will show up "inset" from the original data. To accept the new data and begin editing it, right-click in the workspace (in the wave form display) and choose "Commit extraction" from the context menu. To remove the inset data, right-click and select "Clear extraction" from the menu. Note, committing the results doesn't clobber the original .wav file, that only occurs when you click the "Save Changes" button, or you say "Yes" to the "Save file" prompt when changing .wav files or quitting the FacePoser application.
- Press the save changes button to save the working .wav file out to disk (see Phoneme Tool/data format).
- Load a new .wav file into the editor for editing.
- This option has three sub-options to play the original .wav, the edited wav or just the selected portion, if a selected portion is active. Playing and stopping the .wav can also be accomplished by pressing the Spacebar.
- These options either load a new .wav or save the changes made to the current .wav.
- Stops all sound playback on the sound engine
- If you've marked some portions of the .wav file as selected by dragging the left mouse along the wave form, you can click this button to remove all such markings.
- Redo extraction
- Same as button (above)
- Redo extraction of selected words
- This option requires that you have a portion of the wave form selected as well as a contiguous set of words form the sentence selected. The option will send the subset of the sentence off to the phoneme extraction tool and will display the results when finished. The tool will not change the positions of words, though it will wipe out and re-populate any phonemes belonging to words in the set. Sometimes the phoneme extractor has a hard time with long sentences. In such cases, working on sections of the sentence piecemeal can help with extraction.
- Commit extraction
- If word/phoneme data has been processed by the extraction system, choosing "Commit" will overwrite the current working data.
- Clear extraction
- Throws away the "uncommitted" data.
- Cleanup words/phonemes
- Iterates through all phonemes and words and finds words that are within a couple of pixels of touching (or are overlapping by such and amount) and fixes up the start/end times of the words/phonemes.
- Change Speech API
- The SDK version of FacePoser supports Microsoft SAPI 5.1 for performing automatic phoneme extraction from .wav files.
- Import / export word data to .txt
- If you need to work with the .wav file in a sound tool which strips our data chunks, you can save the original data lump into a .txt file and reapply after you edit the .wav externally.
- Disable voice duck
- The Source engine automatically lowers non-voice volume levels when a spoken wav is playing back. This behavior can be disabled for a spoken .wav by choosing "Disable voice duck" from the right-click menu.
The general interaction UI works as follows:
- To select, use left mouse button on items.
- To deselect, click outside the item area for type of item being used
- To shift the position of an item left, right, hold down
- To shift a boundary/edge of an item, hold down
Note that the cursor will reflect the appropriate mode (4 way cursor == item can be shifted, East-West cursor means item can be resized)
To select a portion of the waveform, simply click and drag with the left mouse button. To move the selection area, hold and use the left mouse button to drag the area. To resize the selection, hover the mouse over the solid blue lines at either edge while holding . To deselect, click anywhere outside of the current selection, or press . You can play the current selection or re-extract phonemes using the right mouse context menu or by hitting .
Use the left mouse to select words. Once selected, one or more words can be moved by holding down thekey and using the mouse to drag the selection. If a single word is selected, it can be moved by holding down and using or on the keyboard to shift it pixel by pixel. The size of a word can be adjusted by holding and hovering the mouse over the edge of the word, then clicking and dragging the edge left or right. The right boundary (end time) of a word can be adjusted using the keyboard by holding and using the / keys.
To deselect words, click anywhere outside of the word area (e.g., just above the words area works just fine)
Right clicking without words selected brings up a context menu with just a couple of options: First, the "Edit sentence text…" option allows you to specify the entire text of the current sentence. Clicking okay to exact the dialog will cause phoneme extraction to be performed again. Additionally, "Cleanup words phonemes" is an available option any time a .wav is loaded.
If you have one or more words selected, the right menu shows additional options:
- Delete word
- You can delete the selected word(s) using this option.
- Edit word
- If there is just one word selected, you can type in new text for the word by selecting this option. Only one word may be entered.
- Insert word before/after word
- If you have a single word selected, and there is sufficient time before/after the word, then you can insert a new word by choosing this menu item. A dialog appears in which you can type a single word, once you click OK, another dialog appears which allows you to pick one or more phonemes for the word just entered. You can type a space separated list of phonemes, or click one or more phoneme buttons to create the phoneme list for the newly entered word, or just click Cancel to put in a word with no phonemes.
- Add phoneme to word
- If the selected word doesn't have any phonemes, you can choose this option to allow entry of a string of one or more phonemes to use for the word.
- Select all words before/after word
- If a single word is selected, you can use this option to select the rest of the row in either direction (so you can shift everything down with the mouse easily)
- Deselect all
- Deselects all words/phonemes currently selected
- Merge words
- If two or more contiguous words are selected, choosing "Merge words" will make the start time of each word match the end time of the previous word
- Separate words
- If two or more contiguous selected words are close together, this option will provide a bit of space between the words.
- Clear Undo
- Resets undo information, deleting the undo history.
The phoneme area behaves almost identically to the word area as far as mouse and keyboard interaction are concerned.
When using the mouse to drag one or more selected phonemes/words, selection rubber band while dragging as well as the entire move is bounded to a valid amount of space.
Phoneme Editor Keyboard Shortcuts
- If a .wav is currently being played, stop playback. If not, deselects all words/phonemes/selection areas
- Moves the keyboard focus either to the word area (PGUP) or the phoneme area (PGDN). The current focus area is shown by a light green bar along the top or bottom edge of the word or phoneme display. Clicking/manipulating words or phonemes will set the focus appropriately.
- RIGHT/LEFT arrow
- The right/left arrows move and select the next or previous word or phoneme. For phonemes, the arrows cycle within a word.
- / +
- You can change words at any time by using the TAB key.
- + ARROW KEY
- Move the selected word/phoneme to right or left
- + ARROW KEY
- Resize end position of selected word phoneme
- / +
- Insert a new word to right/left of selected word/phoneme
- Delete selected word(s) (which deletes all phonemes of the word, too) or delete selected phoneme(s).
- UP or +
- Edit the selected word or phoneme.
- Play selection or entire wav file.
Phoneme Emphasis Editing
By clicking on the "Emphasis" tab with a .wav loaded, you'll see most of the view grayed out but there will now be a work area with a blue line at the center of the screen. You can create an emphasis spline by laying down points using the key and left-clicking on points in the work area.
Once you have placed points, you can select them (shown in red) by dragging a rectangle around the desired points with the mouse. To move the points, just left-click on one or more selected points and move the mouse. If you right-click in the work area, there are various options for selecting/deselecting all points and for undo/redo of editing changes.
The emphasis track scales the intensity of phonemes during playback. For certain phonemes, you may want to author a "weak" and "strong" version and add these to the "phonemes_weak" and "phonemes_strong" expression class files. Note that Valve did not actually use this feature in shipping HL2 (but in theory, it should work).
The blue center line is normal emphasis of the phonemes in the "phonemes" class. As the line goes to the top, the amount of the phoneme from phonemes is faded out and the phoneme from "phonemes_strong" is faded in. If a phoneme doesn't have strong or weak override, then the absolute scale for emphasis is appropriately clamped.