Text to Speech

From Valve Developer Community
Jump to: navigation, search

Text to speech (as it applies to the Source engine) is the technology of turning raw text into actual audio. Currently, Valve doesn't use TTS in any of its games, instead using voice actors for all of its spoken dialogue. However, TTS technology is coming along well, and is now at a point where it can be useful for more experimental/research-oriented Source mods. Furthermore, it can be used as place holders for audio that will be recorded by actors later. TTS also has the nice property that, because it is generated by the computer, we can do all the lip-synching automatically, without the need for running the file through FacePoser.


Here is a video (18 megs) demonstrating the TTS code below in action; the only additional work on the voices being done is a small, hardcoded dictionary of transformations for difficult-to-pronounce words and phrases.


hl2TTS is an awkwardly named Python module that can handle everything necessary to automatically generate WAVs in the correct format and with the lip-synching information attached. These files are ready to be played in HL2 (after being added to script files of course.) It is a bit of a pain to get this file going because of the various dependencies, but once it's running it's straightforward to use. To use this module, you will need to do the following:

  • Have a basic understanding of Python. The code is obvious to use, but you do need to open it in a Python editor (IDLE works fine.) It'd be nice to have a just an .EXE that will run the script, but it hasn't happened yet. In the meantime, 20 minutes of fooling with Python will get you up and running just fine.
  • Install Microsoft SAPI 5.1. This is the same package you need for doing lip-synching in FacePoser, so you may have it already. This is the package that actually does the generation.
  • Install the Python Win32 Extensions. This allows pyTTS (the Python wrapper library for SAPI) to correctly link the DLLs.
  • Install pyTTS. This is a nice wrapper library from the good people at UNC that allows us to use SAPI from within Python very easily.

You are now ready to use hl2TTS.py. Import the module, and then use the StraightForwardRecord function. For instance, if you're at an interactive Python prompt, do:

import hl2TTS
hl2TTS.StraightForwardRecord("The quick brown fox jumped over the lazy dog.", "c:\\test")

This will generate a lip-synched audio file named c:\test0.wav of a computerized voice speaking the words "The quick brown fox jumped over the lazy dog."

You can also do a little more complicated call with the StraightForwardRecord function by passing in the name of a voice:

hl2TTS.StraightForwardRecord("This is said by Microsoft Sam", "c:\\sam", "Microsoft Sam")

(You can see the list of installed voices, and set a default voice, within the Speech option in Control Panel.)

Finally, there is a split option available for particularly long text. This will attempt to split the text at sentence or comma breaks in order to generate shorter WAVs if you are having problems with skipping or hitching when played through Source.

hl2TTS.StraightForwardRecord("This is the first sentence of a really really really really really really really long piece of text.\
This is the second sentence of a really really really really really really long piece of text", "c:\\long", split=True)

This will generate a sound file containing the first sentence in a file called c:\long0.wav, and a separate file containing the second sentence in c:\long1.wav.

The Python code is commented and tries to be clear, but don't hesitate to email the author if you have any problems. The author would also like to hear about any projects or mods that use the code.


Although it isn't written up at all, there is also C++ code that duplicates much of the functionality of the Python version above. It's currently embedded in a large project and was abandoned in favour of the Python version, so at this point it's not worth writing up and posting to the Wiki. If this is something you're interested in, however, please contact ndnichols.

There also exists C++ code to produce lip-synched speech truly on the fly, which is then spoken externally, but concurrently with, the HL2 engine. Essentially, there is a function in the codebase that takes a string to say, speaks the string through the TTS engine, and lipsyncs the actors' lips as the text is being spoken. With this technique, for example, you could have a cute multiplayer mod where any messages sent textually are actually pronounced on the other clients' machines. Unfortunately, again, this code is in the middle of a larger project and hasn't been touched in awhile. Again, though, if you want to do something cool with this, please don't hesitate to contact ndnichols.


At this point, text to speech is not ready to actually replace voice actors. While the pronunciations of words can usually be tweaked to sound correct enough, the overwhelming flatness of the voices dooms them for any kind of production work on a major game. They do have some advantages over voice actors, though; for one, it's a way to quickly get dialog up and running in scripts. For development purposes, it's nice to have flat voices now (with lipsynching for free) than professional voice actors at some point in the future. Furthermore, you can imagine some kinds of mod (maybe featuring a large computer or robotic presence) that the stoic, flat voices may be appropriate for. Finally, TTS technology is of course the only way to create WAVs automatically.


There are much better voices available than the stock voices that come with XP; they tend to not be too expensive, either. We've had particularly good luck with the NeoSpeech voices from NextUp.com, but as always, your mileage may vary.

Please let ndnichols know if you find any of this code useful.