To build an acoustic model to support a new language, LumenVox needs a significant amount of audio in that language. Because building acoustic models for speech recognition is a complex task, the collected audio must be of a specific type.

Please note that the requirements below are the entire requirements for an acoustic model. Companies wishing to assist LumenVox in the audio collection process can always contribute less than the whole.

We need a large number of speakers — at least 1,000 different speakers, split evenly between male and female speakers. To ensure the greatest diversity of speaking styles, the speakers should represent a variety of ages. The majority of the speakers should be between 20 and 40 years old.

The audio must be from telephony applications. Audio collected from a microphone or other source is not acceptable for our purposes, as telephone audio has a distinct set of characteristics that are necessary for our acoustic models.

In addition to this, the actual audio must be diverse as well. Ideally, each speaker will read only one script, and only once. A speaker saying the same sentence multiple times will not help the model.

Captured utterances should include multiple phrases of the following types:

  • Several application words.
  • Yes/no utterances.
  • An utterance of several isolated digits.
  • An utterance of several connected digits (e.g. credit card number, PIN, etc.).
  • Natural numbers.
  • Currency amounts.
  • Names.
  • Phonetically rich sentences.

Each utterance needs an accurate transcription that is associated with the audio file in some way. The method of correlating audio with transcripts can be simple, as long as there is an easy way for LumenVox to automatically match up audio files with their transcripts

