Asterisk PHP Generic Speech Interface

Download Code

This page describes the LumenVox-Asterisk AGI/PHP Speech Interface, which is a PHP library that exposes a number of automatic speech recognition (ASR) and text-to-speech (TTS) functionality for users of Asterisk, The Open Source PBX.


Updates

October 19, 2016: The function get_agi_variables() was altered to correctly set the return variable to an array before processing.


Background

Asterisk has a number of ways to interact with speech servers (ASR/TTS), but the preferred mechanism involves using a community-built, open-source set of modules called UniMRCP for Asterisk. These modules provide a number of Asterisk dialplan applications that expose speech functionality over the Media Resource Control Protocol (MRCP). They include applications such as:

  • MRCPRecog to perform speech recognition.
  • MRCPSynth to perform speech synthesis.
  • SynthAndRecog to perform TTS and ASR at the same time.

As a contributor to the UniMRCP Project, LumenVox fully supports working with these modules, and the LumenVox Knowledge Base includes complete instructions on how to configure UniMRCP-Asterisk and LumenVox.

While these dialplan applications include lots of functionality, building speech applications entirely in dialplan is problematic for a few reasons, especially as the complexity of a speech application increases. ASR and TTS in dialplan works well as long as the questions being asked of callers are relatively simple. Asking if a caller said yes or no might require some dialplan that looks like:

exten => s,n,MRCPRecog(builtin:grammar/boolean)
exten => s,n,GotoIf($[ "${RECOG_INSTANCE(0/0)}" = "true"]?Yes:No)
exten => s,n(Yes),MRCPSynth(You said yes)
exten => s,n(No),MRCPSynth(You said no)

It is a simple matter of calling the MRCPRecog application, checking RECOG_INSTANCE variable for the first answer (denoted as 0/0), and then evaluating its contents and acting appropriately. However, speech returns can be complex objects represented as XML strings. Imagine an application that returns a list of names matched against a database, complete with multiple results ("Did you want to talk to Joe Smith or Joe Jones?") and extra information such as the extensions that belong to the person. In this case, manipulating a complex return in dialplan can be very clunky, as you must iterate through the various results and possibly send them out to external applications to parse XML, which is not easily done in dialplan.

Our solution to this problem is the creation of this interface, which is designed to be used by scripts which are invoked over the Asterisk Gateway Interface (AGI). We have built several methods which perform common speech functions, including handling complex returns, and have endeavored to abstract some of the more idiosyncratic dialplan elements into mechanisms which will be familiar to any PHP developer.

Speech Interface Goals and Caveats

LumenVox' primary goal in building this speech interface was initially to provide some example applications that demonstrate complex speech concepts on Asterisk. In building those applications, it became clear that it was beneficial to abstract much of the common speech functionality into a few methods. In addition to being good development practice, it allows our individual sample applications to be much cleaner and easier to read.

It also has the added benefit of providing an open and (hopefully) easy to use interface for any Asterisk developer who is comfortable with PHP. We intend this interface to meet the following goals:

  • Abstract the trickier parts of speech recognition such as navigating an NLSML return so that new developers can focus on building applications quickly and efficiently, and not have to learn a bunch of standards first.
  • Provide intuitive ways in PHP to change speech settings, specify grammars, etc. so that PHP developers don't have to learn the quirks of the dialplan applications.
  • Be easy to read, understand, and change so that anyone who wishes to can extend the interface in order to add new functionality.

Users should be aware of the following caveats about the interface:

  • The interface has not been tested with any speech software besides LumenVox. In particular, because of how the different vendors have implemented the NLSML specification, it is likely that the parse_nlsml method will not behave as expected with other speech vendors.
  • The interface has only been tested against a few sample applications. As with any new software, it is likely there are bugs. Bug reports and patches are welcome.
  • We have not implemented every single piece of speech functionality that would be useful. For instance, a language like VoiceXML is a much more feature-complete and abstract interface for building speech applications. This interface is designed to cover 80-90% of the things we believe developers need the most, and is designed to be easy to use above all else.
  • The interface does not use the PHPAGI library, so it may duplicate some functionality present in that library.

Using the Speech Interface

The speech interface is designed to be called from PHP applications which are invoked using Asterisk's AGI command. We assume familiarity with basic Asterisk dialplan, AGI, and PHP development.

The speech_interface.php file should be included in your PHP/AGI scripts. Once you have done that, you will have access to the complete set of classes and methods described below.

The general pattern for performing ASR/TTS is the following:

  1. Create a new object from the speech class
  2. Use the various set and get methods in the class to set any parameters
  3. Call one of the three primary methods available in the class to perform ASR, TTS, or both
  4. Check the result
    • If you just performed TTS, the result is simply a string indicating success or failure
    • If you performed ASR, the result will be a speech_return object
  5. Repeat steps 2-4 as needed

speech class

The speech class exposes all of the methods needed to perform ASR and TTS. Creating a new speech object is the starting point for all speech applications.

mrcp_synth(text)

Performs synthesis on text.

It takes one required parameter:

  • text is a string containing the text to be synthesized. This may be plaintext, it may be an SSML document, or it may be the URI to an SSML document. We suggest using URIs whenever possible.

The mrcp_synth method returns a string indicating one of three results:

  • OK: synthesis performed as expected
  • ERROR: there was an error performing the synthesis
  • INTERRUPTED: the caller hung-up while the synthesis was in progress

mrcp_synth_and_recog(text, grammar)

Performs synthesis while doing ASR.

It takes two required parameters:

  • text is a string containing the text to be synthesized. This may be plaintext, it may be an SSML document, or it may be the URI to an SSML document. We suggest using URIs whenever possible.
  • grammar is a string containing the speech recognition grammar to use. The string may contain the contents of the grammar itself, or it may be the URI to a grammar document. We suggest using URIs whenever possible.

The mrcp_synth_and_recog method returns a speech_return object.

check_match(speech_return)

Checks a speech_return object to see if there was a match. Returns a Boolean indicating whether there was at least one match or not.

This method also calls the no_input() or no_match() methods in the event that there was a no-input or no-match result.

This is mainly a helper method that is not required, but it can help simplify a common task.

check_ambiguity(speech_return)

Checks a speech_return object to see if there is any ambiguity. Ambiguous speech results are not always possible, but when they occur they must be handled very carefully. Returns a Boolean indicating whether there is ambiguity or not.

This is a more advanced method that will probably not be used in many applications.

ask_yes_or_no(question)

Asks the user a yes or no question (synthesizing it using text supplied in the question parameter) and returns a Boolean indicating whether they said yes (true) or no (false).

no_input()

Plays out a synthesized message (by default, "I'm sorry, but I did not hear what you said").

no_match()

Plays out a synthesized message (by default, "I'm sorry, but I did not understand what you said").

Setting Parameters

There are a number of parameters that can be set on a per-interaction basis (defaults are controlled in configuration files). The speech class exposes a get and set method for each parameter. The parameters allowed are:

General Settings

The following setting applies to mrcp_synth(), mrcp_recog(), and mrcp_synth_and_recog()

  • mrcp_profile
    • The name of the profile to use in mrcp.conf
    • Affects mrcp_synth(), mrcp_recog(), and mrcp_synth_and_recog()​​

ASR Settings

The following settings only affect the mrcp_recog() and mrcp_synth_and_recog() methods.

  • confidence_threshold
    • The confidence threshold to be used for a recognition, set on a 0.0-1.0 scale. If the confidence for the recognition falls below this value, the recognizer will return a NO-MATCH instead of a normal result.
    • Default: 0.5
  • barge_in
    • Whether barge-in should be disabled (0), enabled (1), or performed by Asterisk (2).
    • Default: 1
    • Note: we strongly recommend using only 0 or 1.
  • sensitivity
    • The barge-in sensitivity, set on a scale of 0.0-1.0. The higher the value, the more sensitive the system will be, making it easier for callers to barge-in.
    • Default: 0.5
  • speed_vs_accuracy
    • Whether the ASR should favor speed or accuracy when performing a recognition, on a scale of 0.0-1.0. Higher numbers indicate the ASR should be faster but less accurate.
    • Default: 0.5
  • max_nbest
    • The maximum number of results that the recognizer should return. Increasing this value will provide matches besides the absolute best, but may be useful in situations where complex disambiguation is required.
    • Default: 1
  • recognition_timeout
    • Once barge-in is detected, the amount of time the caller has a caller has to speak, in milliseconds, before the recognizer returns a timeout event.
    • Default: 15000
  • no_input_timeout
    • The length of time caller has to start speaking, in milliseconds, before the recognizer returns a no-input-timeout event.
    • Default:
  • speech_complete_timeout
    • Length of time after the caller can pause speaking before the recognizer begins processing the audio as a complete utterance. Set in milliseconds.
    • Default: 800
  • dtmf_interdigit_timeout
    • When entering DTMF, the amount of time, in milliseconds, the recognizer will wait between key presses before processing the result.
    • Default: 5000
  • dtmf_terminate_timeout
    • The total amount of time, in milliseconds, the user has to enter DTMF.
    • Default: 10000

TTS Settings

The following settings apply to mrcp_synth and mrcp_synth_and_recog()

  • synth_language
    • The language to use for synthesis. Can be overridden by SSML documents. Set using the standard language-COUNTRY format, e.g. en-US or es-MX.
    • Default: en-US
  • voice_name
    • The voice name to use for synthesis. Can be overridden by SSML documents.
    • Default: (null; LumenVox will auto-select a voice)
  • voice_gender
    • The gender to use for synthesis, either "male" or "female." Can be overridden by SSML documents.
    • Default: (null; LumenVox will auto-select the gender)
  • prosody_volume
    • The volume of the synthesis. Valid settings are:
      • silent
      • x-soft
      • soft
      • medium
      • loud
      • x-loud
      • default
    • Default: medium
  • prosody_rate
    • How quickly the synthesizer should speak. Valid settings are:
      • x-slow
      • slow
      • medium
      • fast
      • x-fast
      • default
    • Default: medium

speech_return

The speech_return is an object representing the return from an ASR interaction. It provides no methods, but has several properties that can be examined to get the status of an ASR request, the result, the answer(s), etc.

recognition_status

A string indicating whether the recognition worked (OK), failed (ERROR), or that the caller hung up during the interaction (INTERRUPTED).

recognition_result

A string indicating whether there was input from the caller (000), a no-match (001), a no-input (002), or a timeout (003).

This value is what is checked by speech->check_match().

mode

A string indicating whether the user responded by speaking (voice) or keypress (dtmf).

nlsml_string

A string containing the complete NLSML-formatted XML returned by the recognizer over MRCP. In general, users should not parse this. Instead, the answers array contains a pre-parsed version of this XML. It is provided here in case a developer needs it for some other purpose.

answers

An array of speech_answer objects.

The answers of a speech_return object contains the actual results of the speech interaction. Assuming there was a valid recognition to parse, each element in the answers array is a speech_answer object with the following properties:

  • confidence
    • An integer representing the confidence score on the 0-100 scale.
  • grammar
    • A string containing the identifier of the grammar that was matched.
  • input
    • A string containing actual text that was recognized.
  • interpretation
    • An array containing one or more semantic interpretations for the input. Each semantic interpretation is a separate element of the interpretation array.

Ambiguity

Developers without much familiarity with speech recognition may find this a little confusing. In particular, the difference between the answers array and the interpretation array is not always clear at first, but it is a necessary distinction that makes understanding ambiguous results easier.

In general, each element in answers corresponds to one n-best result from the recognizer. Since the default n-best value is 1, in most cases answers will be an array with only one element. Each element in the interpretation array corresponds to a possible semantic meaning for the answer. In most cases, there is only 1 semantic meaning for a given answer, so the interpretation array will only contain one element.

For instance, assume a caller says "Yeah" using the built-in Boolean grammar. If n-best is set to its default of 1, then only 1 answer will be returned. The built-in Boolean grammar does not allow for multiple interpretations, so the only interpretation will be "true." This would be represented like this:

answers
  [0]
    confidence = 89
    grammar = "builtin:grammar/Boolean"
    input = "Yeah"
    interpretation
      [0] 
        "true"

This can be checked easily in PHP:

if($speech_object->answers[0]->interpretation[0] == "true"){
	echo "You said yes.";
}
else {
	echo "You said no.";
}

For the vast majority of speech interactions, it is relatively safe to assume that there will be only answer and only one interpretation (in fact it is good design to do this whenever possible).

However, some results will be ambiguous. Imagine a directory with two different people named "John." If the user just says "John," there would be 1 answer with 2 interpretations:

answers
  [0]
    confidence : 73
    grammar : "http://myserver/grammars/directory.grxml"
    input : "John"
    interpretation
      [0] "John Smith"
      [1] "John Jones"

In this case, it would be important for the application to check the length of the interpretations array (e.g. using sizeof()) and behave intelligently.

It's also possible to have multiple n-best matches in cases where you expect a lot of phonetic similarities:

answers
  [0]
    confidence : 82
    grammar : "http://myserver/grammars/cities.grxml"
    input : "New York"
    interpretation
      [0] "New York"
  [1]
    confidence : 51
    grammar : "http://myserver/grammars/cities.grxml"
    input : "Newark"
    interpretation
      [0] "Newark"

In this case, there are two separate answers, each with 1 interpretation.

And of course it's possible to get returns with multiple answers and multiple interpretations:

answers
  [0]
    confidence : 73
    grammar : "http://myserver/grammars/directory.grxml"
    input : "John"
    interpretation
      [0] "John Smith"
      [1] "John Jones"
  [1]
    confidence : 40
    grammar : "http://myserver/grammars/directory.grxml"
    input : "Juan"
    interpretation
      [0] "Juan Rodriguez"

Dealing with ambiguity is a complicated topic, and it is addressed to some extent in our example PHP Asterisk speech applications. The important things to remember are:

  • Most applications do not need to worry about this, and only need to check answers[0]->interpretation[0] to get the result.
  • If your application/grammar is returning ambiguous results, make sure that it is something you expect. Many new speech developers unintentionally build grammars that allow for multiple parses. Use a tool like the LumenVox Speech Tuner to test your grammars and ensure that you are not accidentally allowing this.
  • If you do intend to allow ambiguous results, make sure you have a strategy for dealing with them. Common ways of disambiguation will include iterating through the answers/interpretation arrays with a foreach statement, or possibly passing the object to an external process to disambiguate.

Multi-Slot Returns

In addition to ambiguity, it's possible for an interpretation to be what is called a "multi-slot" return. In the examples above, the return from the recognizer was always a string like "John Smith" or "New York." There could be multiple interpretations, but in any case it was always a string. Those are single-slot returns.

A multi-slot return includes more than one piece of information. Imagine if we wanted to expand our directory application to return a person's name and extension. We might now get a result like:

answers
  [0]
    confidence : 73
    grammar : "http://myserver/grammars/directory.grxml"
    input : "John"
    interpretation
      [0]
        full_name : "John Smith"
        extension : "100"
      [1]
        full_name : "John Jones"
        extension : "101"

In this case, we see that an interpretation element may be an object itself, with properties containing the relevant information. This is automatically handled as part of the speech class' parsing of NLSML. Initially, these complex returns are controlled by the grammar (see the LumenVox Knowledge Base's SISR tutorial), and most grammars will return simple strings. But if you do write a grammar that returns such a return, you must obviously account for it in your application code that checks the return (i.e. you cannot always assume that interpretation[n] is a string).

Sample Type

Software Type