Asterisk PHP Store Locator

Download Code

This document describes a "Store Locator" application for Asterisk that uses automatic speech recognition (ASR) and text-to-speech (TTS) to ask a caller for the city and state (using a list of cities in the United States), simulating a transaction common to many store locator IVR applications. It makes use of our Asterisk AGI PHP speech_interface. You may also be interested in our Asterisk PHP Hello World application for a simple example.

The application is intended to be an example of how to perform a difficult speech transaction in Asterisk. The city/state question is an example of a speech interaction with a high amount of ambiguity, meaning callers often respond by saying utterances that could have more than one meaning. For instance, a caller may respond by saying "Paris" which is unclear, because there are several cities named Paris in the USA. This application will attempt to disambiguate by asking follow-up questions such as "Did you mean Paris, Texas?"

The application should provide a good starting point for any Asterisk developer looking to build similar applications. Some familiarity is assumed with the PHP language, but we have tried to develop the application in such a way that it can be ported to other languages fairly easily.

Updates

October 19, 2016: The downloadable code was altered to eliminate two bugs. The grammar had the incorrect SISR specification. The code was calling the function get_agi_variables incorrectly.

Prerequisites

The application is largely an Asterisk Gateway Interface (AGI) script written in PHP. It has been tested on Asterisk 11, but should work on any version of Asterisk that is supported by the UniMRCP for Asterisk module.

There is not much software that is required:

  • Asterisk
  • UniMRCP for Asterisk dialplan applications built and installed
  • PHP
  • LumenVox ASR
  • LumenVox TTS
  • The LumenVox-Asterisk AGI/PHP Speech Interface
  • A web server to store grammars and SSML documents is recommended, but not required

Installation

Put the store_locator.php file into your Asterisk AGI directory (/var/lib/asterisk/agi-bin/ by default). The speech_interface.php file should also be in this directory. Be sure that you have made it executable by the Asterisk user (chmod 755 store_locator.php).

You will also need to place three files in a location that the LumenVox Media Server can access. They are:

  • say-CityState.ssml
  • citystate.gram
  • state.gram

Ideally, these will be hosted by a web server, as that is our recommended way of serving grammar and SSML documents to LumenVox. However, they may be placed in a directory on the file system that the Media Server can read.

Configuration

Dialplan

You must invoke the AGI application by editing your Asterisk dialplan. A simple context to do so would look like:

[storelocator]
exten => s,1,Answer
exten => s,n,Wait(1)
exten => s,n,AGI(storelocator.php)
exten => s,n,Verbose(1, city: ${CITY}, state: ${STATE})
exten => s,n,Hangup

When the AGI application has finished executing, it returns two channel variables called CITY and STATE. The above dialplan will simply log them to the Asterisk console.

PHP

The store_locator.php file includes two variables near the top that must be configured.

$base_grammar_uri="http://myserver/storelocator/";
$base_ssml_uri="http://myserver/storelocator/";

These should be modified to be valid URIs (http or file) that correspond to the location where the SSML and grammar files were placed.

Use

Using the application is simple: simply dial into the extension you have configured in the dialplan. You should hear the TTS ask you to "Please say the city and state" and then you can say a city/state pair in the United States (only cities with populations above 5,000 people are allowed).

You can just say a city, in which case you will be asked to clarify (or disambiguate) which location you wanted by saying the state or, depending on how many matches there were, picking the city/state pair from a list.

Once the application has a clear match, it will speak out the recognized city/state pair and exit. It returns the city and state to the Asterisk dialplan as channel variables.

Explanation

store_locator.php

The application consists of one PHP file, store_locator.php. The code is commented for those who wish to just dive in, but provided below is a more detailed explanation of what each segment does.

Initial Setup

#!/usr/bin/php
<?php

include 'speech_interface.php'; 

$agivars = array();
$agivars = get_agi_variables();

The above segment is simply the start of the file, in which we declare it is indeed a PHP script and included the required speech_interface. We call a function in the speech interface to get the AGI variables Asterisk passes in and put them in a new array called $agivars.

We next create a simple class called desired_location that will store our ultimate answer, and create an instance of it called $recognized_location:

class desired_location {
	public $city = "";
	public $state = "";
}

$recognized_location = new desired_location;

We set up the location of our grammar and SSML files:

$base_grammar_uri="http://10.22.0.63/storelocator/";
$base_ssml_uri="http://10.22.0.63/storelocator/";
$ask_city_state_ssml = $base_ssml_uri . "say-cityState.ssml";
$city_state_grammar = $base_grammar_uri . "citystate.gram";
$state_grammar = $base_grammar_uri . "state.gram";

Then create an instance of the speech class called $sp and use the methods provided by the class to set our default confidence threshold, n-best value, and some timeouts:

$sp = new speech(); // Initialize the speech object.

$sp->set_confidence_threshold(0.3); // Turn down confidence threshold
$sp->set_max_nbest(2); // We also want to set n-best to get up to 2 matches
$sp->set_no_input_timeout(2000);
$sp->set_recognition_timeout(5000);
$sp->set_speech_complete_timeout(800);

Setting the n-best value up is useful as it will allow us to get multiple matches when cities sound similar. We also turn down the confidence threshold to 0.3 (from a default of 0.5) since confidence scores tend to run a little low for these sorts of returns.

Ask City/State Loop

The bulk of the application is the following loop:

$city_state_done = false;
while (!$city_state_done){

	$city_state_answer = $sp->mrcp_synth_and_recog($ask_city_state_ssml, $city_state_grammar);
	//$city_state_answer = $sp->mrcp_synth_and_recog($ask_city_state_ssml, $city_state_grammar);
	if ($city_state_answer->recognition_status!=speech_return::status_successful){
		// In a real application, this is where you might transfer to an operator or revert to DTMF.
		log_agi('There was an error with speech recognition and/or TTS.');
		exit();
	}
	// Did we get at least one match?
	if (!$sp->check_match($city_state_answer)) {
		$city_state_done = false;
	}	
	else {
		// Was there ambiguity?
		if ($sp->check_ambiguity($city_state_answer)){
			// Attempt to resolve the ambiguity; if we do we're done, otherwise we're repeating the loop
			if(resolve_city_state_ambiguity($city_state_answer)) {
				$city_state_done = true;
			}
			else {
				$city_state_done = false;
			}
		}
		// 1 match, no ambiguity means our work here is done.
		else {
			$recognized_location->city = trim(strval($city_state_answer->answers[0]->interpretation[0]->city));
			$recognized_location->state = trim(strval($city_state_answer->answers[0]->interpretation[0]->state));
			$city_state_done = true;
		}
	}
}

There's a lot happening here, so let's go through the functionality one piece at a time.

  1. The caller will be asked to speak a city state using mrcp_synth_and_recog() method to perform ASR and TTS at the same time. The response is saved in an object called $city_state_answer.
  2. We check to see if the recognition_status on our $city_state_answer object is successful. If it is not, this indicates an error with ASR and TTS (e.g. no licenses, no ASR/TTS servers available) and must be handled. In this stub application we simply write that out to the Asterisk console and exit, but in a real application this is where we might transfer the caller to an operator or to a DTMF-only application.
  3. We use the check_match() method of the speech class to see if we got at least one match. This method automatically plays the no-match or no-input prompts (e.g. "I did not understand you") if there was no match. If there was no match, we'll repeat the loop at step #1.
  4. Assuming there was a match, we use check_ambiguity() to see if there was a clear, unambiguous response. If it returns true, then we call the resolve_city_state_ambiguity() function (defined later on) to ask the caller to disambiguate. If the result of that function is true, then we have our answer and end our loop by setting $city_state_done to true; otherwise we'll repeat the loop.
  5. If there was no ambiguity, there's nothing to do except set the city and state properties of $recognized_location to the first answer and interpretation we got from the recognizer. We want to make sure we evaluate the answers as strings, and we trim them as some of the city names in the grammar have extraneous spaces in them. Then we set $city_state_done to true to exit the loop.

Return

After the loop, we simply output some information and exit:

$tts_text = "I recognized : " . $recognized_location->city . " " . $recognized_location->state;
$sp->mrcp_synth($tts_text);

// Return the city and state to the dialplan as channel variables.
$recognized_location->city = str_replace(' ', '\ ', $recognized_location->city);
execute_agi('SET VARIABLE CITY ' . $recognized_location->city);
execute_agi('SET VARIABLE STATE ' . $recognized_location->state);
exit();

We'll play out the recognition to the caller using mrcp_synth(), then use the SET VARIABLE command in the AGI to set new channel variables called CITY and STATE that the dialplan can use as needed.

Resolving Ambiguity

The final part of the application is the resolve_city_state_ambiguity() function. It takes an object of the speech_result type as its input (i.e. the return from mrcp_synth_and_recog or mrcp_recog) and attempts to get the caller to clarify which city/state pair they wanted.

function resolve_city_state_ambiguity ($speech_result) {
	global $recognized_location;
	global $sp;	
	global $state_grammar;
	$ambiguity_resolved = false;

We'll be reading and writing the global $recognized_location and $sp objects plus $state_grammar, so we declare them as global variables. Then we set up a new Boolean called $ambiguity_resolved.

	if (sizeof($speech_result->answers) > 3) {
		$tts_text = "You will need to be more specific.";
		$sp->mrcp_synth($tts_text);
		$ambiguity_resolved = false;
	}

In the event that there are more than three n-best answers (which shouldn't happen unless the n-best setting is changed), we don't want to even try and disambiguate. In that case we'll just return from the function and ask the caller to repeat themselves.

	else
		$i = 0;
		foreach ($speech_result->answers as $answers_array){
			if (sizeof($answers_array->interpretation) > 1){				
					$tts_text = "I have multiple matches for that city. Please say the state.";
					$state_answer = $sp->mrcp_synth_and_recog($tts_text, $state_grammar);
					$recognized_state = strval($state_answer->answers[0]->interpretation[0]);				
					foreach ($answers_array->interpretation as $interpretation_array){
						if($recognized_state == strval($interpretation_array->state)){
							$recognized_location->city = trim(strval($interpretation_array->city));
							$recognized_location->state = trim(strval($interpretation_array->state));
							$ambiguity_resolved = true;
						break(2);
					}
				}
				$sp->mrcp_synth('I am sorry, but I do not recognize that combination of city and state.');					
			}

If there were 3 or fewer answers, however, we're going to iterate through the results using a pair of foreach statements. The first foreach looks at the answers and the second will examine the interpretations (please see the speech_interface documentation for an explanation of the difference).

Using sizeof($answers_array->interpretation) lets us check if a given answer has multiple interpretations; if it does that implies the caller only spoke a city. We'll resolve that ambiguity by using the state grammar and asking the caller to speak the name of the state they're looking for. We store their response to that prompt as $state_answer, and we check to see if the recognized state matches any of the states in the interpretations array for the current answer.

If we get a match, then we have our city/state pair. We set the $recognized_location properties to what we got from the current answer/interpretation, note that $ambiguity_resolved is now true, and break out of both foreach loops.

			else {
				$recognized_location->city = trim(strval($answers_array->interpretation[0]->city));
				$recognized_location->state = trim(strval($answers_array->interpretation[0]->state));
				$tts_text = "Did you want " . $recognized_location->city . " " . $recognized_location->state;
				if($sp->ask_yes_or_no($tts_text)){
					$ambiguity_resolved = true;
					break;
				}
			}
		}
	if (!$ambiguity_resolved){
		$sp->mrcp_synth("Let us start again.");
	}
	return $ambiguity_resolved;

The else clause above gets triggered if there were not multiple interpretations for an answer. In that case, we just want to confirm the city/state with the caller. We use the ask_yes_or_no method in the speech class to easily ask if they want the city/state we recognized, and if they answer yes we can break out of the foreach loop and return.

Otherwise, we will continue looping until we are out of answers to look at. If that happens, then $ambiguity_resolved is still set to false, we will play our error message, and restart the main Ask/City State Loop.

Grammars and SSML

There are two included grammar files and one SSML file that are used in this application. For more information on writing grammars and SSML documents, please read:

citystate.gram

The primary grammar file is large but simple. It consists of one rule with several thousand lines. Here is the start and end of the file:

#ABNF 1.0;
language en-US;
mode voice;
root $CityState;
tag-format <lumenvox/1.0>;

$CityState = (
(/4512/((Alamitos ) [(California )]){ out.city='Alamitos'; out.state='California';}) | 
(/4512/((Alton ) [(Ohio )]){ out.city='Alton'; out.state='Ohio';}) | 
(/4512/((Arlington  Alexandria ) [(Virginia )]){ out.city='Arlington  Alexandria'; out.state='Virginia';}) | 
(/4512/((Bowie | Glen Dale ) [(Maryland )]){ out.city='Bowie  Glen Dale'; out.state='Maryland';}) | 
(/4512/((Crystal Lake  ) [(Florida )]){ out.city='Crystal Lake '; out.state='Florida';}) | 
(/4512/((Keys ) [(Florida )]){ out.city='Keys'; out.state='Florida';}) | 
(/4258/((Newton Falls  ) [(Ohio )]){ out.city='Newton Falls '; out.state='Ohio';}) |

[. . .]

(/6852/((San Jose  ) [(California )]){ out.city='San Jose '; out.state='California';}) | 
(/6882/((Detroit  ) [(Michigan )]){ out.city='Detroit '; out.state='Michigan';}) | 
(/6975/((San Antonio  ) [(Texas )]){ out.city='San Antonio '; out.state='Texas';}) | 
(/6994/((Dallas  ) [(Texas )]){ out.city='Dallas '; out.state='Texas';}) | 
(/7008/((San Diego  ) [(California )]){ out.city='San Diego '; out.state='California';}) | 
(/7046/((Phoenix  ) [(Arizona )]){ out.city='Phoenix '; out.state='Arizona';}) | 
(/7116/((Philadelphia  ) [(Pennsylvania )]){ out.city='Philadelphia '; out.state='Pennsylvania';}) | 
(/7242/((Houston  ) [(Texas )]){ out.city='Houston '; out.state='Texas';}) | 
(/7439/((Chicago  ) [(Illinois )]){ out.city='Chicago '; out.state='Illinois';}) | 
(/7561/((Los Angeles   | L A) [(California )]){ out.city='Los Angeles '; out.state='California';}) | 
(/7947/((New York   | New York City) [(New York )]){ out.city='New York '; out.state='New York';}) );

There are a few interesting things to point out about the grammar:

  1. Saying a state is always optional, but any city always returns both a city and state. This is done using SISR tags to set the return to an object with properties called city and state. Thus if somebody says something like "San Diego," the grammar will return a city of "San Diego" and a state of "California."
  2. City names that occur in multiple states will return multiple interpretations, e.g. "Paris." This is why the resolve_city_state_ambiguity() is written the way that it is, i.e. multiple interpretations for a given answer cause the system to ask for the state.
  3. Each alternative in the list is weighted roughly by population, so that the most populous city (New York) is the most likely to be returned. The exact numbers for the weights are the result of extensive tuning, so while they are proportional to population they don't match up exactly. The weights are scaled a bit to prevent the most populous cities from dominating the results.

state.gram

The state grammar is similar to the citystate grammar, except it only accepts state names:

#ABNF 1.0 UTF-8;

language en-US;
mode voice;
tag-format <semantics/1.0>;

root $StateOnly;

$StateOnly = (
/4599/(Alabama ){ out='Alabama';} |
/4451/(Arkansas ){ out='Arkansas';} |
/4679/(Arizona ){ out='Arizona';} |
/5220/(California ){ out='California';} |
/4606/(Colorado ){ out='Colorado';} |
/4521/(Connecticut ){ out='Connecticut';} |
/4093/(Delaware ){ out='Delaware';} |
/5008/(Florida ){ out='Florida';} |
/4806/(Georgia ){ out='Georgia';} |
/4217/(Hawaii  | "{HH AX V AY IY:Hawaii}" | "{HH AX W AX IY:Hawaii}" | "{HH AX V AX IY:Hawaii}"){ out='Hawaii';} |
/4470/(Iowa ){ out='Iowa';} |
/4251/(Idaho ){ out='Idaho';} |
/4908/(Illinois ){ out='Illinois';} |
/4695/(Indiana ){ out='Indiana';} |
/4447/(Kansas ){ out='Kansas';} |
/4573/(Kentucky ){ out='Kentucky';} |
/4597/(Louisiana ){ out='Louisiana';} |
/4701/(Massachusetts ){ out='Massachusetts';} |
/4661/(Maryland ){ out='Maryland';} |
/4228/(Maine ){ out='Maine';} |
/4839/(Michigan ){ out='Michigan';} |
/4635/(Minnesota ){ out='Minnesota';} |
/4672/(Missouri ){ out='Missouri';} |
/4466/(Mississippi ){ out='Mississippi';} |
/4124/(Montana ){ out='Montana';} |
/4793/(North Carolina ){ out='North Carolina';} |
/4009/(North Dakota ){ out='North Dakota';} |
/4314/(Nebraska ){ out='Nebraska';} |
/4225/(New Hampshire  | newhampshire){ out='New Hampshire';} |
/4794/(New Jersey ){ out='New Jersey';} |
/4341/(New Mexico ){ out='New Mexico';} |
/4409/(Nevada ){ out='Nevada';} |
/4876/(Ohio ){ out='Ohio';} |
/4524/(Oklahoma ){ out='Oklahoma';} |
/4532/(Oregon ){ out='Oregon';} |
/4900/(Pennsylvania ){ out='Pennsylvania';} |
/4553/(Puerto Rico ){ out='Puerto Rico';} |
/4166/(Rhode Island ){ out='Rhode Island';} |
/4579/(South Carolina ){ out='South Carolina';} |
/4068/(South Dakota ){ out='South Dakota';} |
/4680/(Tennessee ){ out='Tennessee';} |
/5083/(Texas ){ out='Texas';} |
/4415/(Utah ){ out='Utah';} |
/4751/(Virginia ){ out='Virginia';} |
/4002/(Vermont ){ out='Vermont';} |
/4696/(Washington ){ out='Washington';} |
/4658/(Wisconsin ){ out='Wisconsin';} |
/4323/(West Virginia ){ out='West Virginia';} |
/3942/(Wyoming ){ out='Wyoming';} );

Again, notice that there is some weighting based on population.

say-cityState.ssml

The say-cityState.ssml file is very simple and is largely included just to be an example of how to reference SSML documents when performing syntheses:

<?xml version="1.0" encoding="UTF-8" ?>
<speak version="1.0" xml:lang="en-US">
	Please say the city and state for which you would like to find a store.
</speak>

Sample Type

Software Type