Field | Type | Label | Description |
standard_audio_format | AudioFormat.StandardAudioFormat | Standard audio format |
Field | Type | Label | Description |
session_id | string | References an existing session_id (uuid) |
|
audio_id | string | id of the audio requested (Note that this could be session_id to request the inbound audio resource) |
|
audio_start | int64 | Number of milliseconds from the beginning of the audio to return (default is from the beginning) |
|
audio_length | int64 | Maximum number of milliseconds to return. A zero value returns all available audio (from requested start point). (default is all audio, from start point) |
Field | Type | Label | Description |
audio_data | bytes | Binary audio data that was requested |
Field | Type | Label | Description |
session_id | string | References an existing session_id (uuid) to where the audio will be sent |
|
audio_data | bytes | Binary audio data to be added to the audio resource |
Currently no fields returned
Field | Type | Label | Description |
session_id | string | References an existing session_id (uuid) Set it in first AudioStreamRequest request message |
|
audio_data | bytes | Streamed binary audio data to be added to the audio resource |
Currently no fields returned
FinalResultsReady
Callback sent when final interaction results are ready.
Subsequent call(s) to InteractionRequestResults() can be used to obtain
results object.
This callback signals that all processing related to this interaction is
finished.
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object being referenced |
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object being referenced |
Currently no fields returned
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object being referenced |
Currently no fields returned
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object being referenced |
Currently no fields returned
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_ids | string | repeated | List of grammar load interaction IDs, one for each root grammar to activate |
Field | Type | Label | Description |
interaction_id | string | Interaction ID (uuid) that can be used during subsequent ASR processing |
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
language | string | The language selector the specified grammar (e.g.: "en-US", "de-DE" or dialect independent "en", "de", etc.) |
|
grammar_url | string | A grammar URL to be loaded |
|
inline_grammar_text | string | A string containing the raw grammar text |
Field | Type | Label | Description |
interaction_id | string | Interaction ID (uuid) that can be used to reference the grammar during subsequent API calls |
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
grammar_ids | string | repeated | List of grammar load interaction IDs, one for each root grammar to activate |
input_text | string | Input text to be parsed against the grammar[s] |
Field | Type | Label | Description |
interaction_id | string | The interaction object being referenced by the request |
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
language | string | Synthesis language for this request (e.g.: "en-US", "de-DE", etc.) |
|
ssml_url | string | URL from which to fetch synthesis request ssml |
|
inline_request | InteractionCreateTTSRequest.InlineTTSRequest | Inline TTS definition (text and optional parameters) |
|
audio_format | AudioFormat | Audio format to be generated by TTS Synthesis |
Inline TTS definition (text and optional parameters)
Field | Type | Label | Description |
text | string | Text to synthesize, can simple text, or ssml |
|
voice | string | Optional TTS voice (if using simple text, or if not specified within SSML) |
Field | Type | Label | Description |
interaction_id | string | Interaction ID (uuid) that can be used during subsequent TTS processing |
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object being referenced |
Currently no fields returned
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object to get the current settings from |
Field | Type | Label | Description |
json_settings | string | A JSON encoded string containing the requested settings |
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object being referenced |
Field | Type | Label | Description |
result_ready | bool | The result status |
|
results_json | string | The JSON object containing the result being requested or empty if result_ready is false |
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object to set the settings for |
|
json_settings | string | JSON formatted settings to be configured. |
Currently no fields returned
IntermediateResultsReady
Callback sent when intermediate interaction results are available.
Call(s) to InteractionRequestResults() can be used to obtain results object.
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object being referenced |
Field | Type | Label | Description |
session_id | string | Reference to session to close |
Currently no fields returned
Field | Type | Label | Description |
audio_format | AudioFormat | Audio parameters for the audio resource object associated with the new session being created. These audio parameters various attributes such as encoding format, sample rate, etc. Use STANDARD_AUDIO_FORMAT_NO_AUDIO_RESOURCE if no audio resource needs to be created |
Field | Type | Label | Description |
session_id | string | Session ID of newly created session (will be returned from initial call) |
|
vad_event | VadEvent | VAD event notification |
|
final_result | FinalResultsReady | Final results ready notification |
|
partial_result | IntermediateResultsReady | Intermediate results ready notification |
Field | Type | Label | Description |
session_id | string | Which session object to get the settings from |
Field | Type | Label | Description |
json_settings | string | A JSON encoded string containing the requested settings |
Field | Type | Label | Description |
session_id | string | Which session to set the settings for |
|
json_settings | string | JSON formatted settings to be configured. |
Currently no fields returned
VadEvent
Message used to signal events over the course of Voice Activity Detection
processing.
The audio_offset will signify at what point within the session audio
resource the event occurred.
Field | Type | Label | Description |
session_id | string | The session object being referenced |
|
interaction_id | string | The interaction object being referenced |
|
vad_event_type | VadEvent.VadEventType | The type of event this message represents |
|
audio_offset | int32 | The offset in milliseconds from the beginning of the audio resource that this event occurred |
Specification for the audio format
Not all standard formats are supported in all cases. Different
interactions can natively handle a subset of the total audio formats.
Name | Number | Description |
STANDARD_AUDIO_FORMAT_UNSPECIFIED | 0 | Undefined audio |
STANDARD_AUDIO_FORMAT_ULAW_8KHZ | 1 | ULAW 8000 HZ, 1 byte per sample |
STANDARD_AUDIO_FORMAT_ALAW_8KHZ | 2 | ALAW 8000 HZ, 1 byte per sample |
STANDARD_AUDIO_FORMAT_PCM_8KHZ | 3 | PCM 8000 HZ, 2 bytes per sample |
STANDARD_AUDIO_FORMAT_PCM_16KHZ | 10 | PCM 16000 HZ, 2 bytes per sample |
STANDARD_AUDIO_FORMAT_PCM_22KHZ | 20 | PCM 22050 HZ, 2 bytes per sample |
STANDARD_AUDIO_FORMAT_NO_AUDIO_RESOURCE | 100 | Used to indicate that no audio resource should be allocated |
Name | Number | Description |
VAD_EVENT_TYPE_UNSPECIFIED | 0 | Undefined VAD event type |
VAD_EVENT_TYPE_BEGIN_PROCESSING | 1 | VAD begins processing audio |
VAD_EVENT_TYPE_BARGE_IN | 2 | Barge-in occurred, audio that will be process by the ASR starts here. This notification might be useful to stop prompt playback for example |
VAD_EVENT_TYPE_END_OF_SPEECH | 3 | End-of-speech occurred, no further audio will be processed by VAD for the specified interaction. If the setting InteractionASR_VoiceActivityDetection.AUTO_FINALIZE_ON_EOS is true, the ASR will immediately finish processing audio at this point |
VAD_EVENT_TYPE_BARGE_IN_TIMEOUT | 4 | VAD timed out waiting for audio barge-in (start-of-speech). The audio manager will no longer process audio for this interaction. |
VAD_EVENT_TYPE_END_OF_SPEECH_TIMEOUT | 5 | VAD timed out waiting for audio barge-out (end-of-speech). The audio manager will no longer process audio for this interaction. |
SpeechAPIService
The LumenVox Speech API can be used to access various speech resources,
such as Automatic Speech Recognition (ASR), Text-To-Speech (TTS),
Transcription, Call-Progress-Analysis (CPA).
Method Name | Request Type | Response Type | Description |
SessionCreate | SessionCreateRequest | SessionCreateResponse stream | SessionCreate Creates a new session and returns its ID and session related messages through response streamed callback messages The returned session_id (uuid) can be used when making other requests for the session. Also, optionally creates a new audio resource using the specified audio_format parameters. Only one audio resource can be created per session. Audio will be added to this audio resource via gRPC (AudioPush or AudioStream) Typically audio would either be streamed in with AudioStream or sent in blocks using AudioPush. Both methods push audio into the same internal audio resource for processing |
SessionClose | SessionCloseRequest | SessionCloseResponse | SessionClose Closes the specified session. Once closed, a session can no longer be referenced for requests and should be assumed to be no longer valid. |
AudioStream | AudioStreamRequest stream | AudioStreamResponse | AudioStream Sends a stream of binary audio data into the specified audio resource. Note that this may be called before an interaction exists, which allows audio to be added before creating interactions that will process the audio. |
AudioPush | AudioPushRequest | AudioPushResponse | AudioPush Sends a block of binary audio data into the specified audio resource. Note that this may be called before an interaction exists, which allows audio to be added before creating interactions that will process the audio. Please consider the gRPC maximum message size limits |
AudioPull | AudioPullRequest | AudioPullResponse | AudioPull Returns a block of audio data from an audio resource. A begin point in milliseconds and maximum length can be specified to return a segment of the audio data. By default, all audio data within the resource is returned. Please note that due to GRPC maximum message length limitations, API clients may want to retrieve audio in more manageable chunk sizes. Using the default of always returning the entire audio buffer may not be advisable in all situations. Please consider the gRPC maximum message size limits |
SessionSetSettings | SessionSetSettingsRequest | SessionSetSettingsResponse | SessionSetSettings Applies configuration changes to specified session settings. |
SessionGetSettings | SessionGetSettingsRequest | SessionGetSettingsResponse | SessionGetSettings Returns a JSON encoded string containing the requested session settings. |
InteractionCreateASR | InteractionCreateASRRequest | InteractionCreateASRResponse | InteractionCreateASR Creates a new ASR interaction for the specified session. This type of object is required to access ASR functionality. Use the returned interaction_id in subsequent ASR requests. |
InteractionCreateTTS | InteractionCreateTTSRequest | InteractionCreateTTSResponse | InteractionCreateTTS Creates a new TTS interaction for the specified session. This type of object is required to access TTS functionality. Use the returned interaction_id in subsequent TTS requests. |
InteractionCreateGrammarLoad | InteractionCreateGrammarLoadRequest | InteractionCreateGrammarLoadResponse | InteractionCreateGrammarLoad Requests a grammar be loaded within the specified session/interaction. The returned interaction_id may be referenced in subsequent ASR requests. |
InteractionCreateGrammarParse | InteractionCreateGrammarParseRequest | InteractionCreateGrammarParseResponse | InteractionCreateGrammarParse Create a new grammar parse interaction for the specified session. A grammar parse interaction allows sending text directly, to be parsed by the active grammars. Essentially this is the same as an ASR interaction, but the speech to text functionality is skipped. The raw text is passed in directly instead of having the ASR engine supply the text from the audio. The text is parsed with the active grammars in the same way as an ASR interaction. The returned interaction_id may be used to determine status of this request as well as to access results, when processing is completed. |
InteractionSetSettings | InteractionSetSettingsRequest | InteractionSetSettingsResponse | InteractionSetSettings Adds or modifies the specified settings to the specified interaction. Settings not mentioned in this call will remain unaffected. |
InteractionGetSettings | InteractionGetSettingsRequest | InteractionGetSettingsResponse | InteractionGetSettings Return a JSON encoded string containing the current settings for the specified interaction. |
InteractionBeginProcessing | InteractionBeginProcessingRequest | InteractionBeginProcessingResponse | InteractionBeginProcessing Begins processing the specified interaction. Typically, any interaction settings that are needed should be set before calling InteractionBeginProcessing. Calling this function triggers backend services to begin processing the audio or text being inputted |
InteractionFinalizeProcessing | InteractionFinalizeProcessingRequest | InteractionFinalizeProcessingResponse | InteractionFinalizeProcessing Used to force VAD complete when VAD is used, or after VAD speech begin. Takes all available resource audio and triggers an ASR decode. This is optional most of the time, when the default auto-decode setting is used. This can also be used when performing DTMF or Text type interactions Results for the interaction may be available during subsequent calls to InteractionRequestResults |
InteractionRequestResults | InteractionRequestResultsRequest | InteractionRequestResultsResponse | InteractionRequestResults Returns an interaction's results as a JSON encoded string. Note that an empty JSON object may be returned if no results are currently available |
InteractionCancel | InteractionCancelRequest | InteractionCancelResponse | InteractionCancel Cancels the specified interaction. Any active processing related to the interaction is stopped. |
InteractionClose | InteractionCloseRequest | InteractionCloseResponse | InteractionClose Closes the specified interaction. |
.proto Type | Notes | C++ | Java | Python | Go | C# | PHP | Ruby |
double | double | double | float | float64 | double | float | Float | |
float | float | float | float | float32 | float | float | Float | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | long | int/long | int64 | long | integer/string | Bignum |
uint32 | Uses variable-length encoding. | uint32 | int | int/long | uint32 | uint | integer | Bignum or Fixnum (as required) |
uint64 | Uses variable-length encoding. | uint64 | long | int/long | uint64 | ulong | integer/string | Bignum or Fixnum (as required) |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | long | int/long | int64 | long | integer/string | Bignum |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 | uint | integer | Bignum or Fixnum (as required) |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | long | int/long | uint64 | ulong | integer/string | Bignum |
sfixed32 | Always four bytes. | int32 | int | int | int32 | int | integer | Bignum or Fixnum (as required) |
sfixed64 | Always eight bytes. | int64 | long | int/long | int64 | long | integer/string | Bignum |
bool | bool | boolean | boolean | bool | bool | boolean | TrueClass/FalseClass | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | String | str/unicode | string | string | string | String (UTF-8) |
bytes | May contain any arbitrary sequence of bytes. | string | ByteString | str | []byte | ByteString | string | String (ASCII-8BIT) |
Copyright (C) 2001-2024, Ai Software, LLC d/b/a LumenVox