Protocol Documentation

speech.proto

Top

AudioFormat

FieldTypeLabelDescription
standard_audio_format AudioFormat.StandardAudioFormat

Standard audio format

AudioPullRequest

FieldTypeLabelDescription
session_id string

References an existing session_id (uuid)

audio_id string

id of the audio requested (Note that this could be session_id to request the inbound audio resource)

audio_start int64

Number of milliseconds from the beginning of the audio to return (default is from the beginning)

audio_length int64

Maximum number of milliseconds to return. A zero value returns all available audio (from requested start point). (default is all audio, from start point)

AudioPullResponse

FieldTypeLabelDescription
audio_data bytes

Binary audio data that was requested

AudioPushRequest

FieldTypeLabelDescription
session_id string

References an existing session_id (uuid) to where the audio will be sent

audio_data bytes

Binary audio data to be added to the audio resource

AudioPushResponse

Currently no fields returned

AudioStreamRequest

FieldTypeLabelDescription
session_id string

References an existing session_id (uuid) Set it in first AudioStreamRequest request message

audio_data bytes

Streamed binary audio data to be added to the audio resource

AudioStreamResponse

Currently no fields returned

FinalResultsReady

FinalResultsReady

Callback sent when final interaction results are ready.

Subsequent call(s) to InteractionRequestResults() can be used to obtain

results object.

This callback signals that all processing related to this interaction is

finished.

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object being referenced

InteractionBeginProcessingRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object being referenced

InteractionBeginProcessingResponse

Currently no fields returned

InteractionCancelRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object being referenced

InteractionCancelResponse

Currently no fields returned

InteractionCloseRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object being referenced

InteractionCloseResponse

Currently no fields returned

InteractionCreateASRRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_ids string repeated

List of grammar load interaction IDs, one for each root grammar to activate

InteractionCreateASRResponse

FieldTypeLabelDescription
interaction_id string

Interaction ID (uuid) that can be used during subsequent ASR processing

InteractionCreateGrammarLoadRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

language string

The language selector the specified grammar (e.g.: "en-US", "de-DE" or dialect independent "en", "de", etc.)

grammar_url string

A grammar URL to be loaded

inline_grammar_text string

A string containing the raw grammar text

InteractionCreateGrammarLoadResponse

FieldTypeLabelDescription
interaction_id string

Interaction ID (uuid) that can be used to reference the grammar during subsequent API calls

InteractionCreateGrammarParseRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

grammar_ids string repeated

List of grammar load interaction IDs, one for each root grammar to activate

input_text string

Input text to be parsed against the grammar[s]

InteractionCreateGrammarParseResponse

FieldTypeLabelDescription
interaction_id string

The interaction object being referenced by the request

InteractionCreateTTSRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

language string

Synthesis language for this request (e.g.: "en-US", "de-DE", etc.)

ssml_url string

URL from which to fetch synthesis request ssml

inline_request InteractionCreateTTSRequest.InlineTTSRequest

Inline TTS definition (text and optional parameters)

audio_format AudioFormat

Audio format to be generated by TTS Synthesis

InteractionCreateTTSRequest.InlineTTSRequest

Inline TTS definition (text and optional parameters)

FieldTypeLabelDescription
text string

Text to synthesize, can simple text, or ssml

voice string

Optional TTS voice (if using simple text, or if not specified within SSML)

InteractionCreateTTSResponse

FieldTypeLabelDescription
interaction_id string

Interaction ID (uuid) that can be used during subsequent TTS processing

InteractionCreateTextNormalizationRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

language string

The language selector the specified grammar (e.g.: "en-US", "de-DE" or dialect independent "en", "de", etc.)

input_text string

Input text to be normalized.

enable_inverse_text_normalization bool

Set to true to enable inverse_text_normalization (going from spoken form → written form (e.g. twenty two → 22)

enable_punctuation_capitalization_normalization bool

Set to true to enable punctuation and capitalization_normalization

enable_redaction_normalization bool

Set to true to enable redaction of sensitive information

InteractionCreateTextNormalizationResponse

FieldTypeLabelDescription
interaction_id string

The interaction object being referenced by the request

InteractionFinalizeProcessingRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object being referenced

InteractionFinalizeProcessingResponse

Currently no fields returned

InteractionGetSettingsRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object to get the current settings from

InteractionGetSettingsResponse

FieldTypeLabelDescription
json_settings string

A JSON encoded string containing the requested settings

InteractionRequestResultsRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object being referenced

InteractionRequestResultsResponse

FieldTypeLabelDescription
result_ready bool

The result status

results_json string

The JSON object containing the result being requested or empty if result_ready is false

InteractionSetSettingsRequest

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object to set the settings for

json_settings string

JSON formatted settings to be configured.

InteractionSetSettingsResponse

Currently no fields returned

IntermediateResultsReady

IntermediateResultsReady

Callback sent when intermediate interaction results are available.

Call(s) to InteractionRequestResults() can be used to obtain results object.

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object being referenced

SessionCloseRequest

FieldTypeLabelDescription
session_id string

Reference to session to close

SessionCloseResponse

Currently no fields returned

SessionCreateRequest

FieldTypeLabelDescription
audio_format AudioFormat

Audio parameters for the audio resource object associated with the new session being created. These audio parameters various attributes such as encoding format, sample rate, etc. Use STANDARD_AUDIO_FORMAT_NO_AUDIO_RESOURCE if no audio resource needs to be created

SessionCreateResponse

FieldTypeLabelDescription
session_id string

Session ID of newly created session (will be returned from initial call)

vad_event VadEvent

VAD event notification

final_result FinalResultsReady

Final results ready notification

partial_result IntermediateResultsReady

Intermediate results ready notification

SessionGetSettingsRequest

FieldTypeLabelDescription
session_id string

Which session object to get the settings from

SessionGetSettingsResponse

FieldTypeLabelDescription
json_settings string

A JSON encoded string containing the requested settings

SessionSetSettingsRequest

FieldTypeLabelDescription
session_id string

Which session to set the settings for

json_settings string

JSON formatted settings to be configured.

SessionSetSettingsResponse

Currently no fields returned

VadEvent

VadEvent

Message used to signal events over the course of Voice Activity Detection

processing.

The audio_offset will signify at what point within the session audio

resource the event occurred.

FieldTypeLabelDescription
session_id string

The session object being referenced

interaction_id string

The interaction object being referenced

vad_event_type VadEvent.VadEventType

The type of event this message represents

audio_offset int32

The offset in milliseconds from the beginning of the audio resource that this event occurred

AudioFormat.StandardAudioFormat

Specification for the audio format

Not all standard formats are supported in all cases. Different

interactions can natively handle a subset of the total audio formats.

NameNumberDescription
STANDARD_AUDIO_FORMAT_UNSPECIFIED 0

Undefined audio

STANDARD_AUDIO_FORMAT_ULAW_8KHZ 1

ULAW 8000 HZ, 1 byte per sample

STANDARD_AUDIO_FORMAT_ALAW_8KHZ 2

ALAW 8000 HZ, 1 byte per sample

STANDARD_AUDIO_FORMAT_PCM_8KHZ 3

PCM 8000 HZ, 2 bytes per sample

STANDARD_AUDIO_FORMAT_PCM_16KHZ 10

PCM 16000 HZ, 2 bytes per sample

STANDARD_AUDIO_FORMAT_PCM_22KHZ 20

PCM 22050 HZ, 2 bytes per sample

STANDARD_AUDIO_FORMAT_NO_AUDIO_RESOURCE 100

Used to indicate that no audio resource should be allocated

VadEvent.VadEventType

NameNumberDescription
VAD_EVENT_TYPE_UNSPECIFIED 0

Undefined VAD event type

VAD_EVENT_TYPE_BEGIN_PROCESSING 1

VAD begins processing audio

VAD_EVENT_TYPE_BARGE_IN 2

Barge-in occurred, audio that will be process by the ASR starts here. This notification might be useful to stop prompt playback for example

VAD_EVENT_TYPE_END_OF_SPEECH 3

End-of-speech occurred, no further audio will be processed by VAD for the specified interaction. If the setting InteractionASR_VoiceActivityDetection.AUTO_FINALIZE_ON_EOS is true, the ASR will immediately finish processing audio at this point

VAD_EVENT_TYPE_BARGE_IN_TIMEOUT 4

VAD timed out waiting for audio barge-in (start-of-speech). The audio manager will no longer process audio for this interaction.

VAD_EVENT_TYPE_END_OF_SPEECH_TIMEOUT 5

VAD timed out waiting for audio barge-out (end-of-speech). The audio manager will no longer process audio for this interaction.

SpeechAPIService

SpeechAPIService

The LumenVox Speech API can be used to access various speech resources,

such as Automatic Speech Recognition (ASR), Text-To-Speech (TTS),

Transcription, Call-Progress-Analysis (CPA).

Method NameRequest TypeResponse TypeDescription
SessionCreate SessionCreateRequest SessionCreateResponse stream

SessionCreate Creates a new session and returns its ID and session related messages through response streamed callback messages The returned session_id (uuid) can be used when making other requests for the session. Also, optionally creates a new audio resource using the specified audio_format parameters. Only one audio resource can be created per session. Audio will be added to this audio resource via gRPC (AudioPush or AudioStream) Typically audio would either be streamed in with AudioStream or sent in blocks using AudioPush. Both methods push audio into the same internal audio resource for processing

SessionClose SessionCloseRequest SessionCloseResponse

SessionClose Closes the specified session. Once closed, a session can no longer be referenced for requests and should be assumed to be no longer valid.

AudioStream AudioStreamRequest stream AudioStreamResponse

AudioStream Sends a stream of binary audio data into the specified audio resource. Note that this may be called before an interaction exists, which allows audio to be added before creating interactions that will process the audio.

AudioPush AudioPushRequest AudioPushResponse

AudioPush Sends a block of binary audio data into the specified audio resource. Note that this may be called before an interaction exists, which allows audio to be added before creating interactions that will process the audio. Please consider the gRPC maximum message size limits

AudioPull AudioPullRequest AudioPullResponse

AudioPull Returns a block of audio data from an audio resource. A begin point in milliseconds and maximum length can be specified to return a segment of the audio data. By default, all audio data within the resource is returned. Please note that due to GRPC maximum message length limitations, API clients may want to retrieve audio in more manageable chunk sizes. Using the default of always returning the entire audio buffer may not be advisable in all situations. Please consider the gRPC maximum message size limits

SessionSetSettings SessionSetSettingsRequest SessionSetSettingsResponse

SessionSetSettings Applies configuration changes to specified session settings.

SessionGetSettings SessionGetSettingsRequest SessionGetSettingsResponse

SessionGetSettings Returns a JSON encoded string containing the requested session settings.

InteractionCreateASR InteractionCreateASRRequest InteractionCreateASRResponse

InteractionCreateASR Creates a new ASR interaction for the specified session. This type of object is required to access ASR functionality. Use the returned interaction_id in subsequent ASR requests.

InteractionCreateTTS InteractionCreateTTSRequest InteractionCreateTTSResponse

InteractionCreateTTS Creates a new TTS interaction for the specified session. This type of object is required to access TTS functionality. Use the returned interaction_id in subsequent TTS requests.

InteractionCreateGrammarLoad InteractionCreateGrammarLoadRequest InteractionCreateGrammarLoadResponse

InteractionCreateGrammarLoad Requests a grammar be loaded within the specified session/interaction. The returned interaction_id may be referenced in subsequent ASR requests.

InteractionCreateGrammarParse InteractionCreateGrammarParseRequest InteractionCreateGrammarParseResponse

InteractionCreateGrammarParse Create a new grammar parse interaction for the specified session. A grammar parse interaction allows sending text directly, to be parsed by the active grammars. Essentially this is the same as an ASR interaction, but the speech to text functionality is skipped. The raw text is passed in directly instead of having the ASR engine supply the text from the audio. The text is parsed with the active grammars in the same way as an ASR interaction. The returned interaction_id may be used to determine status of this request as well as to access results, when processing is completed.

InteractionSetSettings InteractionSetSettingsRequest InteractionSetSettingsResponse

InteractionSetSettings Adds or modifies the specified settings to the specified interaction. Settings not mentioned in this call will remain unaffected.

InteractionGetSettings InteractionGetSettingsRequest InteractionGetSettingsResponse

InteractionGetSettings Return a JSON encoded string containing the current settings for the specified interaction.

InteractionBeginProcessing InteractionBeginProcessingRequest InteractionBeginProcessingResponse

InteractionBeginProcessing Begins processing the specified interaction. Typically, any interaction settings that are needed should be set before calling InteractionBeginProcessing. Calling this function triggers backend services to begin processing the audio or text being inputted

InteractionFinalizeProcessing InteractionFinalizeProcessingRequest InteractionFinalizeProcessingResponse

InteractionFinalizeProcessing Used to force VAD complete when VAD is used, or after VAD speech begin. Takes all available resource audio and triggers an ASR decode. This is optional most of the time, when the default auto-decode setting is used. This can also be used when performing DTMF or Text type interactions Results for the interaction may be available during subsequent calls to InteractionRequestResults

InteractionRequestResults InteractionRequestResultsRequest InteractionRequestResultsResponse

InteractionRequestResults Returns an interaction's results as a JSON encoded string. Note that an empty JSON object may be returned if no results are currently available

InteractionCancel InteractionCancelRequest InteractionCancelResponse

InteractionCancel Cancels the specified interaction. Any active processing related to the interaction is stopped.

InteractionClose InteractionCloseRequest InteractionCloseResponse

InteractionClose Closes the specified interaction.

Scalar Value Types

.proto TypeNotesC++JavaPythonGoC#PHPRuby
double double double float float64 double float Float
float float float float float32 float float Float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int int32 int integer Bignum or Fixnum (as required)
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long int64 long integer/string Bignum
uint32 Uses variable-length encoding. uint32 int int/long uint32 uint integer Bignum or Fixnum (as required)
uint64 Uses variable-length encoding. uint64 long int/long uint64 ulong integer/string Bignum or Fixnum (as required)
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int int32 int integer Bignum or Fixnum (as required)
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long int64 long integer/string Bignum
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int uint32 uint integer Bignum or Fixnum (as required)
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long uint64 ulong integer/string Bignum
sfixed32 Always four bytes. int32 int int int32 int integer Bignum or Fixnum (as required)
sfixed64 Always eight bytes. int64 long int/long int64 long integer/string Bignum
bool bool boolean boolean bool bool boolean TrueClass/FalseClass
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode string string string String (UTF-8)
bytes May contain any arbitrary sequence of bytes. string ByteString str []byte ByteString string String (ASCII-8BIT)

Copyright (C) 2001-2024, Ai Software, LLC d/b/a LumenVox