Audio and voice

Overview

Audio and Voice Nodes are specialized components within the Dynamiq framework designed to handle Speech-to-Text (STT), Text-to-Speech (TTS) and Speech-to-Speech (STS) conversions. These nodes can transcribe audio files to text and synthesize spoken audio from text, leveraging Whisper for STT and ElevenLabs for TTS and STS.

The Audio and Voice nodes are essential for workflows that require:

Transcribing audio files to text (e.g., meeting recordings).
Converting text to audio, or audio to audio, for generating voice responses (e.g., virtual assistants, interactive systems).

Whisper Speech-to-Text (STT)

The Whisper node enables audio transcription using the Whisper model, providing high-quality speech-to-text conversion. This node is part of the Audio group in the Workflow editor and requires a connection to either the Whisper, or OpenAI, API.

Configuration

Name: Customizable name for identifying this node.
Connection: The connection configuration for Whisper.
Model: Model name, e.g., whisper-1.

Input

audio: Audio file input in Bytes or BytesIO format, supporting various audio formats (default is audio/wav).

Output

content: Transcription output as a string containing the recognized text from the audio input.

When used for tracing, or as final output, content is base64-encoded string to facilitate easy handling and transport.

Connection Configuration

Type: Whisper
Name: Customizable name for identifying this connection.
API key: Your API key
URL: Whisper API URL, e.g., https://api.openai.com/v1/ for OpenAI

Usage Example

Add an Input node and connect your audio file.
Drag a Whisper node into the workspace and connect it to the Input node. Set the desired model and other configurations.
Make sure that in Whisper node Input section used input transformer like{"audio":"$.input.output.files[0]} to pass exact file from the list.
Attach a downstream node (e.g. Output) to handle the transcribed content.

ElevenLabs Text-to-Speech

The ElevenLabs TTS node converts text into high-quality synthesized speech using ElevenLabs’ advanced TTS models. It provides options to adjust the voice characteristics, making it suitable for generating lifelike audio from text.

Configuration

Name: Customizable name for identifying this node.
Connection: The connection configuration for Whisper.
Model: Model name, e.g., eleven_monolingual_v1.
Voices: Select from available voices, e.g. Rachel, to match your required voice profile.
Stability: Controls the stability and consistency of the voice.
Similarity: Adjusts how closely the voice resembles the original.
Style Exaggeration: Amplifies the style of the speaker, enhancing expressiveness.
Speaker Boost: Toggle to increase the likeness to the selected voice.

Input

text: Text input in string format for conversion to speech.

Output

content: Audio output as bytes, containing the synthesized speech.

When used for tracing, or as final output, content is base64-encoded string to facilitate easy handling and transport.

Connection Configuration

Type: ElevenLabs
Name: Customizable name for identifying this connection.
API key: Your API key

Usage Example

Add an Input node to pass in text data.
Add OpenAI node to handle question and return answer (optional).
Connect ElevenLabs TTS to OpenAI node and configure the model, voice, and settings as desired.
Attach a downstream node (e.g. Output) to save or process the generated audio content.

ElevenLabs Speech-to-Speech

The ElevenLabs STS node enables the transformation of an audio input into a new synthesized audio output in the selected voice. This node is particularly useful for voice modulation or re-synthesis applications, where the input speech is “re-voiced” using ElevenLabs models.

Configuration

Name: Customizable name for identifying this node.
Connection: The connection configuration for Whisper.
Model: Model name, e.g., eleven_english_sts_v2.
Voices: Select from available voices, e.g. Dave, to match your required voice profile.
Stability: Controls the stability and consistency of the voice.
Similarity: Adjusts how closely the voice resembles the original.
Style Exaggeration: Amplifies the style of the speaker, enhancing expressiveness.
Speaker Boost: Toggle to increase the likeness to the selected voice.

Input

audio: Audio file input in bytes or BytesIO format, representing the original speech to be transformed.

Output

content: Audio output as bytes, containing the synthesized speech that mirrors the input but with the selected voice characteristics.

When used for tracing or as final output, content is base64-encoded string to facilitate easy handling and transport.

Connection Configuration

Type: ElevenLabs
Name: Customizable name for identifying this connection.
API key: Your API key

Usage Example

Add an Input node to provide the original audio file.
Connect it to the ElevenLabs STS node, select the desired model and voice, and configure the settings.
Make sure that in ElevenLabs STS node Input section used input transformer like{"audio":"$.input.output.files[0]} to pass exact file from the list.
Attach a downstream node (e.g. Output) to export the generated audio content.

PreviousGuide to Advanced Agent Orchestration: Graph Orchestrator NextTools and External Integrations

Last updated 9 months ago