Audio and voice
Last updated
Last updated
Audio and Voice Nodes are specialized components within the Dynamiq framework designed to handle Speech-to-Text (STT), Text-to-Speech (TTS) and Speech-to-Speech (STS) conversions. These nodes can transcribe audio files to text and synthesize spoken audio from text, leveraging Whisper for STT and ElevenLabs for TTS and STS.
The Audio and Voice nodes are essential for workflows that require:
Transcribing audio files to text (e.g., meeting recordings).
Converting text to audio, or audio to audio, for generating voice responses (e.g., virtual assistants, interactive systems).
The Whisper node enables audio transcription using the Whisper model, providing high-quality speech-to-text conversion. This node is part of the Audio group in the Workflow editor and requires a connection to either the Whisper, or OpenAI, API.
Name: Customizable name for identifying this node.
Connection: The connection configuration for Whisper.
Model: Model name, e.g., whisper-1
.
audio: Audio file input in Bytes or BytesIO format, supporting various audio formats (default is audio/wav).
content: Transcription output as a string containing the recognized text from the audio input.
When used for tracing, or as final output, content is base64-encoded string to facilitate easy handling and transport.
Type: Whisper
Name: Customizable name for identifying this connection.
API key: Your API key
URL: Whisper API URL, e.g., https://api.openai.com/v1/
for OpenAI
Add an Input node and connect your audio file.
Drag a Whisper node into the workspace and connect it to the Input node. Set the desired model and other configurations.
Make sure that in Whisper node Input section used input transformer like{"audio":"$.input.output.files[0]}
to pass exact file from the list.
Attach a downstream node (e.g. Output) to handle the transcribed content.
The ElevenLabs TTS node converts text into high-quality synthesized speech using ElevenLabs’ advanced TTS models. It provides options to adjust the voice characteristics, making it suitable for generating lifelike audio from text.
Name: Customizable name for identifying this node.
Connection: The connection configuration for Whisper.
Model: Model name, e.g., eleven_monolingual_v1
.
Voices: Select from available voices, e.g. Rachel
, to match your required voice profile.
Stability: Controls the stability and consistency of the voice.
Similarity: Adjusts how closely the voice resembles the original.
Style Exaggeration: Amplifies the style of the speaker, enhancing expressiveness.
Speaker Boost: Toggle to increase the likeness to the selected voice.
text: Text input in string format for conversion to speech.
content: Audio output as bytes
, containing the synthesized speech.
When used for tracing, or as final output, content is base64-encoded string to facilitate easy handling and transport.
Type: ElevenLabs
Name: Customizable name for identifying this connection.
API key: Your API key
Add an Input node to pass in text data.
Add OpenAI node to handle question and return answer (optional).
Connect ElevenLabs TTS to OpenAI node and configure the model, voice, and settings as desired.
Attach a downstream node (e.g. Output) to save or process the generated audio content.
The ElevenLabs STS node enables the transformation of an audio input into a new synthesized audio output in the selected voice. This node is particularly useful for voice modulation or re-synthesis applications, where the input speech is “re-voiced” using ElevenLabs models.
Name: Customizable name for identifying this node.
Connection: The connection configuration for Whisper.
Model: Model name, e.g., eleven_english_sts_v2
.
Voices: Select from available voices, e.g. Dave
, to match your required voice profile.
Stability: Controls the stability and consistency of the voice.
Similarity: Adjusts how closely the voice resembles the original.
Style Exaggeration: Amplifies the style of the speaker, enhancing expressiveness.
Speaker Boost: Toggle to increase the likeness to the selected voice.
audio: Audio file input in bytes
or BytesIO
format, representing the original speech to be transformed.
content: Audio output as bytes
, containing the synthesized speech that mirrors the input but with the selected voice characteristics.
When used for tracing or as final output, content is base64-encoded string to facilitate easy handling and transport.
Type: ElevenLabs
Name: Customizable name for identifying this connection.
API key: Your API key
Add an Input node to provide the original audio file.
Connect it to the ElevenLabs STS node, select the desired model and voice, and configure the settings.
Make sure that in ElevenLabs STS node Input section used input transformer like{"audio":"$.input.output.files[0]}
to pass exact file from the list.
Attach a downstream node (e.g. Output) to export the generated audio content.