Gemini 3.1 Flash TTS
Generate expressive speech from text with Gemini 3.1 Flash TTS, including optional multi-speaker voice configuration.
Generate expressive speech from text with Gemini 3.1 Flash TTS, including optional multi-speaker voice configuration.
Model Overview
Gemini 3.1 Flash TTS converts text prompts into speech audio and supports inline delivery cues such as [laughing], [whispering], and [short pause]. The model can run as a single-speaker voice or use a speakers list where prompt line prefixes match configured speaker aliases.
Best At
- Dialogue-style text-to-speech with named speakers.
- Expressive narration controlled by natural-language style instructions.
- Speech prompts that use inline audio tags for pacing, tone, and delivery.
Limitations / Not Good At
- Speaker aliases must match the prefixes used in the prompt.
- The single-speaker voice setting is ignored when speaker groups are provided.
- Very long or inconsistently tagged scripts may need editing before synthesis.
Ideal Use Cases
- Podcast or interview drafts with two or more named speakers.
- Narration, explainer audio, and character dialogue.
- Prototyping multilingual or expressive speech workflows from text.
Input & Output Format
- Input: required
prompt, optionalstyle_instructions, optional single-speakervoice, optionallanguage_code, optionaltemperature,output_format, and zero or more speaker groups withspeaker_idplusvoice. - Output: generated audio returned on
response.
Performance Notes
- Fal prices this model per 1000 input characters.
- Multi-speaker synthesis is enabled only when at least one speaker group is present.
Prompt
StringText to convert to speech. Use speaker prefixes that match configured speaker aliases for multi-speaker scripts.
Prompt
StringText to convert to speech. Use speaker prefixes that match configured speaker aliases for multi-speaker scripts.
Host: Welcome back. DrChen: [excited] Gemini TTS can now generate expressive multi-speaker dialogue from a script.Style Instructions
StringOptional natural-language instructions for tone, pace, accent, emotion, or delivery style.
Voice
StringVoice preset for single-speaker synthesis. Ignored when speaker groups are configured.
KoreLanguage Code
StringOptional language hint, such as English (US), Japanese (Japan), or Chinese Mandarin (China). Leave empty for auto-detect.
Speakers
InferredRepeatable speaker configs for multi-speaker synthesis. Speaker aliases must match prefixes in the prompt.
Speaker Alias
StringAlias used in the prompt, for example 'Host' in 'Host: Welcome back'. Use alphanumeric text without spaces.
Voice
StringVoice preset for this speaker.
KoreTemperature
NumberControls delivery variation. Lower values are more predictable; higher values are more varied.
1Output Format
StringGenerated audio file format.
mp3Audio
InferredGenerated speech audio.
Nodespell Team
Type
Node
Status
Official
Package
Nodespell AI
Category
AI / Audio / GoogleInput
Output