Kokoro 8.2m
High-quality, efficient text-to-speech model (82M parameters) based on StyleTTS2.
High-quality, efficient text-to-speech model (82M parameters) based on StyleTTS2.
Model Overview
A text-to-speech (TTS) model that generates natural-sounding speech from text, based on the StyleTTS2 architecture with 82 million parameters.
Best At
Generating clear and expressive speech across multiple languages, including American English, British English, French, Hindi, Italian, and Japanese. It offers a good balance of quality and efficiency.
Limitations / Not Good At
While it supports multiple languages, the quality and availability of specific voices can vary. Some languages have fewer voice options or lower training durations, potentially impacting the naturalness of the synthesized speech.
Ideal Use Cases
- Creating voiceovers for videos and presentations
- Generating audiobooks or podcast segments
- Developing interactive voice response (IVR) systems
- Accessibility tools for content creators
- Prototyping voice applications
Input & Output Format
- Input: Text (string), Voice (string, optional), Speed (number, optional)
- Output: Audio file in URI format (e.g., WAV)
Performance Notes
This model is known for being fast and cost-efficient due to its relatively small size (82M parameters). It can handle long text inputs by automatically splitting them.
Text
StringText to convert to speech
Text
StringText input (long text is automatically split)
Speed
NumberSpeech speed multiplier (0.5 = half speed, 2.0 = double speed)
1Voice
StringVoice to use for synthesis
af_bellaOutput
InferredOutput
Type
Node
Status
Official
Package
Nodespell AI
Category
AI / Audio / KokoroInput
Output