Advanced Features
This guide covers advanced features of PyKokoro for power users.
Note
Use KokoroPipeline as the supported interface. Legacy Kokoro snippets
can be updated by replacing kokoro.create with pipe.run.
Voice Blending
Create custom voices by blending multiple voices together.
Basic Voice Blending
from pykokoro import Kokoro, VoiceBlend
with Kokoro() as kokoro:
# Blend two voices equally
blend = VoiceBlend.parse("af_bella + af_sarah")
audio, sr = kokoro.create(
"This is a blended voice",
voice=blend
)
Weighted Blending
Control the contribution of each voice:
from pykokoro import Kokoro, VoiceBlend
with Kokoro() as kokoro:
# 70% bella, 30% sarah
blend = VoiceBlend.parse("af_bella*0.7 + af_sarah*0.3")
audio, sr = kokoro.create(
"Weighted blend",
voice=blend
)
# Percentage notation (normalized automatically)
blend2 = VoiceBlend.parse("af_bella*70% + af_sarah*30%")
Multiple Voice Blending
Blend more than two voices:
from pykokoro import Kokoro, VoiceBlend
with Kokoro() as kokoro:
# Three-way blend
blend = VoiceBlend.parse(
"af_bella*0.5 + af_sarah*0.3 + af_nicole*0.2"
)
audio, sr = kokoro.create(
"Complex blend",
voice=blend
)
Creating Blended Voice Programmatically
from pykokoro import Kokoro, VoiceBlend
# Create blend object directly
blend = VoiceBlend(
voices=["af_bella", "af_sarah", "am_adam"],
weights=[0.4, 0.4, 0.2]
)
with Kokoro() as kokoro:
audio, sr = kokoro.create(
"Custom blend",
voice=blend
)
Phoneme-Based Generation
For precise control, generate speech directly from phonemes.
Using create_from_phonemes()
from pykokoro import Kokoro
with Kokoro() as kokoro:
# Get phonemes for text
phonemes = kokoro.tokenizer.phonemize("Hello, world!")
# Generate from phonemes
audio, sr = kokoro.create_from_phonemes(
phonemes,
voice="af_bella",
speed=1.0
)
Text to Phonemes
Convert text to phonemes:
from pykokoro import Kokoro
with Kokoro() as kokoro:
# Get phonemes
phonemes = kokoro.tokenizer.phonemize(
"Hello, world!",
lang="en-us"
)
print(f"Phonemes: {phonemes}")
# Get detailed phoneme info
result = kokoro.tokenizer.text_to_phonemes(
"Hello",
lang="en-us",
with_words=True
)
print(result)
PhonemeSegment Processing
Work with phoneme segments for batch processing:
from pykokoro import phonemize_text_list, create_tokenizer
tokenizer = create_tokenizer()
texts = ["Hello", "World", "How are you?"]
segments = phonemize_text_list(texts, tokenizer, lang="en-us")
for segment in segments:
print(f"Text: {segment.text}")
print(f"Phonemes: {segment.phonemes}")
print(f"Tokens: {segment.tokens}")
Advanced Text Splitting
Split and Phonemize in One Step (Legacy API)
For advanced text processing with automatic splitting and phoneme generation.
The split_and_phonemize_text function intelligently handles long text by:
Splitting text using your chosen mode (paragraph, sentence, or clause)
Phonemizing each segment
Automatically cascading to finer split modes if segments exceed the phoneme limit
Only truncating as last resort (when even individual words are too long)
Cascade behavior:
paragraphmode →sentence→clause→word→ truncatesentencemode →clause→word→ truncateclausemode →word→ truncatewordmode → truncate (with warning)
from pykokoro import split_and_phonemize_text, create_tokenizer
tokenizer = create_tokenizer()
long_text = """
This is the first sentence. This is the second.
This is a new paragraph.
"""
# The function automatically ensures all segments stay within limit
segments = split_and_phonemize_text(
long_text,
tokenizer,
lang="en-us",
split_mode="sentence", # Will cascade to clause/word if needed
max_phoneme_length=510 # Kokoro's maximum
)
for segment in segments:
print(f"Paragraph {segment.paragraph}, Sentence {segment.sentence}")
print(f"Text: {segment.text}")
print(f"Phonemes: {segment.phonemes[:50]}...")
# segment.phonemes is guaranteed to be <= 510
Split Modes in Detail (Legacy API)
Paragraph Mode:
segments = split_and_phonemize_text(
text,
tokenizer,
split_mode="paragraph" # Splits on double newlines
)
Sentence Mode:
Requires spaCy for sentence boundary detection:
segments = split_and_phonemize_text(
text,
tokenizer,
split_mode="sentence" # Splits on sentence boundaries
)
Clause Mode:
Splits on both sentences and commas for finer control:
segments = split_and_phonemize_text(
text,
tokenizer,
split_mode="clause" # Splits on sentences and commas
)
Pause Mode
The modern Kokoro.create() API uses pause_mode for controlling pause behavior:
from pykokoro import Kokoro
with Kokoro() as kokoro:
# Default: TTS controls pauses naturally
audio, sr = kokoro.create(
long_text,
voice="af_sarah"
)
# Auto mode: PyKokoro inserts boundary pauses
audio, sr = kokoro.create(
long_text,
voice="af_sarah",
pause_mode="auto",
pause_clause=0.25, # Clause boundaries
pause_sentence=0.5, # Sentence boundaries
pause_paragraph=1.0, # Paragraph boundaries
pause_variance=0.05, # Natural variance (Gaussian)
random_seed=42 # For reproducible results
)
# Manual mode: PyKokoro trims and preserves explicit pauses
audio, sr = kokoro.create(
long_text,
voice="af_sarah",
pause_mode="manual", # Preserve explicit pauses
pause_clause=0.25, # Clause boundaries
pause_sentence=0.5, # Sentence boundaries
pause_paragraph=1.0, # Paragraph boundaries
pause_variance=0.05, # Natural variance (Gaussian)
random_seed=42 # For reproducible results
)
Pause Variance Details:
The pause_variance parameter adds Gaussian (normal distribution) variance to
pause durations, making speech sound more natural:
0.0 - No variance, exact pause durations
0.05 - Default, ±100ms at 95% confidence interval
0.1 - Higher variance, ±200ms at 95% confidence
The variance ensures that pauses are never exactly the same length, mimicking natural human speech rhythm.
Reproducibility:
Use random_seed for consistent output across runs:
# Same output every time
audio1, sr = kokoro.create(text, voice="af_sarah",
pause_mode="auto",
random_seed=42)
audio2, sr = kokoro.create(text, voice="af_sarah",
pause_mode="auto",
random_seed=42)
# audio1 and audio2 are identical
# Different output each time
audio3, sr = kokoro.create(text, voice="af_sarah",
pause_mode="auto",
random_seed=None) # or omit parameter
Custom Warning Callbacks
Handle warnings during phoneme generation:
from pykokoro import split_and_phonemize_text, create_tokenizer
def my_warning_handler(message):
print(f"WARNING: {message}")
tokenizer = create_tokenizer()
segments = split_and_phonemize_text(
very_long_text,
tokenizer,
warn_callback=my_warning_handler
)
GPU Acceleration
Automatic GPU Detection
PyKokoro automatically uses GPU if available:
from pykokoro import Kokoro, get_device
# Check available device
device = get_device()
print(f"Using device: {device}")
# Kokoro will use GPU automatically
with Kokoro() as kokoro:
audio, sr = kokoro.create("Hello!", voice="af_bella")
Forcing Specific Device
from pykokoro import Kokoro
# Force CPU
kokoro_cpu = Kokoro(device="cpu")
# Force CUDA (NVIDIA)
kokoro_gpu = Kokoro(device="cuda")
# Force ROCm (AMD)
kokoro_rocm = Kokoro(device="rocm")
GPU Information
from pykokoro import get_gpu_info
info = get_gpu_info()
print(f"Device: {info['device']}")
print(f"Providers: {info['providers']}")
Custom Model Paths
Model Sources
PyKokoro supports multiple model sources:
HuggingFace (Default):
from pykokoro import Kokoro
# Default: HuggingFace with 54 voices
kokoro = Kokoro(model_source="huggingface", model_quality="fp32")
HuggingFace v1.0 (Default - 54 voices, 8 quality options):
# Default: HuggingFace v1.0
kokoro = Kokoro(model_quality="q8") # Recommended default
HuggingFace v1.1-zh (103 voices, 8 quality options):
# HuggingFace v1.1-zh with English + Chinese voices
# Supports all quantization levels: fp32, fp16, q8, q8f16, q4, q4f16, uint8, uint8f16
kokoro = Kokoro(
model_variant="v1.1-zh",
model_quality="q8" # All qualities available
)
# Use English voices
audio, sr = kokoro.create(
"Hello world!",
voice="af_maple", # v1.1-zh English voice
lang="en-us"
)
GitHub v1.0 (54 voices, 4 quality options):
# GitHub v1.0 with GPU-optimized fp16
kokoro = Kokoro(
model_source="github",
model_variant="v1.0",
model_quality="fp16-gpu" # Options: fp32, fp16, fp16-gpu, q8
)
GitHub v1.1-zh (103 voices, fp32 only):
# GitHub v1.1-zh with English + Chinese voices
kokoro = Kokoro(
model_source="github",
model_variant="v1.1-zh",
model_quality="fp32" # Only fp32 available
)
# Use English voices
audio, sr = kokoro.create(
"Hello world!",
voice="af_maple", # v1.1-zh English voice
lang="en-us"
)
Note: Chinese text generation is currently in development. Use English voices from v1.1-zh with English text for now.
Use Custom Model Files
from pykokoro import Kokoro
kokoro = Kokoro(
model_path="/path/to/custom/model.onnx",
voices_path="/path/to/voices.bin"
)
Download Models Manually
HuggingFace Models:
from pykokoro import download_model, download_voice, download_all_models
# Download specific model quality (v1.0 by default)
download_model(variant="v1.0", quality="q8")
# Download specific voice
download_voice(voice_name="af_bella", variant="v1.0")
# Download all models
download_all_models(variant="v1.0")
GitHub Models:
from pykokoro.onnx_backend import (
download_model_github,
download_voices_github,
download_all_models_github
)
# Download GitHub v1.0 model
download_model_github(variant="v1.0", quality="fp16-gpu")
# Download GitHub v1.0 voices
download_voices_github(variant="v1.0")
# Download all GitHub v1.1-zh files
download_all_models_github(
variant="v1.1-zh",
quality="fp32",
progress_callback=lambda msg: print(msg)
)
HuggingFace v1.1-zh Models:
from pykokoro import (
download_model,
download_all_voices,
download_all_models,
download_config
)
# Download HuggingFace v1.1-zh model (with quantization)
download_model(variant="v1.1-zh", quality="q8")
# Download all 103 voices for v1.1-zh
def progress(voice_name, current, total):
print(f"Downloading {current}/{total}: {voice_name}")
download_all_voices(variant="v1.1-zh", progress_callback=progress)
# Download configuration for v1.1-zh
download_config(variant="v1.1-zh")
# Download everything (model + config + all voices)
download_all_models(
variant="v1.1-zh",
quality="q8",
progress_callback=lambda msg: print(msg)
)
Available Quality Options by Source:
HuggingFace v1.0: fp32, fp16, q8, q8f16, q4, q4f16, uint8, uint8f16
HuggingFace v1.1-zh: fp32, fp16, q8, q8f16, q4, q4f16, uint8, uint8f16
GitHub v1.0: fp32, fp16, fp16-gpu, q8
GitHub v1.1-zh: fp32 only
Get Model Paths
from pykokoro import get_model_path, get_voice_path
# HuggingFace model paths
model_path = get_model_path(quality="q8")
voice_path = get_voice_path()
print(f"Model: {model_path}")
print(f"Voices: {voice_path}")
# GitHub model paths are stored in variant-specific subdirectories:
# ~/.cache/pykokoro/models/onnx/v1.0/kokoro-v1.0.onnx
# ~/.cache/pykokoro/models/onnx/v1.1-zh/kokoro-v1.1-zh.onnx
# ~/.cache/pykokoro/voices/v1.0/voices-v1.0.bin
# ~/.cache/pykokoro/voices/v1.1-zh/voices-v1.1-zh.bin
Advanced Tokenizer Configuration
Custom Tokenizer Settings
from pykokoro import create_tokenizer, TokenizerConfig
# Custom tokenizer config
tokenizer_config = TokenizerConfig(
vocab_path="/path/to/vocab.txt",
espeak_config=espeak_config
)
tokenizer = create_tokenizer(config=tokenizer_config)
Mixed Language Support
For text with multiple languages:
from pykokoro import create_tokenizer, TokenizerConfig
config = TokenizerConfig(
enable_mixed_language=True,
primary_language="en-us",
allowed_languages=["en-us", "es", "fr"],
language_confidence_threshold=0.7
)
tokenizer = create_tokenizer(config=config)
Audio Trimming
Trim Silence from Audio
from pykokoro import trim
# Generate audio with silence
with Kokoro() as kokoro:
audio, sr = kokoro.create("Hello!", voice="af_bella")
# Trim silence
trimmed_audio, trim_info = trim(audio)
print(f"Original: {len(audio)} samples")
print(f"Trimmed: {len(trimmed_audio)} samples")
print(f"Trim info: {trim_info}")
Short Sentence Handling
PyKokoro improves short, single-word sentences by surrounding the word with a
pause marker. You can tune settins via ShortSentenceConfig:
from pykokoro.short_sentence_handler import ShortSentenceConfig
short_config = ShortSentenceConfig(
phoneme_pretext="…",
)
Configuration Management
Save and Load Configuration
from pykokoro import save_config, load_config
# Save configuration
config = {
"default_voice": "af_bella",
"default_speed": 1.0,
"model_quality": "q8"
}
save_config(config, "my_config.json")
# Load configuration
loaded_config = load_config("my_config.json")
Get Cache Paths
from pykokoro import get_user_cache_path, get_user_config_path
cache_path = get_user_cache_path()
config_path = get_user_config_path()
print(f"Cache: {cache_path}")
print(f"Config: {config_path}")
Performance Tips
Reuse Kokoro Instance
Don’t create a new
Kokoro()for each request - initialize once and reuse.Use GPU When Available
GPU acceleration provides 3-10x speedup.
Batch Processing
Process multiple texts in one session to avoid initialization overhead.
Choose Appropriate Model Quality
Use
q6orq8for production;fp16only when quality is critical.Use pause_mode for Long Text
Using
pause_mode="auto"with appropriate pause settings improves quality for long text.
Internal Architecture
Understanding PyKokoro’s Internal Structure
PyKokoro uses a modular architecture with specialized manager classes for different responsibilities:
OnnxSessionManager (pykokoro/onnx_session.py)
Manages ONNX Runtime session creation and configuration:
Automatic provider selection (CUDA → ROCm → CPU)
Session options and execution providers
Fallback handling when GPU is unavailable
VoiceManager (pykokoro/voice_manager.py)
Handles voice loading and blending:
Loads voice embeddings from binary files
Implements voice blending with weighted combinations
Validates voice availability across model variants
AudioGenerator (pykokoro/audio_generator.py)
Manages the audio generation pipeline:
Converts phonemes to tokens
Runs ONNX inference for audio generation
Handles speed adjustment and audio post-processing
MixedLanguageHandler (pykokoro/mixed_language_handler.py)
Automatic language detection for multilingual text:
Detects language boundaries in mixed-language text
Routes text segments to appropriate language models
Configurable confidence thresholds
PhonemeDictionary (pykokoro/phoneme_dictionary.py)
Custom word-to-phoneme mappings:
Override default pronunciation for specific words
Support for context-aware phoneme substitution
JSON-based dictionary format
Using Manager Classes Directly
While most users interact with the high-level Kokoro API, advanced users can work with manager classes directly:
from pykokoro.onnx_session import OnnxSessionManager
from pykokoro.voice_manager import VoiceManager
from pykokoro.audio_generator import AudioGenerator
# Create ONNX session with custom options
session_manager = OnnxSessionManager(
device="cuda",
providers=["CUDAExecutionProvider"],
user_session_options={"intra_op_num_threads": 4}
)
session = session_manager.create_session(
model_path="/path/to/model.onnx"
)
# Load voices with custom blend
voice_manager = VoiceManager(model_source="huggingface")
voice_manager.load_voices("/path/to/voices.bin")
voice_data = voice_manager.get_blended_voice("af_bella*0.7 + af_sarah*0.3")
# Generate audio
audio_gen = AudioGenerator(
session=session,
sample_rate=24000,
lang="en-us"
)
audio = audio_gen.generate_audio_from_phonemes(
phonemes="həˈloʊ wɝld",
voice_data=voice_data,
speed=1.0
)
Custom Phoneme Dictionaries
Create custom pronunciation mappings:
from pykokoro.phoneme_dictionary import PhonemeDictionary
# Create dictionary
dictionary = PhonemeDictionary()
# Add custom pronunciations
dictionary.add_word("PyKokoro", "paɪ kəˈkɔɹoʊ")
dictionary.add_word("ONNX", "ɑnɪks")
# Save to file
dictionary.save("custom_pronunciations.json")
# Load and use
loaded_dict = PhonemeDictionary.load("custom_pronunciations.json")
# Apply to tokenizer
from pykokoro import create_tokenizer
tokenizer = create_tokenizer()
tokenizer.phoneme_dictionary = loaded_dict
# Now "PyKokoro" will use custom pronunciation
phonemes = tokenizer.phonemize("Welcome to PyKokoro!")
Next Steps
Examples - Real-world usage examples
API Reference - Complete API documentation