Pipeline Usage and Stages ========================= ``KokoroPipeline`` is the configurable engine behind the high-level ``Kokoro`` class. Use it when you want to swap parsing/segmentation stages, run custom G2P logic, or control model loading at a lower level. Pipeline overview ----------------- The default pipeline wiring is: ``doc_parser -> g2p -> phoneme_processing -> audio_generation -> audio_postprocessing`` Default stage classes: * ``SsmdDocumentParser`` * ``KokoroG2PAdapter`` * ``OnnxPhonemeProcessorAdapter`` * ``OnnxAudioGenerationAdapter`` * ``OnnxAudioPostprocessingAdapter`` If any of the audio stages are omitted, the pipeline builds a ``Kokoro`` ONNX backend and wires the missing adapters automatically. Quick start ----------- .. code-block:: python from pykokoro import GenerationConfig, KokoroPipeline, PipelineConfig config = PipelineConfig( voice="af_bella", generation=GenerationConfig(speed=1.0), ) pipeline = KokoroPipeline(config) result = pipeline.run("Hello from the pipeline.") result.save_wav("output.wav") # Inspect intermediates segments = result.segments phoneme_segments = result.phoneme_segments # Enable trace details when needed traced = pipeline.run("Hello", return_trace=True) if traced.trace: print(traced.trace.warnings) Configuration ------------- ``PipelineConfig`` and ``GenerationConfig`` are frozen dataclasses. Use ``dataclasses.replace`` when you want a modified copy. .. code-block:: python from dataclasses import replace from pykokoro import GenerationConfig, KokoroPipeline, PipelineConfig cfg = PipelineConfig(voice="af_bella") faster_cfg = replace(cfg, generation=replace(cfg.generation, speed=1.2)) pipeline = KokoroPipeline(faster_cfg) PipelineConfig fields ~~~~~~~~~~~~~~~~~~~~~ Core ^^^^ * ``voice``: Default voice name (``str``) or ``VoiceBlend`` used unless SSMD metadata overrides the voice per segment. * ``generation``: ``GenerationConfig`` instance with speed, language, pause handling, and phoneme controls. Model and provider ^^^^^^^^^^^^^^^^^^ * ``model_quality``: ``"fp32"``, ``"fp16"``, ``"fp16-gpu"``, ``"q8"``, ``"q8f16"``, ``"q4"``, ``"q4f16"``, ``"uint8"``, ``"uint8f16"``. ``None`` uses the backend default. * ``model_source``: ``"huggingface"`` or ``"github"``. * ``model_variant``: ``"v1.0"`` or ``"v1.1-zh"``. * ``model_path``: Path to a local ONNX model file. Overrides model download. * ``voices_path``: Path to a local voices file. Overrides voice download. * ``provider``: ONNX provider name (``"auto"``, ``"cpu"``, ``"cuda"``, ``"openvino"``, ``"directml"``, ``"coreml"``). * ``provider_options``: Dict of provider/session options passed to ONNX Runtime. * ``session_options``: Pre-built ``onnxruntime.SessionOptions`` (advanced use). Tokenizer and phoneme handling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * ``tokenizer_config``: ``TokenizerConfig`` used by SSMD parsing and ``kokorog2p``. * ``tokenizer_config.spacy_model``: spaCy package name or ``"auto"``. * ``tokenizer_config.spacy_model_size``: package tier for auto mode (``"sm"``, ``"md"``, ``"lg"``, ``"trf"``). Default: ``"md"``. * ``espeak_config``: Deprecated espeak configuration. Prefer ``TokenizerConfig``. * ``short_sentence_config``: ``ShortSentenceConfig`` for short-sentence handling. * ``overlap_mode``: ``"snap"`` clips overlapping SSMD spans to segment bounds, ``"strict"`` drops partial spans and emits trace warnings. Helper for auto spaCy model tier: .. code-block:: python from pykokoro import PipelineConfig, with_spacy_model_size cfg = PipelineConfig(voice="af_bella") cfg = with_spacy_model_size(cfg, size="md") Other ^^^^^ * ``return_trace``: Include ``Trace`` in ``AudioResult`` with timings/warnings. * ``enable_deprecation_warnings``: Reserved for compatibility warnings. * ``cache_dir``: Directory for the G2P disk cache (JSON files). Set ``None`` to disable caching. GenerationConfig fields ~~~~~~~~~~~~~~~~~~~~~~~ * ``speed``: Speech rate multiplier (``1.0`` is normal). * ``lang``: Default language code for phonemization (``"en-us"`` etc). * ``is_phonemes``: Treat input text as phoneme strings instead of raw text. * ``pause_mode``: ``"tts"`` keeps natural model pauses, ``"manual"`` trims segment silence and preserves explicit pauses, ``"auto"`` inserts pauses at sentence/paragraph boundaries and trims segment silence. * ``pause_clause``: Default pause for SSMD ``...c`` breaks (seconds). * ``pause_sentence``: Default pause for SSMD ``...s`` breaks (seconds). * ``pause_paragraph``: Default pause for SSMD ``...p`` breaks (seconds). * ``pause_variance``: Stored for compatibility with the ``Kokoro`` API. The pipeline stages do not currently apply variance. * ``random_seed``: Stored for compatibility with the ``Kokoro`` API. The pipeline stages do not currently use the seed. * ``enable_short_sentence``: Override short sentence handling for the run. Runtime overrides ----------------- ``KokoroPipeline.run`` accepts overrides for any ``PipelineConfig`` field. The ``lang`` keyword is special-cased to update ``generation.lang`` for convenience. .. code-block:: python from dataclasses import replace from pykokoro import GenerationConfig # Override just the language result = pipeline.run("Bonjour", lang="fr") # Override generation settings per call manual = replace( pipeline.config.generation, pause_mode="manual", pause_sentence=0.5, ) result = pipeline.run("Hello...s world", generation=manual) # Override model settings per call result = pipeline.run("Quick test", model_quality="q8") Stage behavior -------------- SSMD document parser ~~~~~~~~~~~~~~~~~~~~ ``SsmdDocumentParser`` uses ``parse_ssmd_to_segments`` to turn SSMD markup into clean text plus metadata spans, pause boundaries, and sentence/paragraph segments. It honors ``generation.pause_*`` values when converting break strengths into durations. Supported SSMD features include: * Break markers: ``...c``, ``...s``, ``...p``, ``...500ms`` * Language overrides: ``[Bonjour](fr)`` * Phoneme overrides: ``[tomato](ph: t eh m aa t ow)`` * Prosody markup (rate/pitch/volume) and emphasis * Voice markers: ``[Hello]{voice="af_sarah"}`` The parser attaches SSMD metadata to annotation spans so later stages can select per-segment voices and prosody. Plain text sentence splitting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``PlainTextDocumentParser`` uses the optional ``phrasplit`` package for sentence splitting. When ``phrasplit`` is unavailable, it falls back to a single segment. The language model is derived from ``generation.lang`` using spaCy package naming rules (for example ``en_core_web_sm`` for English). Split boundaries are forced at SSMD pause boundaries and at spans that contain phoneme overrides so those overrides are kept intact. Set ``PYKOKORO_DEBUG_SEGMENTS=1`` to log segment offsets. Kokoro G2P adapter ~~~~~~~~~~~~~~~~~~ ``KokoroG2PAdapter`` uses the ``kokorog2p`` package to produce phonemes and token IDs. * ``generation.lang`` selects the G2P language. * ``generation.is_phonemes`` treats input as phonemes and skips text G2P. * SSMD ``ph``/``phonemes`` spans override phonemes for that segment. * ``tokenizer_config`` is forwarded to ``kokorog2p.get_g2p``. * ``spacy_model="auto"`` resolves per language (default size ``md``). * ``cache_dir`` enables on-disk caching of phonemes/tokens. * Long phoneme token sequences are split into batches of ``MAX_PHONEME_LENGTH``. Onnx phoneme processing ~~~~~~~~~~~~~~~~~~~~~~~ ``OnnxPhonemeProcessorAdapter`` calls the ONNX backend to normalize tokens, skip empty segments, and apply short-sentence handling. * ``short_sentence_config`` controls defaults for short sentence handling. * ``generation.enable_short_sentence`` can override the config per run. Onnx audio generation ~~~~~~~~~~~~~~~~~~~~~ ``OnnxAudioGenerationAdapter`` generates raw audio per phoneme segment. * ``voice`` provides the default voice style. * SSMD voice metadata (``voice``/``voice_name``) overrides the voice per segment. * ``generation.speed`` controls synthesis speed. Onnx audio postprocessing ~~~~~~~~~~~~~~~~~~~~~~~~~ ``OnnxAudioPostprocessingAdapter`` trims silence and concatenates segments. * ``generation.pause_mode"`` set to ``"manual"`` or ``"auto"`` enables silence trimming before inserting explicit pauses. * SSMD prosody metadata (rate/pitch/volume) is applied to each segment. * ``pause_before``/``pause_after`` values from G2P are inserted between segments. Customizing the pipeline ------------------------ You can replace individual stages or use the provided no-op adapters. The showcase script demonstrates multiple wiring styles: ``examples/pipeline_stage_showcase.py`` Example with explicit stage wiring: .. code-block:: python from pykokoro import GenerationConfig, KokoroPipeline, PipelineConfig from pykokoro.onnx_backend import Kokoro from pykokoro.stages.audio_generation.onnx import OnnxAudioGenerationAdapter from pykokoro.stages.audio_postprocessing.onnx import OnnxAudioPostprocessingAdapter from pykokoro.stages.doc_parsers.ssmd import SsmdDocumentParser from pykokoro.stages.g2p.kokorog2p import KokoroG2PAdapter from pykokoro.stages.phoneme_processing.onnx import OnnxPhonemeProcessorAdapter cfg = PipelineConfig( voice="af_heart", generation=GenerationConfig(lang="en-us"), ) kokoro = Kokoro(model_quality=cfg.model_quality) pipeline = KokoroPipeline( cfg, doc_parser=SsmdDocumentParser(), g2p=KokoroG2PAdapter(), phoneme_processing=OnnxPhonemeProcessorAdapter(kokoro), audio_generation=OnnxAudioGenerationAdapter(kokoro), audio_postprocessing=OnnxAudioPostprocessingAdapter(kokoro), ) Local model files and providers ------------------------------- To load local ONNX artifacts, set ``model_path`` and ``voices_path``. You can also select a specific execution provider. .. code-block:: python from pathlib import Path from pykokoro import KokoroPipeline, PipelineConfig cfg = PipelineConfig( voice="af_bella", model_path=Path("/models/kokoro.onnx"), voices_path=Path("/models/voices.bin"), provider="cuda", provider_options={"device_id": 0}, ) pipeline = KokoroPipeline(cfg) result = pipeline.run("Hello from local files.")