JP TTS Engine Comparison

10 agents · 4 GitHub deep-dives · May 21 2026 · Apple Silicon (M4 Pro, 64GB)

最近、新しい本を読み始めました。物語は静かな町から始まり、主人公はゆっくりと自分の過去と向き合っていきます。
Test passage rendered by every engine. ~10s of natural narrative JP.

Per-engine: what it is, how it works, license, Apple Silicon viability, the invocation that produced its sample.

VoxCPM2 Apache-2.0 MLX (2nd-class) OpenBMB · 2B params · tokenizer-free DiT · 48kHz
How it works
Tokenizer-free diffusion-autoregressive: LocEnc → TSLM → RALM → LocDiT. AudioVAE V2 reconstructs 48kHz. Trained on 2M+ hours multilingual. Five modes: zero-shot, voice-design (caption), continuation (prompt audio+text), reference clone (ref audio+text), ultimate (all four).
Strong
Multilingual coverage incl. JP. Apache-2.0. Voice Design via natural-language prompt. 48kHz output. Apple-Silicon MLX port shipped 2026-04-30 (PR #641).
Weak on Apple Silicon
Diffusion-on-MLX is undercooked — maintainer publicly redirects performance-hungry users to nanovllm-voxcpm (CUDA fork). MPS has explicit upstream-acknowledged limitations. Clone mode produces a "fluent-foreigner accent" on JP (HF discussion #14).
JP accent fix (maintainer-confirmed)
Switch from clone → Voice Design (caption/instruct), lower CFG from 2.0 to 1.5, chunk to ≤2 sentences per call. From VoxCPM issue #222.
python voxcpm_tts.py "..." --mode default --instruct "落ち着いた大人の朗読、自然な日本語" --cfg-value 1.5
Irodori-TTS v3 MIT MLX native Aratako (Chihiro Arata) · 500M params · RF-DiT + DACVAE · 48kHz · JP-only
How it works
Flow-matching DiT with cross-attention text/caption conditioning. Three condition signals: text (mandatory), caption (instruct/voice-design), ref_audio (zero-shot clone). Speaker Inversion (PR #18, merged Apr 2026) addresses cross-sentence voice drift via speaker_kv_scale tuning.
Strong
Built JP-first — JP-native creators specifically praise it for reducing 「ペラペラ外国人っぽさ」 (fluent-foreigner accent). MIT license. 40+ emoji emotion markers (😭 = crying, 🤧 = cold, 😄 = laughing). 500M means it's small + fast on Apple Silicon.
Weak
Kanji handling — pre-convert unusual kanji to hiragana. Long sentences ( >30s) trigger skipping — chunk it. Narrow dynamic range (issue #9 unanswered for 3+ weeks). Cross-sentence voice drift (the Wooly-Fluffy issue #136 dealbreaker, may be fixed by Speaker Inversion).
Memory (mlx-audio)
Default sequence_length=750 needs ~24GB unified memory. 16GB Macs use sequence_length≤400 + cfg_guidance_mode=alternating. Joe's 64GB has plenty of headroom.
python irodori_tts.py "..." --caption "落ち着いた大人の朗読、自然な日本語" --model mlx-community/Irodori-TTS-500M-v3-8bit
Qwen3-TTS Apache-2.0 MLX native Alibaba · 24kHz · 18 langs incl JP
How it works
Multilingual TTS with built-in voice catalog (Ono_Anna for JP) plus zero-shot clone. Natural-language instruct prompt for emotion/style.
Strong
Working reliably on Apple Silicon. Built-in JP speaker Ono_Anna. Apache-2.0. Instruct prompts handle calm-narrator naturally.
Weak
"Decent, not Fish-tier." Same fluent-foreigner accent in zero-shot clone mode. No major model upgrade since launch.
Verdict
Reliable commercial baseline. Use when you need a known-working Apache engine and decent JP is enough.
python qwen_tts.py "..." --voice Ono_Anna --instruct "穏やかな大人の朗読"
Google Chirp 3 HD ja-JP Cloud — you own output N/A (cloud) $30/M chars · 1M chars/mo free · 24kHz
How it works
Google's transformer-based neural TTS. JP voices follow ja-JP-Chirp3-HD-{Name} pattern. Stock voices tested here: Charon, Kore, Aoede, Zephyr, Achernar.
Strong
Polished out of the box. JP-native reviewer rates 8/10. Handles JP punctuation/line breaks for pause control. You own the output (no training-data restrictions on commercial use).
Weak
API dependency. No offline. No instruct/emotion control (Chirp 3 line). Limited voice options (~5–10 ja-JP voices).
Joe-specific cost math
Manaoke catalog (100 songs × 500 chars) ≈ 50K chars ≈ $1.50 total. Free tier covers most realistic use. DBZ episode ~60K chars ≈ $1.80/ep.
python google_chirp_tts.py "..." --voice ja-JP-Chirp3-HD-Charon
AivisSpeech-Engine ⚠ AGPL-3.0 (transitive) CPU-only on Mac Walkers Inc · v1.2.0 (Apr 30 2026) · SBV2 JP-Extra inside · LGPL wrapper
How it works
LGPL-3.0 wrapper around Style-Bert-VITS2 JP-Extra. Better JP NLP frontend than raw SBV2 (OpenJTalk dictionary bundled). AIVMX voice format. Engine + AivisHub model marketplace ecosystem.
License landmine
Engine LICENSE says LGPL-3.0, but pyproject.toml pins SBV2 from tsukumijima's fork (AGPL-3.0), imported in-process — AGPL applies to whole stack for commercial network deploy. Confirmed via primary-source inspection.
Apple Silicon
CPU-only by design. pyproject.toml selects plain onnxruntime for darwin/arm64 (no CoreML). Maintainer: 「GPU 対応は積極的には行っておりません」 (issue #21).
Anneli scandal
Default voice (Sept 2025) was unauthorized clone of voice actress Hibiku Yamamura. AivisHub paused, reopened with provenance audit. Engine architecturally unaffected; reputational shadow remains. NOT in our A/B (out of license + brand scope).
⚠ Not installed for this comparison — license + Mac performance ceiling rules it out for Joe.
Style-Bert-VITS2 (JP-Extra) AGPL-3.0 MPS inference, CPU training litagin02 · ~440M · JP-Extra arch frozen since 2024-05
How it works
VITS-family architecture with style embeddings and BERT-conditioned text encoder. JP-Extra is the JP-specialized branch with stronger JP prosody. Trained corpora include JVNV (CC BY-SA 4.0).
Strong
JP-native pitch-accent quality among the strongest in OSS. JVNV checkpoints are commercial-OK with attribution. MPS inference works on torch ≥ 2.7.1.
Weak
AGPL-3.0 copyleft. Training on Mac is CUDA-hardcoded — PR #240 adds MPS support but has been open 7+ weeks unmerged. Empirical: M2 Max 12-core CPU trains a 6s clip in 5m10s — fine-tuning impractical on Mac. JP-Extra architecture frozen since 2024.
Maintainer direction
Ver 3.0.0 PR #231: "Japanese only support, completely abolish Chinese and English." Pivot in progress.
⚠ Not installed for this comparison — same AGPL + Mac-CPU ceiling as AivisSpeech.
Fish Audio S2 (hosted Plus) $11/mo · commercial OK OSS weights NC-only Fish Audio Co.
How it works
Fish-Speech architecture (DAR + VQGAN). The OSS S1-mini/S2-Pro weights are CC-BY-NC-SA blocked for commercial. Hosted Plus tier explicitly grants commercial output rights.
Strong
The reference "Fish sound" you've been chasing. Cleanest legal path to it. JP support strong.
Weak
$132/yr subscription. API dependency. No offline. Russia/CN-adjacent vendor flag — check non-adversarial preference if it matters.
Recommendation
Subscribe for one month, ear-test on real passages, decide if it's actually what you want. Cheapest way to calibrate the upper bound.
⚠ Not subscribed yet. Action: 1-month Plus seat ($11) to definitively answer "is Fish what I want."
CosyVoice 3 Removed 2026-05-20 Apache-2.0 but lost JP A/B
Why removed
JP output had distinct Chinese-accented prosody (kanji-hanzi leakage). Paper §4.4 admits 28× ZH-vs-JP training imbalance. Reclaimed 11.1GB.
Good at
Mandarin, ZH↔EN cross-lingual cloning, Chinese dialects via instruct. None of which is Joe's use case.

12 voices · letters shuffled per load · Pick the one you like best · Reveal labels when done.

Recommended next research moves

  1. 5 broad research agents (Twitter, YouTube, Reddit/HN/Qiita, Academic, Alt-routes) done
  2. 4 GitHub deep-dive agents (Irodori, VoxCPM2, AivisSpeech, SBV2) done
  3. 1 gap-fill agent (demos, HF, Discord, CN, awesome-lists) done
  4. VoxCPM2 Voice Design + CFG 1.5 + chunked variants done
  5. Irodori v3 via mlx-audio 0.4.3+ (just merged 2026-05-18) done
  6. Google Chirp 3 HD ja-JP, 5 voices done
  7. Test Irodori v3 + Speaker Inversion knobs (speaker_kv_scale) — first public test of the post-PR-#18 configuration next
  8. One month of Fish Audio Plus ($11) to ear-test the actual Fish ceiling next
  9. Try Step-Audio-EditX (Apache-2.0, JP-capable, surfaced in Round 5) — single demo listen future
  10. If VoxCPM2 Voice Design closes the accent gap, migrate to mlx-audio PyPI 0.4.4 when it cuts (would drop the SHA-pin) future

Quality levers by engine

VoxCPM2 — fix the accent

HF discussion #14 + VoxCPM issue #222 maintainer-confirmed stack:

  1. Voice Design mode (--mode default --instruct "...") — not clone
  2. CFG 1.5 instead of default 2.0 (--cfg-value 1.5)
  3. Chunk to ≤2 sentences per call
  4. Avoid self-conditioning (prompt_audio + prompt_text) in Voice Design mode — it produced 192s of hallucinated audio in our test

Irodori v3 — play to its strengths

  • Pre-convert unusual kanji to hiragana — model's training data is limited on rare kanji
  • Chunk anything >30s of speech — training clips were ≤30s, skips appear past that
  • Tune speaker_kv_scale (default 5.0) to control reference adherence vs naturalness
  • Use emoji markers for emotion: 😭 cry · 🤧 cold · 😄 laugh · etc.
  • Accept narrow dynamic range — production-mix afterwards if needed

Google Chirp — when shipping commercially

  • Default to Charon for warm narration · Kore for neutral · Aoede for energetic
  • 1M chars/mo free — covers most of Joe's use case at zero $
  • JP punctuation (。 、 ・) controls pause length naturally — no SSML needed
  • You own the output — no commercial-use restrictions

Fish hosted — the ceiling test

  • $11/mo Plus seat explicitly grants commercial rights on outputs that the CC-BY-NC-SA OSS weights forbid
  • Use it for one month to decide if Fish quality is what you actually want, or if your ear is happy with VoxCPM2 Voice Design
  • If yes: $132/yr is cheaper than every other route
  • If no: stop chasing it — you've calibrated the upper bound

Optimal config per use case

Manaoke karaoke

Default: Google Chirp 3 HD Charon / Kore (you own output, $1.50 for whole catalog at scale).

Voice variety: mix in VoxCPM2 Voice Design with character-instructs for non-narrator lines.

Avoid: AivisSpeech / SBV2 (AGPL) — shipping public commercially with copyleft inheritance is the wrong fight.

DBZ Narrator (personal)

Default: Irodori v3 caption mode with character-instructs ("迫力ある男性ナレーター、緊張感のある朗読" etc.). MIT license, JP-native voice, emoji for emphasis.

Backup: VoxCPM2 Voice Design with CFG 1.5 for variety.

Personal viewing = AGPL is acceptable if needed, but Irodori MIT is the cleaner default.

Mom's record narration

Default: Google Chirp 3 HD Charon — calm, polished, no fiddling. 1M/mo free covers most realistic length. Personal use = license irrelevant either way.

If offline matters: Irodori v3 with calm caption.

Silo / screenplay reads (offline)

Default: Irodori v3 (offline, MIT) for JP; Qwen3-TTS Ono_Anna as backup.

Air-gapped Silo machine: NO cloud option. Local-only mandate is a hard constraint here.

Strategy: chunk screenplays sentence-by-sentence, render per-character with distinct captions.

Architectural reality on Apple Silicon (2026)

Things we found that aren't going to change soon — adjust expectations, don't fight the platform.