JP TTS Engine Comparison

10 agents · 4 GitHub deep-dives · May 21 2026 · Apple Silicon (M4 Pro, 64GB)

最近、新しい本を読み始めました。物語は静かな町から始まり、主人公はゆっくりと自分の過去と向き合っていきます。

Test passage rendered by every engine. ~10s of natural narrative JP.

Per-engine: what it is, how it works, license, Apple Silicon viability, the invocation that produced its sample.

VoxCPM2 Apache-2.0 MLX (2nd-class) OpenBMB · 2B params · tokenizer-free DiT · 48kHz

How it works

Tokenizer-free diffusion-autoregressive: LocEnc → TSLM → RALM → LocDiT. AudioVAE V2 reconstructs 48kHz. Trained on 2M+ hours multilingual. Five modes: zero-shot, voice-design (caption), continuation (prompt audio+text), reference clone (ref audio+text), ultimate (all four).

Strong

Multilingual coverage incl. JP. Apache-2.0. Voice Design via natural-language prompt. 48kHz output. Apple-Silicon MLX port shipped 2026-04-30 (PR #641).

Weak on Apple Silicon

Diffusion-on-MLX is undercooked — maintainer publicly redirects performance-hungry users to nanovllm-voxcpm (CUDA fork). MPS has explicit upstream-acknowledged limitations. Clone mode produces a "fluent-foreigner accent" on JP (HF discussion #14).

JP accent fix (maintainer-confirmed)

Switch from clone → Voice Design (caption/instruct), lower CFG from 2.0 to 1.5, chunk to ≤2 sentences per call. From VoxCPM issue #222.

python voxcpm_tts.py "..." --mode default --instruct "落ち着いた大人の朗読、自然な日本語" --cfg-value 1.5

Irodori-TTS v3 MIT MLX native Aratako (Chihiro Arata) · 500M params · RF-DiT + DACVAE · 48kHz · JP-only

How it works

Flow-matching DiT with cross-attention text/caption conditioning. Three condition signals: text (mandatory), caption (instruct/voice-design), ref_audio (zero-shot clone). Speaker Inversion (PR #18, merged Apr 2026) addresses cross-sentence voice drift via speaker_kv_scale tuning.

Strong

Built JP-first — JP-native creators specifically praise it for reducing 「ペラペラ外国人っぽさ」 (fluent-foreigner accent). MIT license. 40+ emoji emotion markers (😭 = crying, 🤧 = cold, 😄 = laughing). 500M means it's small + fast on Apple Silicon.

Weak

Kanji handling — pre-convert unusual kanji to hiragana. Long sentences ( >30s) trigger skipping — chunk it. Narrow dynamic range (issue #9 unanswered for 3+ weeks). Cross-sentence voice drift (the Wooly-Fluffy issue #136 dealbreaker, may be fixed by Speaker Inversion).

Memory (mlx-audio)

Default sequence_length=750 needs ~24GB unified memory. 16GB Macs use sequence_length≤400 + cfg_guidance_mode=alternating. Joe's 64GB has plenty of headroom.

python irodori_tts.py "..." --caption "落ち着いた大人の朗読、自然な日本語" --model mlx-community/Irodori-TTS-500M-v3-8bit

Qwen3-TTS Apache-2.0 MLX native Alibaba · 24kHz · 18 langs incl JP

How it works

Multilingual TTS with built-in voice catalog (Ono_Anna for JP) plus zero-shot clone. Natural-language instruct prompt for emotion/style.

Strong

Working reliably on Apple Silicon. Built-in JP speaker Ono_Anna. Apache-2.0. Instruct prompts handle calm-narrator naturally.

Weak

"Decent, not Fish-tier." Same fluent-foreigner accent in zero-shot clone mode. No major model upgrade since launch.

Verdict

Reliable commercial baseline. Use when you need a known-working Apache engine and decent JP is enough.

python qwen_tts.py "..." --voice Ono_Anna --instruct "穏やかな大人の朗読"

Google Chirp 3 HD ja-JP Cloud — you own output N/A (cloud) $30/M chars · 1M chars/mo free · 24kHz

How it works

Google's transformer-based neural TTS. JP voices follow ja-JP-Chirp3-HD-{Name} pattern. Stock voices tested here: Charon, Kore, Aoede, Zephyr, Achernar.

Strong

Polished out of the box. JP-native reviewer rates 8/10. Handles JP punctuation/line breaks for pause control. You own the output (no training-data restrictions on commercial use).

Weak

API dependency. No offline. No instruct/emotion control (Chirp 3 line). Limited voice options (~5–10 ja-JP voices).

Joe-specific cost math

Manaoke catalog (100 songs × 500 chars) ≈ 50K chars ≈ $1.50 total. Free tier covers most realistic use. DBZ episode ~60K chars ≈ $1.80/ep.

python google_chirp_tts.py "..." --voice ja-JP-Chirp3-HD-Charon

AivisSpeech-Engine ⚠ AGPL-3.0 (transitive) CPU-only on Mac Walkers Inc · v1.2.0 (Apr 30 2026) · SBV2 JP-Extra inside · LGPL wrapper

How it works

LGPL-3.0 wrapper around Style-Bert-VITS2 JP-Extra. Better JP NLP frontend than raw SBV2 (OpenJTalk dictionary bundled). AIVMX voice format. Engine + AivisHub model marketplace ecosystem.

License landmine

Engine LICENSE says LGPL-3.0, but pyproject.toml pins SBV2 from tsukumijima's fork (AGPL-3.0), imported in-process — AGPL applies to whole stack for commercial network deploy. Confirmed via primary-source inspection.

Apple Silicon

CPU-only by design. pyproject.toml selects plain onnxruntime for darwin/arm64 (no CoreML). Maintainer: 「GPU 対応は積極的には行っておりません」 (issue #21).

Anneli scandal

Default voice (Sept 2025) was unauthorized clone of voice actress Hibiku Yamamura. AivisHub paused, reopened with provenance audit. Engine architecturally unaffected; reputational shadow remains. NOT in our A/B (out of license + brand scope).

⚠ Not installed for this comparison — license + Mac performance ceiling rules it out for Joe.

Style-Bert-VITS2 (JP-Extra) AGPL-3.0 MPS inference, CPU training litagin02 · ~440M · JP-Extra arch frozen since 2024-05

How it works

VITS-family architecture with style embeddings and BERT-conditioned text encoder. JP-Extra is the JP-specialized branch with stronger JP prosody. Trained corpora include JVNV (CC BY-SA 4.0).

Strong

JP-native pitch-accent quality among the strongest in OSS. JVNV checkpoints are commercial-OK with attribution. MPS inference works on torch ≥ 2.7.1.

Weak

AGPL-3.0 copyleft. Training on Mac is CUDA-hardcoded — PR #240 adds MPS support but has been open 7+ weeks unmerged. Empirical: M2 Max 12-core CPU trains a 6s clip in 5m10s — fine-tuning impractical on Mac. JP-Extra architecture frozen since 2024.

Maintainer direction

Ver 3.0.0 PR #231: "Japanese only support, completely abolish Chinese and English." Pivot in progress.

⚠ Not installed for this comparison — same AGPL + Mac-CPU ceiling as AivisSpeech.

Fish Audio S2 (hosted Plus) $11/mo · commercial OK OSS weights NC-only Fish Audio Co.

How it works

Fish-Speech architecture (DAR + VQGAN). The OSS S1-mini/S2-Pro weights are CC-BY-NC-SA blocked for commercial. Hosted Plus tier explicitly grants commercial output rights.

Strong

The reference "Fish sound" you've been chasing. Cleanest legal path to it. JP support strong.

Weak

$132/yr subscription. API dependency. No offline. Russia/CN-adjacent vendor flag — check non-adversarial preference if it matters.

Recommendation

Subscribe for one month, ear-test on real passages, decide if it's actually what you want. Cheapest way to calibrate the upper bound.

⚠ Not subscribed yet. Action: 1-month Plus seat ($11) to definitively answer "is Fish what I want."

CosyVoice 3 Removed 2026-05-20 Apache-2.0 but lost JP A/B

Why removed

JP output had distinct Chinese-accented prosody (kanji-hanzi leakage). Paper §4.4 admits 28× ZH-vs-JP training imbalance. Reclaimed 11.1GB.

Good at

Mandarin, ZH↔EN cross-lingual cloning, Chinese dialects via instruct. None of which is Joe's use case.

12 voices · letters shuffled per load · Pick the one you like best · Reveal labels when done.

Recommended next research moves

5 broad research agents (Twitter, YouTube, Reddit/HN/Qiita, Academic, Alt-routes) done
4 GitHub deep-dive agents (Irodori, VoxCPM2, AivisSpeech, SBV2) done
1 gap-fill agent (demos, HF, Discord, CN, awesome-lists) done
VoxCPM2 Voice Design + CFG 1.5 + chunked variants done
Irodori v3 via mlx-audio 0.4.3+ (just merged 2026-05-18) done
Google Chirp 3 HD ja-JP, 5 voices done
Test Irodori v3 + Speaker Inversion knobs (speaker_kv_scale) — first public test of the post-PR-#18 configuration next
One month of Fish Audio Plus ($11) to ear-test the actual Fish ceiling next
Try Step-Audio-EditX (Apache-2.0, JP-capable, surfaced in Round 5) — single demo listen future
If VoxCPM2 Voice Design closes the accent gap, migrate to mlx-audio PyPI 0.4.4 when it cuts (would drop the SHA-pin) future

Quality levers by engine

VoxCPM2 — fix the accent

HF discussion #14 + VoxCPM issue #222 maintainer-confirmed stack:

Voice Design mode (--mode default --instruct "...") — not clone
CFG 1.5 instead of default 2.0 (--cfg-value 1.5)
Chunk to ≤2 sentences per call
Avoid self-conditioning (prompt_audio + prompt_text) in Voice Design mode — it produced 192s of hallucinated audio in our test

Irodori v3 — play to its strengths

Pre-convert unusual kanji to hiragana — model's training data is limited on rare kanji
Chunk anything >30s of speech — training clips were ≤30s, skips appear past that
Tune speaker_kv_scale (default 5.0) to control reference adherence vs naturalness
Use emoji markers for emotion: 😭 cry · 🤧 cold · 😄 laugh · etc.
Accept narrow dynamic range — production-mix afterwards if needed

Google Chirp — when shipping commercially

Default to Charon for warm narration · Kore for neutral · Aoede for energetic
1M chars/mo free — covers most of Joe's use case at zero $
JP punctuation (。、・) controls pause length naturally — no SSML needed
You own the output — no commercial-use restrictions

Fish hosted — the ceiling test

$11/mo Plus seat explicitly grants commercial rights on outputs that the CC-BY-NC-SA OSS weights forbid
Use it for one month to decide if Fish quality is what you actually want, or if your ear is happy with VoxCPM2 Voice Design
If yes: $132/yr is cheaper than every other route
If no: stop chasing it — you've calibrated the upper bound

Optimal config per use case

Manaoke karaoke

Default: Google Chirp 3 HD Charon / Kore (you own output, $1.50 for whole catalog at scale).

Voice variety: mix in VoxCPM2 Voice Design with character-instructs for non-narrator lines.

Avoid: AivisSpeech / SBV2 (AGPL) — shipping public commercially with copyleft inheritance is the wrong fight.

DBZ Narrator (personal)

Default: Irodori v3 caption mode with character-instructs ("迫力ある男性ナレーター、緊張感のある朗読" etc.). MIT license, JP-native voice, emoji for emphasis.

Backup: VoxCPM2 Voice Design with CFG 1.5 for variety.

Personal viewing = AGPL is acceptable if needed, but Irodori MIT is the cleaner default.

Mom's record narration

Default: Google Chirp 3 HD Charon — calm, polished, no fiddling. 1M/mo free covers most realistic length. Personal use = license irrelevant either way.

If offline matters: Irodori v3 with calm caption.

Silo / screenplay reads (offline)

Default: Irodori v3 (offline, MIT) for JP; Qwen3-TTS Ono_Anna as backup.

Air-gapped Silo machine: NO cloud option. Local-only mandate is a hard constraint here.

Strategy: chunk screenplays sentence-by-sentence, render per-character with distinct captions.

Architectural reality on Apple Silicon (2026)

Things we found that aren't going to change soon — adjust expectations, don't fight the platform.

Diffusion-on-MLX is undercooked. VoxCPM2 upstream redirects performance-hungry users to nanovllm-voxcpm (CUDA fork). Apple's ml-explore/mlx team has zero in-flight TTS work across all 2026 issues. The accent artifact in clone mode is partly architectural.
AGPL-3.0 engines are CPU-only on Mac by maintainer choice. AivisSpeech + SBV2 both ship with no Apple GPU acceleration. tsukumijima explicit: 「GPU 対応は積極的には行っておりません」.
The actively-Mac-friendly engines are RF-DiT (Irodori) and Apache-2.0 multilingual (VoxCPM2, Qwen3-TTS). Bet on these for local; bet on cloud for commercial polish.