--- license: other license_name: boson-higgs-audio-v3-research-and-non-commercial-license license_link: LICENSE language: - af - ar - as - ast - az - ba - be - bg - bn - bs - ca - ceb - ckb - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - ga - gl - gu - ha - he - hi - hr - ht - hu - hy - id - is - it - jv - ka - kab - kam - kea - kk - kn - ko - ky - la - lb - lg - ln - lt - luo - lv - mhr - mi - mk - ml - mn - mr - ms - mt - ne - nl - no - nso - ny - oc - om - pa - pl - ps - pt - ro - ru - rw - sd - sk - sl - sn - so - sq - sr - sv - sw - ta - te - tg - tl - tr - ug - uk - umb - ur - uz - vi - xh - zh - zu tags: - text-to-speech - speech-generation - voice-agent - expressive-speech - controllable-tts - multilingual-tts pipeline_tag: text-to-speech library_name: transformers --- # Higgs Audio v3 TTS
Higgs Audio v3 TTS is built for voice chat: it **speaks, not just reads**. It turns model responses into expressive conversational speech across **100+ languages**, with **zero-shot voice cloning** and **inline control** over emotion, style, prosody, pauses, and sound effects. > [!TIP] > Released for research and non-commercial use under the **Boson Higgs Audio v3 Research and Non-Commercial License**. Production, hosted APIs, or revenue-generating use requires a separate commercial license. Prohibited: voice cloning without consent, impersonation, fraud, election deception, biometric surveillance, or any unlawful use. ![Higgs Audio v3 TTS Architecture](./assets/model_architecture.png) Higgs autoregressive decoder consumes interleaved text and audio tokens. Audio is encoded by the **Higgs Tokenizer** into 8 codebooks at 25 fps, staggered via a **delay pattern**, then mapped to backbone hidden states through a **multi-codebook fused embedding**. Output codes pass through a **multi-codebook fused head**, are de-delayed, and decoded back to waveform. | Component | Spec | |---|---| | Backbone | ~4B autoregressive decoder (36 L, hidden=2560, GQA 32/8) | | Multi-codebook embedding / head | Fused single-tensor, tied with text embedding | | Context length | 8,192 tokens (training sequence length) | | Audio tokens | 8 codebooks ร— 1026 vocab, delay pattern | | Sample rate | 24 kHz | | Frame rate | 25 fps (40 ms / frame) | ## Supported Languages The model reaches **single-digit WER/CER on 102 languages**, which split into two tiers. ### WER/CER under 5 โ€” polished, production-quality (85) ๐Ÿ‡ฟ๐Ÿ‡ฆ Afrikaans ยท ๐Ÿ‡ธ๐Ÿ‡ฆ๐Ÿ‡ช๐Ÿ‡ฌ Arabic ยท ๐Ÿ‡ฆ๐Ÿ‡ฒ Armenian ยท ๐Ÿ‡ฎ๐Ÿ‡ณ Assamese ยท ๐Ÿ‡ช๐Ÿ‡ธ Asturian ยท ๐Ÿ‡ฆ๐Ÿ‡ฟ Azerbaijani ยท ๐Ÿ‡ท๐Ÿ‡บ Bashkir ยท ๐Ÿ‡ช๐Ÿ‡ธ Basque ยท ๐Ÿ‡ง๐Ÿ‡พ Belarusian ยท ๐Ÿ‡ง๐Ÿ‡ฉ๐Ÿ‡ฎ๐Ÿ‡ณ Bengali ยท ๐Ÿ‡ง๐Ÿ‡ฆ Bosnian ยท ๐Ÿ‡ง๐Ÿ‡ฌ Bulgarian ยท ๐Ÿ‡ช๐Ÿ‡ธ Catalan ยท ๐Ÿ‡ต๐Ÿ‡ญ Cebuano ยท ๐Ÿ‡ฎ๐Ÿ‡ถ Central Kurdish ยท ๐Ÿ‡จ๐Ÿ‡ณ Chinese ยท ๐Ÿ‡ญ๐Ÿ‡ท Croatian ยท ๐Ÿ‡จ๐Ÿ‡ฟ Czech ยท ๐Ÿ‡ฉ๐Ÿ‡ฐ Danish ยท ๐Ÿ‡ณ๐Ÿ‡ฑ๐Ÿ‡ง๐Ÿ‡ช Dutch ยท ๐Ÿ‡ท๐Ÿ‡บ Eastern Mari ยท ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡ฆ๐Ÿ‡บ English ยท ๐ŸŒ Esperanto ยท ๐Ÿ‡ช๐Ÿ‡ช Estonian ยท ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish ยท ๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡จ๐Ÿ‡ฆ French ยท ๐Ÿ‡ช๐Ÿ‡ธ Galician ยท ๐Ÿ‡ฌ๐Ÿ‡ช Georgian ยท ๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฆ๐Ÿ‡น German ยท ๐Ÿ‡ฌ๐Ÿ‡ท Greek ยท ๐Ÿ‡ฎ๐Ÿ‡ณ Gujarati ยท ๐Ÿ‡ญ๐Ÿ‡น Haitian Creole ยท ๐Ÿ‡ณ๐Ÿ‡ฌ Hausa ยท ๐Ÿ‡ฎ๐Ÿ‡ฑ Hebrew ยท ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi ยท ๐Ÿ‡ญ๐Ÿ‡บ Hungarian ยท ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian ยท ๐Ÿ‡ฎ๐Ÿ‡น Italian ยท ๐Ÿ‡ฏ๐Ÿ‡ต Japanese ยท ๐Ÿ‡ฎ๐Ÿ‡ฉ Javanese ยท ๐Ÿ‡ฎ๐Ÿ‡ณ Kannada ยท ๐Ÿ‡ฐ๐Ÿ‡ฟ Kazakh ยท ๐Ÿ‡ฐ๐Ÿ‡ท Korean ยท ๐Ÿ‡ท๐Ÿ‡ผ Kinyarwanda ยท ๐Ÿ‡ฐ๐Ÿ‡ฌ Kyrgyz ยท ๐Ÿ‡ฑ๐Ÿ‡ป Latvian ยท ๐Ÿ‡จ๐Ÿ‡ฉ Lingala ยท ๐Ÿ‡ฑ๐Ÿ‡น Lithuanian ยท ๐Ÿ‡ฐ๐Ÿ‡ช Luo ยท ๐Ÿ‡ฒ๐Ÿ‡ฐ Macedonian ยท ๐Ÿ‡ฒ๐Ÿ‡พ๐Ÿ‡ฎ๐Ÿ‡ฉ Malay ยท ๐Ÿ‡ฎ๐Ÿ‡ณ Malayalam ยท ๐Ÿ‡ฒ๐Ÿ‡น Maltese ยท ๐Ÿ‡ณ๐Ÿ‡ฟ Mฤori ยท ๐Ÿ‡ฎ๐Ÿ‡ณ Marathi ยท ๐Ÿ‡ฒ๐Ÿ‡ณ Mongolian ยท ๐Ÿ‡ณ๐Ÿ‡ต Nepali ยท ๐Ÿ‡ณ๐Ÿ‡ด Norwegian ยท ๐Ÿ‡ซ๐Ÿ‡ท Occitan ยท ๐Ÿ‡ฎ๐Ÿ‡ท๐Ÿ‡ฆ๐Ÿ‡ซ Persian ยท ๐Ÿ‡ต๐Ÿ‡ฑ Polish ยท ๐Ÿ‡ต๐Ÿ‡น๐Ÿ‡ง๐Ÿ‡ท Portuguese ยท ๐Ÿ‡ท๐Ÿ‡ด Romanian ยท ๐Ÿ‡ท๐Ÿ‡บ Russian ยท ๐Ÿ‡ฟ๐Ÿ‡ฆ Sepedi ยท ๐Ÿ‡ท๐Ÿ‡ธ Serbian ยท ๐Ÿ‡ฟ๐Ÿ‡ผ Shona ยท ๐Ÿ‡ธ๐Ÿ‡ฐ Slovak ยท ๐Ÿ‡ธ๐Ÿ‡ฎ Slovene ยท ๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฒ๐Ÿ‡ฝ Spanish ยท ๐Ÿ‡น๐Ÿ‡ฟ๐Ÿ‡ฐ๐Ÿ‡ช Swahili ยท ๐Ÿ‡ธ๐Ÿ‡ช Swedish ยท ๐Ÿ‡ต๐Ÿ‡ญ Tagalog ยท ๐Ÿ‡น๐Ÿ‡ฏ Tajik ยท ๐Ÿ‡ฎ๐Ÿ‡ณ๐Ÿ‡ฑ๐Ÿ‡ฐ Tamil ยท ๐Ÿ‡ฎ๐Ÿ‡ณ Telugu ยท ๐Ÿ‡น๐Ÿ‡ญ Thai ยท ๐Ÿ‡น๐Ÿ‡ท Turkish ยท ๐Ÿ‡บ๐Ÿ‡ฆ Ukrainian ยท ๐Ÿ‡ต๐Ÿ‡ฐ๐Ÿ‡ฎ๐Ÿ‡ณ Urdu ยท ๐Ÿ‡จ๐Ÿ‡ณ Uyghur ยท ๐Ÿ‡บ๐Ÿ‡ฟ Uzbek ยท ๐Ÿ‡ป๐Ÿ‡ณ Vietnamese ยท ๐Ÿ‡ฟ๐Ÿ‡ฆ Xhosa ยท ๐Ÿ‡ฟ๐Ÿ‡ฆ Zulu ### WER/CER between 5 and 10 โ€” usable, but less polished (17) ๐Ÿ‡ฆ๐Ÿ‡ฑ Albanian ยท ๐Ÿ‡ฒ๐Ÿ‡ผ๐Ÿ‡ฟ๐Ÿ‡ฒ Chichewa/Nyanja ยท ๐Ÿ‡ฎ๐Ÿ‡ณ๐Ÿ‡ต๐Ÿ‡ฐ Eastern Punjabi ยท ๐Ÿ‡บ๐Ÿ‡ฌ Ganda ยท ๐Ÿ‡ฎ๐Ÿ‡ธ Icelandic ยท ๐Ÿ‡ฎ๐Ÿ‡ช Irish ยท ๐Ÿ‡ฉ๐Ÿ‡ฟ Kabyle ยท ๐Ÿ‡จ๐Ÿ‡ป Kabuverdianu ยท ๐Ÿ‡ฐ๐Ÿ‡ช Kamba ยท ๐Ÿ‡ป๐Ÿ‡ฆ Latin ยท ๐Ÿ‡ฑ๐Ÿ‡บ Luxembourgish ยท ๐Ÿ‡ช๐Ÿ‡น๐Ÿ‡ฐ๐Ÿ‡ช Oromo ยท ๐Ÿ‡ฆ๐Ÿ‡ซ๐Ÿ‡ต๐Ÿ‡ฐ Pashto ยท ๐Ÿ‡ต๐Ÿ‡ฐ๐Ÿ‡ฎ๐Ÿ‡ณ Sindhi ยท ๐Ÿ‡ธ๐Ÿ‡ด Somali ยท ๐Ÿ‡ฆ๐Ÿ‡ด Umbundu ยท ๐Ÿ‡ฌ๐Ÿ‡ง Welsh ## Control Tokens All tags follow `<|category:value|>` syntax and can be inserted mid-utterance. > For how to place these tags when writing the target text (sentence-level vs. inline, `sfx` formatting, stacking, worked examples), see **[PROMPTING.md](./PROMPTING.md)**. - **Emotion** โ€” `elation`, `amusement`, `enthusiasm`, `determination`, `pride`, `contentment`, `affection`, `relief`, `contemplation`, `confusion`, `surprise`, `awe`, `longing`, `arousal`, `anger`, `fear`, `disgust`, `bitterness`, `sadness`, `shame`, `helplessness`
Token Description
<|emotion:elation|>Elation / joy
<|emotion:amusement|>Amusement / playful laughter
<|emotion:enthusiasm|>Enthusiasm / excitement
<|emotion:determination|>Determination / firmness
<|emotion:pride|>Pride / confidence
<|emotion:contentment|>Calm satisfaction
<|emotion:affection|>Warmth / affection
<|emotion:relief|>Relief
<|emotion:contemplation|>Thoughtful / reflective
<|emotion:confusion|>Confused
<|emotion:surprise|>Surprised
<|emotion:awe|>Awe / wonder
<|emotion:longing|>Longing / yearning
<|emotion:arousal|>Heightened desire
<|emotion:anger|>Anger
<|emotion:fear|>Fear
<|emotion:disgust|>Disgust
<|emotion:bitterness|>Bitterness
<|emotion:sadness|>Sadness
<|emotion:shame|>Shame
<|emotion:helplessness|>Helplessness
- **Style** โ€” `singing`, `shouting`, `whispering`
Token Description
<|style:singing|>Singing
<|style:shouting|>Shouting / projected voice
<|style:whispering|>Whisper
- **Sound effects** โ€” `cough`, `laughter`, `crying`, `screaming`, `burping`, `humming`, `sigh`, `sniff`, `sneeze`

Pair each token with the matching onomatopoeia immediately after it.

Token Description Suggested onomatopoeia
<|sfx:cough|>CoughAhem
<|sfx:laughter|>LaughterHaha / Hehe
<|sfx:crying|>CryingBoohoo / Sob
<|sfx:screaming|>ScreamingAhh / Aaah
<|sfx:burping|>BurpingBurp
<|sfx:humming|>HummingHmm / Mmm
<|sfx:sigh|>SighUh / Ahh
<|sfx:sniff|>SniffSff
<|sfx:sneeze|>SneezeAchoo
- **Prosody** - Speed โ€” `speed_very_slow`, `speed_slow`, `speed_fast`, `speed_very_fast` - Pauses โ€” `pause`, `long_pause` - Pitch โ€” `pitch_low`, `pitch_high` - Delivery โ€” `expressive_high`, `expressive_low`
Token Effect
<|prosody:speed_very_slow|>โ‰ˆ0.65ร— speed
<|prosody:speed_slow|>โ‰ˆ0.85ร— speed
<|prosody:speed_fast|>โ‰ˆ1.2ร— speed
<|prosody:speed_very_fast|>โ‰ˆ1.4ร— speed
<|prosody:pitch_low|>โ‰ˆโˆ’3 semitones
<|prosody:pitch_high|>โ‰ˆ+2.5 semitones
<|prosody:pause|>โ‰ˆ400โ€“700 ms pause
<|prosody:long_pause|>โ‰ˆ700โ€“1500 ms pause
<|prosody:expressive_high|>More expressive delivery
<|prosody:expressive_low|>Flatter delivery
## Evaluation Benchmarks ### Multilingual Voice Clone We evaluate Higgs Audio v3 TTS on public multilingual TTS suites and our internal 111-language Higgs-Multilingual set, covering both common and lower-resource languages. WER / CER (โ†“, ร—100) macro-averaged across each benchmark's language set. Lower is better; **bold** marks the best per row. All numbers are reproducible end-to-end with original metrics and normalization.
Benchmark Higgs Audio v2 Higgs Audio v3 Fish Audio S2 Pro Qwen3-TTS-1.7B VibeVoice-7B IndexTTS-2 MiMo-Audio-7B-Instruct MOSS-TTS-v1.5 OmniVoice ChatterBox FireRedTTS-2
SeedTTS 2.10 1.11 1.31 1.30 3.59 1.63 3.70 1.73 1.21 17.00 1.72
CV3 21.19 4.41 4.60 7.73 11.66 129.26 71.55 6.11 4.92 32.62 19.20
MiniMax-Multilingual 49.86 2.74 5.15 27.41 8.21 112.91 85.67 3.78 2.98 49.30 12.52
Higgs-Multilingual 52.24 3.61 8.68 97.09 13.74 57.71 59.61 21.28 3.63 57.52 33.69
### Emergent TTS Win-rate (โ†‘) per category โ€” judge preference vs the BASELINE row; **bold** marks the highest win-rate per column. For a fair comparison, every model shares the same reference audio per prompt, and we run the benchmark text verbatim โ€” no inline control tags inserted.
Model Overall โ†‘ Emotions โ†‘ Foreign Words โ†‘ Paralinguistics โ†‘ Complex Pronunciation โ†‘ Questions โ†‘ Syntactic Complexity โ†‘
Higgs Audio v3 53.65% 53.75% 48.75% 68.57% 25.10% 61.43% 60.71%
Fish Audio S2 Pro 43.80% 53.04% 33.93% 53.75% 18.16% 55.00% 45.71%
Qwen3-TTS-1.7B 38.84% 45.54% 24.64% 44.29% 30.00% 53.39% 34.11%
IndexTTS-2 31.12% 39.29% 5.36% 42.50% 12.45% 45.89% 38.93%
MOSS-TTS-v1.5 43.89% 60.54% 35.18% 51.43% 11.63% 53.21% 47.32%
OmniVoice 40.82% 61.07% 28.75% 52.68% 13.67% 45.00% 40.36%
## Usage ### SGLang Usage Pair the weights in this repo with [**SGLang-Omni**](https://github.com/sgl-project/sglang-omni) โ€” a production serving stack with continuous batching for multi-codebook decoding and the same inline tag controls. The Higgs TTS cookbook walks you through installation, server launch, request examples, and the full API reference. See the [Higgs TTS cookbook](https://sgl-project.github.io/sglang-omni/cookbook/higgs_tts.html) for the full details. #### Install and Serve ```bash docker pull lmsysorg/sglang-omni:dev docker run -it --gpus all --shm-size 32g --ipc host --network host --privileged \ lmsysorg/sglang-omni:dev /bin/zsh git clone git@github.com:sgl-project/sglang-omni.git && cd sglang-omni uv venv .venv -p 3.12 && source .venv/bin/activate uv pip install -v -e . ``` ```bash export HF_TOKEN=hf_xxxxxxxxxxxxxxxx hf download bosonai/higgs-audio-v3-tts-4b sgl-omni serve \ --model-path bosonai/higgs-audio-v3-tts-4b \ --port 8000 ``` #### Zero-shot synthesis ```bash curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"input": "Hello, how are you?"}' \ --output output.wav ``` #### Voice cloning Supplying the reference transcript (`text`) materially improves cloning fidelity. ```python import requests resp = requests.post( "http://localhost:8000/v1/audio/speech", json={ "input": "Have a nice day and enjoy south california sunshine.", "references": [{ "audio_path": "ref.wav", "text": "Hey, Adam here. Let's create something that feels real, sounds human, and connects every time.", }], "temperature": 0.8, "top_k": 50, "max_new_tokens": 1024, }, ) with open("output.wav", "wb") as f: f.write(resp.content) ``` #### Streaming (Server-Sent Events) Set `"stream": true` to receive base64-encoded WAV chunks as the vocoder emits them โ€” sub-second time-to-first-audio. Each event carries `audio.data` (base64 WAV bytes); the terminal event has `finish_reason: "stop"` plus usage metadata. ```python import requests, base64, json with requests.post( "http://localhost:8000/v1/audio/speech", json={"input": "Get the trust fund to the bank early.", "stream": True}, stream=True, ) as resp, open("output.wav", "wb") as f: for line in resp.iter_lines(): if not line or not line.startswith(b"data: ") or line == b"data: [DONE]": continue event = json.loads(line[6:]) if event.get("finish_reason") == "stop": break audio = event.get("audio") or {} if audio.get("data"): f.write(base64.b64decode(audio["data"])) ``` #### Inline control tokens Embed `<|emotion:โ€ฆ|>`, `<|style:โ€ฆ|>`, `<|prosody:โ€ฆ|>`, and `<|sfx:โ€ฆ|>` tokens directly in `input`. Two rules: 1. **Delivery tokens first.** Emotion, style, and the prosody *speed / pitch / expressive* tokens shape the whole turn โ€” put them at the start of `input`. Positional tokens (`<|prosody:pause|>`, `<|prosody:long_pause|>`, `<|sfx:โ€ฆ|>`) go inline exactly where they fire. 2. **Pair every `<|sfx:โ€ฆ|>` with its onomatopoeia.** E.g. `<|sfx:laughter|>Haha`, `<|sfx:sigh|>Uh`, `<|sfx:sneeze|>Achoo`. The written sound gives the model the acoustic cue to realize the effect. Example โ€” amusement + laughter: ```bash curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"input": "<|emotion:amusement|><|prosody:expressive_high|>Wait, wait, that was kind of hilarious. <|sfx:laughter|>Hehe, no, seriously, I was not ready for that."}' \ --output output.wav ``` #### Throughput Throughput on Seed-TTS EN (full set, **N=1088** per run). Client `--max-concurrency` sweep against a Higgs server (`max_running_requests=16`, bf16, CUDA Graph on). Each row is the **mean of 3 runs**. Hardware: **1ร— H100**.
Concurrency Throughput (req/s) Mean latency RTF (per-req) audio_s/s
1 1.62 617 ms 0.147 6.89
2 2.70 742 ms 0.180 11.37
4 5.45 733 ms 0.177 22.84
8 8.91 898 ms 0.217 37.38
16 14.74 1079 ms 0.262 61.84
- **Concurrency** โ€” Maximum number of in-flight client requests (`--max-concurrency`). - **Throughput (req/s)** โ€” Completed requests divided by total benchmark wall-clock time. - **Mean latency** โ€” Average end-to-end time per request (send to full response received). - **RTF (per-req)** โ€” Average ratio of processing time to generated audio duration per request (<1 is faster than real time). - **audio_s/s** โ€” Total seconds of audio produced divided by total benchmark wall-clock time. To reproduce the results, follow the instructions in [this script](https://github.com/sgl-project/sglang-omni/blob/main/benchmarks/eval/benchmark_tts_seedtts.py). ### API Usage For zero-ops deployment, use the [**Boson AI API**](https://docs.boson.ai/models/higgs-audio-tts/overview). ## Citation ```bibtex @misc{bosonai_higgs_audio_tts_v3_2026, title = {Higgs Audio v3 TTS: Conversational Speech for Voice AI from Boson AI}, author = {Boson AI}, year = {2026}, howpublished = {https://huggingface.co/bosonai/higgs-audio-v3-tts-4b}, } ``` ## License Boson Higgs Audio v3 Research and Non-Commercial License โ€” see [LICENSE](./LICENSE).