---
license: other
license_name: boson-higgs-audio-v3-research-and-non-commercial-license
license_link: LICENSE
language:
- af
- ar
- as
- ast
- az
- ba
- be
- bg
- bn
- bs
- ca
- ceb
- ckb
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- ga
- gl
- gu
- ha
- he
- hi
- hr
- ht
- hu
- hy
- id
- is
- it
- jv
- ka
- kab
- kam
- kea
- kk
- kn
- ko
- ky
- la
- lb
- lg
- ln
- lt
- luo
- lv
- mhr
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- ne
- nl
- no
- nso
- ny
- oc
- om
- pa
- pl
- ps
- pt
- ro
- ru
- rw
- sd
- sk
- sl
- sn
- so
- sq
- sr
- sv
- sw
- ta
- te
- tg
- tl
- tr
- ug
- uk
- umb
- ur
- uz
- vi
- xh
- zh
- zu
tags:
- text-to-speech
- speech-generation
- voice-agent
- expressive-speech
- controllable-tts
- multilingual-tts
pipeline_tag: text-to-speech
library_name: transformers
---
# Higgs Audio v3 TTS
Higgs Audio v3 TTS is built for voice chat: it **speaks, not just reads**. It turns model responses into expressive conversational speech across **100+ languages**, with **zero-shot voice cloning** and **inline control** over emotion, style, prosody, pauses, and sound effects.
> [!TIP]
> Released for research and non-commercial use under the **Boson Higgs Audio v3 Research and Non-Commercial License**. Production, hosted APIs, or revenue-generating use requires a separate commercial license. Prohibited: voice cloning without consent, impersonation, fraud, election deception, biometric surveillance, or any unlawful use.

Higgs autoregressive decoder consumes interleaved text and audio tokens. Audio is encoded by the **Higgs Tokenizer** into 8 codebooks at 25 fps, staggered via a **delay pattern**, then mapped to backbone hidden states through a **multi-codebook fused embedding**. Output codes pass through a **multi-codebook fused head**, are de-delayed, and decoded back to waveform.
| Component | Spec |
|---|---|
| Backbone | ~4B autoregressive decoder (36 L, hidden=2560, GQA 32/8) |
| Multi-codebook embedding / head | Fused single-tensor, tied with text embedding |
| Context length | 8,192 tokens (training sequence length) |
| Audio tokens | 8 codebooks ร 1026 vocab, delay pattern |
| Sample rate | 24 kHz |
| Frame rate | 25 fps (40 ms / frame) |
## Supported Languages
The model reaches **single-digit WER/CER on 102 languages**, which split into two tiers.
### WER/CER under 5 โ polished, production-quality (85)
๐ฟ๐ฆ Afrikaans ยท ๐ธ๐ฆ๐ช๐ฌ Arabic ยท ๐ฆ๐ฒ Armenian ยท ๐ฎ๐ณ Assamese ยท ๐ช๐ธ Asturian ยท ๐ฆ๐ฟ Azerbaijani ยท ๐ท๐บ Bashkir ยท ๐ช๐ธ Basque ยท ๐ง๐พ Belarusian ยท ๐ง๐ฉ๐ฎ๐ณ Bengali ยท ๐ง๐ฆ Bosnian ยท ๐ง๐ฌ Bulgarian ยท ๐ช๐ธ Catalan ยท ๐ต๐ญ Cebuano ยท ๐ฎ๐ถ Central Kurdish ยท ๐จ๐ณ Chinese ยท ๐ญ๐ท Croatian ยท ๐จ๐ฟ Czech ยท ๐ฉ๐ฐ Danish ยท ๐ณ๐ฑ๐ง๐ช Dutch ยท ๐ท๐บ Eastern Mari ยท ๐บ๐ธ๐ฌ๐ง๐ฆ๐บ English ยท ๐ Esperanto ยท ๐ช๐ช Estonian ยท ๐ซ๐ฎ Finnish ยท ๐ซ๐ท๐จ๐ฆ French ยท ๐ช๐ธ Galician ยท ๐ฌ๐ช Georgian ยท ๐ฉ๐ช๐ฆ๐น German ยท ๐ฌ๐ท Greek ยท ๐ฎ๐ณ Gujarati ยท ๐ญ๐น Haitian Creole ยท ๐ณ๐ฌ Hausa ยท ๐ฎ๐ฑ Hebrew ยท ๐ฎ๐ณ Hindi ยท ๐ญ๐บ Hungarian ยท ๐ฎ๐ฉ Indonesian ยท ๐ฎ๐น Italian ยท ๐ฏ๐ต Japanese ยท ๐ฎ๐ฉ Javanese ยท ๐ฎ๐ณ Kannada ยท ๐ฐ๐ฟ Kazakh ยท ๐ฐ๐ท Korean ยท ๐ท๐ผ Kinyarwanda ยท ๐ฐ๐ฌ Kyrgyz ยท ๐ฑ๐ป Latvian ยท ๐จ๐ฉ Lingala ยท ๐ฑ๐น Lithuanian ยท ๐ฐ๐ช Luo ยท ๐ฒ๐ฐ Macedonian ยท ๐ฒ๐พ๐ฎ๐ฉ Malay ยท ๐ฎ๐ณ Malayalam ยท ๐ฒ๐น Maltese ยท ๐ณ๐ฟ Mฤori ยท ๐ฎ๐ณ Marathi ยท ๐ฒ๐ณ Mongolian ยท ๐ณ๐ต Nepali ยท ๐ณ๐ด Norwegian ยท ๐ซ๐ท Occitan ยท ๐ฎ๐ท๐ฆ๐ซ Persian ยท ๐ต๐ฑ Polish ยท ๐ต๐น๐ง๐ท Portuguese ยท ๐ท๐ด Romanian ยท ๐ท๐บ Russian ยท ๐ฟ๐ฆ Sepedi ยท ๐ท๐ธ Serbian ยท ๐ฟ๐ผ Shona ยท ๐ธ๐ฐ Slovak ยท ๐ธ๐ฎ Slovene ยท ๐ช๐ธ๐ฒ๐ฝ Spanish ยท ๐น๐ฟ๐ฐ๐ช Swahili ยท ๐ธ๐ช Swedish ยท ๐ต๐ญ Tagalog ยท ๐น๐ฏ Tajik ยท ๐ฎ๐ณ๐ฑ๐ฐ Tamil ยท ๐ฎ๐ณ Telugu ยท ๐น๐ญ Thai ยท ๐น๐ท Turkish ยท ๐บ๐ฆ Ukrainian ยท ๐ต๐ฐ๐ฎ๐ณ Urdu ยท ๐จ๐ณ Uyghur ยท ๐บ๐ฟ Uzbek ยท ๐ป๐ณ Vietnamese ยท ๐ฟ๐ฆ Xhosa ยท ๐ฟ๐ฆ Zulu
### WER/CER between 5 and 10 โ usable, but less polished (17)
๐ฆ๐ฑ Albanian ยท ๐ฒ๐ผ๐ฟ๐ฒ Chichewa/Nyanja ยท ๐ฎ๐ณ๐ต๐ฐ Eastern Punjabi ยท ๐บ๐ฌ Ganda ยท ๐ฎ๐ธ Icelandic ยท ๐ฎ๐ช Irish ยท ๐ฉ๐ฟ Kabyle ยท ๐จ๐ป Kabuverdianu ยท ๐ฐ๐ช Kamba ยท ๐ป๐ฆ Latin ยท ๐ฑ๐บ Luxembourgish ยท ๐ช๐น๐ฐ๐ช Oromo ยท ๐ฆ๐ซ๐ต๐ฐ Pashto ยท ๐ต๐ฐ๐ฎ๐ณ Sindhi ยท ๐ธ๐ด Somali ยท ๐ฆ๐ด Umbundu ยท ๐ฌ๐ง Welsh
## Control Tokens
All tags follow `<|category:value|>` syntax and can be inserted mid-utterance.
> For how to place these tags when writing the target text (sentence-level vs. inline, `sfx` formatting, stacking, worked examples), see **[PROMPTING.md](./PROMPTING.md)**.
- **Emotion** โ `elation`, `amusement`, `enthusiasm`, `determination`, `pride`, `contentment`, `affection`, `relief`, `contemplation`, `confusion`, `surprise`, `awe`, `longing`, `arousal`, `anger`, `fear`, `disgust`, `bitterness`, `sadness`, `shame`, `helplessness`
| Token | Description |
<|emotion:elation|> | Elation / joy |
<|emotion:amusement|> | Amusement / playful laughter |
<|emotion:enthusiasm|> | Enthusiasm / excitement |
<|emotion:determination|> | Determination / firmness |
<|emotion:pride|> | Pride / confidence |
<|emotion:contentment|> | Calm satisfaction |
<|emotion:affection|> | Warmth / affection |
<|emotion:relief|> | Relief |
<|emotion:contemplation|> | Thoughtful / reflective |
<|emotion:confusion|> | Confused |
<|emotion:surprise|> | Surprised |
<|emotion:awe|> | Awe / wonder |
<|emotion:longing|> | Longing / yearning |
<|emotion:arousal|> | Heightened desire |
<|emotion:anger|> | Anger |
<|emotion:fear|> | Fear |
<|emotion:disgust|> | Disgust |
<|emotion:bitterness|> | Bitterness |
<|emotion:sadness|> | Sadness |
<|emotion:shame|> | Shame |
<|emotion:helplessness|> | Helplessness |
- **Style** โ `singing`, `shouting`, `whispering`
| Token | Description |
<|style:singing|> | Singing |
<|style:shouting|> | Shouting / projected voice |
<|style:whispering|> | Whisper |
- **Sound effects** โ `cough`, `laughter`, `crying`, `screaming`, `burping`, `humming`, `sigh`, `sniff`, `sneeze`
Pair each token with the matching onomatopoeia immediately after it.
| Token | Description | Suggested onomatopoeia |
<|sfx:cough|> | Cough | Ahem |
<|sfx:laughter|> | Laughter | Haha / Hehe |
<|sfx:crying|> | Crying | Boohoo / Sob |
<|sfx:screaming|> | Screaming | Ahh / Aaah |
<|sfx:burping|> | Burping | Burp |
<|sfx:humming|> | Humming | Hmm / Mmm |
<|sfx:sigh|> | Sigh | Uh / Ahh |
<|sfx:sniff|> | Sniff | Sff |
<|sfx:sneeze|> | Sneeze | Achoo |
- **Prosody**
- Speed โ `speed_very_slow`, `speed_slow`, `speed_fast`, `speed_very_fast`
- Pauses โ `pause`, `long_pause`
- Pitch โ `pitch_low`, `pitch_high`
- Delivery โ `expressive_high`, `expressive_low`
| Token | Effect |
<|prosody:speed_very_slow|> | โ0.65ร speed |
<|prosody:speed_slow|> | โ0.85ร speed |
<|prosody:speed_fast|> | โ1.2ร speed |
<|prosody:speed_very_fast|> | โ1.4ร speed |
<|prosody:pitch_low|> | โโ3 semitones |
<|prosody:pitch_high|> | โ+2.5 semitones |
<|prosody:pause|> | โ400โ700 ms pause |
<|prosody:long_pause|> | โ700โ1500 ms pause |
<|prosody:expressive_high|> | More expressive delivery |
<|prosody:expressive_low|> | Flatter delivery |
## Evaluation Benchmarks
### Multilingual Voice Clone
We evaluate Higgs Audio v3 TTS on public multilingual TTS suites and our internal 111-language Higgs-Multilingual set, covering both common and lower-resource languages.
WER / CER (โ, ร100) macro-averaged across each benchmark's language set. Lower is better; **bold** marks the best per row. All numbers are reproducible end-to-end with original metrics and normalization.
| Benchmark |
Higgs Audio v2 |
Higgs Audio v3 |
Fish Audio S2 Pro |
Qwen3-TTS-1.7B |
VibeVoice-7B |
IndexTTS-2 |
MiMo-Audio-7B-Instruct |
MOSS-TTS-v1.5 |
OmniVoice |
ChatterBox |
FireRedTTS-2 |
| SeedTTS |
2.10 |
1.11 |
1.31 |
1.30 |
3.59 |
1.63 |
3.70 |
1.73 |
1.21 |
17.00 |
1.72 |
| CV3 |
21.19 |
4.41 |
4.60 |
7.73 |
11.66 |
129.26 |
71.55 |
6.11 |
4.92 |
32.62 |
19.20 |
| MiniMax-Multilingual |
49.86 |
2.74 |
5.15 |
27.41 |
8.21 |
112.91 |
85.67 |
3.78 |
2.98 |
49.30 |
12.52 |
| Higgs-Multilingual |
52.24 |
3.61 |
8.68 |
97.09 |
13.74 |
57.71 |
59.61 |
21.28 |
3.63 |
57.52 |
33.69 |
### Emergent TTS
Win-rate (โ) per category โ judge preference vs the BASELINE row; **bold** marks the highest win-rate per column. For a fair comparison, every model shares the same reference audio per prompt, and we run the benchmark text verbatim โ no inline control tags inserted.
| Model |
Overall โ |
Emotions โ |
Foreign Words โ |
Paralinguistics โ |
Complex Pronunciation โ |
Questions โ |
Syntactic Complexity โ |
| Higgs Audio v3 |
53.65% |
53.75% |
48.75% |
68.57% |
25.10% |
61.43% |
60.71% |
| Fish Audio S2 Pro |
43.80% |
53.04% |
33.93% |
53.75% |
18.16% |
55.00% |
45.71% |
| Qwen3-TTS-1.7B |
38.84% |
45.54% |
24.64% |
44.29% |
30.00% |
53.39% |
34.11% |
| IndexTTS-2 |
31.12% |
39.29% |
5.36% |
42.50% |
12.45% |
45.89% |
38.93% |
| MOSS-TTS-v1.5 |
43.89% |
60.54% |
35.18% |
51.43% |
11.63% |
53.21% |
47.32% |
| OmniVoice |
40.82% |
61.07% |
28.75% |
52.68% |
13.67% |
45.00% |
40.36% |
## Usage
### SGLang Usage
Pair the weights in this repo with [**SGLang-Omni**](https://github.com/sgl-project/sglang-omni) โ a production serving stack with continuous batching for multi-codebook decoding and the same inline tag controls. The Higgs TTS cookbook walks you through installation, server launch, request examples, and the full API reference.
See the [Higgs TTS cookbook](https://sgl-project.github.io/sglang-omni/cookbook/higgs_tts.html) for the full details.
#### Install and Serve
```bash
docker pull lmsysorg/sglang-omni:dev
docker run -it --gpus all --shm-size 32g --ipc host --network host --privileged \
lmsysorg/sglang-omni:dev /bin/zsh
git clone git@github.com:sgl-project/sglang-omni.git && cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -v -e .
```
```bash
export HF_TOKEN=hf_xxxxxxxxxxxxxxxx
hf download bosonai/higgs-audio-v3-tts-4b
sgl-omni serve \
--model-path bosonai/higgs-audio-v3-tts-4b \
--port 8000
```
#### Zero-shot synthesis
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, how are you?"}' \
--output output.wav
```
#### Voice cloning
Supplying the reference transcript (`text`) materially improves cloning fidelity.
```python
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": "Have a nice day and enjoy south california sunshine.",
"references": [{
"audio_path": "ref.wav",
"text": "Hey, Adam here. Let's create something that feels real, sounds human, and connects every time.",
}],
"temperature": 0.8, "top_k": 50, "max_new_tokens": 1024,
},
)
with open("output.wav", "wb") as f:
f.write(resp.content)
```
#### Streaming (Server-Sent Events)
Set `"stream": true` to receive base64-encoded WAV chunks as the vocoder emits them โ sub-second time-to-first-audio. Each event carries `audio.data` (base64 WAV bytes); the terminal event has `finish_reason: "stop"` plus usage metadata.
```python
import requests, base64, json
with requests.post(
"http://localhost:8000/v1/audio/speech",
json={"input": "Get the trust fund to the bank early.", "stream": True},
stream=True,
) as resp, open("output.wav", "wb") as f:
for line in resp.iter_lines():
if not line or not line.startswith(b"data: ") or line == b"data: [DONE]":
continue
event = json.loads(line[6:])
if event.get("finish_reason") == "stop":
break
audio = event.get("audio") or {}
if audio.get("data"):
f.write(base64.b64decode(audio["data"]))
```
#### Inline control tokens
Embed `<|emotion:โฆ|>`, `<|style:โฆ|>`, `<|prosody:โฆ|>`, and `<|sfx:โฆ|>` tokens directly in `input`. Two rules:
1. **Delivery tokens first.** Emotion, style, and the prosody *speed / pitch / expressive* tokens shape the whole turn โ put them at the start of `input`. Positional tokens (`<|prosody:pause|>`, `<|prosody:long_pause|>`, `<|sfx:โฆ|>`) go inline exactly where they fire.
2. **Pair every `<|sfx:โฆ|>` with its onomatopoeia.** E.g. `<|sfx:laughter|>Haha`, `<|sfx:sigh|>Uh`, `<|sfx:sneeze|>Achoo`. The written sound gives the model the acoustic cue to realize the effect.
Example โ amusement + laughter:
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "<|emotion:amusement|><|prosody:expressive_high|>Wait, wait, that was kind of hilarious. <|sfx:laughter|>Hehe, no, seriously, I was not ready for that."}' \
--output output.wav
```
#### Throughput
Throughput on Seed-TTS EN (full set, **N=1088** per run). Client `--max-concurrency` sweep against a Higgs server (`max_running_requests=16`, bf16, CUDA Graph on). Each row is the **mean of 3 runs**. Hardware: **1ร H100**.
| Concurrency |
Throughput (req/s) |
Mean latency |
RTF (per-req) |
audio_s/s |
| 1 |
1.62 |
617 ms |
0.147 |
6.89 |
| 2 |
2.70 |
742 ms |
0.180 |
11.37 |
| 4 |
5.45 |
733 ms |
0.177 |
22.84 |
| 8 |
8.91 |
898 ms |
0.217 |
37.38 |
| 16 |
14.74 |
1079 ms |
0.262 |
61.84 |
- **Concurrency** โ Maximum number of in-flight client requests (`--max-concurrency`).
- **Throughput (req/s)** โ Completed requests divided by total benchmark wall-clock time.
- **Mean latency** โ Average end-to-end time per request (send to full response received).
- **RTF (per-req)** โ Average ratio of processing time to generated audio duration per request (<1 is faster than real time).
- **audio_s/s** โ Total seconds of audio produced divided by total benchmark wall-clock time.
To reproduce the results, follow the instructions in [this script](https://github.com/sgl-project/sglang-omni/blob/main/benchmarks/eval/benchmark_tts_seedtts.py).
### API Usage
For zero-ops deployment, use the [**Boson AI API**](https://docs.boson.ai/models/higgs-audio-tts/overview).
## Citation
```bibtex
@misc{bosonai_higgs_audio_tts_v3_2026,
title = {Higgs Audio v3 TTS: Conversational Speech for Voice AI from Boson AI},
author = {Boson AI},
year = {2026},
howpublished = {https://huggingface.co/bosonai/higgs-audio-v3-tts-4b},
}
```
## License
Boson Higgs Audio v3 Research and Non-Commercial License โ see [LICENSE](./LICENSE).