Text-to-Speech
Transformers
Safetensors
higgs_multimodal_qwen3
text-generation
speech-generation
voice-agent
expressive-speech
controllable-tts
multilingual-tts
Instructions to use bosonai/higgs-audio-v3-tts-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bosonai/higgs-audio-v3-tts-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="bosonai/higgs-audio-v3-tts-4b")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("bosonai/higgs-audio-v3-tts-4b", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Commit ·
5402f01
1
Parent(s): e470359
ke/add_agentsmd (#9)
Browse files- Add AGENTS.md and PROMPTING.md for agent and authoring guidance (17e548b566b4ba97802d400ce2b342720599ed56)
Co-authored-by: Ke <bytebecky@users.noreply.huggingface.co>
- AGENTS.md +177 -0
- PROMPTING.md +75 -0
- README.md +3 -0
AGENTS.md
ADDED
|
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AGENTS.md — Higgs Audio v3 TTS (4B)
|
| 2 |
+
|
| 3 |
+
> Operational guide for AI coding agents. This file is **self-contained**: you can act on it
|
| 4 |
+
> even before cloning this repo. For model background, benchmarks, the full language list, and
|
| 5 |
+
> citation, see the **[model card README](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/README.md)** — don't duplicate that narrative here.
|
| 6 |
+
|
| 7 |
+
Higgs Audio v3 TTS is a **4B-parameter, conversational text-to-speech model**: expressive,
|
| 8 |
+
low-latency, 100+ languages, zero-shot voice cloning, and inline control over emotion / prosody /
|
| 9 |
+
pauses / sound effects mid-utterance.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## Step 0 — Pick the right path (read this first)
|
| 14 |
+
|
| 15 |
+
Choose by constraint, not by habit:
|
| 16 |
+
|
| 17 |
+
| Goal | Use | Entry point |
|
| 18 |
+
|------|-----|-------------|
|
| 19 |
+
| Just hear it / try preset voices & avatars | **Live Demo** | https://boson.ai/workspace/avatar |
|
| 20 |
+
| Integrate quickly, no GPU, your own voice | **Hosted API** | https://docs.boson.ai/models/higgs-audio-tts/overview |
|
| 21 |
+
| Data privacy, custom testing, full control | **Self-host (SGLang-Omni)** | https://lmsys.org/blog/2026-06-04-higgs-audio-v3-tts/ |
|
| 22 |
+
| Inspect weights / config / tokenizer | **Model card (this repo)** | https://huggingface.co/bosonai/higgs-audio-v3-tts-4b |
|
| 23 |
+
|
| 24 |
+
Deep dive on everything: **Technical blog** → https://boson.ai/blog/higgs-audio-v3-tts
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Path A — Hosted API (fastest, no GPU)
|
| 29 |
+
|
| 30 |
+
> **Authoritative docs:** https://docs.boson.ai/models/higgs-audio-tts/overview
|
| 31 |
+
> Get an API key, full field reference, and Python/TypeScript SDK examples there.
|
| 32 |
+
> An agent cannot invent a key — if `BOSON_API_KEY` is unset, stop and point the user to this page.
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
export BOSON_API_KEY=bai-xxxx # obtain from https://docs.boson.ai (key format: bai-...)
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
Basic synthesis:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
curl https://api.boson.ai/v1/audio/speech \
|
| 42 |
+
-H "Authorization: Bearer $BOSON_API_KEY" \
|
| 43 |
+
-H "Content-Type: application/json" \
|
| 44 |
+
-d '{"model":"higgs-audio-v3-tts","input":"Hello, this is a test."}' \
|
| 45 |
+
--output out.mp3
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
Request fields:
|
| 49 |
+
|
| 50 |
+
| Field | Notes |
|
| 51 |
+
|-------|-------|
|
| 52 |
+
| `model` | `"higgs-audio-v3-tts"` |
|
| 53 |
+
| `input` | text to synthesize (**required**) |
|
| 54 |
+
| `voice` | preset speaker, e.g. `"jake"` |
|
| 55 |
+
| `ref_audio` + `ref_text` | URL/base64 clip + its transcript → **voice cloning** |
|
| 56 |
+
| `response_format` | `"mp3"` (default) or `"pcm"` (use `pcm` for low-latency streaming) |
|
| 57 |
+
| `stream` | `true` for SSE streaming |
|
| 58 |
+
|
| 59 |
+
> Verify exact field names/limits against the API docs before shipping — the hosted API evolves
|
| 60 |
+
> independently of these weights.
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## Path B — Self-host with SGLang-Omni
|
| 65 |
+
|
| 66 |
+
### B0 — Preflight: confirm hardware first (do this before pulling anything)
|
| 67 |
+
|
| 68 |
+
Performance numbers are benchmarked on **1× H100 (80 GB)**. The model is also **confirmed to run on
|
| 69 |
+
1× A100 40 GB** — so **~40 GB VRAM is a known-good floor**. Smaller GPUs are **untested** (no data,
|
| 70 |
+
not "won't work"). Before deploying:
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv # GPU present? how much VRAM?
|
| 74 |
+
docker --version && docker info | grep -i runtime # Docker + NVIDIA runtime ready?
|
| 75 |
+
df -h . # disk for the ~4B weights + image
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
Rules for the agent:
|
| 79 |
+
- **No NVIDIA GPU** → stop. Self-host is not viable; steer the user to **Path A (hosted API)**.
|
| 80 |
+
- **≥ 40 GB VRAM (e.g. A100 40 GB, H100)** → known-good; proceed.
|
| 81 |
+
- **24 GB (e.g. RTX 4090)** → *reported* to work, **not officially verified**. The ~4B weights fit,
|
| 82 |
+
but expect to lower concurrency / `max_new_tokens` and watch for OOM at the `serve` step.
|
| 83 |
+
- **< 24 GB VRAM** → untested. It *may* still run (4B model), but no one has verified it. Warn the
|
| 84 |
+
user, and be ready to lower concurrency / `max_new_tokens` if you hit OOM at the `serve` step.
|
| 85 |
+
- **Don't assume** a VRAM number — confirm against the SGLang-Omni cookbook / blog before promising
|
| 86 |
+
a given GPU will work: https://lmsys.org/blog/2026-06-04-higgs-audio-v3-tts/
|
| 87 |
+
|
| 88 |
+
### B1 — Install & serve
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
# 1. Container
|
| 92 |
+
docker pull lmsysorg/sglang-omni:dev
|
| 93 |
+
docker run -it --gpus all --shm-size 32g --ipc host --network host --privileged \
|
| 94 |
+
lmsysorg/sglang-omni:dev /bin/zsh
|
| 95 |
+
|
| 96 |
+
# 2. Engine
|
| 97 |
+
git clone git@github.com:sgl-project/sglang-omni.git && cd sglang-omni
|
| 98 |
+
uv venv .venv -p 3.12 && source .venv/bin/activate
|
| 99 |
+
uv pip install -v -e .
|
| 100 |
+
|
| 101 |
+
# 3. Weights
|
| 102 |
+
hf download bosonai/higgs-audio-v3-tts-4b
|
| 103 |
+
|
| 104 |
+
# 4. Serve (OpenAI-compatible audio endpoint)
|
| 105 |
+
sgl-omni serve --model-path bosonai/higgs-audio-v3-tts-4b --port 8000
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
Call the local server:
|
| 109 |
+
|
| 110 |
+
```bash
|
| 111 |
+
curl -X POST http://localhost:8000/v1/audio/speech \
|
| 112 |
+
-H "Content-Type: application/json" \
|
| 113 |
+
-d '{"input": "Hello, how are you?"}' \
|
| 114 |
+
--output output.wav
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
**Recommended sampling (voice cloning):** `temperature: 0.8`, `top_k: 50`, `max_new_tokens: 1024`.
|
| 118 |
+
|
| 119 |
+
Cookbook reference: https://sgl-project.github.io/sglang-omni/cookbook/higgs_tts.html
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## Control tags — how to write target text
|
| 124 |
+
|
| 125 |
+
Embed tags directly in the `input` text to steer emotion, prosody, style, and sound effects.
|
| 126 |
+
Format is always `<|category:tag|>`, with two placements:
|
| 127 |
+
|
| 128 |
+
- **Sentence-level** (emotion / style / prosody speed·pitch·expressive) → put at the sentence start.
|
| 129 |
+
- **Inline** (sfx, and prosody `pause` / `long_pause`) → insert at the exact spot in the sentence.
|
| 130 |
+
- **`sfx` gotcha:** `<|sfx:cough|>Ahem, ...` — tag first, onomatopoeia attached, **no space**.
|
| 131 |
+
|
| 132 |
+
```
|
| 133 |
+
<|emotion:elation|>Welcome aboard, we are thrilled to have you here!
|
| 134 |
+
<|emotion:elation|><|sfx:laughter|>Haha, welcome, we're so happy you're here!
|
| 135 |
+
Hello there <|prosody:pause|> and welcome to the show.
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
> **Full 43-tag catalog + rules + examples → [PROMPTING.md](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/PROMPTING.md).**
|
| 139 |
+
> Only recognized tags work — anything else degrades output or gets read literally.
|
| 140 |
+
|
| 141 |
+
For chat formatting, use **[`chat_template.jinja`](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/chat_template.jinja)** from the model repo (and the API docs);
|
| 142 |
+
**do not hand-assemble the chat prompt** — go through the template.
|
| 143 |
+
|
| 144 |
+
## Language codes
|
| 145 |
+
|
| 146 |
+
Only the ISO codes listed in README's supported-languages section are reliable. Codes outside that
|
| 147 |
+
list fall back / degrade. → see the [model card README](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/README.md) (`## Supported Languages`).
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## Repo contents (what's actually here)
|
| 152 |
+
|
| 153 |
+
This repo (`https://huggingface.co/bosonai/higgs-audio-v3-tts-4b`) is **weights + config**, not an
|
| 154 |
+
inference codebase:
|
| 155 |
+
|
| 156 |
+
- `config.json`, `model.safetensors(.index.json)` — model weights & shape
|
| 157 |
+
- `chat_template.jinja` — **authoritative** prompt/chat formatting; respect it
|
| 158 |
+
- `tokenizer.json`, `tokenizer_config.json` — tokenizer
|
| 159 |
+
- `README.md` — HuggingFace model card (capabilities, benchmarks, languages, citation)
|
| 160 |
+
- `LICENSE` — see red line below
|
| 161 |
+
|
| 162 |
+
## Do / Don't
|
| 163 |
+
|
| 164 |
+
- ✅ Use `chat_template.jinja` for prompt construction; use the OpenAI-compatible `/v1/audio/speech` shape.
|
| 165 |
+
- ✅ Use `pcm` + `stream` for real-time / conversational latency.
|
| 166 |
+
- ❌ **Don't use commercially.** License is research & non-commercial
|
| 167 |
+
(`boson-higgs-audio-v3-research-and-non-commercial-license`) — see [LICENSE](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/LICENSE).
|
| 168 |
+
- ❌ Don't hardcode the 100-language claim as "any code works" — validate against the supported list.
|
| 169 |
+
|
| 170 |
+
## Pointers (don't duplicate — link)
|
| 171 |
+
|
| 172 |
+
All on the model card: `https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/README.md`
|
| 173 |
+
|
| 174 |
+
- Benchmarks / WER-CER tables → README `## Evaluation Benchmarks`
|
| 175 |
+
- Full language list → README `## Supported Languages`
|
| 176 |
+
- Control-token catalog → README `## Control Tokens`
|
| 177 |
+
- Citation → README `## Citation`
|
PROMPTING.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Writing target text for Higgs Audio v3 TTS — control tags
|
| 2 |
+
|
| 3 |
+
How to embed control tags in the `input` text to steer emotion, prosody, style, and sound effects.
|
| 4 |
+
For where this fits in the overall workflow, see [AGENTS.md](./AGENTS.md).
|
| 5 |
+
|
| 6 |
+
## Format rule (read first)
|
| 7 |
+
|
| 8 |
+
Every tag is `<|category:tag|>`. There are **two placements**:
|
| 9 |
+
|
| 10 |
+
- **Sentence-level** — emotion, style, and prosody's `speed_* / pitch_* / expressive_*`.
|
| 11 |
+
Put at the **start of the sentence**; it colors the whole sentence.
|
| 12 |
+
- **Inline** — sound effects (`sfx`) and prosody's `pause / long_pause`.
|
| 13 |
+
Insert **at the exact position** in the sentence where the effect should occur.
|
| 14 |
+
|
| 15 |
+
**`sfx` gotcha:** format is `<|sfx:tag|>onomatopoeia, then the line` — the tag comes **first**,
|
| 16 |
+
immediately followed by the onomatopoeia with **no space** between them.
|
| 17 |
+
|
| 18 |
+
## Examples
|
| 19 |
+
|
| 20 |
+
Sentence-level:
|
| 21 |
+
```
|
| 22 |
+
<|emotion:elation|>Welcome aboard, we are absolutely thrilled to have you here!
|
| 23 |
+
<|style:whispering|>Come closer, I have a little secret to share.
|
| 24 |
+
<|prosody:speed_slow|>Take your time, there's really no need to rush.
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
Inline sfx (tag first, onomatopoeia attached, no space):
|
| 28 |
+
```
|
| 29 |
+
<|sfx:cough|>Ahem, welcome everyone, let's get started.
|
| 30 |
+
<|sfx:laughter|>Haha, so glad you could make it!
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
Inline pause (between phrases):
|
| 34 |
+
```
|
| 35 |
+
Hello there <|prosody:pause|> and welcome to the show.
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
Stacking tags (sentence-level emotion + inline sfx in one line):
|
| 39 |
+
```
|
| 40 |
+
<|emotion:elation|><|sfx:laughter|>Haha, welcome, welcome, we're so happy you're here!
|
| 41 |
+
<|sfx:sigh|>Haah, what a day — but welcome, please make yourself at home.
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
## Tips
|
| 45 |
+
|
| 46 |
+
- You can stack tags in one sentence (e.g. a leading emotion tag plus an inline sfx).
|
| 47 |
+
- `speed_very_slow` only slows the model to roughly ~5s; for slower delivery, insert
|
| 48 |
+
`<|prosody:long_pause|>` between phrases instead.
|
| 49 |
+
- Only the tags below are recognized — anything else degrades output or gets read literally.
|
| 50 |
+
|
| 51 |
+
## Full tag catalog (43)
|
| 52 |
+
|
| 53 |
+
### Emotion (21) — sentence-level
|
| 54 |
+
`affection`, `amusement`, `anger`, `arousal`, `awe`, `bitterness`, `confusion`, `contemplation`,
|
| 55 |
+
`contentment`, `determination`, `disgust`, `elation`, `enthusiasm`, `fear`, `helplessness`,
|
| 56 |
+
`longing`, `pride`, `relief`, `sadness`, `shame`, `surprise`
|
| 57 |
+
|
| 58 |
+
Syntax: `<|emotion:elation|>`
|
| 59 |
+
|
| 60 |
+
### Prosody (10)
|
| 61 |
+
- Sentence-level: `speed_very_slow`, `speed_slow`, `speed_fast`, `speed_very_fast`,
|
| 62 |
+
`pitch_low`, `pitch_high`, `expressive_high`, `expressive_low`
|
| 63 |
+
- Inline: `pause`, `long_pause`
|
| 64 |
+
|
| 65 |
+
Syntax: `<|prosody:speed_slow|>`, `<|prosody:pause|>`
|
| 66 |
+
|
| 67 |
+
### Style (3) — sentence-level
|
| 68 |
+
`singing`, `shouting`, `whispering`
|
| 69 |
+
|
| 70 |
+
Syntax: `<|style:whispering|>`
|
| 71 |
+
|
| 72 |
+
### Sound effects (9) — inline
|
| 73 |
+
`cough`, `laughter`, `crying`, `screaming`, `burping`, `humming`, `sigh`, `sniff`, `sneeze`
|
| 74 |
+
|
| 75 |
+
Syntax: `<|sfx:cough|>Ahem, ...` (tag first, onomatopoeia attached, no space)
|
README.md
CHANGED
|
@@ -151,6 +151,9 @@ The model reaches **single-digit WER/CER on 102 languages**, which split into tw
|
|
| 151 |
## Control Tokens
|
| 152 |
|
| 153 |
All tags follow `<|category:value|>` syntax and can be inserted mid-utterance.
|
|
|
|
|
|
|
|
|
|
| 154 |
- **Emotion** — `elation`, `amusement`, `enthusiasm`, `determination`, `pride`, `contentment`, `affection`, `relief`, `contemplation`, `confusion`, `surprise`, `awe`, `longing`, `arousal`, `anger`, `fear`, `disgust`, `bitterness`, `sadness`, `shame`, `helplessness`
|
| 155 |
|
| 156 |
<div style="margin-left:1.5em;margin-top:-10px">
|
|
|
|
| 151 |
## Control Tokens
|
| 152 |
|
| 153 |
All tags follow `<|category:value|>` syntax and can be inserted mid-utterance.
|
| 154 |
+
|
| 155 |
+
> For how to place these tags when writing the target text (sentence-level vs. inline, `sfx` formatting, stacking, worked examples), see **[PROMPTING.md](./PROMPTING.md)**.
|
| 156 |
+
|
| 157 |
- **Emotion** — `elation`, `amusement`, `enthusiasm`, `determination`, `pride`, `contentment`, `affection`, `relief`, `contemplation`, `confusion`, `surprise`, `awe`, `longing`, `arousal`, `anger`, `fear`, `disgust`, `bitterness`, `sadness`, `shame`, `helplessness`
|
| 158 |
|
| 159 |
<div style="margin-left:1.5em;margin-top:-10px">
|