SilinMeng0510 bytebecky commited on
Commit
5402f01
·
1 Parent(s): e470359

ke/add_agentsmd (#9)

Browse files

- Add AGENTS.md and PROMPTING.md for agent and authoring guidance (17e548b566b4ba97802d400ce2b342720599ed56)


Co-authored-by: Ke <bytebecky@users.noreply.huggingface.co>

Files changed (3) hide show
  1. AGENTS.md +177 -0
  2. PROMPTING.md +75 -0
  3. README.md +3 -0
AGENTS.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AGENTS.md — Higgs Audio v3 TTS (4B)
2
+
3
+ > Operational guide for AI coding agents. This file is **self-contained**: you can act on it
4
+ > even before cloning this repo. For model background, benchmarks, the full language list, and
5
+ > citation, see the **[model card README](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/README.md)** — don't duplicate that narrative here.
6
+
7
+ Higgs Audio v3 TTS is a **4B-parameter, conversational text-to-speech model**: expressive,
8
+ low-latency, 100+ languages, zero-shot voice cloning, and inline control over emotion / prosody /
9
+ pauses / sound effects mid-utterance.
10
+
11
+ ---
12
+
13
+ ## Step 0 — Pick the right path (read this first)
14
+
15
+ Choose by constraint, not by habit:
16
+
17
+ | Goal | Use | Entry point |
18
+ |------|-----|-------------|
19
+ | Just hear it / try preset voices & avatars | **Live Demo** | https://boson.ai/workspace/avatar |
20
+ | Integrate quickly, no GPU, your own voice | **Hosted API** | https://docs.boson.ai/models/higgs-audio-tts/overview |
21
+ | Data privacy, custom testing, full control | **Self-host (SGLang-Omni)** | https://lmsys.org/blog/2026-06-04-higgs-audio-v3-tts/ |
22
+ | Inspect weights / config / tokenizer | **Model card (this repo)** | https://huggingface.co/bosonai/higgs-audio-v3-tts-4b |
23
+
24
+ Deep dive on everything: **Technical blog** → https://boson.ai/blog/higgs-audio-v3-tts
25
+
26
+ ---
27
+
28
+ ## Path A — Hosted API (fastest, no GPU)
29
+
30
+ > **Authoritative docs:** https://docs.boson.ai/models/higgs-audio-tts/overview
31
+ > Get an API key, full field reference, and Python/TypeScript SDK examples there.
32
+ > An agent cannot invent a key — if `BOSON_API_KEY` is unset, stop and point the user to this page.
33
+
34
+ ```bash
35
+ export BOSON_API_KEY=bai-xxxx # obtain from https://docs.boson.ai (key format: bai-...)
36
+ ```
37
+
38
+ Basic synthesis:
39
+
40
+ ```bash
41
+ curl https://api.boson.ai/v1/audio/speech \
42
+ -H "Authorization: Bearer $BOSON_API_KEY" \
43
+ -H "Content-Type: application/json" \
44
+ -d '{"model":"higgs-audio-v3-tts","input":"Hello, this is a test."}' \
45
+ --output out.mp3
46
+ ```
47
+
48
+ Request fields:
49
+
50
+ | Field | Notes |
51
+ |-------|-------|
52
+ | `model` | `"higgs-audio-v3-tts"` |
53
+ | `input` | text to synthesize (**required**) |
54
+ | `voice` | preset speaker, e.g. `"jake"` |
55
+ | `ref_audio` + `ref_text` | URL/base64 clip + its transcript → **voice cloning** |
56
+ | `response_format` | `"mp3"` (default) or `"pcm"` (use `pcm` for low-latency streaming) |
57
+ | `stream` | `true` for SSE streaming |
58
+
59
+ > Verify exact field names/limits against the API docs before shipping — the hosted API evolves
60
+ > independently of these weights.
61
+
62
+ ---
63
+
64
+ ## Path B — Self-host with SGLang-Omni
65
+
66
+ ### B0 — Preflight: confirm hardware first (do this before pulling anything)
67
+
68
+ Performance numbers are benchmarked on **1× H100 (80 GB)**. The model is also **confirmed to run on
69
+ 1× A100 40 GB** — so **~40 GB VRAM is a known-good floor**. Smaller GPUs are **untested** (no data,
70
+ not "won't work"). Before deploying:
71
+
72
+ ```bash
73
+ nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv # GPU present? how much VRAM?
74
+ docker --version && docker info | grep -i runtime # Docker + NVIDIA runtime ready?
75
+ df -h . # disk for the ~4B weights + image
76
+ ```
77
+
78
+ Rules for the agent:
79
+ - **No NVIDIA GPU** → stop. Self-host is not viable; steer the user to **Path A (hosted API)**.
80
+ - **≥ 40 GB VRAM (e.g. A100 40 GB, H100)** → known-good; proceed.
81
+ - **24 GB (e.g. RTX 4090)** → *reported* to work, **not officially verified**. The ~4B weights fit,
82
+ but expect to lower concurrency / `max_new_tokens` and watch for OOM at the `serve` step.
83
+ - **< 24 GB VRAM** → untested. It *may* still run (4B model), but no one has verified it. Warn the
84
+ user, and be ready to lower concurrency / `max_new_tokens` if you hit OOM at the `serve` step.
85
+ - **Don't assume** a VRAM number — confirm against the SGLang-Omni cookbook / blog before promising
86
+ a given GPU will work: https://lmsys.org/blog/2026-06-04-higgs-audio-v3-tts/
87
+
88
+ ### B1 — Install & serve
89
+
90
+ ```bash
91
+ # 1. Container
92
+ docker pull lmsysorg/sglang-omni:dev
93
+ docker run -it --gpus all --shm-size 32g --ipc host --network host --privileged \
94
+ lmsysorg/sglang-omni:dev /bin/zsh
95
+
96
+ # 2. Engine
97
+ git clone git@github.com:sgl-project/sglang-omni.git && cd sglang-omni
98
+ uv venv .venv -p 3.12 && source .venv/bin/activate
99
+ uv pip install -v -e .
100
+
101
+ # 3. Weights
102
+ hf download bosonai/higgs-audio-v3-tts-4b
103
+
104
+ # 4. Serve (OpenAI-compatible audio endpoint)
105
+ sgl-omni serve --model-path bosonai/higgs-audio-v3-tts-4b --port 8000
106
+ ```
107
+
108
+ Call the local server:
109
+
110
+ ```bash
111
+ curl -X POST http://localhost:8000/v1/audio/speech \
112
+ -H "Content-Type: application/json" \
113
+ -d '{"input": "Hello, how are you?"}' \
114
+ --output output.wav
115
+ ```
116
+
117
+ **Recommended sampling (voice cloning):** `temperature: 0.8`, `top_k: 50`, `max_new_tokens: 1024`.
118
+
119
+ Cookbook reference: https://sgl-project.github.io/sglang-omni/cookbook/higgs_tts.html
120
+
121
+ ---
122
+
123
+ ## Control tags — how to write target text
124
+
125
+ Embed tags directly in the `input` text to steer emotion, prosody, style, and sound effects.
126
+ Format is always `<|category:tag|>`, with two placements:
127
+
128
+ - **Sentence-level** (emotion / style / prosody speed·pitch·expressive) → put at the sentence start.
129
+ - **Inline** (sfx, and prosody `pause` / `long_pause`) → insert at the exact spot in the sentence.
130
+ - **`sfx` gotcha:** `<|sfx:cough|>Ahem, ...` — tag first, onomatopoeia attached, **no space**.
131
+
132
+ ```
133
+ <|emotion:elation|>Welcome aboard, we are thrilled to have you here!
134
+ <|emotion:elation|><|sfx:laughter|>Haha, welcome, we're so happy you're here!
135
+ Hello there <|prosody:pause|> and welcome to the show.
136
+ ```
137
+
138
+ > **Full 43-tag catalog + rules + examples → [PROMPTING.md](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/PROMPTING.md).**
139
+ > Only recognized tags work — anything else degrades output or gets read literally.
140
+
141
+ For chat formatting, use **[`chat_template.jinja`](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/chat_template.jinja)** from the model repo (and the API docs);
142
+ **do not hand-assemble the chat prompt** — go through the template.
143
+
144
+ ## Language codes
145
+
146
+ Only the ISO codes listed in README's supported-languages section are reliable. Codes outside that
147
+ list fall back / degrade. → see the [model card README](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/README.md) (`## Supported Languages`).
148
+
149
+ ---
150
+
151
+ ## Repo contents (what's actually here)
152
+
153
+ This repo (`https://huggingface.co/bosonai/higgs-audio-v3-tts-4b`) is **weights + config**, not an
154
+ inference codebase:
155
+
156
+ - `config.json`, `model.safetensors(.index.json)` — model weights & shape
157
+ - `chat_template.jinja` — **authoritative** prompt/chat formatting; respect it
158
+ - `tokenizer.json`, `tokenizer_config.json` — tokenizer
159
+ - `README.md` — HuggingFace model card (capabilities, benchmarks, languages, citation)
160
+ - `LICENSE` — see red line below
161
+
162
+ ## Do / Don't
163
+
164
+ - ✅ Use `chat_template.jinja` for prompt construction; use the OpenAI-compatible `/v1/audio/speech` shape.
165
+ - ✅ Use `pcm` + `stream` for real-time / conversational latency.
166
+ - ❌ **Don't use commercially.** License is research & non-commercial
167
+ (`boson-higgs-audio-v3-research-and-non-commercial-license`) — see [LICENSE](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/LICENSE).
168
+ - ❌ Don't hardcode the 100-language claim as "any code works" — validate against the supported list.
169
+
170
+ ## Pointers (don't duplicate — link)
171
+
172
+ All on the model card: `https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/README.md`
173
+
174
+ - Benchmarks / WER-CER tables → README `## Evaluation Benchmarks`
175
+ - Full language list → README `## Supported Languages`
176
+ - Control-token catalog → README `## Control Tokens`
177
+ - Citation → README `## Citation`
PROMPTING.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Writing target text for Higgs Audio v3 TTS — control tags
2
+
3
+ How to embed control tags in the `input` text to steer emotion, prosody, style, and sound effects.
4
+ For where this fits in the overall workflow, see [AGENTS.md](./AGENTS.md).
5
+
6
+ ## Format rule (read first)
7
+
8
+ Every tag is `<|category:tag|>`. There are **two placements**:
9
+
10
+ - **Sentence-level** — emotion, style, and prosody's `speed_* / pitch_* / expressive_*`.
11
+ Put at the **start of the sentence**; it colors the whole sentence.
12
+ - **Inline** — sound effects (`sfx`) and prosody's `pause / long_pause`.
13
+ Insert **at the exact position** in the sentence where the effect should occur.
14
+
15
+ **`sfx` gotcha:** format is `<|sfx:tag|>onomatopoeia, then the line` — the tag comes **first**,
16
+ immediately followed by the onomatopoeia with **no space** between them.
17
+
18
+ ## Examples
19
+
20
+ Sentence-level:
21
+ ```
22
+ <|emotion:elation|>Welcome aboard, we are absolutely thrilled to have you here!
23
+ <|style:whispering|>Come closer, I have a little secret to share.
24
+ <|prosody:speed_slow|>Take your time, there's really no need to rush.
25
+ ```
26
+
27
+ Inline sfx (tag first, onomatopoeia attached, no space):
28
+ ```
29
+ <|sfx:cough|>Ahem, welcome everyone, let's get started.
30
+ <|sfx:laughter|>Haha, so glad you could make it!
31
+ ```
32
+
33
+ Inline pause (between phrases):
34
+ ```
35
+ Hello there <|prosody:pause|> and welcome to the show.
36
+ ```
37
+
38
+ Stacking tags (sentence-level emotion + inline sfx in one line):
39
+ ```
40
+ <|emotion:elation|><|sfx:laughter|>Haha, welcome, welcome, we're so happy you're here!
41
+ <|sfx:sigh|>Haah, what a day — but welcome, please make yourself at home.
42
+ ```
43
+
44
+ ## Tips
45
+
46
+ - You can stack tags in one sentence (e.g. a leading emotion tag plus an inline sfx).
47
+ - `speed_very_slow` only slows the model to roughly ~5s; for slower delivery, insert
48
+ `<|prosody:long_pause|>` between phrases instead.
49
+ - Only the tags below are recognized — anything else degrades output or gets read literally.
50
+
51
+ ## Full tag catalog (43)
52
+
53
+ ### Emotion (21) — sentence-level
54
+ `affection`, `amusement`, `anger`, `arousal`, `awe`, `bitterness`, `confusion`, `contemplation`,
55
+ `contentment`, `determination`, `disgust`, `elation`, `enthusiasm`, `fear`, `helplessness`,
56
+ `longing`, `pride`, `relief`, `sadness`, `shame`, `surprise`
57
+
58
+ Syntax: `<|emotion:elation|>`
59
+
60
+ ### Prosody (10)
61
+ - Sentence-level: `speed_very_slow`, `speed_slow`, `speed_fast`, `speed_very_fast`,
62
+ `pitch_low`, `pitch_high`, `expressive_high`, `expressive_low`
63
+ - Inline: `pause`, `long_pause`
64
+
65
+ Syntax: `<|prosody:speed_slow|>`, `<|prosody:pause|>`
66
+
67
+ ### Style (3) — sentence-level
68
+ `singing`, `shouting`, `whispering`
69
+
70
+ Syntax: `<|style:whispering|>`
71
+
72
+ ### Sound effects (9) — inline
73
+ `cough`, `laughter`, `crying`, `screaming`, `burping`, `humming`, `sigh`, `sniff`, `sneeze`
74
+
75
+ Syntax: `<|sfx:cough|>Ahem, ...` (tag first, onomatopoeia attached, no space)
README.md CHANGED
@@ -151,6 +151,9 @@ The model reaches **single-digit WER/CER on 102 languages**, which split into tw
151
  ## Control Tokens
152
 
153
  All tags follow `<|category:value|>` syntax and can be inserted mid-utterance.
 
 
 
154
  - **Emotion** — `elation`, `amusement`, `enthusiasm`, `determination`, `pride`, `contentment`, `affection`, `relief`, `contemplation`, `confusion`, `surprise`, `awe`, `longing`, `arousal`, `anger`, `fear`, `disgust`, `bitterness`, `sadness`, `shame`, `helplessness`
155
 
156
  <div style="margin-left:1.5em;margin-top:-10px">
 
151
  ## Control Tokens
152
 
153
  All tags follow `<|category:value|>` syntax and can be inserted mid-utterance.
154
+
155
+ > For how to place these tags when writing the target text (sentence-level vs. inline, `sfx` formatting, stacking, worked examples), see **[PROMPTING.md](./PROMPTING.md)**.
156
+
157
  - **Emotion** — `elation`, `amusement`, `enthusiasm`, `determination`, `pride`, `contentment`, `affection`, `relief`, `contemplation`, `confusion`, `surprise`, `awe`, `longing`, `arousal`, `anger`, `fear`, `disgust`, `bitterness`, `sadness`, `shame`, `helplessness`
158
 
159
  <div style="margin-left:1.5em;margin-top:-10px">