Tupt sound at regular intervals of time

#8
by Venom2212 - opened

I am getting the Tupt sound at regular intervals (between 2-3 secs) of time when using the stream endpoint with sglang-omni

import base64
import json
import requests
import sounddevice as sd

BASE_URL = "http://10.10.52.25:8080"

inp = """<|emotion:awe|><|prosody:expressive_high|><|prosody:speed_slow|>
There are moments in life that change everything.
<|prosody:pause|>
Not because they are loud,
or dramatic,
or impossible to ignore.
<|prosody:long_pause|>
But because, in a single instant,
they remind us who we really are.
<|emotion:determination|>
And from that moment forward,
we choose to move ahead,
one step at a time."""

response = requests.post(
    f"{BASE_URL}/v1/audio/speech",
    json={
        "input": inp,
        "stream": True,
    },
    stream=True,
)

response.raise_for_status()

stream = sd.RawOutputStream(
    samplerate=24000,
    channels=1,
    dtype="int16",
)

stream.start()

try:
    for line in response.iter_lines():

        if not line:
            continue

        if isinstance(line, bytes):
            line = line.decode("utf-8")

        if not line.startswith("data: "):
            continue

        if line == "data: [DONE]":
            break

        event = json.loads(line[6:])

        audio = event.get("audio", {})

        if audio.get("data"):
            stream.write(
                base64.b64decode(audio["data"])
            )

finally:
    stream.stop()
    stream.close()
    response.close()

I’m experiencing a similar issue as well. In my tests, the generated audio sometimes contains short “tupt” or clicking-like sounds at regular intervals, which affects the overall listening quality.

I hope the team can look into this issue, and I’m looking forward to an official fix in a future update. Thanks again for your great work!

Boson AI org

Hi @Venom2212 @sgxtj Thanks for the valuable feedback. Actually I suggest raise an issue with sgl omni if you want to run it offline, there might be some bugs when running it, and I suggest using 'pcm' format as response, it might be better, tell me if there is anything trouble!

In my case, I’m using the official API endpoint rather than local inference. I’ve attached the prompt and the generated result for reference(Approximately at the 6-second mark). The input text I synthesized was in Danish:

Den nyeste generation af store sprogmodeller udvikler sig løbende inden for arkitektur og træningsparadigmer, og deres kapacitetsgrænser er blevet markant udvidet til at omfatte multimodal forståelse og kompleks ræsonnering. De nuværende mainstream-modeller understøtter ikke blot præcis informationsudtrækning inden for kontekstvinduer på millionniveau, men udviser også en høj grad af menneskelignende sammenhæng i instruktionsfølge og logisk deduktion."

Sign up or log in to comment