Tupt sound at regular intervals of time

by Venom2212 - opened 2 days ago

I am getting the Tupt sound at regular intervals (between 2-3 secs) of time when using the stream endpoint with sglang-omni

import base64
import json
import requests
import sounddevice as sd

BASE_URL = "http://10.10.52.25:8080"

inp = """<|emotion:awe|><|prosody:expressive_high|><|prosody:speed_slow|>
There are moments in life that change everything.
<|prosody:pause|>
Not because they are loud,
or dramatic,
or impossible to ignore.
<|prosody:long_pause|>
But because, in a single instant,
they remind us who we really are.
<|emotion:determination|>
And from that moment forward,
we choose to move ahead,
one step at a time."""

response = requests.post(
    f"{BASE_URL}/v1/audio/speech",
    json={
        "input": inp,
        "stream": True,
    },
    stream=True,
)

response.raise_for_status()

stream = sd.RawOutputStream(
    samplerate=24000,
    channels=1,
    dtype="int16",
)

stream.start()

try:
    for line in response.iter_lines():

        if not line:
            continue

        if isinstance(line, bytes):
            line = line.decode("utf-8")

        if not line.startswith("data: "):
            continue

        if line == "data: [DONE]":
            break

        event = json.loads(line[6:])

        audio = event.get("audio", {})

        if audio.get("data"):
            stream.write(
                base64.b64decode(audio["data"])
            )

finally:
    stream.stop()
    stream.close()
    response.close()

sgxtj

1 day ago

I’m experiencing a similar issue as well. In my tests, the generated audio sometimes contains short “tupt” or clicking-like sounds at regular intervals, which affects the overall listening quality.

I hope the team can look into this issue, and I’m looking forward to an official fix in a future update. Thanks again for your great work!

popsoda2002

Boson AI org 1 day ago

Hi @Venom2212 @sgxtj Thanks for the valuable feedback. Actually I suggest raise an issue with sgl omni if you want to run it offline, there might be some bugs when running it, and I suggest using 'pcm' format as response, it might be better, tell me if there is anything trouble!

sgxtj

1 day ago

•

edited 1 day ago

In my case, I’m using the official API endpoint rather than local inference. I’ve attached the prompt and the generated result for reference(Approximately at the 6-second mark). The input text I synthesized was in Danish:

Den nyeste generation af store sprogmodeller udvikler sig løbende inden for arkitektur og træningsparadigmer, og deres kapacitetsgrænser er blevet markant udvidet til at omfatte multimodal forståelse og kompleks ræsonnering. De nuværende mainstream-modeller understøtter ikke blot præcis informationsudtrækning inden for kontekstvinduer på millionniveau, men udviser også en høj grad af menneskelignende sammenhæng i instruktionsfølge og logisk deduktion."

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment