How to handle mixed Mandarin and Cantonese text when the model does not support per‑word language selection?

by sgxtj - opened 7 days ago

I am testing a multilingual TTS model with text that mixes Mandarin and Cantonese. The input contains simplified Chinese characters (which should be read in Mandarin) and Cantonese‑specific words (which should be read in Cantonese). My expectation is that the model would automatically switch language based on the vocabulary – speaking Mandarin for standard simplified characters and Cantonese for Cantonese words. However, I noticed that the model does not support specifying the language per utterance or per word. As a result, the synthesized speech does not switch properly and many Cantonese words are pronounced in Mandarin.

What is the recommended way to handle this situation? Are there any practical workarounds, such as splitting the text by language and synthesizing each segment separately, using a language detection step to insert control tokens, or somehow fine‑tuning the model to recognise mixed‑language patterns? Any advice or examples would be very helpful. Thank you.

SilinMeng0510

Boson AI org 6 days ago

Thanks for trying our model and the feedbacks :)
Our engineers are currently working on explicit control on dialects switching. This will be our next step for multi-lingual generation in higgs-audio-v3.5-tts.
For now, our model don't support tagging for this but one usage trick you can do is that you maintain 2 reference audio of a same person, speaking both Mandarin and Cantonese. When you want this person to speak Mandarin then use the Mandarin speaking reference audio, and do the same for Cantonese. So our model should be able to speak only either Mandarin and Cantonese within a turn. If you want to do both together within a turn, unfortunately, we currently not support this very well.

sgxtj

1 day ago

Thanks for your reply and for the suggestion.

I tried the workaround you mentioned, maintaining two reference audios from the same speaker, one in Mandarin and one in Cantonese, and switching the reference audio depending on the target segment. However, in my tests it still did not achieve the expected result, especially for mixed Mandarin/Cantonese content within the same sentence or turn. Some Cantonese-specific words were still not pronounced naturally, and the overall switching behavior was not stable enough for my use case.

I understand that explicit dialect/language control is not currently supported, and I’m very glad to hear that your team is working on dialect switching. I’m really looking forward to this feature in future versions.

Thanks again for your help, and I really appreciate the great work your team is doing on this model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment