speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1469

Subscribe to a channel

Speech Technology

517 pages of instrument/vocals separation

https://docs.google.com/document/d/17fjNvJzj8ZGSer7c7OFe_CNfUKbAxEh_OBv94ZdRG5c/

Instrumental and vocal & stems separation & mastering
(UVR 5 GUI: VR/MDX-Net/MDX23C/Demucs 1-4, and BS/Mel-Roformer in beta
MVSEP-MDX23-Colab/KaraFan/drumsep/LarsNet/SCNet
x-minus.pro (uvronline.app)/mvsep.com/
GSEP/Dango.ai/Audioshake/Music.ai)

Читать полностью…

Speech Technology

Dataset generated with OpenAI

https://huggingface.co/datasets/laion/laions_got_talent

"LAION's Got Talent" is a generated dataset comprising voice acting samples that exhibit a wide range of emotions, vocal bursts, topics, and content. This dataset is a component of the BUD-E project, spearheaded by LAION with support from Intel.

Читать полностью…

Speech Technology

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

https://arxiv.org/abs/2503.01174

https://x.com/Sid_Arora_18/status/1897315720205328593

As usual, baseline cascaded system is intentionally weak. Whisper tiny as a baseline???

Читать полностью…

Speech Technology

The second baseline from https://x.com/xueyao_98 is now available!

Check out their technical blog and open-sourced code:
Blog: https://veiled-army-9c5.notion.site/Vevo1-5-1d2ce17b49a280b5b444d3fa2300c93a
Code: https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevosing

Training data will be distributed starting April 28th.
Register for SVCC here: https://forms.gle/GZGAWJAZvgDK6QKcA

Читать полностью…

Speech Technology

Uniform steps are definitely a problem in speech LLMs, couple of attempts to solve that which come together recently, the idea is that we apply text/speech alignment before we feed data into LLM:

https://github.com/FreedomIntelligence/Soundwave

https://github.com/mtkresearch/TASTE-SpokenLM

https://arxiv.org/abs/2502.12900

Soundwave: Less is More for Speech-Text Alignment in LLMs

Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at this https URL.

Читать полностью…

Speech Technology

This talk was here already but I watched it again recently and can recommend to revisit it again

Hearing the AGI from GMM HMM to GPT 4o Yu Zhang
November 15th LTI Colloquium Speaker

https://www.youtube.com/watch?v=pRUrO0x637A

Highly recommended:

1. Importance of scale
2. Importance of self-supervised learning for dirty data training
3. Very tricky case with dither seed and self-supervised learning
4. Voice search data is useless
5. Importance of multi-objective training (again)
6. Why readable transcripts (Whisper) better than good WER (RNNT)
7. Discussion on factors of audio and text data for audio LLM training
8. Size of the decoder and size of the encoder

Not always relevant for us gpu-poor guys but very nice overall.

Читать полностью…

Speech Technology

Announcing the AudioMOS Challenge 2025!

Homepage:https://sites.google.com/view/voicemos-challenge/audiomos-challenge-2025

We are enlarging the scope of the previous VoiceMOS challenge series to cover not only speech but also music and general audio.

Founded in 2022, the VoiceMOS Challenge (VMC) series aims to compare prediction techniques for human ratings of speech. To facilitate development in the automatic evaluation of audio generation systems, we decided to enlarge the scope and rename it as the AudioMOS Challenge.

Track 1: MOS prediction for text-to-music systems
This track is based on the MusicEval dataset, spanning 31 TTM systems, along with ratings collected from music experts. Evaluation was conducted across two axes: overall musical impression and alignment with the text prompt.

Track 2: Audiobox-aesthetics-style prediction for TTS, TTA and TTM samples
This track is based on the recently released Meta Audiobox Aesthetics, where they proposed four new axes: production quality, production complexity, content enjoyment, and content usefulness.

Track 3: MOS prediction for speech in high sampling frequencies
For the training set, we provide samples in 16/24/48kHz, and during evaluation, the participants are asked to evaluate samples reflecting their scores in a listening test that contains samples from all frqeuencies.

We are planning to submit a challenge proposal to ASRU2025. The challenge will start officially on April 9th. Please pre-register if interested!

Читать полностью…

Speech Technology

https://github.com/bytedance/MegaTTS3

Читать полностью…

Speech Technology

https://github.com/DataoceanAI/Dolphin

Dolphin is a multilingual, multitask ASR model developed through a collaboration between Dataocean AI and Tsinghua University. It supports 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. It is trained on over 210,000 hours of data, which includes both DataoceanAI's proprietary datasets and open-source datasets. The model can perform speech recognition, voice activity detection (VAD), segmentation, and language identification (LID).

Supports Russian, Uzbek, Kazakh, Tajik, etc

https://github.com/DataoceanAI/Dolphin/blob/main/languages.md

Читать полностью…

Speech Technology

All comes from NLP

https://github.com/Bartelds/ctc-dro

https://arxiv.org/abs/2502.01777

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

Martijn Bartelds, Ananjan Nandi, Moussa Koulako Bala Doumbouya, Dan Jurafsky, Tatsunori Hashimoto, Karen Livescu

Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and offers the potential for reducing group disparities in other domains with similar challenges.

Читать полностью…

Speech Technology

https://github.com/canopyai/Orpheus-TTS/issues/10#issuecomment-2740645470

christophschuhmann left a comment (canopyai/Orpheus-TTS#10)
Hey, Christoph from Laion here, the guy who made the Laion 5 billion data set. I have been making a voice acting data set with some donations from Intel with altogether 5,000 hours of high quality voice acting.
https://huggingface.co/datasets/laion/laions_got_talent_enhanced_flash_annotations_and_long_captions
https://huggingface.co/datasets/laion/laions_got_talent_raw
I was using HyperLab, which is a reseller for OpenAI API, so I never actually agreed to the OpenAI terms of service and then prompted the voice API to role play like an actor at a casting audition. This way I generated evenly distributed utterances over 40 emotion categories for all 11 OpenAI voices for English, German, French, and Spanish. The data is already online and I also have very detailed emotion captions. I will make an official release in the next few weeks, but you could already take it and tune German, Spanish, and French models on it. I would be very happy about a capable German model because I want to deploy voice assistants in schools in Germany. I'm doing all of this in my free time and I am still a high school teacher and want to keep it this way. In the following repository, the quality is the best, but unfortunately I lost the accent labels for the English samples. Some samples in the English part are with accents. In the second repository here, you find the unenhanced data, which is slightly lower from the recording quality, but you can find in the emotion entry of the JSON the corresponding accent. For English, I generated 14 different accents. German, Spanish, and French don't have any accents. Have fun!

Читать полностью…

Speech Technology

https://github.com/canopyai/Orpheus-TTS

Читать полностью…

Speech Technology

Some more results from our experiments with GEC with LLMs

https://alphacephei.com/nsh/2025/03/15/generative-error-correction.html

Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task.

Gemma2 and Gemma3 are ok, yet to try 27B version.

Simple prompt from the papers certainly doesn’t work. One has to provide much more details and specific issues in prompt. We yet to work on the prompt more.

Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16%

For now GEC doesn’t seem like a breakthrough tech, it seems like something like extra sause is needed, simple ROVER is equally ok and much more stable.

We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can’t confirm in experiments.

For big model input split doesn’t help much.

There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar. We need to work more on it.

The difference between Gemma2-9B and Gemini Flash is not very large except for number of hallucinations.

Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics).

Читать полностью…

Speech Technology

ICASSP 2025 papers are available online now

https://ieeexplore.ieee.org/xpl/conhome/10887540/proceeding?isnumber=10887541

the program website is wierd

https://icassp25.conflux.events/program

Читать полностью…

Speech Technology

I've spent some time on generative error correction recently, more numbers and results on it later, meanwhile the paper

https://arxiv.org/abs/2409.09785

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

another older paper here

/channel/speechtech/1962

Читать полностью…

Speech Technology

NVIDIA released new English model improving leaderboard results

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2

https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

Читать полностью…

Speech Technology

People are excited about https://github.com/MoonshotAI/Kimi-Audio

Accuracy numbers look nice

Читать полностью…

Speech Technology

Epic review process of MegaTTS3

https://openreview.net/forum?id=o362EkNU2z

the model itself is well designed, we love non-autoregressive models and MFA aligner too!

https://github.com/bytedance/MegaTTS3

Читать полностью…

Speech Technology

https://sites.google.com/view/respinasrchallenge2025/home

MADASR 2.0 : Multi-Lingual Multi-Dialect ASR Challenge in 8 Indian Languages

Читать полностью…

Speech Technology

https://github.com/zhai-lw/SQCodec

https://arxiv.org/abs/2504.04949

One Quantizer is Enough: Toward a Lightweight Audio Codec

Linwei Zhai, Han Ding, Cui Zhao, fei wang, Ge Wang, Wang Zhi, Wei Xi

Читать полностью…

Speech Technology

Modern diarization

https://github.com/cuhealthybrains/MT-LLM

Читать полностью…

Speech Technology

https://github.com/gwh22/UniVoice

This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.

Читать полностью…

Speech Technology

https://www.youtube.com/watch?v=2WLH-g4_xeA

Читать полностью…

Speech Technology

Couple new TTS today

https://github.com/yynil/RWKVTTS

https://github.com/yxlu-0102/IDEA-TTS

Читать полностью…

Speech Technology

No day without new TTS

https://x.com/anuj_diwan/status/1902884487718965330

If you'd like an open-source text-to-speech model that follows your style instructions, consider using our ParaSpeechCaps-based model!
Model: https://huggingface.co/ajd12342/parler-tts-mini-v1-paraspeechcaps
Paper: https://arxiv.org/abs/2503.04713

Читать полностью…

Speech Technology

https://x.com/PiotrZelasko/status/1902723841534681357

Canary-1B-Flash and Canary-180M-Flash - two new variants of Canary optimized for fast training and inference.

Key features of Canary-1B-Flash:
* Several times faster!
* More accurate than Canary-1B!
* Word-level timestamps!
* Dropped NC license!
Both models support the same set of languages as original Canary-1B: English, French, Spanish, and German.

Читать полностью…

Speech Technology

Twitter suggested me this paper on GEC stressing the named entity recognition issue, right on the subject:

https://arxiv.org/abs/2410.13198

Читать полностью…

Speech Technology

Interesting distillation of Kokoro

https://github.com/EndlessReform/smoltts

They also have speech dataset encoded with Mimi, LibriTTS-r is just 60Mb

Читать полностью…

Speech Technology

https://github.com/soham97/mellow

https://arxiv.org/abs/2503.08540

Mellow: a small audio language model for reasoning

Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj

Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.

Читать полностью…

Speech Technology

https://arxiv.org/abs/2502.06490

Recent Advances in Discrete Speech Tokens: A Review

Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu

The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.

Читать полностью…
Subscribe to a channel