Telegram-канал speechtech - Speech Technology: Unsorted

Speech Technology

10 July 2025 18:14

Similar thing https://dmdspeech.github.io and few others.

Читать полностью…

Speech Technology

28 June 2025 20:37

Not just speech but speech part is also nice

https://huggingface.co/datasets/facebook/seamless-interaction

Читать полностью…

Speech Technology

26 June 2025 23:06

Supports speech recognition

https://huggingface.co/blog/gemma3n

Читать полностью…

Speech Technology

21 June 2025 16:08

We track InWorld company status as it was founded by Dialogflow guys (it was very popular those days). Interesting that AI for games didn't work

https://www.linkedin.com/posts/kylangibbs_inworld-is-evolving-1-we-just-published-activity-7341215644828188672-aYWf

Inworld is evolving.

1. We just published our vision of the future. These are distilled learnings based on our first 4 years engaged with partners like NVIDIA, Microsoft, Status, Little Umbrella, Streamlabs, Nanobit, NBCUniversal, Mistral AI, Google and thousands of other developers.

Due to explosive growth in demand we are widening our focus to broader consumer applications (extending from games into new areas like fitness, learning and social connection). We are seeing new and existing companies across consumer categories shift the focus of AI adoption from cost savings to net new revenue opportunities through novel AI-native applications, and we are leaning in to support that shift.

Читать полностью…

Speech Technology

20 June 2025 00:01

LAION proudly presents 2 state-of-the-art emotion detection models for voice and face, surpassing Gemini 2.5 Pro and Hume API. They are completely open under a CC BY 4.0 license, alongside a ~5,000-hour voice-acting dataset & 2 expert-annotated benchmarks.

https://huggingface.co/laion/BUD-E-Whisper

https://arxiv.org/abs/2505.20033

Читать полностью…

Speech Technology

18 June 2025 16:25

Whole playlist of JSALT 2025 videos

https://www.youtube.com/playlist?list=PLSeS0sl8xpTwz7h5iJSniiF89iUdZXNJ2

Читать полностью…

Speech Technology

17 June 2025 11:49

As part of our mission to create open-source datasets for low-resource African languages, Digital Umuganda has released 2,250 hours of open-source Kinyarwanda speech data. To accompany this release, we launched an ASR hackathon on Kaggle, inviting the ecosystem to build models and contribute to shaping the future of low resource language technologies.

Our goal is to collect 10,000 hours each of Kinyarwanda and Swahili speech data. This hackathon is a crucial step in that journey. The feedback will help us refine our data collection strategy for the remaining hours and ensure the datasets meet the needs of developers, researchers, and language advocates across the region.

We would greatly appreciate it if you could share this initiative with your network and help us reach more contributors passionate about language, technology, and open data.

The hackathon is made of 3 tracks

Track A – Small: 540 hours of fully transcribed Kinyarwanda speech.

Track B – Medium: 1180 hours of fully transcribed Kinyarwanda speech.

Track C – Large: 1180 hours of transcribed speech plus 1170 hours of unlabeled Kinyarwanda audio.

For more information you can check the Hackathon website https://digital-umuganda.github.io/kasr_hackathon/

Читать полностью…

Speech Technology

11 June 2025 18:50

Runtime quality control is interesting

https://www.linkedin.com/posts/yongyi-zang_github-resemble-aichatterbox-sota-open-source-activity-7333625257456480256-XfT8

Got very curious, so started to look into the source code of Resemble AI newly released open TTS model Chatterbox (https://lnkd.in/gzmCFFaQ) that claims to outperform ElevenLabs. Here's a (too quick, hopefully not wrong) tech deep-ish dive of its architecture:

High-level overview: text -> semantic tokens -> flow matching for Mel -> Mel to waveform. For voice conversion, speech -> semantic tokens, and the rest. Speaker embedding conditioning for text -> semantic tokens and semantic tokens -> Mel.

Given a reference audio, it is encoded to a speaker embedding network A (will explain in a second) and a speech tokenizer. The speech tokenizer looks very much like CosyVoice with some twists, is S3Tokenizer (basically just very semantic because trained on ASR objective).

These two embeddings are then sent into its core sequence-modeling model (a llama). CFG is applied by running within the same batch two times, one without conditioning speaker embedding. These two runs are then averaged to form final logits based on a weight. The text tokens and speech tokens have separate absolute positional embeddings. (Why not RoPE BTW?)

A runtime quality control model inspects alignment through looking into activation map of layer 9 of sequence modeling during generation (this is genius actually, they actually went in to look at attention maps!) and make sure the tokens are attending to the right words (final token is not being attended to for too long, previous tokens are not attended to again). It can't really fix anything, but it can make it immediately end speech by setting the <EOS> token's probability to ... 2^15...

So the speech tokens are generated by the sequence modeling, and a speaker embedding model B is used to extract the speaker embedding again and use that as condition for the vocoder. Remember speaker embedding network A? That's a GE2E (so basically 3 LSTMs). The network B is a CAM++ x-vector, if they didn't change it then likely a pre-trained model on AAM-Softmax objective. Why two embedding networks? I can't really figure this one out.

So the vocoder is a two stage system, they first generate Mel spectrogram then generate waveform from Mel. The Mel generation part is identical to CosyVoice, where a CFM model is used; They used HiFTNet instead of HiFiGAN as vocoder though.

HiFTNet is a neural source filter network that has several stages. It starts by predicting F0 from mel, then use these F0s to generate source signal. The source signal is then conditioned on the process of a inverse-STFT network (*neural filter*) where frame-wise magnitude and phase is predicted.

When the sequence modeling part is skipped, the speech tokens can also be directly extracted from source speech, which is then going through the decoder network.

The watermarking model is a separate model that runs on waveform-level. It adds a watermark into magnitude spectrogram, then use 3 branches of convolutions to model watermark presence at different time scales.

Читать полностью…

Speech Technology

02 June 2025 17:41

Discrete diffusion model from PlayHT

https://huggingface.co/PlayHT/PlayDiffusion

code here:

https://github.com/playht/PlayDiffusion

Читать полностью…

Speech Technology

30 May 2025 09:43

Interesting point. We shouldn't test speech LLMs on factual knowledge

https://www.youtube.com/watch?v=2d1MU280yQk

https://github.com/slp-rl/WhiStress

https://arxiv.org/abs/2505.19103

WHISTRESS: Enriching Transcriptions with Sentence Stress Detection

Iddo Yosha, Dorin Shteyman, Yossi Adi

Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: this https URL.

Читать полностью…

Speech Technology

24 May 2025 00:03

So Kyutai released a modular system https://unmute.sh, basically admitted that their first demo is not really usable.

Читать полностью…

Speech Technology

20 May 2025 02:57

https://arxiv.org/abs/2411.18803

TS3-Codec: Transformer-Based Simple Streaming Single Codec

Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li

Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NAC models are predominantly convolution-based, the performance of NACs with a purely transformer-based, and convolution-free architecture remains unexplored. This paper introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec consists of only a stack of transformer layers with a few linear layers, offering greater simplicity and expressiveness by fully eliminating convolution layers that require careful hyperparameter tuning and large computations. Under the streaming setup, the proposed TS3-Codec achieves comparable or superior performance compared to the codec with state-of-the-art convolution-based architecture while requiring only 12% of the computation and 77% of bitrate. Furthermore, it significantly outperforms the convolution-based codec when using similar computational resources.

Читать полностью…

Speech Technology

17 May 2025 03:10

Voicebox is fundamental model by itself, but this talk has very interesting part about applications of synthetic data to ASR model training

https://www.youtube.com/watch?v=PKleJNikO8M

Читать полностью…

Speech Technology

14 May 2025 00:29

Gemini models are very good (and recent 2.5 preview is even better)

https://github.com/ddlBoJack/MMAR

Читать полностью…

Speech Technology

08 May 2025 09:28

New multilingual speech restoration paper out Miipher-2 🚀! The RTF on a TPU is 0.0078: 1 million hours of data can be cleaned in 3 days using just 100 TPUs!

Paper: https://arxiv.org/abs/2505.04457
Demo: https://google.github.io/df-conformer/miipher2/

Читать полностью…

Speech Technology

10 July 2025 17:26

Not something exceptional, just a current trend

https://arxiv.org/abs/2507.05911

Differentiable Reward Optimization for LLM based TTS system

Changfeng Gao, Zhihao Du, Shiliang Zhang

This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereby streamlining the RLHF training process. Additionally, we introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system's capability to follow instructions this http URL results indicate that DiffRO significantly improves the pronunciation accuracy of the TTS system, achieving state-of-the-art (SOTA) WER results on the seed-tts-eval benchmark. Moreover, with the integration of the MTR model, we demonstrate the ability to control emotional and quality attributes in a zero-shot manner.

Читать полностью…

Speech Technology

27 June 2025 13:57

Gemma3n USM encoder operates at 6.5 frames per second

https://huggingface.co/n0mad-0/gemma3n-usm-rip

Читать полностью…

Speech Technology

22 June 2025 09:53

Tested https://huggingface.co/kyutai/stt-1b-en_fr model on some diverse data. Accuracy is on the lower side.

CMUKids WER is 11.3 for example compared to 4.8 for parakeet-tdt-0.6b-v2. Librispeech test-clean WER is 4+ too.

Output sometimes Chinese, sometimes Arabic.

Читать полностью…

Speech Technology

21 June 2025 15:44

https://github.com/fluxions-ai/vui
https://huggingface.co/fluxions/vui

got some attention recently. Multispeaker TTS model with context (like Dia) 100m params

DIA vs vui

- vui 16x smaller
- Unlimited render length, dia 30 seconds
- vui has 150ms latency, time to first byte
- vui runs in <5gb VRAM
- 4x faster codec
- 1/2 the number of people
- built with google cloud tpus, vs two 4090's in a basement.
- 7x faster RTF

Читать полностью…

Speech Technology

19 June 2025 15:43

https://x.com/kyutai_labs/status/1935652243119788111

Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications.

Check out the details here: https://kyutai.org/next/stt

https://github.com/kyutai-labs/delayed-streams-modeling

Читать полностью…

Speech Technology

18 June 2025 16:23

https://www.linkedin.com/posts/alexander-polok-b5567284_dicow-diarization-conditioned-whisper-for-activity-7341058825732415488-UVez

We are happy to announce that our DiCoW and DiariZen based system finished 🥈 in the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM) at Interspeech 2025 organized by Nexdata.jp【旧Datatang株式会社公式】!

https://www.nexdata.ai/competition/mlc-slm

📄 System description and additional analysis (including dataset inconsistencies) are now available on arXiv:
BUT System for the MLC-SLM Challenge:
👉 https://www.arxiv.org/abs/2506.13414

In addition, I’m very pleased to share that our journal paper:
"DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition"
👉https://www.sciencedirect.com/science/article/abs/pii/S088523082500066X
has been accepted for publication in Computer Speech & Language (Elsevier)!

And last but not least — just yesterday I had the pleasure of presenting a tutorial and mini-challenge on fine-tuning DiCoW in data/compute constrained environments at this year's JSALT Summer School!
🎓 If you want to try it yourself: https://colab.research.google.com/github/Lakoc/JSALT_tutorial/blob/main/challenge.ipynb

https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard

🎥 Recording available here: https://www.youtube.com/watch?v=KqNKGjcsi9g&list=PLSeS0sl8xpTwz7h5iJSniiF89iUdZXNJ2&index=28

Читать полностью…

Speech Technology

17 June 2025 10:34

We speech guys rarely eat dogfoot ;) Few interesting dictation tools popular these days:

1) Willowvoice
2) Wisprflow
3) Superwhisper

It is interesting how LLM correction plays here, you don't need plain transcript, instead it converts your inputs to required style.

https://willowvoice.com/

https://wisprflow.ai/

https://superwhisper.com/

Читать полностью…

Speech Technology

10 June 2025 16:16

In-car multispeaker recognition. CER is still 50%

https://github.com/DaiYvhang/AISHELL-5

Читать полностью…

Speech Technology

31 May 2025 06:12

TTS evaluation with audio LLM, interesting results too

https://github.com/boson-ai/EmergentTTS-Eval-public

Читать полностью…

Speech Technology

26 May 2025 10:44

Supports many languages, including Russian. No code/model yet though.

https://funaudiollm.github.io/cosyvoice3/

https://arxiv.org/abs/2505.17589

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at this https URL.

Читать полностью…

Speech Technology

23 May 2025 20:12

Includes ASR, yet to test it

https://developers.googleblog.com/en/introducing-gemma-3n/

Читать полностью…

Speech Technology

17 May 2025 03:12

The paper itself

Using Voicebox-based Synthetic Speech for ASR Adaptation

https://www.isca-archive.org/syndata4genai_2024/dhamyal24_syndata4genai.pdf

Читать полностью…

Speech Technology

17 May 2025 03:10

These ConvAI videos are somewhat good, better than papers, I could again recommend both two recent ones:

"Automatic Quality Assessment for Speech and Beyond"

Very important classification of speech quality aspects inside

https://www.youtube.com/watch?v=REH034Wm3so

Читать полностью…

Speech Technology

09 May 2025 18:45

18B speech recognition models

https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d

https://arxiv.org/abs/2502.10373

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe

Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in la

Читать полностью…

Speech Technology

07 May 2025 16:34

Daniels Povey's talk

https://youtube.com/watch?v=2B1-gKDTuh0

overall, consistency-based training is getting more and more importance these days in different areas - TTS too. As training data amount comes to the limit, better supervision brings more results.

Читать полностью…