Telegram-канал speechtech - Speech Technology: Unsorted - каталог телеграмм

speechtech | Unsorted

Subscribe to a channel

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

26 Feb 2025 15:00

Introducing Emilia-Large: 200K+ Hours of Open-Source Speech Data!
We’re excited to release Emilia-Large, the largest TTS pretraining datasets! With 200K+ hours of multilingual speech data, fully open-source. It is ready to use for #TTS and #SpeechLM.

https://x.com/realamphion/status/1894719602816393295

Читать полностью…

Speech Technology

24 Feb 2025 23:52

KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

https://arxiv.org/abs/2502.15602

Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

https://github.com/YoonjinXD/kadtk

Читать полностью…

Speech Technology

20 Feb 2025 00:52

This thing was recently releases, somehow missed it before

Sortformer diarizer: an open-source, end-to-end neural model for speaker diarization.

- Integration with ASR and LLM Models: Sortformer is designed to be integrated with ASR or LLM models as a Transformer Encoder. It can be used to inject token-level speaker ID info into the encoder parts of ASR models and LLMs.
- Train/Fine-tune via Token-level Labels: Sortformer resolves the permutation problem using arrival-time sort-loss-based training, enabling speaker IDs for words to be trained via token-level labels. No more timestamp-based training for speaker diarization!

https://arxiv.org/abs/2409.06656

https://huggingface.co/nvidia/diar_sortformer_4spk-v1

Читать полностью…

Speech Technology

15 Feb 2025 00:31

https://huggingface.co/spaces/Speech-Arena-2025/Speech-DF-Arena

Читать полностью…

Speech Technology

10 Feb 2025 22:26

https://github.com/yangb05/PengChengStarling

Читать полностью…

Speech Technology

02 Feb 2025 14:20

1M hours 48TB

https://mlcommons.org/2025/01/new-unsupervised-peoples-speech/

Читать полностью…

Speech Technology

23 Jan 2025 02:12

Once again (third time) https://github.com/KdaiP/StableTTS is really good.

It is all about conditioning. Many words in the paper, but this picture is the main one.

Читать полностью…

Speech Technology

18 Jan 2025 14:58

Guided sampling helps to reduce artifacts and improve clarity. It also significantly reduces expressiveness. However, one can see that simply reducing temperature has similar effect with less compute.

https://alphacephei.com/nsh/2025/01/17/guidance.html

Читать полностью…

Speech Technology

12 Jan 2025 01:42

https://www.youtube.com/watch?v=BX755dCCGB4

Читать полностью…

Speech Technology

09 Jan 2025 15:58

Here are the histograms to illustrate VITS duration issues. The model is simply orthogonal.

Читать полностью…

Speech Technology

05 Jan 2025 00:11

Maybe notes are somewhat scattered, but I'd better not use ChatGPT to fix them. Please check our recent experiments, I'd be happy to hear your comments.

https://alphacephei.com/nsh/2025/01/03/matcha-tts-notes.html

Читать полностью…

Speech Technology

23 Dec 2024 14:34

Big ASR from Wenet team

TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch

Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou, Zhendong Peng

Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98\% to 2.45\%. Thirdly, our model is not only competent in Mandarin speech recognition but also proficient in multilingual, multi-dialect, emotion, gender, and sound event perception. We refer to this as Automatic Speech Perception (ASP), and the perception results are presented in the experimental section.

Читать полностью…

Speech Technology

21 Dec 2024 00:11

https://huggingface.co/blog/big-bench-audio-release

Читать полностью…

Speech Technology

16 Dec 2024 10:00

CosyVoice2 release

https://funaudiollm.github.io/cosyvoice2/

https://arxiv.org/abs/2412.10117

In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the
response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.

Читать полностью…

Speech Technology

12 Dec 2024 23:51

Very good ideas here, dirty data training, joint asr/tts and so on

https://arxiv.org/abs/2412.08237

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu

It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.

Читать полностью…

Speech Technology

25 Feb 2025 08:35

We all know that distillation is more efficient than training from scratch, so the paper is not very insightful, but it is interesting where it all goes.

https://pages.cs.huji.ac.il/adiyoss-lab/slamming/

https://arxiv.org/abs/2502.15814

Slamming: Training a Speech Language Model on One GPU in a Day

Gallil Maimon, Avishai Elmakies, Yossi Adi

We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components.
...

Читать полностью…

Speech Technology

23 Feb 2025 22:07

https://huggingface.co/datasets/KBLab/rixvox-v2

23k hours of Swedish speech. These guys also release Whisper tunes

https://huggingface.co/KBLab

Читать полностью…

Speech Technology

20 Feb 2025 00:31

https://github.com/JusperLee/TIGER

Demos are pretty nice (video part)

https://cslikai.cn/TIGER/

Читать полностью…

Speech Technology

12 Feb 2025 15:07

https://arxiv.org/abs/2502.05232

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Adam Stooke, Rohit Prabhavalkar, Khe Chai Sim, Pedro Moreno Mengibar

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform "self-transduction".

Читать полностью…

Speech Technology

10 Feb 2025 15:57

Great speeds

https://arxiv.org/abs/2406.08835

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Ziyang Zhuang, Chenfeng Miao, Kun Zou, Ming Fang, Tao Wei, Zijian Li, Ning Cheng, Wei Hu, Shaojun Wang, Jing Xiao

Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.

Читать полностью…

Speech Technology

27 Jan 2025 12:18

https://github.com/FireRedTeam/FireRedASR

FireRedASR is a family of large-scale automatic speech recognition (ASR) models supporting Mandarin, Chinese dialects and English, while also offering singing lyrics recognition capability, achieving a new state-of-the-art on public Mandarin ASR benchmarks.

FireRedASR is designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:

FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.

FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.

https://arxiv.org/pdf/2501.14350

Читать полностью…

Speech Technology

21 Jan 2025 22:24

Exactly 20 years ago we started our first project in speech, a voice for Festival TTS. Many things happened since then but it was a great story. Looking for the next 20 years now.

https://www.linux.org.ru/news/linux-general/775065?cid=776417

Читать полностью…

Speech Technology

13 Jan 2025 23:09

We tried discrete loss for duration from StyleTTS2 in MatchaTTS, it is really good

https://alphacephei.com/nsh/2025/01/12/discrete-units.html

Читать полностью…

Speech Technology

10 Jan 2025 00:08

The dataset comprises of 5000 hours speech corpus in Akan, Ewe, Dagbani, Daagare, and Ikposo. Each language includes 1000 hours of audio speech from indigenous speakers of the language and 100 hours of transcription.

https://github.com/HCI-LAB-UGSPEECHDATA/speech_data_ghana_ug

Читать полностью…

Speech Technology

09 Jan 2025 12:37

Diarization-conditioned Whisper for target speaker recognition

https://github.com/BUTSpeechFIT/DiCoW

Читать полностью…

Speech Technology

25 Dec 2024 14:52

A paper from respected people. Between, testing on books (librispeech and MLS) with LLama is usually a bad idea. The thing is that Llama already seen all the books many times.

https://arxiv.org/abs/2412.16464

Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland, Ozlem Kalinli, Mike Seltzer

While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.

Читать полностью…

Speech Technology

21 Dec 2024 14:10

Speech talks from MILA

https://poonehmousavi.github.io/rg

CONVAI_RG" rel="nofollow">https://www.youtube.com/@CONVAI_RG

Recent one Discrete Audio Tokens for Multimodal LLMs by Mirco Ravanelli

https://www.youtube.com/watch?v=2-Dqzg3fuVE

Upcoming ones are also interesting

Читать полностью…

Speech Technology

20 Dec 2024 22:52

https://github.com/imxtx/awesome-controllabe-speech-synthesis

Читать полностью…

Speech Technology

15 Dec 2024 23:28

The talks of the Codec-SUPERB@SLT 2024 about neural audio codecs and speech language models are up on YouTube

https://www.youtube.com/playlist?list=PLJV_el3uVTsNnC37JYD8kBcNDI7CNJgum

Читать полностью…

Speech Technology

12 Dec 2024 00:13

Gemini 2.0 introduces multilingual native audio output.

https://www.youtube.com/watch?v=qE673AY-WEI

Читать полностью…

Subscribe to a channel