speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

An example of issue copied from repo to repo:

https://github.com/jaywalnut310/vits/issues/11

in vits we predict float duration and then convert it to attention steps. So we need to round floats. VITS applies ceil which results in longer duration than original (usually the scale is 0.9). As a result, you need to scale back to match original length

https://github.com/jaywalnut310/vits/blob/main/models.py#L511

In glowtts there is extra clamp

https://github.com/coqui-ai/TTS/blob/main/TTS/tts/models/glow_tts.py#L351

This thing is copied from repo to repo, fun thing happends in Matcha, where we multiply by length factor after we applied ceil:

https://github.com/shivammehta25/Matcha-TTS/blob/main/matcha/models/matcha_tts.py#L122

Читать полностью…

Speech Technology

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone!

Highlights:
- #2 ranked on TTS-Arena (as "Anonymous Sparkle")
- 1M hours of multilingual training data
- 13 languages supported, including English, Chinese, Japanese & more
- <150ms latency with high-quality instant voice cloning
- Pretrained model now open source
- Cost-effective self-hosting or cloud options

Let's check out the details 🧵⬇️

https://x.com/FishAudio/status/1864370933496205728

Supported languages:

English (en) >300k hours
Chinese (zh) >300k hours
Japanese (ja) >100k hours
German (de) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours
Russian (ru) ~20k hours
Dutch (nl) <10k hours
Italian (it) <10k hours
Polish (pl) <10k hours
Portuguese (pt) <10k hours

Читать полностью…

Speech Technology

https://x.com/LiuXub/status/1863622470709690575

TAAE — the first Transformer-based Audio AutoEncoder scaled to 1B parameters for neural speech coding! 🔥

TAAE achieves state-of-the-art speech quality at ultra-low bitrates of 400 or 700 bits-per-second, delivering reconstruction quality remarkably close to real audio. It sets a new benchmark for efficient and high-quality speech tokenization.

📖 Paper: https://arxiv.org/abs/2411.19842v1
👂 Demos: https://stability-ai.github.io/stable-codec-demo/
💻 GitHub: https://github.com/Stability-AI/stable-codec

Code and pre-trained models will be released to empower the community!

Читать полностью…

Speech Technology

SLT 2024 starts tomorrow

https://2024.ieeeslt.org/detailed-schedule/

Читать полностью…

Speech Technology

https://www.youtube.com/watch?v=TGlfK0lwjgw

Titouan Parcollet is a “Research Scientist at the Samsung AI Center Cambridge” and an “adjunct researcher at the Cambridge Machine Learning Systems Lab from the University of Cambridge”. Further, he is an “Associate Professor on leave from the Laboratoire Informatique d'Avignon (LIA) and Avignon Université (FR)”. His current Research focus is on self-supervised / representation learning and on continual learning. He played an instrumental part in the development of SpeechBrain and Pytorch-Kaldi.

Читать полностью…

Speech Technology

A big review paper

https://www.sciencedirect.com/science/article/pii/S088523082400130X?ssrnid=4870649&amp;dgcid=SSRN_redirect_SD

Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023☆


The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. The Challenge has seen important milestones passed, with results suggesting that synthetic speech was indistinguishable from natural speech in terms of intelligibility in 2021 and that by that same year it was perhaps even indistinguishable in naturalness. The high quality of synthetic speech generated by the latest TTS systems has thus revealed limitations with ITU-T P.800.1 Mean Opinion Score (MOS) in detecting the remaining differences between synthetic and natural speech. Yet, it was the only method used in previous Challenges and is still the most popular method in the field for speech synthesis evaluation. In the 2023 Challenge, we addressed observed limitations of past Challenges by incorporating state-of-the-art speech synthesis evaluation techniques to refine the evaluation of speech quality, speaker similarity and intelligibility. For speech quality, a relative comparison of the systems receiving the best MOS was able to discover a greater number of significant differences between systems. Regarding speaker similarity, we demonstrated that there is a strong bias depending on whether the listeners are familiar with the target voice or not. As for intelligibility, the evaluation of language-specific phenomena, such as the pronunciation of homographs, better highlighted system limits compared to global transcription tasks of synthesised utterances. In addition to reporting results for the 18 entries to the 2023 Challenge, we extend the results analysis to type of TTS module to provide some insights on the most recent advances in model design. Overall, this year’s results demonstrate the need for a shift towards new methods for refining TTS evaluation to shed light on increasingly smaller and localised differences between synthesised and natural speech.

Читать полностью…

Speech Technology

Small is always nice

https://arxiv.org/abs/2408.13920

Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition

Dionyssos Kounadis-Bastian, Oliver Schrüfer, Anna Derington, Hagen Wierstorf, Florian Eyben, Felix Burkhardt, Björn Schuller

Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlation Coefficient (CCC) arose as an alternative metric for A/D/V where a model's output is evaluated to match a whole dataset's CCC rather than L2 distances of individual audios. Recent studies have shown that wav2vec2 / wavLM architectures outputing a float value for each A/D/V dimension achieve today's State-of-the-art (Sota) CCC on A/D/V. The Wav2Vec2.0 / WavLM family has a high computational footprint, but training small models using human annotations has been unsuccessful. In this paper we use a large Transformer Sota A/D/V model as Teacher/Annotator to train 5 student models: 4 MobileNets and our proposed Wav2Small, using only the Teacher's A/D/V outputs instead of human annotations. The Teacher model we propose also sets a new Sota on the MSP Podcast dataset of valence CCC=0.676. We choose MobileNetV4 / MobileNet-V3 as students, as MobileNet has been designed for fast execution times. We also propose Wav2Small - an architecture designed for minimal parameters and RAM consumption. Wav2Small with an .onnx (quantised) of only 120KB is a potential solution for A/D/V on hardware with low resources, having only 72K parameters vs 3.12M parameters for MobileNet-V4-Small.

Читать полностью…

Speech Technology

https://github.com/jishengpeng/WavChat

https://arxiv.org/abs/2411.13577

WavChat: A Survey of Spoken Dialogue Models
Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at this https URL.

Читать полностью…

Speech Technology

https://www.youtube.com/watch?v=pRUrO0x637A

Читать полностью…

Speech Technology

Some numbers for codec quality for Russian audio dataset

BigVGAN2 is good, but very slow (112M parameters). MEL-Vocos is not perfect. Encodec-Vocos is probably good.

Should we test something else like SNAC?

Читать полностью…

Speech Technology

Apple's papers are always very practical. This one is also good, many in-depth experiments and practical cases. Note that biasing effect is minimal (usually WER goes down a little 17% -> 15%).

https://arxiv.org/abs/2411.00664

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski, Takaaki Hori, Dogan Can, Xiaodan Zhuang

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.

Читать полностью…

Speech Technology

Overall, we find no evidence that multiscale aspects of MR-HuBERT lead to improved acquisition of high level concepts. The question now is how to build an architecture that does leverage this hierarchy?🤔 (4/5)

https://twitter.com/theo_clark_/status/1852299593272131874

https://arxiv.org/abs/2410.23955

Читать полностью…

Speech Technology

Even with our new speech codec, producing a 2-minute dialogue requires generating over 5000 tokens. To model these long sequences, we developed a specialized Transformer architecture that can efficiently handle hierarchies of information, matching the structure of our acoustic tokens.

https://deepmind.google/discover/blog/pushing-the-frontiers-of-audio-generation/

Читать полностью…

Speech Technology

Nice paper with few interesting details. Extra CTC head for Whisper stabilization is interesting for example.

https://arxiv.org/abs/2409.09543

Target Speaker ASR with Whisper

Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

We propose a novel approach to enable the use of large, single speaker ASR models, such as Whisper, for target speaker ASR. The key insight of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs, than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single speaker ASR models, into target speaker ASR models. Our target-speaker ASR model can be used for speaker attributed ASR by producing, in sequence, a transcript for each hypothesized speaker in a diarization output. This simplified model for speaker attributed ASR using only a single microphone outperforms cascades of speech separation and diarization by 11% absolute ORC-WER on the NOTSOFAR-1 dataset.

Читать полностью…

Speech Technology

Name speaks for itself

https://github.com/yakami129/VirtualWife

Читать полностью…

Speech Technology

While widely used, discrete methods have disadvantages (there are advantages too). There are attempts to replace them with continuous models. This paper gets quite some attention

https://x.com/marco_ppasini/status/1864330701530644835

https://arxiv.org/abs/2411.18447

Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation

Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas

Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.

Читать полностью…

Speech Technology

Indic Parler-TTS is a multilingual Indic extension of Parler-TTS Mini.

It is a fine-tuned version of Indic Parler-TTS Pretrained, trained on a 1,806 hours multilingual Indic and English dataset.

Indic Parler-TTS Mini can officially speak in 20 Indic languages, making it comprehensive for regional language technologies, and in English. The 21 languages supported are: Assamese, Bengali, Bodo, Dogri, English, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Sanskrit, Santali, Sindhi, Tamil, Telugu, and Urdu.

Thanks to its better prompt tokenizer, it can easily be extended to other languages. This tokenizer has a larger vocabulary and handles byte fallback, which simplifies multilingual training.

https://huggingface.co/ai4bharat/indic-parler-tts

Читать полностью…

Speech Technology

For crypto guys, projects to finetune different TTS models

https://github.com/impel-intelligence/dippy-speech-subnet
https://github.com/myshell-ai/MyShell-TTS-Subnet

Читать полностью…

Speech Technology

ML-SUPERB 2.0 Challenge at
#Interspeech2025

154 languages & 200+ accents/dialects
Live leaderboard & online evaluation! Join now: multilingual.superbbenchmark.org

https://multilingual.superbbenchmark.org/

Читать полностью…

Speech Technology

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

arXiv: https://arxiv.org/abs/2411.06807
Demo: https://chomeyama.github.io/wavehax-demo/

An approach to significantly improve codec generation

Читать полностью…

Speech Technology

Just a reminder that BEST-RQ is a good self-supervised method

https://arxiv.org/abs/2202.01855

Recently added in SpeechBrain too

https://github.com/speechbrain/speechbrain/releases/tag/v1.0.2

Also

https://github.com/HarunoriKawano/BEST-RQ

Читать полностью…

Speech Technology

A bit more data on cross-language codecs

Читать полностью…

Speech Technology

SANE 2024 Videos, interesting things

https://www.youtube.com/playlist?list=PLBJWRPcgwk7vVzKLPnTrqm831VohoLMmy

Читать полностью…

Speech Technology

https://github.com/john852517791/awesome-fake-audio-detection

Читать полностью…

Speech Technology

https://github.com/fixie-ai/ultravox/releases/tag/v0.4.1

Читать полностью…

Speech Technology

It is simply bad

https://arxiv.org/abs/2411.03866

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esaú Villatoro-Tello, Manjunath K E, Kadri Hacioğlu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and different speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that the SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations within in-domain data, such as changes in speed or the presence of additive noise, can significantly impact performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

Читать полностью…

Speech Technology

Fish Agent V0.1 3B is a groundbreaking Voice-to-Voice model capable of capturing and generating environmental audio information with unprecedented accuracy. What sets it apart is its semantic-token-free architecture, eliminating the need for traditional semantic encoders/decoders like Whisper and CosyVoice.

Additionally, it stands as a state-of-the-art text-to-speech (TTS) model, trained on an extensive dataset of 700,000 hours of multilingual audio content.

This model is a continue-pretrained version of Qwen-2.5-3B-Instruct for 200B voice & text tokens.

https://huggingface.co/fishaudio/fish-agent-v0.1-3b

Читать полностью…

Speech Technology

https://huggingface.co/nvidia/stt_uz_fastconformer_hybrid_large_pc

Читать полностью…

Speech Technology

https://arxiv.org/abs/2410.18908

A Survey on Speech Large Language Models

Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu

Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs' advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.

Читать полностью…

Speech Technology

Nice approach with prealignment and great speed

https://arxiv.org/abs/2406.08835

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Ziyang Zhuang, Chenfeng Miao, Kun Zou, Shuai Gong, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao

Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.

Читать полностью…
Subscribe to a channel