https://huggingface.co/nvidia/stt_uz_fastconformer_hybrid_large_pc
Читать полностью…https://arxiv.org/abs/2410.18908
A Survey on Speech Large Language Models
Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu
Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs' advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.
Nice approach with prealignment and great speed
https://arxiv.org/abs/2406.08835
EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed
Ziyang Zhuang, Chenfeng Miao, Kun Zou, Shuai Gong, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao
Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.
Not really a speech paper but you might ask where do DeepMind and the likes take their research ideas from. Not surprisingly, they analyze biology of the human brain. Take this for example:
https://arxiv.org/abs/2408.05446
https://twitter.com/stanislavfort/status/1823347721358438624
Inspired by biology we 1) get adversarial robustness + interpretability for free, 2) turn classifiers into generators & 3) design attacks on vLLMs
Between, Whisper window of 30 seconds is not random as well, it is biologically motivated, short time human memory is estimated to be 15 to 30 seconds.
https://bokcenter.harvard.edu/how-memory-works#:~:text=Time%20and%20inattention%20may%20cause,items%20being%20the%20average%20number.
Between between, neuroscientits believe that neurons operate at 4 bits
https://brainchip.com/4-bits-are-enough/
Interesting thread on decompiling and optimizing silero vad
https://github.com/snakers4/silero-vad/discussions/408#discussioncomment-10348222
Yet another Miipher-ed dataset, FLEURS-R, has been released.This dataset comprises 1.3k hours of studio-quality speech across 102 locales with CC-BY-4.0.
Paper: https://arxiv.org/abs/2408.06227
Dataset: https://huggingface.co/datasets/google/fleurs-r
Or friend @bjutte reports
Attendi/optimizing-real-time-factor-addressing-rtf-variability-and-enhancing-audio-transcription-5cb4c27ab767" rel="nofollow">https://medium.com/@Attendi/optimizing-real-time-factor-addressing-rtf-variability-and-enhancing-audio-transcription-5cb4c27ab767
Speech LLM thing goes on
https://arxiv.org/abs/2408.02622
Language Model Can Listen While Speaking
Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.
2d bucketing for faster training
https://github.com/NVIDIA/NeMo/blob/oomptimizer/docs/source/asr/datasets.rst#2d-bucketing
Canary-1B can be trained with 5x larger batch sizes compared to our earlier baseline. It maxes out GPU utilization (memory, compute, and power consumption wise). As a result the mean training step time is 2.75x longer, resulting in a training throughput of 5x / 2.75x ~= 180% of the original recipe. I managed to reproduce Canary-1B in about 40k training steps on the same number of GPUs, changing only bucketing/batch size settings using new features in this PR.
https://github.com/NVIDIA/NeMo/pull/9763
UTMOSv2 is out
https://github.com/sarulab-speech/UTMOSv2
UTMOSv2 achieved 1st place in 7 out of 16 metrics and 2nd place in the remaining 9 metrics in the VoiceMOS Challenge 2024 Track1
https://arxiv.org/abs/2407.14358
https://huggingface.co/stabilityai/stable-audio-open-1.0
Stable Audio Open
Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
https://github.com/frankyoujian/Edge-Punct-Casing
https://arxiv.org/abs/2407.13142
A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR
Jian You, Xiangfeng Li
Punctuation and word casing prediction are necessary for automatic speech recognition (ASR). With the popularity of on-device end-to-end streaming ASR systems, the on-device punctuation and word casing prediction become a necessity while we found little discussion on this. With the emergence of Transformer, Transformer based models have been explored for this scenario. However, Transformer based models are too large for on-device ASR systems. In this paper, we propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time. The model is based on Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM). Experimental results on the IWSLT2011 test set show that the proposed model obtains 9% relative improvement compared to the best of non-Transformer models on overall F1-score. Compared to the representative of Transformer based models, the proposed model achieves comparable results to the representative model while being only one-fortieth its size and 2.5 times faster in terms of inference time. It is suitable for on-device streaming ASR systems. Our code is publicly available.
An English TTS objective leaderboard
https://ttsdsbenchmark.com/
Code
https://github.com/ttsds/ttsds
Paper
https://arxiv.org/abs/2407.12707
TTSDS -- Text-to-Speech Distribution Score
Christoph Minixhofer, Ondřej Klejch, Peter Bell
Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.
https://voxblink2.github.io/
This is actually multilingual
Another nice thing they used face recognition to improve identification
If your whisper started to translate instead of transcription, it is likely due to hackers
https://arxiv.org/abs/2407.04482
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models
Vyas Raina, Mark Gales
Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.
Nice paper with few interesting details. Extra CTC head for Whisper stabilization is interesting for example.
https://arxiv.org/abs/2409.09543
Target Speaker ASR with Whisper
Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget
We propose a novel approach to enable the use of large, single speaker ASR models, such as Whisper, for target speaker ASR. The key insight of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs, than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single speaker ASR models, into target speaker ASR models. Our target-speaker ASR model can be used for speaker attributed ASR by producing, in sequence, a transcript for each hypothesized speaker in a diarization output. This simplified model for speaker attributed ASR using only a single microphone outperforms cascades of speech separation and diarization by 11% absolute ORC-WER on the NOTSOFAR-1 dataset.
Name speaks for itself
https://github.com/yakami129/VirtualWife
Some interesting ideas here, for example
We can see that ByT5 is critical for generating coherent speech. A potential explanation could be that ByT5 recognizes individual characters, whereas BERT relies on subword tokenization techniques (such as byte-pair encoding). Since words that are spelled similarly often sound alike, the ability to discern characters can enhance the model’s ability
https://arxiv.org/abs/2406.02328
https://github.com/yangdongchao/SimpleSpeech
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng
In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.
Standard speech LLM with some tricks
https://twitter.com/lileics/status/1824919329068228979
CMU team got the TOP 1 place in the simul translation contest (english to german) at IWSLT2024
https://www.arxiv.org/abs/2408.07452
CMU's IWSLT 2024 Simultaneous Speech Translation System
Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li, Graham Neubig, Shinji Watanabe
This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. ... Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.
As expected, Whisper is top above other recognizers in Lyrics transcription but specialized systems beat it
https://audioshake.github.io/jam-alt/
https://huggingface.co/datasets/audioshake/jam-alt
https://arxiv.org/abs/2408.06370
Lyrics Transcription for Humans: A Readability-Aware Benchmark
Ondřej Cífka, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter
Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.
From Ultravox new release
https://github.com/fixie-ai/ultravox/discussions/78
In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.
This paper seems helpful
https://arxiv.org/abs/2405.19041
BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
Chen Wang, Minpeng Liao, Zhongqiang Huang, Jiajun Zhang
Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
Qwen2-Audio-7B, the next version of Qwen-Audio, which is capable of accepting audio and text inputs and generating text outputs!
Demo: https://huggingface.co/spaces/Qwen/Qwen2-Audio-Instruct-Demo
https://github.com/RicherMans/Dasheng
This repo provides checkpoints for the Interspeech 2024 paper Scaling up masked audio encoder learning for general audio classification. The goal of this work is to investigate the scalability of masked autoencoders for audio. Prior work did not scale beyond 10,000 hours of audio, while Dasheng used 272,000 hours of training data.
https://diva-audio.github.io/
[TL;DR] DiVA Llama 3 outperforms existing Speech LMs on QA, Emotion Recognition, and Translation with a speech encoder trained using only weak supervision. DiVA learns to encode speech while preserving the underlying LLM output distribution using cross-modal context distillation between text and speech. DiVA was trained with open-source code in Levanter on 3.5k hours of publicly available and permissively licensed ASR data from Common Voice.
https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
Recently released LLama 3.1 paper has a big section on speech understanding - speech adapter, prosody modeling, speech generation. etc. An interesting overview of the current tech.
wav2vec encoder pretrained with 15 million hours of data and finetuned with 230k hours of transcribed data, for example
Start with page 63
https://huggingface.co/spaces/AudioLLMs/AudioBench-Leaderboard
https://arxiv.org/abs/2406.16020
https://github.com/AudioLLMs/AudioBench
The paper is interesting but has many arguable points. For example, authors see no correlation between objective measures and Arena score and propose to add extra scores to fit Arena score. Instead, one could conclude that side-by-side evaluation by non-experts is just broken.
Читать полностью…CMU Lectures from Shinji Watanabe
[Fall 2023] Speech Recognition and Understanding
https://www.youtube.com/playlist?list=PLfVqr2l0FG-tW8d5ZSz-_tCgQed_F1ndb
an interesting part about RNNT vs Attention where Shinji argues about turn back to CTC decoding instead of RNNT. A valid point recently
https://youtu.be/BQBOu9BOFpc?list=PLfVqr2l0FG-tW8d5ZSz-_tCgQed_F1ndb&t=2585
Not yet released
https://github.com/QwenLM/Qwen2-Audio
the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:
voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;
audio analysis: users could provide audio and text instructions for analysis during the interaction;
We are going to release two models of the Qwen2-Audio series soon: Qwen2-Audio and Qwen2-Audio-Chat.
The original paper
https://arxiv.org/abs/2407.08551
Autoregressive Speech Synthesis without Vector Quantization
Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See this https URL for demos of our work.