Some interesting ideas here, for example
We can see that ByT5 is critical for generating coherent speech. A potential explanation could be that ByT5 recognizes individual characters, whereas BERT relies on subword tokenization techniques (such as byte-pair encoding). Since words that are spelled similarly often sound alike, the ability to discern characters can enhance the model’s ability
https://arxiv.org/abs/2406.02328
https://github.com/yangdongchao/SimpleSpeech
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng
In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.
Standard speech LLM with some tricks
https://twitter.com/lileics/status/1824919329068228979
CMU team got the TOP 1 place in the simul translation contest (english to german) at IWSLT2024
https://www.arxiv.org/abs/2408.07452
CMU's IWSLT 2024 Simultaneous Speech Translation System
Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li, Graham Neubig, Shinji Watanabe
This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. ... Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.
As expected, Whisper is top above other recognizers in Lyrics transcription but specialized systems beat it
https://audioshake.github.io/jam-alt/
https://huggingface.co/datasets/audioshake/jam-alt
https://arxiv.org/abs/2408.06370
Lyrics Transcription for Humans: A Readability-Aware Benchmark
Ondřej Cífka, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter
Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.
From Ultravox new release
https://github.com/fixie-ai/ultravox/discussions/78
In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.
This paper seems helpful
https://arxiv.org/abs/2405.19041
BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
Chen Wang, Minpeng Liao, Zhongqiang Huang, Jiajun Zhang
Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
Qwen2-Audio-7B, the next version of Qwen-Audio, which is capable of accepting audio and text inputs and generating text outputs!
Demo: https://huggingface.co/spaces/Qwen/Qwen2-Audio-Instruct-Demo
https://github.com/RicherMans/Dasheng
This repo provides checkpoints for the Interspeech 2024 paper Scaling up masked audio encoder learning for general audio classification. The goal of this work is to investigate the scalability of masked autoencoders for audio. Prior work did not scale beyond 10,000 hours of audio, while Dasheng used 272,000 hours of training data.
https://diva-audio.github.io/
[TL;DR] DiVA Llama 3 outperforms existing Speech LMs on QA, Emotion Recognition, and Translation with a speech encoder trained using only weak supervision. DiVA learns to encode speech while preserving the underlying LLM output distribution using cross-modal context distillation between text and speech. DiVA was trained with open-source code in Levanter on 3.5k hours of publicly available and permissively licensed ASR data from Common Voice.
https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
Recently released LLama 3.1 paper has a big section on speech understanding - speech adapter, prosody modeling, speech generation. etc. An interesting overview of the current tech.
wav2vec encoder pretrained with 15 million hours of data and finetuned with 230k hours of transcribed data, for example
Start with page 63
https://huggingface.co/spaces/AudioLLMs/AudioBench-Leaderboard
https://arxiv.org/abs/2406.16020
https://github.com/AudioLLMs/AudioBench
The paper is interesting but has many arguable points. For example, authors see no correlation between objective measures and Arena score and propose to add extra scores to fit Arena score. Instead, one could conclude that side-by-side evaluation by non-experts is just broken.
Читать полностью…CMU Lectures from Shinji Watanabe
[Fall 2023] Speech Recognition and Understanding
https://www.youtube.com/playlist?list=PLfVqr2l0FG-tW8d5ZSz-_tCgQed_F1ndb
an interesting part about RNNT vs Attention where Shinji argues about turn back to CTC decoding instead of RNNT. A valid point recently
https://youtu.be/BQBOu9BOFpc?list=PLfVqr2l0FG-tW8d5ZSz-_tCgQed_F1ndb&t=2585
Not yet released
https://github.com/QwenLM/Qwen2-Audio
the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:
voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;
audio analysis: users could provide audio and text instructions for analysis during the interaction;
We are going to release two models of the Qwen2-Audio series soon: Qwen2-Audio and Qwen2-Audio-Chat.
The original paper
https://arxiv.org/abs/2407.08551
Autoregressive Speech Synthesis without Vector Quantization
Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See this https URL for demos of our work.
https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE
Speech-MASSIVE is a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages (Arabic, German, Spanish, French, Hungarian, Korean, Dutch, Polish, European Portuguese, Russian, Turkish, and Vietnamese) from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. MASSIVE utterances' labels span 18 domains, with 60 intents and 55 slots. Full train split is provided for French and German, and for all the 12 languages (including French and German), we provide few-shot train, validation, test splits. Few-shot train (115 examples) covers all 18 domains, 60 intents, and 55 slots (including empty slots).
Whisper needs much more exploration actually. This is a great paper on relevant subject
https://arxiv.org/pdf/2406.05806
Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper
Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee
This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.
Not really a speech paper but you might ask where do DeepMind and the likes take their research ideas from. Not surprisingly, they analyze biology of the human brain. Take this for example:
https://arxiv.org/abs/2408.05446
https://twitter.com/stanislavfort/status/1823347721358438624
Inspired by biology we 1) get adversarial robustness + interpretability for free, 2) turn classifiers into generators & 3) design attacks on vLLMs
Between, Whisper window of 30 seconds is not random as well, it is biologically motivated, short time human memory is estimated to be 15 to 30 seconds.
https://bokcenter.harvard.edu/how-memory-works#:~:text=Time%20and%20inattention%20may%20cause,items%20being%20the%20average%20number.
Between between, neuroscientits believe that neurons operate at 4 bits
https://brainchip.com/4-bits-are-enough/
Interesting thread on decompiling and optimizing silero vad
https://github.com/snakers4/silero-vad/discussions/408#discussioncomment-10348222
Yet another Miipher-ed dataset, FLEURS-R, has been released.This dataset comprises 1.3k hours of studio-quality speech across 102 locales with CC-BY-4.0.
Paper: https://arxiv.org/abs/2408.06227
Dataset: https://huggingface.co/datasets/google/fleurs-r
Or friend @bjutte reports
Attendi/optimizing-real-time-factor-addressing-rtf-variability-and-enhancing-audio-transcription-5cb4c27ab767" rel="nofollow">https://medium.com/@Attendi/optimizing-real-time-factor-addressing-rtf-variability-and-enhancing-audio-transcription-5cb4c27ab767
Speech LLM thing goes on
https://arxiv.org/abs/2408.02622
Language Model Can Listen While Speaking
Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.
2d bucketing for faster training
https://github.com/NVIDIA/NeMo/blob/oomptimizer/docs/source/asr/datasets.rst#2d-bucketing
Canary-1B can be trained with 5x larger batch sizes compared to our earlier baseline. It maxes out GPU utilization (memory, compute, and power consumption wise). As a result the mean training step time is 2.75x longer, resulting in a training throughput of 5x / 2.75x ~= 180% of the original recipe. I managed to reproduce Canary-1B in about 40k training steps on the same number of GPUs, changing only bucketing/batch size settings using new features in this PR.
https://github.com/NVIDIA/NeMo/pull/9763
UTMOSv2 is out
https://github.com/sarulab-speech/UTMOSv2
UTMOSv2 achieved 1st place in 7 out of 16 metrics and 2nd place in the remaining 9 metrics in the VoiceMOS Challenge 2024 Track1
https://arxiv.org/abs/2407.14358
https://huggingface.co/stabilityai/stable-audio-open-1.0
Stable Audio Open
Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
https://github.com/frankyoujian/Edge-Punct-Casing
https://arxiv.org/abs/2407.13142
A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR
Jian You, Xiangfeng Li
Punctuation and word casing prediction are necessary for automatic speech recognition (ASR). With the popularity of on-device end-to-end streaming ASR systems, the on-device punctuation and word casing prediction become a necessity while we found little discussion on this. With the emergence of Transformer, Transformer based models have been explored for this scenario. However, Transformer based models are too large for on-device ASR systems. In this paper, we propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time. The model is based on Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM). Experimental results on the IWSLT2011 test set show that the proposed model obtains 9% relative improvement compared to the best of non-Transformer models on overall F1-score. Compared to the representative of Transformer based models, the proposed model achieves comparable results to the representative model while being only one-fortieth its size and 2.5 times faster in terms of inference time. It is suitable for on-device streaming ASR systems. Our code is publicly available.
An English TTS objective leaderboard
https://ttsdsbenchmark.com/
Code
https://github.com/ttsds/ttsds
Paper
https://arxiv.org/abs/2407.12707
TTSDS -- Text-to-Speech Distribution Score
Christoph Minixhofer, Ondřej Klejch, Peter Bell
Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.
https://voxblink2.github.io/
This is actually multilingual
Another nice thing they used face recognition to improve identification
If your whisper started to translate instead of transcription, it is likely due to hackers
https://arxiv.org/abs/2407.04482
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models
Vyas Raina, Mark Gales
Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.
Somewhat interesting discussion on the paper Autoregressive Speech Synthesis without Vector Quantization
https://twitter.com/unilightwf/status/1811610158713413716
Its interesting that Microsoft turns to continuous models. I saw few other papers on the same direction too.
the claims that discrete models have issues with continuity is quite a valid point. For example, its easy to prove speaker similarity is not perfect as discrete representation doesn't doesn't really follow continuous xvector. It is strange that none discrete papers mentioned that.
We have open-sourced Emilia for speech generation, a 101k-hour dataset in six languages from in-the-wild (e.g. talk shows, interviews, debates). Checkout perf of model trained with it.
HF: https://huggingface.co/datasets/amphion/Emilia
ArXiv: https://arxiv.org/abs/2407.05361
Demo: https://emilia-dataset.github.io/Emilia-Demo-Page/
Many LLM papers recently, this one is interesting in claims of very good accuracy for Chinese and English
https://arxiv.org/abs/2407.04675
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.