speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

Some interesting ideas here, for example

We can see that ByT5 is critical for generating coherent speech. A potential explanation could be that ByT5 recognizes individual characters, whereas BERT relies on subword tokenization techniques (such as byte-pair encoding). Since words that are spelled similarly often sound alike, the ability to discern characters can enhance the model’s ability

https://arxiv.org/abs/2406.02328

https://github.com/yangdongchao/SimpleSpeech

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.

Читать полностью…

Speech Technology

Standard speech LLM with some tricks

https://twitter.com/lileics/status/1824919329068228979

CMU team got the TOP 1 place in the simul translation contest (english to german) at IWSLT2024

https://www.arxiv.org/abs/2408.07452

CMU's IWSLT 2024 Simultaneous Speech Translation System

Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li, Graham Neubig, Shinji Watanabe

This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. ... Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.

Читать полностью…

Speech Technology

As expected, Whisper is top above other recognizers in Lyrics transcription but specialized systems beat it

https://audioshake.github.io/jam-alt/

https://huggingface.co/datasets/audioshake/jam-alt

https://arxiv.org/abs/2408.06370

Lyrics Transcription for Humans: A Readability-Aware Benchmark

Ondřej Cífka, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter

Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.

Читать полностью…

Speech Technology

From Ultravox new release

https://github.com/fixie-ai/ultravox/discussions/78

In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.

This paper seems helpful

https://arxiv.org/abs/2405.19041

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation

Chen Wang, Minpeng Liao, Zhongqiang Huang, Jiajun Zhang

Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.

Читать полностью…

Speech Technology

Qwen2-Audio-7B, the next version of Qwen-Audio, which is capable of accepting audio and text inputs and generating text outputs!

Demo: https://huggingface.co/spaces/Qwen/Qwen2-Audio-Instruct-Demo

Читать полностью…

Speech Technology

https://github.com/RicherMans/Dasheng

This repo provides checkpoints for the Interspeech 2024 paper Scaling up masked audio encoder learning for general audio classification. The goal of this work is to investigate the scalability of masked autoencoders for audio. Prior work did not scale beyond 10,000 hours of audio, while Dasheng used 272,000 hours of training data.

Читать полностью…

Speech Technology

https://diva-audio.github.io/

[TL;DR] DiVA Llama 3 outperforms existing Speech LMs on QA, Emotion Recognition, and Translation with a speech encoder trained using only weak supervision. DiVA learns to encode speech while preserving the underlying LLM output distribution using cross-modal context distillation between text and speech. DiVA was trained with open-source code in Levanter on 3.5k hours of publicly available and permissively licensed ASR data from Common Voice.

Читать полностью…

Speech Technology

https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

Recently released LLama 3.1 paper has a big section on speech understanding - speech adapter, prosody modeling, speech generation. etc. An interesting overview of the current tech.

wav2vec encoder pretrained with 15 million hours of data and finetuned with 230k hours of transcribed data, for example

Start with page 63

Читать полностью…

Speech Technology

https://huggingface.co/spaces/AudioLLMs/AudioBench-Leaderboard

https://arxiv.org/abs/2406.16020

https://github.com/AudioLLMs/AudioBench

Читать полностью…

Speech Technology

The paper is interesting but has many arguable points. For example, authors see no correlation between objective measures and Arena score and propose to add extra scores to fit Arena score. Instead, one could conclude that side-by-side evaluation by non-experts is just broken.

Читать полностью…

Speech Technology

CMU Lectures from Shinji Watanabe

[Fall 2023] Speech Recognition and Understanding

https://www.youtube.com/playlist?list=PLfVqr2l0FG-tW8d5ZSz-_tCgQed_F1ndb

an interesting part about RNNT vs Attention where Shinji argues about turn back to CTC decoding instead of RNNT. A valid point recently

https://youtu.be/BQBOu9BOFpc?list=PLfVqr2l0FG-tW8d5ZSz-_tCgQed_F1ndb&t=2585

Читать полностью…

Speech Technology

Not yet released

https://github.com/QwenLM/Qwen2-Audio

the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:

voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;

audio analysis: users could provide audio and text instructions for analysis during the interaction;

We are going to release two models of the Qwen2-Audio series soon: Qwen2-Audio and Qwen2-Audio-Chat.

Читать полностью…

Speech Technology

The original paper

https://arxiv.org/abs/2407.08551

Autoregressive Speech Synthesis without Vector Quantization

Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See this https URL for demos of our work.

Читать полностью…

Speech Technology

https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE

Speech-MASSIVE is a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages (Arabic, German, Spanish, French, Hungarian, Korean, Dutch, Polish, European Portuguese, Russian, Turkish, and Vietnamese) from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. MASSIVE utterances' labels span 18 domains, with 60 intents and 55 slots. Full train split is provided for French and German, and for all the 12 languages (including French and German), we provide few-shot train, validation, test splits. Few-shot train (115 examples) covers all 18 domains, 60 intents, and 55 slots (including empty slots).

Читать полностью…

Speech Technology

Whisper needs much more exploration actually. This is a great paper on relevant subject

https://arxiv.org/pdf/2406.05806

Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee

This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.

Читать полностью…

Speech Technology

Not really a speech paper but you might ask where do DeepMind and the likes take their research ideas from. Not surprisingly, they analyze biology of the human brain. Take this for example:

https://arxiv.org/abs/2408.05446

https://twitter.com/stanislavfort/status/1823347721358438624

Inspired by biology we 1) get adversarial robustness + interpretability for free, 2) turn classifiers into generators & 3) design attacks on vLLMs

Between, Whisper window of 30 seconds is not random as well, it is biologically motivated, short time human memory is estimated to be 15 to 30 seconds.

https://bokcenter.harvard.edu/how-memory-works#:~:text=Time%20and%20inattention%20may%20cause,items%20being%20the%20average%20number.

Between between, neuroscientits believe that neurons operate at 4 bits

https://brainchip.com/4-bits-are-enough/

Читать полностью…

Speech Technology

Interesting thread on decompiling and optimizing silero vad

https://github.com/snakers4/silero-vad/discussions/408#discussioncomment-10348222

Читать полностью…

Speech Technology

Yet another Miipher-ed dataset, FLEURS-R, has been released.This dataset comprises 1.3k hours of studio-quality speech across 102 locales with CC-BY-4.0.

Paper: https://arxiv.org/abs/2408.06227
Dataset: https://huggingface.co/datasets/google/fleurs-r

Читать полностью…

Speech Technology

Or friend @bjutte reports

Attendi/optimizing-real-time-factor-addressing-rtf-variability-and-enhancing-audio-transcription-5cb4c27ab767" rel="nofollow">https://medium.com/@Attendi/optimizing-real-time-factor-addressing-rtf-variability-and-enhancing-audio-transcription-5cb4c27ab767

Читать полностью…

Speech Technology

Speech LLM thing goes on

https://arxiv.org/abs/2408.02622

Language Model Can Listen While Speaking

Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

Читать полностью…

Speech Technology

2d bucketing for faster training

https://github.com/NVIDIA/NeMo/blob/oomptimizer/docs/source/asr/datasets.rst#2d-bucketing

Canary-1B can be trained with 5x larger batch sizes compared to our earlier baseline. It maxes out GPU utilization (memory, compute, and power consumption wise). As a result the mean training step time is 2.75x longer, resulting in a training throughput of 5x / 2.75x ~= 180% of the original recipe. I managed to reproduce Canary-1B in about 40k training steps on the same number of GPUs, changing only bucketing/batch size settings using new features in this PR.

https://github.com/NVIDIA/NeMo/pull/9763

Читать полностью…

Speech Technology

UTMOSv2 is out

https://github.com/sarulab-speech/UTMOSv2

UTMOSv2 achieved 1st place in 7 out of 16 metrics and 2nd place in the remaining 9 metrics in the VoiceMOS Challenge 2024 Track1

Читать полностью…

Speech Technology

https://arxiv.org/abs/2407.14358

https://huggingface.co/stabilityai/stable-audio-open-1.0

Stable Audio Open

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

Читать полностью…

Speech Technology

https://github.com/frankyoujian/Edge-Punct-Casing

https://arxiv.org/abs/2407.13142

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR

Jian You, Xiangfeng Li

Punctuation and word casing prediction are necessary for automatic speech recognition (ASR). With the popularity of on-device end-to-end streaming ASR systems, the on-device punctuation and word casing prediction become a necessity while we found little discussion on this. With the emergence of Transformer, Transformer based models have been explored for this scenario. However, Transformer based models are too large for on-device ASR systems. In this paper, we propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time. The model is based on Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM). Experimental results on the IWSLT2011 test set show that the proposed model obtains 9% relative improvement compared to the best of non-Transformer models on overall F1-score. Compared to the representative of Transformer based models, the proposed model achieves comparable results to the representative model while being only one-fortieth its size and 2.5 times faster in terms of inference time. It is suitable for on-device streaming ASR systems. Our code is publicly available.

Читать полностью…

Speech Technology

An English TTS objective leaderboard

https://ttsdsbenchmark.com/

Code

https://github.com/ttsds/ttsds

Paper

https://arxiv.org/abs/2407.12707

TTSDS -- Text-to-Speech Distribution Score

Christoph Minixhofer, Ondřej Klejch, Peter Bell

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.

Читать полностью…

Speech Technology

https://voxblink2.github.io/

This is actually multilingual

Another nice thing they used face recognition to improve identification

Читать полностью…

Speech Technology

If your whisper started to translate instead of transcription, it is likely due to hackers

https://arxiv.org/abs/2407.04482

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Vyas Raina, Mark Gales

Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.

Читать полностью…

Speech Technology

Somewhat interesting discussion on the paper Autoregressive Speech Synthesis without Vector Quantization

https://twitter.com/unilightwf/status/1811610158713413716

Its interesting that Microsoft turns to continuous models. I saw few other papers on the same direction too.

the claims that discrete models have issues with continuity is quite a valid point. For example, its easy to prove speaker similarity is not perfect as discrete representation doesn't doesn't really follow continuous xvector. It is strange that none discrete papers mentioned that.

Читать полностью…

Speech Technology

We have open-sourced Emilia for speech generation, a 101k-hour dataset in six languages from in-the-wild (e.g. talk shows, interviews, debates). Checkout perf of model trained with it.
HF: https://huggingface.co/datasets/amphion/Emilia
ArXiv: https://arxiv.org/abs/2407.05361
Demo: https://emilia-dataset.github.io/Emilia-Demo-Page/

Читать полностью…

Speech Technology

Many LLM papers recently, this one is interesting in claims of very good accuracy for Chinese and English

https://arxiv.org/abs/2407.04675

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou

Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.

Читать полностью…
Subscribe to a channel