Telegram-канал speechtech - Speech Technology: Unsorted - каталог телеграмм

speechtech | Unsorted

Subscribe to a channel

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

02 Nov 2024 09:10

https://huggingface.co/nvidia/stt_uz_fastconformer_hybrid_large_pc

Читать полностью…

Speech Technology

28 Oct 2024 23:14

https://arxiv.org/abs/2410.18908

A Survey on Speech Large Language Models

Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu

Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs' advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.

Читать полностью…

Speech Technology

21 Aug 2024 14:34

Nice approach with prealignment and great speed

https://arxiv.org/abs/2406.08835

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Ziyang Zhuang, Chenfeng Miao, Kun Zou, Shuai Gong, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao

Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.

Читать полностью…

Speech Technology

18 Aug 2024 21:54

Not really a speech paper but you might ask where do DeepMind and the likes take their research ideas from. Not surprisingly, they analyze biology of the human brain. Take this for example:

https://arxiv.org/abs/2408.05446

https://twitter.com/stanislavfort/status/1823347721358438624

Inspired by biology we 1) get adversarial robustness + interpretability for free, 2) turn classifiers into generators & 3) design attacks on vLLMs

Between, Whisper window of 30 seconds is not random as well, it is biologically motivated, short time human memory is estimated to be 15 to 30 seconds.

https://bokcenter.harvard.edu/how-memory-works#:~:text=Time%20and%20inattention%20may%20cause,items%20being%20the%20average%20number.

Between between, neuroscientits believe that neurons operate at 4 bits

https://brainchip.com/4-bits-are-enough/

Читать полностью…

Speech Technology

15 Aug 2024 16:30

Interesting thread on decompiling and optimizing silero vad

https://github.com/snakers4/silero-vad/discussions/408#discussioncomment-10348222

Читать полностью…

Speech Technology

14 Aug 2024 16:39

Yet another Miipher-ed dataset, FLEURS-R, has been released.This dataset comprises 1.3k hours of studio-quality speech across 102 locales with CC-BY-4.0.

Paper: https://arxiv.org/abs/2408.06227
Dataset: https://huggingface.co/datasets/google/fleurs-r

Читать полностью…

Speech Technology

12 Aug 2024 16:02

Or friend @bjutte reports

Attendi/optimizing-real-time-factor-addressing-rtf-variability-and-enhancing-audio-transcription-5cb4c27ab767" rel="nofollow">https://medium.com/@Attendi/optimizing-real-time-factor-addressing-rtf-variability-and-enhancing-audio-transcription-5cb4c27ab767

Читать полностью…

Speech Technology

06 Aug 2024 11:23

Speech LLM thing goes on

https://arxiv.org/abs/2408.02622

Language Model Can Listen While Speaking

Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

Читать полностью…

Speech Technology

01 Aug 2024 03:48

2d bucketing for faster training

https://github.com/NVIDIA/NeMo/blob/oomptimizer/docs/source/asr/datasets.rst#2d-bucketing

Canary-1B can be trained with 5x larger batch sizes compared to our earlier baseline. It maxes out GPU utilization (memory, compute, and power consumption wise). As a result the mean training step time is 2.75x longer, resulting in a training throughput of 5x / 2.75x ~= 180% of the original recipe. I managed to reproduce Canary-1B in about 40k training steps on the same number of GPUs, changing only bucketing/batch size settings using new features in this PR.

https://github.com/NVIDIA/NeMo/pull/9763

Читать полностью…

Speech Technology

25 Jul 2024 09:08

UTMOSv2 is out

https://github.com/sarulab-speech/UTMOSv2

UTMOSv2 achieved 1st place in 7 out of 16 metrics and 2nd place in the remaining 9 metrics in the VoiceMOS Challenge 2024 Track1

Читать полностью…

Speech Technology

23 Jul 2024 02:17

https://arxiv.org/abs/2407.14358

https://huggingface.co/stabilityai/stable-audio-open-1.0

Stable Audio Open

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

Читать полностью…

Speech Technology

22 Jul 2024 07:58

https://github.com/frankyoujian/Edge-Punct-Casing

https://arxiv.org/abs/2407.13142

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR

Jian You, Xiangfeng Li

Punctuation and word casing prediction are necessary for automatic speech recognition (ASR). With the popularity of on-device end-to-end streaming ASR systems, the on-device punctuation and word casing prediction become a necessity while we found little discussion on this. With the emergence of Transformer, Transformer based models have been explored for this scenario. However, Transformer based models are too large for on-device ASR systems. In this paper, we propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time. The model is based on Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM). Experimental results on the IWSLT2011 test set show that the proposed model obtains 9% relative improvement compared to the best of non-Transformer models on overall F1-score. Compared to the representative of Transformer based models, the proposed model achieves comparable results to the representative model while being only one-fortieth its size and 2.5 times faster in terms of inference time. It is suitable for on-device streaming ASR systems. Our code is publicly available.

Читать полностью…

Speech Technology

19 Jul 2024 14:27

An English TTS objective leaderboard

https://ttsdsbenchmark.com/

Code

https://github.com/ttsds/ttsds

Paper

https://arxiv.org/abs/2407.12707

TTSDS -- Text-to-Speech Distribution Score

Christoph Minixhofer, Ondřej Klejch, Peter Bell

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.

Читать полностью…

Speech Technology

17 Jul 2024 16:23

https://voxblink2.github.io/

This is actually multilingual

Another nice thing they used face recognition to improve identification

Читать полностью…

Speech Technology

14 Jul 2024 00:20

If your whisper started to translate instead of transcription, it is likely due to hackers

https://arxiv.org/abs/2407.04482

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Vyas Raina, Mark Gales

Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.

Читать полностью…

Speech Technology

30 Oct 2024 22:45

Nice paper with few interesting details. Extra CTC head for Whisper stabilization is interesting for example.

https://arxiv.org/abs/2409.09543

Target Speaker ASR with Whisper

Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

We propose a novel approach to enable the use of large, single speaker ASR models, such as Whisper, for target speaker ASR. The key insight of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs, than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single speaker ASR models, into target speaker ASR models. Our target-speaker ASR model can be used for speaker attributed ASR by producing, in sequence, a transcript for each hypothesized speaker in a diarization output. This simplified model for speaker attributed ASR using only a single microphone outperforms cascades of speech separation and diarization by 11% absolute ORC-WER on the NOTSOFAR-1 dataset.

Читать полностью…

Speech Technology

28 Oct 2024 22:30

Name speaks for itself

https://github.com/yakami129/VirtualWife

Читать полностью…

Speech Technology

20 Aug 2024 00:06

Some interesting ideas here, for example

We can see that ByT5 is critical for generating coherent speech. A potential explanation could be that ByT5 recognizes individual characters, whereas BERT relies on subword tokenization techniques (such as byte-pair encoding). Since words that are spelled similarly often sound alike, the ability to discern characters can enhance the model’s ability

https://arxiv.org/abs/2406.02328

https://github.com/yangdongchao/SimpleSpeech

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.

Читать полностью…

Speech Technology

18 Aug 2024 02:03

Standard speech LLM with some tricks

https://twitter.com/lileics/status/1824919329068228979

CMU team got the TOP 1 place in the simul translation contest (english to german) at IWSLT2024

https://www.arxiv.org/abs/2408.07452

CMU's IWSLT 2024 Simultaneous Speech Translation System

Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li, Graham Neubig, Shinji Watanabe

This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. ... Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.

Читать полностью…

Speech Technology

14 Aug 2024 16:48

As expected, Whisper is top above other recognizers in Lyrics transcription but specialized systems beat it

https://audioshake.github.io/jam-alt/

https://huggingface.co/datasets/audioshake/jam-alt

https://arxiv.org/abs/2408.06370

Lyrics Transcription for Humans: A Readability-Aware Benchmark

Ondřej Cífka, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter

Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.

Читать полностью…

Speech Technology

14 Aug 2024 15:07

From Ultravox new release

https://github.com/fixie-ai/ultravox/discussions/78

In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.

This paper seems helpful

https://arxiv.org/abs/2405.19041

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation

Chen Wang, Minpeng Liao, Zhongqiang Huang, Jiajun Zhang

Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.

Читать полностью…

Speech Technology

10 Aug 2024 00:43

Qwen2-Audio-7B, the next version of Qwen-Audio, which is capable of accepting audio and text inputs and generating text outputs!

Demo: https://huggingface.co/spaces/Qwen/Qwen2-Audio-Instruct-Demo

Читать полностью…

Speech Technology

04 Aug 2024 03:34

https://github.com/RicherMans/Dasheng

This repo provides checkpoints for the Interspeech 2024 paper Scaling up masked audio encoder learning for general audio classification. The goal of this work is to investigate the scalability of masked autoencoders for audio. Prior work did not scale beyond 10,000 hours of audio, while Dasheng used 272,000 hours of training data.

Читать полностью…

Speech Technology

26 Jul 2024 21:57

https://diva-audio.github.io/

[TL;DR] DiVA Llama 3 outperforms existing Speech LMs on QA, Emotion Recognition, and Translation with a speech encoder trained using only weak supervision. DiVA learns to encode speech while preserving the underlying LLM output distribution using cross-modal context distillation between text and speech. DiVA was trained with open-source code in Levanter on 3.5k hours of publicly available and permissively licensed ASR data from Common Voice.

Читать полностью…

Speech Technology

24 Jul 2024 06:33

https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

Recently released LLama 3.1 paper has a big section on speech understanding - speech adapter, prosody modeling, speech generation. etc. An interesting overview of the current tech.

wav2vec encoder pretrained with 15 million hours of data and finetuned with 230k hours of transcribed data, for example

Start with page 63

Читать полностью…

Speech Technology

23 Jul 2024 01:58

https://huggingface.co/spaces/AudioLLMs/AudioBench-Leaderboard

https://arxiv.org/abs/2406.16020

https://github.com/AudioLLMs/AudioBench

Читать полностью…

Speech Technology

19 Jul 2024 15:05

The paper is interesting but has many arguable points. For example, authors see no correlation between objective measures and Arena score and propose to add extra scores to fit Arena score. Instead, one could conclude that side-by-side evaluation by non-experts is just broken.

Читать полностью…

Speech Technology

19 Jul 2024 06:20

CMU Lectures from Shinji Watanabe

[Fall 2023] Speech Recognition and Understanding

https://www.youtube.com/playlist?list=PLfVqr2l0FG-tW8d5ZSz-_tCgQed_F1ndb

an interesting part about RNNT vs Attention where Shinji argues about turn back to CTC decoding instead of RNNT. A valid point recently

https://youtu.be/BQBOu9BOFpc?list=PLfVqr2l0FG-tW8d5ZSz-_tCgQed_F1ndb&t=2585

Читать полностью…

Speech Technology

16 Jul 2024 23:06

Not yet released

https://github.com/QwenLM/Qwen2-Audio

the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:

voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;

audio analysis: users could provide audio and text instructions for analysis during the interaction;

We are going to release two models of the Qwen2-Audio series soon: Qwen2-Audio and Qwen2-Audio-Chat.

Читать полностью…

Speech Technology

13 Jul 2024 18:47

The original paper

https://arxiv.org/abs/2407.08551

Autoregressive Speech Synthesis without Vector Quantization

Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See this https URL for demos of our work.

Читать полностью…

Subscribe to a channel