speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

Interesting multichannel system with speaker separation based on Nemo fastconformer

https://github.com/facebookresearch/MMCSG

We prepend a fixed beamformer module before feature extraction in the model. The beamformer takes all input 7 channels and outputs 13 beams --- 12 different directions around the wearer of the glasses, plus one beam pointed towards the mouth of the wearer.

The input convolutional layer of the pre-trained model encoder is extended to accept all 13 beams at the input.

The tokenizer of the pretrained model is extended to include two speaker tokens: »0,»1 for SELF/OTHER speaker, i.e. the wearer of the glasses and the conversational partner. The corresponding input and output layers are extended to process these two new tokens.

The extended model is finetuned on the chunks prepared in the previous step.

Читать полностью…

Speech Technology

https://twitter.com/realmrfakename/status/1761482183745912903

Today, I’m thrilled to release a project I’ve been working on for the past couple weeks in collaboration with Hugging Face: the TTS Arena. The TTS Arena, inspired by LMSys's Chatbot Arena, allows you to enter text which will be synthesized by two SOTA models. You can then vote on which model generated a better sample. The results will be published on a publicly-accessible leaderboard. We’ve added several open access models, including Pheme, MetaVoice, XTTS, OpenVoice, & WhisperSpeech. It also includes the proprietary ElevenLabs model.

https://huggingface.co/spaces/TTS-AGI/TTS-Arena

Читать полностью…

Speech Technology

Everyone stars

https://github.com/hubertsiuzdak/snac

today

Читать полностью…

Speech Technology

Vicuna is the best LLM for ASR. WER 1.9 on librispeech test-clean

https://arxiv.org/abs/2402.08846

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.

Читать полностью…

Speech Technology

Honestly samples sound a bit plain and robotic

From Amazon

https://amazon-ltts-paper.com/

https://arxiv.org/abs/2402.08093

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at this https URL.

Читать полностью…

Speech Technology

The prosody problem, blog post from Papercup, speech translation service

https://www.papercup.com/blog/realistic-synthetic-voices

Читать полностью…

Speech Technology

https://github.com/LAION-AI/natural_voice_assistant

Читать полностью…

Speech Technology

And upcoming SpeechLLM TTS models from Nvidia too, based on T5 and Megatron

https://github.com/NVIDIA/NeMo/pull/8364

Читать полностью…

Speech Technology

Modern TTS definitely misses idea of syntax

https://github.com/shinhyeokoh/rwen

https://arxiv.org/abs/2212.07939

RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis

Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh

With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged. Recently, by introducing syntactic and semantic information w.r.t the input text, various approaches have been proposed to enrich the naturalness and expressiveness of TTS models. Although these strategies showed impressive results, they still have some limitations in utilizing language information. First, most approaches only use graph networks to utilize syntactic and semantic information without considering linguistic features. Second, most previous works do not explicitly consider adjacent words when encoding syntactic and semantic information, even though it is obvious that adjacent words are usually meaningful when encoding the current word. To address these issues, we propose Relation-aware Word Encoding Network (RWEN), which effectively allows syntactic and semantic information based on two modules (i.e., Semantic-level Relation Encoding and Adjacent Word Relation Encoding). Experimental results show substantial improvements compared to previous works.

Читать полностью…

Speech Technology

Our latest breakthrough in speech synthesis – ParrotTTS! 🚀

Developed in collaboration with IIIT Hyderabad and TCS Research, ParrotTTS efficiently transforms text into speech, showcasing remarkable adaptability and language transfer capabilities.

Key Features:
1️⃣ Multi-speaker variant training using transcripts from a single speaker.
2️⃣ Swift adaptation to new languages with just 5 hours of paired data in extremely low-resource settings.
3️⃣ Language transfer without bilingual or parallel examples, preserving speaker-specific characteristics.

ParrotTTS attains SOTA results in an extremely low-resource, multi-lingual setup covering 6 languages (Hindi, Marathi, German, Spanish, French, English). It outperforms various baselines, including Fastspeech2 (a pioneering model from Microsoft Research) using only 30 hours of paired data across 6 languages.

Check out our results and learn more at: https://parrot-tts.github.io/tts/

Kudos to the incredible team behind this innovation:
Saiteja Kosgi Vishal Tambrahalli Neha Sahipjohn Anil Nelakanti Vineet Gandhi

ParrotTTS has been accepted at EACL 2024! 🌟🎉

Читать полностью…

Speech Technology

Interesting non-autoregressive model landed in espnet

https://github.com/espnet/espnet/pull/5363

https://arxiv.org/abs/2010.14233

Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment

Ethan A. Chi, Julian Salazar, Katrin Kirchhoff

Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space. We demonstrate this in speech recognition with Align-Refine, an end-to-end Transformer-based model which refines connectionist temporal classification (CTC) alignments to allow length-changing insertions and deletions. Align-Refine outperforms Imputer and Mask-CTC, matching an autoregressive baseline on WSJ at 1/14th the real-time factor and attaining a LibriSpeech test-other WER of 9.0% without an LM. Our model is strong even in one iteration with a shallower decoder.

Читать полностью…

Speech Technology

https://www.assemblyai.com/blog/conformer-2/

Читать полностью…

Speech Technology

The talk is really nice and touches many hot problems with modern tech:

1. RNNT models are fast but don't really work for rare words. A deeper integration of LM is needed. LODR-like integration helps. A rare word WER metric is required too.
2. Modern transducers are very bad at finding true alignment, they win in accuracy by pushing everything to the end.
3. Streaming speech recognition is 2 times less accurate. Google hope to repair more than 50% with more advanced neural network architecture.
4. Self-supervised training does not really work as Google sees. They propose their own loss more focused on ASR instead of constrastive loss.

some extra points discussed:

1. Are blank states harmful
2. Is it possible to include intonation and other emotions into lattice representation
3. WER is not the right way for streaming either

Afterthought: a lot of things we are doing now can fundamentally change in the future

Читать полностью…

Speech Technology

https://arxiv.org/abs/2307.03917

On decoder-only architecture for speech-to-text and large language model integration

Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu

Microsoft

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

Читать полностью…

Speech Technology

This talk focuses on some foundational problems in practical speech recognition and discusses some solutions for each of these problems.

https://www.youtube.com/watch?v=Y6s4EzDTAwA

Читать полностью…

Speech Technology

This kind of testing is bad for many reasons but lets respect the social aspects. Common Voice is a good example here.

Читать полностью…

Speech Technology

https://github.com/DigitalPhonetics/VoicePAT

Читать полностью…

Speech Technology

https://twitter.com/AIatMeta/status/1760025535621824776

https://ai.meta.com/datasets/mmcsg-dataset/

MMCSG Dataset

The MMCSG (Multi-Modal Conversations in Smart Glasses) dataset comprises two-sided conversations recorded using Aria glasses, featuring multi-modal data such as multi-channel audio, video, accelerometer, and gyroscope measurements. This dataset is suitable for research in areas like automatic speech recognition, activity detection, and speaker diarization.

Читать полностью…

Speech Technology

Dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.

Читать полностью…

Speech Technology

Based on Mamba/RWKV

https://github.com/theodorblackbird/lina-speech

Читать полностью…

Speech Technology

NeMo Canary 1B by NVIDIA

> Tops the Open ASR Leaderboard.
> Beats Whisper to punch for ASR.
> Beats Seamless M4Tv2 for Speech Translation.
> Supports 4 languages - English, Spanish, French & German.

> Trained on 85,000 hours of annotated audio.
> Encoder-Decoder Architecture
> Fast-Conformer Encoder

https://huggingface.co/spaces/nvidia/canary-1b

Читать полностью…

Speech Technology

https://www.youtube.com/watch?v=G8it4LGtcuo

Читать полностью…

Speech Technology

TTS from Nvidia (paper from 2023)

https://github.com/NVIDIA/RAD-MMM

Multilingual Multiaccented Multispeaker TTS with RADTTS

Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.

Читать полностью…

Speech Technology

https://github.com/facebookresearch/audioseal

Читать полностью…

Speech Technology

VoxBlink dataset

https://voxblink.github.io/

38k speakers

Читать полностью…

Speech Technology

https://github.com/speechbrain/speechbrain/tree/develop/recipes/ZaionEmotionDataset

https://arxiv.org/abs/2306.12991

Speech Emotion Diarization: Which Emotion Appears When?

Yingzhi Wang, Mirco Ravanelli, Alaa Nfissi, Alya Yacoubi

Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.

Читать полностью…

Speech Technology

https://github.com/ga642381/Speech-Prompts-Adapters

Читать полностью…

Speech Technology

In line with this work, we're open-sourcing a new dataset to help the broader community improve fairness of speech recognition models. The dataset includes ~27K utterances in recorded speech from 595 paid participants.

Dataset ➡️ https://ai.meta.com/datasets/speech-fairness-dataset/

https://twitter.com/MetaAI/status/1679525451667238913

Читать полностью…

Speech Technology

Speech restoration method Miipher (used to generate LibriTTS-R) has been accepted to WASPAA!! It converts degraded speech to studio quality, and generates almost inexhaustible training data for speech generation.

Demo: https://google.github.io/df-conformer/miipher/
Paper: https://arxiv.org/abs/2303.01664

Original thing is also nice

LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from http://www.openslr.org/141/.

Читать полностью…

Speech Technology

While experiment test sets are questionable, the overall directioni is somewhat interesting

https://arxiv.org/abs/2307.04172

Can Generative Large Language Models Perform ASR Error Correction?

Rao Ma, Mengjie Qian, Potsawee Manakul, Mark Gales, Kate Knill

ASR error correction continues to serve as an important part of post-processing for speech recognition systems. Traditionally, these models are trained with supervised training using the decoding results of the underlying ASR system and the reference text. This approach is computationally intensive and the model needs to be re-trained when switching the underlying ASR model. Recent years have seen the development of large language models and their ability to perform natural language processing tasks in a zero-shot manner. In this paper, we take ChatGPT as an example to examine its ability to perform ASR error correction in the zero-shot or 1-shot settings. We use the ASR N-best list as model input and propose unconstrained error correction and N-best constrained error correction methods. Results on a Conformer-Transducer model and the pre-trained Whisper model show that we can largely improve the ASR system performance with error correction using the powerful ChatGPT model.

Читать полностью…
Subscribe to a channel