Interesting multichannel system with speaker separation based on Nemo fastconformer
https://github.com/facebookresearch/MMCSG
We prepend a fixed beamformer module before feature extraction in the model. The beamformer takes all input 7 channels and outputs 13 beams --- 12 different directions around the wearer of the glasses, plus one beam pointed towards the mouth of the wearer.
The input convolutional layer of the pre-trained model encoder is extended to accept all 13 beams at the input.
The tokenizer of the pretrained model is extended to include two speaker tokens: »0,»1 for SELF/OTHER speaker, i.e. the wearer of the glasses and the conversational partner. The corresponding input and output layers are extended to process these two new tokens.
The extended model is finetuned on the chunks prepared in the previous step.
https://twitter.com/realmrfakename/status/1761482183745912903
Today, I’m thrilled to release a project I’ve been working on for the past couple weeks in collaboration with Hugging Face: the TTS Arena. The TTS Arena, inspired by LMSys's Chatbot Arena, allows you to enter text which will be synthesized by two SOTA models. You can then vote on which model generated a better sample. The results will be published on a publicly-accessible leaderboard. We’ve added several open access models, including Pheme, MetaVoice, XTTS, OpenVoice, & WhisperSpeech. It also includes the proprietary ElevenLabs model.
https://huggingface.co/spaces/TTS-AGI/TTS-Arena
Everyone stars
https://github.com/hubertsiuzdak/snac
today
Vicuna is the best LLM for ASR. WER 1.9 on librispeech test-clean
https://arxiv.org/abs/2402.08846
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.
Honestly samples sound a bit plain and robotic
From Amazon
https://amazon-ltts-paper.com/
https://arxiv.org/abs/2402.08093
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at this https URL.
The prosody problem, blog post from Papercup, speech translation service
https://www.papercup.com/blog/realistic-synthetic-voices
https://github.com/LAION-AI/natural_voice_assistant
Читать полностью…And upcoming SpeechLLM TTS models from Nvidia too, based on T5 and Megatron
https://github.com/NVIDIA/NeMo/pull/8364
Modern TTS definitely misses idea of syntax
https://github.com/shinhyeokoh/rwen
https://arxiv.org/abs/2212.07939
RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis
Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh
With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged. Recently, by introducing syntactic and semantic information w.r.t the input text, various approaches have been proposed to enrich the naturalness and expressiveness of TTS models. Although these strategies showed impressive results, they still have some limitations in utilizing language information. First, most approaches only use graph networks to utilize syntactic and semantic information without considering linguistic features. Second, most previous works do not explicitly consider adjacent words when encoding syntactic and semantic information, even though it is obvious that adjacent words are usually meaningful when encoding the current word. To address these issues, we propose Relation-aware Word Encoding Network (RWEN), which effectively allows syntactic and semantic information based on two modules (i.e., Semantic-level Relation Encoding and Adjacent Word Relation Encoding). Experimental results show substantial improvements compared to previous works.
Our latest breakthrough in speech synthesis – ParrotTTS! 🚀
Developed in collaboration with IIIT Hyderabad and TCS Research, ParrotTTS efficiently transforms text into speech, showcasing remarkable adaptability and language transfer capabilities.
Key Features:
1️⃣ Multi-speaker variant training using transcripts from a single speaker.
2️⃣ Swift adaptation to new languages with just 5 hours of paired data in extremely low-resource settings.
3️⃣ Language transfer without bilingual or parallel examples, preserving speaker-specific characteristics.
ParrotTTS attains SOTA results in an extremely low-resource, multi-lingual setup covering 6 languages (Hindi, Marathi, German, Spanish, French, English). It outperforms various baselines, including Fastspeech2 (a pioneering model from Microsoft Research) using only 30 hours of paired data across 6 languages.
Check out our results and learn more at: https://parrot-tts.github.io/tts/
Kudos to the incredible team behind this innovation:
Saiteja Kosgi Vishal Tambrahalli Neha Sahipjohn Anil Nelakanti Vineet Gandhi
ParrotTTS has been accepted at EACL 2024! 🌟🎉
Interesting non-autoregressive model landed in espnet
https://github.com/espnet/espnet/pull/5363
https://arxiv.org/abs/2010.14233
Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment
Ethan A. Chi, Julian Salazar, Katrin Kirchhoff
Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space. We demonstrate this in speech recognition with Align-Refine, an end-to-end Transformer-based model which refines connectionist temporal classification (CTC) alignments to allow length-changing insertions and deletions. Align-Refine outperforms Imputer and Mask-CTC, matching an autoregressive baseline on WSJ at 1/14th the real-time factor and attaining a LibriSpeech test-other WER of 9.0% without an LM. Our model is strong even in one iteration with a shallower decoder.
The talk is really nice and touches many hot problems with modern tech:
1. RNNT models are fast but don't really work for rare words. A deeper integration of LM is needed. LODR-like integration helps. A rare word WER metric is required too.
2. Modern transducers are very bad at finding true alignment, they win in accuracy by pushing everything to the end.
3. Streaming speech recognition is 2 times less accurate. Google hope to repair more than 50% with more advanced neural network architecture.
4. Self-supervised training does not really work as Google sees. They propose their own loss more focused on ASR instead of constrastive loss.
some extra points discussed:
1. Are blank states harmful
2. Is it possible to include intonation and other emotions into lattice representation
3. WER is not the right way for streaming either
Afterthought: a lot of things we are doing now can fundamentally change in the future
https://arxiv.org/abs/2307.03917
On decoder-only architecture for speech-to-text and large language model integration
Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu
Microsoft
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
This talk focuses on some foundational problems in practical speech recognition and discusses some solutions for each of these problems.
https://www.youtube.com/watch?v=Y6s4EzDTAwA
This kind of testing is bad for many reasons but lets respect the social aspects. Common Voice is a good example here.
Читать полностью…https://twitter.com/AIatMeta/status/1760025535621824776
https://ai.meta.com/datasets/mmcsg-dataset/
MMCSG Dataset
The MMCSG (Multi-Modal Conversations in Smart Glasses) dataset comprises two-sided conversations recorded using Aria glasses, featuring multi-modal data such as multi-channel audio, video, accelerometer, and gyroscope measurements. This dataset is suitable for research in areas like automatic speech recognition, activity detection, and speaker diarization.
Dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.
Читать полностью…Based on Mamba/RWKV
https://github.com/theodorblackbird/lina-speech
NeMo Canary 1B by NVIDIA
> Tops the Open ASR Leaderboard.
> Beats Whisper to punch for ASR.
> Beats Seamless M4Tv2 for Speech Translation.
> Supports 4 languages - English, Spanish, French & German.
> Trained on 85,000 hours of annotated audio.
> Encoder-Decoder Architecture
> Fast-Conformer Encoder
https://huggingface.co/spaces/nvidia/canary-1b
TTS from Nvidia (paper from 2023)
https://github.com/NVIDIA/RAD-MMM
Multilingual Multiaccented Multispeaker TTS with RADTTS
Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro
We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.
VoxBlink dataset
https://voxblink.github.io/
38k speakers
https://github.com/speechbrain/speechbrain/tree/develop/recipes/ZaionEmotionDataset
https://arxiv.org/abs/2306.12991
Speech Emotion Diarization: Which Emotion Appears When?
Yingzhi Wang, Mirco Ravanelli, Alaa Nfissi, Alya Yacoubi
Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.
https://github.com/ga642381/Speech-Prompts-Adapters
Читать полностью…In line with this work, we're open-sourcing a new dataset to help the broader community improve fairness of speech recognition models. The dataset includes ~27K utterances in recorded speech from 595 paid participants.
Dataset ➡️ https://ai.meta.com/datasets/speech-fairness-dataset/
https://twitter.com/MetaAI/status/1679525451667238913
Speech restoration method Miipher (used to generate LibriTTS-R) has been accepted to WASPAA!! It converts degraded speech to studio quality, and generates almost inexhaustible training data for speech generation.
Demo: https://google.github.io/df-conformer/miipher/
Paper: https://arxiv.org/abs/2303.01664
Original thing is also nice
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from http://www.openslr.org/141/.
While experiment test sets are questionable, the overall directioni is somewhat interesting
https://arxiv.org/abs/2307.04172
Can Generative Large Language Models Perform ASR Error Correction?
Rao Ma, Mengjie Qian, Potsawee Manakul, Mark Gales, Kate Knill
ASR error correction continues to serve as an important part of post-processing for speech recognition systems. Traditionally, these models are trained with supervised training using the decoding results of the underlying ASR system and the reference text. This approach is computationally intensive and the model needs to be re-trained when switching the underlying ASR model. Recent years have seen the development of large language models and their ability to perform natural language processing tasks in a zero-shot manner. In this paper, we take ChatGPT as an example to examine its ability to perform ASR error correction in the zero-shot or 1-shot settings. We use the ASR N-best list as model input and propose unconstrained error correction and N-best constrained error correction methods. Results on a Conformer-Transducer model and the pre-trained Whisper model show that we can largely improve the ASR system performance with error correction using the powerful ChatGPT model.