Nice research on more transparent models is going on
https://github.com/lxy-peter/EfficientPunct
Huggingface is down because
It took a while but we finally released our massive youtube speech dataset: https://huggingface.co/datasets/espnet/yodas .370k hours across 140 languages.
https://twitter.com/chenwanch1/status/1762942313972592676
The size is about 100TB
TTS has come to the point where data has no labels
https://arxiv.org/abs/2310.16338
Generative Pre-training for Speech with Flow Matching
Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.
This kind of testing is bad for many reasons but lets respect the social aspects. Common Voice is a good example here.
Читать полностью…https://twitter.com/AIatMeta/status/1760025535621824776
https://ai.meta.com/datasets/mmcsg-dataset/
MMCSG Dataset
The MMCSG (Multi-Modal Conversations in Smart Glasses) dataset comprises two-sided conversations recorded using Aria glasses, featuring multi-modal data such as multi-channel audio, video, accelerometer, and gyroscope measurements. This dataset is suitable for research in areas like automatic speech recognition, activity detection, and speaker diarization.
Dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.
Читать полностью…Based on Mamba/RWKV
https://github.com/theodorblackbird/lina-speech
NeMo Canary 1B by NVIDIA
> Tops the Open ASR Leaderboard.
> Beats Whisper to punch for ASR.
> Beats Seamless M4Tv2 for Speech Translation.
> Supports 4 languages - English, Spanish, French & German.
> Trained on 85,000 hours of annotated audio.
> Encoder-Decoder Architecture
> Fast-Conformer Encoder
https://huggingface.co/spaces/nvidia/canary-1b
TTS from Nvidia (paper from 2023)
https://github.com/NVIDIA/RAD-MMM
Multilingual Multiaccented Multispeaker TTS with RADTTS
Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro
We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.
VoxBlink dataset
https://voxblink.github.io/
38k speakers
https://github.com/speechbrain/speechbrain/tree/develop/recipes/ZaionEmotionDataset
https://arxiv.org/abs/2306.12991
Speech Emotion Diarization: Which Emotion Appears When?
Yingzhi Wang, Mirco Ravanelli, Alaa Nfissi, Alya Yacoubi
Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.
https://github.com/ga642381/Speech-Prompts-Adapters
Читать полностью…This Bangla ASR Model is Trained with ~18000 Hours of Youtube Data. Get 2.5% Word Error Rate on the Test Dataset.
https://huggingface.co/hishab/hishab_bn_fastconformer
Review of neural codecs
https://arxiv.org/abs/2402.13236
Towards audio language modeling - an overview
Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-yi Lee
Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs). Numerous high-performance neural audio codecs and codec-based LMs have been developed. The paper aims to provide a thorough and systematic overview of the neural audio codec models and codec-based LMs.
Interesting multichannel system with speaker separation based on Nemo fastconformer
https://github.com/facebookresearch/MMCSG
We prepend a fixed beamformer module before feature extraction in the model. The beamformer takes all input 7 channels and outputs 13 beams --- 12 different directions around the wearer of the glasses, plus one beam pointed towards the mouth of the wearer.
The input convolutional layer of the pre-trained model encoder is extended to accept all 13 beams at the input.
The tokenizer of the pretrained model is extended to include two speaker tokens: »0,»1 for SELF/OTHER speaker, i.e. the wearer of the glasses and the conversational partner. The corresponding input and output layers are extended to process these two new tokens.
The extended model is finetuned on the chunks prepared in the previous step.
https://twitter.com/realmrfakename/status/1761482183745912903
Today, I’m thrilled to release a project I’ve been working on for the past couple weeks in collaboration with Hugging Face: the TTS Arena. The TTS Arena, inspired by LMSys's Chatbot Arena, allows you to enter text which will be synthesized by two SOTA models. You can then vote on which model generated a better sample. The results will be published on a publicly-accessible leaderboard. We’ve added several open access models, including Pheme, MetaVoice, XTTS, OpenVoice, & WhisperSpeech. It also includes the proprietary ElevenLabs model.
https://huggingface.co/spaces/TTS-AGI/TTS-Arena
Everyone stars
https://github.com/hubertsiuzdak/snac
today
Vicuna is the best LLM for ASR. WER 1.9 on librispeech test-clean
https://arxiv.org/abs/2402.08846
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.
Honestly samples sound a bit plain and robotic
From Amazon
https://amazon-ltts-paper.com/
https://arxiv.org/abs/2402.08093
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at this https URL.
The prosody problem, blog post from Papercup, speech translation service
https://www.papercup.com/blog/realistic-synthetic-voices
https://github.com/LAION-AI/natural_voice_assistant
Читать полностью…And upcoming SpeechLLM TTS models from Nvidia too, based on T5 and Megatron
https://github.com/NVIDIA/NeMo/pull/8364
Modern TTS definitely misses idea of syntax
https://github.com/shinhyeokoh/rwen
https://arxiv.org/abs/2212.07939
RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis
Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh
With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged. Recently, by introducing syntactic and semantic information w.r.t the input text, various approaches have been proposed to enrich the naturalness and expressiveness of TTS models. Although these strategies showed impressive results, they still have some limitations in utilizing language information. First, most approaches only use graph networks to utilize syntactic and semantic information without considering linguistic features. Second, most previous works do not explicitly consider adjacent words when encoding syntactic and semantic information, even though it is obvious that adjacent words are usually meaningful when encoding the current word. To address these issues, we propose Relation-aware Word Encoding Network (RWEN), which effectively allows syntactic and semantic information based on two modules (i.e., Semantic-level Relation Encoding and Adjacent Word Relation Encoding). Experimental results show substantial improvements compared to previous works.
Our latest breakthrough in speech synthesis – ParrotTTS! 🚀
Developed in collaboration with IIIT Hyderabad and TCS Research, ParrotTTS efficiently transforms text into speech, showcasing remarkable adaptability and language transfer capabilities.
Key Features:
1️⃣ Multi-speaker variant training using transcripts from a single speaker.
2️⃣ Swift adaptation to new languages with just 5 hours of paired data in extremely low-resource settings.
3️⃣ Language transfer without bilingual or parallel examples, preserving speaker-specific characteristics.
ParrotTTS attains SOTA results in an extremely low-resource, multi-lingual setup covering 6 languages (Hindi, Marathi, German, Spanish, French, English). It outperforms various baselines, including Fastspeech2 (a pioneering model from Microsoft Research) using only 30 hours of paired data across 6 languages.
Check out our results and learn more at: https://parrot-tts.github.io/tts/
Kudos to the incredible team behind this innovation:
Saiteja Kosgi Vishal Tambrahalli Neha Sahipjohn Anil Nelakanti Vineet Gandhi
ParrotTTS has been accepted at EACL 2024! 🌟🎉
Interesting non-autoregressive model landed in espnet
https://github.com/espnet/espnet/pull/5363
https://arxiv.org/abs/2010.14233
Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment
Ethan A. Chi, Julian Salazar, Katrin Kirchhoff
Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space. We demonstrate this in speech recognition with Align-Refine, an end-to-end Transformer-based model which refines connectionist temporal classification (CTC) alignments to allow length-changing insertions and deletions. Align-Refine outperforms Imputer and Mask-CTC, matching an autoregressive baseline on WSJ at 1/14th the real-time factor and attaining a LibriSpeech test-other WER of 9.0% without an LM. Our model is strong even in one iteration with a shallower decoder.
The talk is really nice and touches many hot problems with modern tech:
1. RNNT models are fast but don't really work for rare words. A deeper integration of LM is needed. LODR-like integration helps. A rare word WER metric is required too.
2. Modern transducers are very bad at finding true alignment, they win in accuracy by pushing everything to the end.
3. Streaming speech recognition is 2 times less accurate. Google hope to repair more than 50% with more advanced neural network architecture.
4. Self-supervised training does not really work as Google sees. They propose their own loss more focused on ASR instead of constrastive loss.
some extra points discussed:
1. Are blank states harmful
2. Is it possible to include intonation and other emotions into lattice representation
3. WER is not the right way for streaming either
Afterthought: a lot of things we are doing now can fundamentally change in the future