Telegram-канал speechtech - Speech Technology: Unsorted

Speech Technology

29 Feb 2024 15:53

Nice research on more transparent models is going on

https://github.com/lxy-peter/EfficientPunct

Speech Technology

29 Feb 2024 00:32

Huggingface is down because

It took a while but we finally released our massive youtube speech dataset: https://huggingface.co/datasets/espnet/yodas .370k hours across 140 languages.

https://twitter.com/chenwanch1/status/1762942313972592676

The size is about 100TB

Читать полностью…

Speech Technology

28 Feb 2024 21:34

TTS has come to the point where data has no labels

https://arxiv.org/abs/2310.16338

Generative Pre-training for Speech with Flow Matching

Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

Читать полностью…

Speech Technology

24 Feb 2024 21:57

This kind of testing is bad for many reasons but lets respect the social aspects. Common Voice is a good example here.

Читать полностью…

Speech Technology

23 Feb 2024 22:43

https://github.com/DigitalPhonetics/VoicePAT

Читать полностью…

Speech Technology

22 Feb 2024 22:46

https://twitter.com/AIatMeta/status/1760025535621824776

https://ai.meta.com/datasets/mmcsg-dataset/

MMCSG Dataset

The MMCSG (Multi-Modal Conversations in Smart Glasses) dataset comprises two-sided conversations recorded using Aria glasses, featuring multi-modal data such as multi-channel audio, video, accelerometer, and gyroscope measurements. This dataset is suitable for research in areas like automatic speech recognition, activity detection, and speaker diarization.

Читать полностью…

Speech Technology

14 Feb 2024 23:00

Dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.

Читать полностью…

Speech Technology

12 Feb 2024 21:36

Based on Mamba/RWKV

https://github.com/theodorblackbird/lina-speech

Читать полностью…

Speech Technology

10 Feb 2024 16:25

NeMo Canary 1B by NVIDIA

> Tops the Open ASR Leaderboard.
> Beats Whisper to punch for ASR.
> Beats Seamless M4Tv2 for Speech Translation.
> Supports 4 languages - English, Spanish, French & German.

> Trained on 85,000 hours of annotated audio.
> Encoder-Decoder Architecture
> Fast-Conformer Encoder

https://huggingface.co/spaces/nvidia/canary-1b

Читать полностью…

Speech Technology

10 Feb 2024 00:50

https://www.youtube.com/watch?v=G8it4LGtcuo

Читать полностью…

Speech Technology

07 Feb 2024 23:44

TTS from Nvidia (paper from 2023)

https://github.com/NVIDIA/RAD-MMM

Multilingual Multiaccented Multispeaker TTS with RADTTS

Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained F0 and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.

Читать полностью…

Speech Technology

02 Feb 2024 05:03

https://github.com/facebookresearch/audioseal

Читать полностью…

Speech Technology

31 Jan 2024 23:02

VoxBlink dataset

https://voxblink.github.io/

38k speakers

Читать полностью…

Speech Technology

23 Jul 2023 21:33

https://github.com/speechbrain/speechbrain/tree/develop/recipes/ZaionEmotionDataset

https://arxiv.org/abs/2306.12991

Speech Emotion Diarization: Which Emotion Appears When?

Yingzhi Wang, Mirco Ravanelli, Alaa Nfissi, Alya Yacoubi

Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.

Читать полностью…

Speech Technology

17 Jul 2023 02:19

https://github.com/ga642381/Speech-Prompts-Adapters

Читать полностью…

Speech Technology

29 Feb 2024 11:54

This Bangla ASR Model is Trained with ~18000 Hours of Youtube Data. Get 2.5% Word Error Rate on the Test Dataset.

https://huggingface.co/hishab/hishab_bn_fastconformer

Читать полностью…

Speech Technology

28 Feb 2024 21:39

Review of neural codecs

https://arxiv.org/abs/2402.13236

Towards audio language modeling - an overview

Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-yi Lee

Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs). Numerous high-performance neural audio codecs and codec-based LMs have been developed. The paper aims to provide a thorough and systematic overview of the neural audio codec models and codec-based LMs.

Читать полностью…

Speech Technology

26 Feb 2024 23:37

Interesting multichannel system with speaker separation based on Nemo fastconformer

https://github.com/facebookresearch/MMCSG

We prepend a fixed beamformer module before feature extraction in the model. The beamformer takes all input 7 channels and outputs 13 beams --- 12 different directions around the wearer of the glasses, plus one beam pointed towards the mouth of the wearer.

The input convolutional layer of the pre-trained model encoder is extended to accept all 13 beams at the input.

The tokenizer of the pretrained model is extended to include two speaker tokens: »0,»1 for SELF/OTHER speaker, i.e. the wearer of the glasses and the conversational partner. The corresponding input and output layers are extended to process these two new tokens.

The extended model is finetuned on the chunks prepared in the previous step.

Читать полностью…

Speech Technology

24 Feb 2024 21:51

https://twitter.com/realmrfakename/status/1761482183745912903

Today, I’m thrilled to release a project I’ve been working on for the past couple weeks in collaboration with Hugging Face: the TTS Arena. The TTS Arena, inspired by LMSys's Chatbot Arena, allows you to enter text which will be synthesized by two SOTA models. You can then vote on which model generated a better sample. The results will be published on a publicly-accessible leaderboard. We’ve added several open access models, including Pheme, MetaVoice, XTTS, OpenVoice, & WhisperSpeech. It also includes the proprietary ElevenLabs model.

https://huggingface.co/spaces/TTS-AGI/TTS-Arena

Читать полностью…

Speech Technology

22 Feb 2024 23:33

Everyone stars

https://github.com/hubertsiuzdak/snac

today

Читать полностью…

Speech Technology

15 Feb 2024 16:12

Vicuna is the best LLM for ASR. WER 1.9 on librispeech test-clean

https://arxiv.org/abs/2402.08846

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.

Читать полностью…

Speech Technology

14 Feb 2024 22:38

Honestly samples sound a bit plain and robotic

From Amazon

https://amazon-ltts-paper.com/

https://arxiv.org/abs/2402.08093

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at this https URL.

Читать полностью…

Speech Technology

11 Feb 2024 21:42

The prosody problem, blog post from Papercup, speech translation service

https://www.papercup.com/blog/realistic-synthetic-voices

Читать полностью…

Speech Technology

10 Feb 2024 16:23

https://github.com/LAION-AI/natural_voice_assistant

Читать полностью…

Speech Technology

07 Feb 2024 23:48

And upcoming SpeechLLM TTS models from Nvidia too, based on T5 and Megatron

https://github.com/NVIDIA/NeMo/pull/8364

Читать полностью…

Speech Technology

06 Feb 2024 03:14

Modern TTS definitely misses idea of syntax

https://github.com/shinhyeokoh/rwen

https://arxiv.org/abs/2212.07939

RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis

Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh

With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged. Recently, by introducing syntactic and semantic information w.r.t the input text, various approaches have been proposed to enrich the naturalness and expressiveness of TTS models. Although these strategies showed impressive results, they still have some limitations in utilizing language information. First, most approaches only use graph networks to utilize syntactic and semantic information without considering linguistic features. Second, most previous works do not explicitly consider adjacent words when encoding syntactic and semantic information, even though it is obvious that adjacent words are usually meaningful when encoding the current word. To address these issues, we propose Relation-aware Word Encoding Network (RWEN), which effectively allows syntactic and semantic information based on two modules (i.e., Semantic-level Relation Encoding and Adjacent Word Relation Encoding). Experimental results show substantial improvements compared to previous works.

Читать полностью…

Speech Technology

31 Jan 2024 23:18

Our latest breakthrough in speech synthesis – ParrotTTS! 🚀

Developed in collaboration with IIIT Hyderabad and TCS Research, ParrotTTS efficiently transforms text into speech, showcasing remarkable adaptability and language transfer capabilities.

Key Features:
1️⃣ Multi-speaker variant training using transcripts from a single speaker.
2️⃣ Swift adaptation to new languages with just 5 hours of paired data in extremely low-resource settings.
3️⃣ Language transfer without bilingual or parallel examples, preserving speaker-specific characteristics.

ParrotTTS attains SOTA results in an extremely low-resource, multi-lingual setup covering 6 languages (Hindi, Marathi, German, Spanish, French, English). It outperforms various baselines, including Fastspeech2 (a pioneering model from Microsoft Research) using only 30 hours of paired data across 6 languages.

Check out our results and learn more at: https://parrot-tts.github.io/tts/

Kudos to the incredible team behind this innovation:
Saiteja Kosgi Vishal Tambrahalli Neha Sahipjohn Anil Nelakanti Vineet Gandhi

ParrotTTS has been accepted at EACL 2024! 🌟🎉

Читать полностью…

Speech Technology

26 Jul 2023 20:34

Interesting non-autoregressive model landed in espnet

https://github.com/espnet/espnet/pull/5363

https://arxiv.org/abs/2010.14233

Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment

Ethan A. Chi, Julian Salazar, Katrin Kirchhoff

Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space. We demonstrate this in speech recognition with Align-Refine, an end-to-end Transformer-based model which refines connectionist temporal classification (CTC) alignments to allow length-changing insertions and deletions. Align-Refine outperforms Imputer and Mask-CTC, matching an autoregressive baseline on WSJ at 1/14th the real-time factor and attaining a LibriSpeech test-other WER of 9.0% without an LM. Our model is strong even in one iteration with a shallower decoder.

Читать полностью…

Speech Technology

23 Jul 2023 21:10

https://www.assemblyai.com/blog/conformer-2/

Читать полностью…

Speech Technology

13 Jul 2023 23:58

The talk is really nice and touches many hot problems with modern tech:

1. RNNT models are fast but don't really work for rare words. A deeper integration of LM is needed. LODR-like integration helps. A rare word WER metric is required too.
2. Modern transducers are very bad at finding true alignment, they win in accuracy by pushing everything to the end.
3. Streaming speech recognition is 2 times less accurate. Google hope to repair more than 50% with more advanced neural network architecture.
4. Self-supervised training does not really work as Google sees. They propose their own loss more focused on ASR instead of constrastive loss.

some extra points discussed:

1. Are blank states harmful
2. Is it possible to include intonation and other emotions into lattice representation
3. WER is not the right way for streaming either

Afterthought: a lot of things we are doing now can fundamentally change in the future

Читать полностью…