Telegram-канал speechtech - Speech Technology: Unsorted

Speech Technology

11 Apr 2024 23:39

Parler-TTS quality is not something exceptional. But the whole idea working with audio with text prompts is somewhat interesting (audio cleanup, denoising, separation, etc).

Читать полностью…

Speech Technology

09 Apr 2024 04:50

https://github.com/ga642381/speech-trident

Читать полностью…

Speech Technology

05 Apr 2024 09:59

RALL-E with chain-of-thought (CoT) prompting is helpful to improve the robustness of codec LLM for speech synthesis, reducing error rate from 68% to 4% on extremely hard test sentences.

https://huggingface.co/papers/2404.03204
https://ralle-demo.github.io/RALL-E/

Читать полностью…

Speech Technology

05 Apr 2024 01:24

The audio-to-audio feature in stableaudio
opens up new workflows for rapid sonic exploration in audiovisual production.

https://twitter.com/jordiponsdotme/status/1775779209891246560

Читать полностью…

Speech Technology

05 Apr 2024 01:16

English Anime voice dataset

https://huggingface.co/datasets/ShoukanLabs/AniSpeech

and project to finetune styletts2 using it

https://huggingface.co/ShoukanLabs/Vokan

Читать полностью…

Speech Technology

02 Apr 2024 10:26

Race for parameters in TTS models begins

https://www.linkedin.com/feed/update/urn:li:activity:7179705576422039552/

We have just launched our new Speech Synthesis Foundation Model (SSFM), which is a voice AI model that has adopted a large language model with 14x more parameters than our legacy model. The new model has been trained with 75x more speech database, and with it, we have achieved the state of the art expressiveness of synthetic voice. This means that just 10 seconds of a prompt voice can mimic its voice identity and speaking style. Check out the comparison with OpenAI and Azure TTS in the following link: https://lnkd.in/gaU6zGfu

https://typecast.ai/learn/typecast-ssfm-text-to-speech/

Читать полностью…

Speech Technology

22 Mar 2024 02:26

Announcing VoiceCraft

SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc.

VoiceCraft works on in-the-wild data such as movies, random videos and podcasts

We fully open source it at https://github.com/jasonppy/VoiceCraft

Читать полностью…

Speech Technology

21 Mar 2024 20:44

Distil-Whisper is now more accurate, efficient and accessible 🚀

distil-large-v3 is 6x faster and within 1% WER of large-v3, with reduced hallucinations and better long-form support across all libraries

Previous Distil-Whisper models were trained on an avg audio length of 7-seconds. Predictions beyond this point were thus largely inaccurate

To preserve Whisper's ability to transcribe 30-second chunks, we added pre-processing to pack audios to 30-seconds

Repo: https://huggingface.co/distil-whisper/distil-large-v3

https://twitter.com/sanchitgandhi99/status/1770877844823896117

Читать полностью…

Speech Technology

20 Mar 2024 06:59

ICASSP2024 proceedings released

https://ieeexplore.ieee.org/xpl/conhome/10445798/proceeding

Читать полностью…

Speech Technology

12 Mar 2024 11:42

Checkpoints for NS3 too

https://huggingface.co/amphion/naturalspeech3_facodec

Читать полностью…

Speech Technology

09 Mar 2024 15:57

Some further thougths on NaturalSpeech3

https://alphacephei.com/nsh/2024/03/08/naturalspeech-factorization.html

Читать полностью…

Speech Technology

06 Mar 2024 23:31

https://github.com/nii-yamagishilab/ZMM-TTS

Читать полностью…

Speech Technology

03 Mar 2024 20:29

Industrial cases are always interesting because they demonstrate the real state of the technology. This TTS is nice

https://rime.ai/blog/introducing-mist

Читать полностью…

Speech Technology

02 Mar 2024 17:30

SLT 2024 will be in Macao, China Dec 2 - Dec 5 2024

Читать полностью…

Speech Technology

01 Mar 2024 10:06

Sherpa merged NNAPI support on Android

https://github.com/k2-fsa/sherpa-onnx/pull/160

Читать полностью…

Speech Technology

09 Apr 2024 21:26

https://github.com/speechbrain/benchmarks/tree/main/benchmarks/CL_MASR

CL-MASR: A Continual Learning Benchmark for Multilingual ASR
This is the official benchmark platform accompanying the paper CL-MASR: A Continual Learning Benchmark for Multilingual ASR.

It includes scripts to train Whisper and WavLM-based ASR systems on a subset of 20 languages selected from Common Voice 13 in a continual learning fashion using a handful of methods including rehearsal-based, architecture-based, and regularization-based approaches.

The goal is to continually learn new languages while limiting forgetting the previously learned ones. An ideal method should achieve both positive forward transfer (i.e. improve performance on new tasks leveraging shared knowledge from previous tasks) and positive backward transfer (i.e. improve performance on previous tasks leveraging shared knowledge from new tasks).

Читать полностью…

Speech Technology

05 Apr 2024 10:31

Multimodal speech LLM work by Google DeepMind

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems

Frank Palma Gomez, Ramon Sanabria, Yun-hsuan Sung, Daniel Cer, Siddharth Dalmia, Gustavo Hernandez Abrego

Large language models (LLMs) are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn't require speech data during LLM pre-training and can exploit LLM's multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.

https://arxiv.org/abs/2404.01616v2

Читать полностью…

Speech Technology

05 Apr 2024 01:33

Hm, 12 million hours and only 13% more accurate than Whisper

Key stats:
- Trained on 12.5M hours of training data
- 13.5% more accurate than models like Whisper and >22% more accurate than APIs from Azure/AWS/Google
- Up to 30% fewer hallucinations than seq2seq models like Whisper
- 71% better speaker count estimation and 14% better word timestamp estimation compared to our prior models
- 38 seconds to process a 60-minute audio file

https://twitter.com/AssemblyAI/status/1775527558412460120

Читать полностью…

Speech Technology

05 Apr 2024 01:19

Announcing the VoiceMOS Challenge 2024! VMC'24 has been accepted as a special session at SLT2024. There will be 3 tracks. The challenge tentatively starts on 4/10.

Registration form: https://forms.gle/tBBeNdvHghAdjTg27
Website:
https://sites.google.com/view/voicemos-challenge

Читать полностью…

Speech Technology

03 Apr 2024 08:56

https://github.com/KdaiP/StableTTS

Читать полностью…

Speech Technology

01 Apr 2024 14:49

Interesting concept of error chain in autoregressive models and also nice to remember that phones are real

Transducers with Pronunciation-Aware Embeddings for Automatic Speech Recognition
ICASSP 2024
Hainan Xu; Zhehuai Chen; Fei Jia; Boris Ginsburg

This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model’s decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.

https://ieeexplore.ieee.org/abstract/document/10447685

Читать полностью…

Speech Technology

21 Mar 2024 20:48

UTokyo-SaruLab team won 1st place in The Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge TTS (Acoustic + Vocoder) Track

https://arxiv.org/abs/2403.13720

UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito

We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.

Читать полностью…

Speech Technology

21 Mar 2024 20:43

The leaderboard for the tasks is now available online and can be accessed at: https://huggingface.co/spaces/NOTSOFAR/CHiME8Challenge.
Additionally, the teams have refined the baselines.

Читать полностью…

Speech Technology

14 Mar 2024 21:04

Feels like e2e time ended. Naturalspeech and now this. Coming year will be the year of decomposition

https://arxiv.org/abs/2403.06387

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Yufeng Yang, Ashutosh Pandey, DeLiang Wang

It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by 28.4% relatively with a 5.57% WER, and achieves 3.32/4.44% WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.

Читать полностью…

Speech Technology

12 Mar 2024 11:36

https://anonymous.4open.science/w/ham-tts/

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Anonymous Submission to ACL, 2024

Abstract
Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.

Читать полностью…

Speech Technology

08 Mar 2024 18:04

Some developers are really fast

https://github.com/open-mmlab/Amphion/pull/152

Читать полностью…

Speech Technology

03 Mar 2024 21:56

For the future

https://arxiv.org/abs/2402.01571

Spiking Music: Audio Compression with Event Based Auto-encoders

Martim Lisboa, Guillaume Bellec

Neurons in the brain communicate information via punctual events called spikes. The timing of spikes is thought to carry rich information, but it is not clear how to leverage this in digital systems. We demonstrate that event-based encoding is efficient for audio compression. To build this event-based representation we use a deep binary auto-encoder, and under high sparsity pressure, the model enters a regime where the binary event matrix is stored more efficiently with sparse matrix storage algorithms. We test this on the large MAESTRO dataset of piano recordings against vector quantized auto-encoders. Not only does our "Spiking Music compression" algorithm achieve a competitive compression/reconstruction trade-off, but selectivity and synchrony between encoded events and piano key strikes emerge without supervision in the sparse regime.

Читать полностью…

Speech Technology

03 Mar 2024 20:13

Auto-regressive GPT synthesizers are not going to be stable. Transformers are the key

https://github.com/choiHkk/Transformer-TTS-V2

Читать полностью…

Speech Technology

02 Mar 2024 17:25

Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

People say it is really good. No code yet

https://github.com/HumanAIGC/EMO

Читать полностью…

Speech Technology

29 Feb 2024 17:32

https://www.youtube.com/watch?v=VXlpF3DrVP0

Читать полностью…