Parler-TTS quality is not something exceptional. But the whole idea working with audio with text prompts is somewhat interesting (audio cleanup, denoising, separation, etc).
Читать полностью…RALL-E with chain-of-thought (CoT) prompting is helpful to improve the robustness of codec LLM for speech synthesis, reducing error rate from 68% to 4% on extremely hard test sentences.
https://huggingface.co/papers/2404.03204
https://ralle-demo.github.io/RALL-E/
The audio-to-audio feature in stableaudio
opens up new workflows for rapid sonic exploration in audiovisual production.
https://twitter.com/jordiponsdotme/status/1775779209891246560
English Anime voice dataset
https://huggingface.co/datasets/ShoukanLabs/AniSpeech
and project to finetune styletts2 using it
https://huggingface.co/ShoukanLabs/Vokan
Race for parameters in TTS models begins
https://www.linkedin.com/feed/update/urn:li:activity:7179705576422039552/
We have just launched our new Speech Synthesis Foundation Model (SSFM), which is a voice AI model that has adopted a large language model with 14x more parameters than our legacy model. The new model has been trained with 75x more speech database, and with it, we have achieved the state of the art expressiveness of synthetic voice. This means that just 10 seconds of a prompt voice can mimic its voice identity and speaking style. Check out the comparison with OpenAI and Azure TTS in the following link: https://lnkd.in/gaU6zGfu
https://typecast.ai/learn/typecast-ssfm-text-to-speech/
Announcing VoiceCraft
SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc.
VoiceCraft works on in-the-wild data such as movies, random videos and podcasts
We fully open source it at https://github.com/jasonppy/VoiceCraft
Distil-Whisper is now more accurate, efficient and accessible 🚀
distil-large-v3 is 6x faster and within 1% WER of large-v3, with reduced hallucinations and better long-form support across all libraries
Previous Distil-Whisper models were trained on an avg audio length of 7-seconds. Predictions beyond this point were thus largely inaccurate
To preserve Whisper's ability to transcribe 30-second chunks, we added pre-processing to pack audios to 30-seconds
Repo: https://huggingface.co/distil-whisper/distil-large-v3
https://twitter.com/sanchitgandhi99/status/1770877844823896117
ICASSP2024 proceedings released
https://ieeexplore.ieee.org/xpl/conhome/10445798/proceeding
Checkpoints for NS3 too
https://huggingface.co/amphion/naturalspeech3_facodec
Some further thougths on NaturalSpeech3
https://alphacephei.com/nsh/2024/03/08/naturalspeech-factorization.html
Industrial cases are always interesting because they demonstrate the real state of the technology. This TTS is nice
https://rime.ai/blog/introducing-mist
SLT 2024 will be in Macao, China Dec 2 - Dec 5 2024
Читать полностью…Sherpa merged NNAPI support on Android
https://github.com/k2-fsa/sherpa-onnx/pull/160
https://github.com/speechbrain/benchmarks/tree/main/benchmarks/CL_MASR
CL-MASR: A Continual Learning Benchmark for Multilingual ASR
This is the official benchmark platform accompanying the paper CL-MASR: A Continual Learning Benchmark for Multilingual ASR.
It includes scripts to train Whisper and WavLM-based ASR systems on a subset of 20 languages selected from Common Voice 13 in a continual learning fashion using a handful of methods including rehearsal-based, architecture-based, and regularization-based approaches.
The goal is to continually learn new languages while limiting forgetting the previously learned ones. An ideal method should achieve both positive forward transfer (i.e. improve performance on new tasks leveraging shared knowledge from previous tasks) and positive backward transfer (i.e. improve performance on previous tasks leveraging shared knowledge from new tasks).
Multimodal speech LLM work by Google DeepMind
Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems
Frank Palma Gomez, Ramon Sanabria, Yun-hsuan Sung, Daniel Cer, Siddharth Dalmia, Gustavo Hernandez Abrego
Large language models (LLMs) are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn't require speech data during LLM pre-training and can exploit LLM's multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.
https://arxiv.org/abs/2404.01616v2
Hm, 12 million hours and only 13% more accurate than Whisper
Key stats:
- Trained on 12.5M hours of training data
- 13.5% more accurate than models like Whisper and >22% more accurate than APIs from Azure/AWS/Google
- Up to 30% fewer hallucinations than seq2seq models like Whisper
- 71% better speaker count estimation and 14% better word timestamp estimation compared to our prior models
- 38 seconds to process a 60-minute audio file
https://twitter.com/AssemblyAI/status/1775527558412460120
Announcing the VoiceMOS Challenge 2024! VMC'24 has been accepted as a special session at SLT2024. There will be 3 tracks. The challenge tentatively starts on 4/10.
Registration form: https://forms.gle/tBBeNdvHghAdjTg27
Website:
https://sites.google.com/view/voicemos-challenge
Interesting concept of error chain in autoregressive models and also nice to remember that phones are real
Transducers with Pronunciation-Aware Embeddings for Automatic Speech Recognition
ICASSP 2024
Hainan Xu; Zhehuai Chen; Fei Jia; Boris Ginsburg
This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model’s decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.
https://ieeexplore.ieee.org/abstract/document/10447685
UTokyo-SaruLab team won 1st place in The Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge TTS (Acoustic + Vocoder) Track
https://arxiv.org/abs/2403.13720
UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge
Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito
We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.
The leaderboard for the tasks is now available online and can be accessed at: https://huggingface.co/spaces/NOTSOFAR/CHiME8Challenge.
Additionally, the teams have refined the baselines.
Feels like e2e time ended. Naturalspeech and now this. Coming year will be the year of decomposition
https://arxiv.org/abs/2403.06387
Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Yufeng Yang, Ashutosh Pandey, DeLiang Wang
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by 28.4% relatively with a 5.57% WER, and achieves 3.32/4.44% WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.
https://anonymous.4open.science/w/ham-tts/
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Anonymous Submission to ACL, 2024
Abstract
Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.
Some developers are really fast
https://github.com/open-mmlab/Amphion/pull/152
For the future
https://arxiv.org/abs/2402.01571
Spiking Music: Audio Compression with Event Based Auto-encoders
Martim Lisboa, Guillaume Bellec
Neurons in the brain communicate information via punctual events called spikes. The timing of spikes is thought to carry rich information, but it is not clear how to leverage this in digital systems. We demonstrate that event-based encoding is efficient for audio compression. To build this event-based representation we use a deep binary auto-encoder, and under high sparsity pressure, the model enters a regime where the binary event matrix is stored more efficiently with sparse matrix storage algorithms. We test this on the large MAESTRO dataset of piano recordings against vector quantized auto-encoders. Not only does our "Spiking Music compression" algorithm achieve a competitive compression/reconstruction trade-off, but selectivity and synchrony between encoded events and piano key strikes emerge without supervision in the sparse regime.
Auto-regressive GPT synthesizers are not going to be stable. Transformers are the key
https://github.com/choiHkk/Transformer-TTS-V2
Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
People say it is really good. No code yet
https://github.com/HumanAIGC/EMO