Hm, 12 million hours and only 13% more accurate than Whisper
Key stats:
- Trained on 12.5M hours of training data
- 13.5% more accurate than models like Whisper and >22% more accurate than APIs from Azure/AWS/Google
- Up to 30% fewer hallucinations than seq2seq models like Whisper
- 71% better speaker count estimation and 14% better word timestamp estimation compared to our prior models
- 38 seconds to process a 60-minute audio file
https://twitter.com/AssemblyAI/status/1775527558412460120
Announcing the VoiceMOS Challenge 2024! VMC'24 has been accepted as a special session at SLT2024. There will be 3 tracks. The challenge tentatively starts on 4/10.
Registration form: https://forms.gle/tBBeNdvHghAdjTg27
Website:
https://sites.google.com/view/voicemos-challenge
Interesting concept of error chain in autoregressive models and also nice to remember that phones are real
Transducers with Pronunciation-Aware Embeddings for Automatic Speech Recognition
ICASSP 2024
Hainan Xu; Zhehuai Chen; Fei Jia; Boris Ginsburg
This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model’s decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.
https://ieeexplore.ieee.org/abstract/document/10447685
UTokyo-SaruLab team won 1st place in The Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge TTS (Acoustic + Vocoder) Track
https://arxiv.org/abs/2403.13720
UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge
Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito
We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.
The leaderboard for the tasks is now available online and can be accessed at: https://huggingface.co/spaces/NOTSOFAR/CHiME8Challenge.
Additionally, the teams have refined the baselines.
Feels like e2e time ended. Naturalspeech and now this. Coming year will be the year of decomposition
https://arxiv.org/abs/2403.06387
Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Yufeng Yang, Ashutosh Pandey, DeLiang Wang
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by 28.4% relatively with a 5.57% WER, and achieves 3.32/4.44% WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.
https://anonymous.4open.science/w/ham-tts/
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Anonymous Submission to ACL, 2024
Abstract
Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.
Some developers are really fast
https://github.com/open-mmlab/Amphion/pull/152
For the future
https://arxiv.org/abs/2402.01571
Spiking Music: Audio Compression with Event Based Auto-encoders
Martim Lisboa, Guillaume Bellec
Neurons in the brain communicate information via punctual events called spikes. The timing of spikes is thought to carry rich information, but it is not clear how to leverage this in digital systems. We demonstrate that event-based encoding is efficient for audio compression. To build this event-based representation we use a deep binary auto-encoder, and under high sparsity pressure, the model enters a regime where the binary event matrix is stored more efficiently with sparse matrix storage algorithms. We test this on the large MAESTRO dataset of piano recordings against vector quantized auto-encoders. Not only does our "Spiking Music compression" algorithm achieve a competitive compression/reconstruction trade-off, but selectivity and synchrony between encoded events and piano key strikes emerge without supervision in the sparse regime.
Auto-regressive GPT synthesizers are not going to be stable. Transformers are the key
https://github.com/choiHkk/Transformer-TTS-V2
Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
People say it is really good. No code yet
https://github.com/HumanAIGC/EMO
This Bangla ASR Model is Trained with ~18000 Hours of Youtube Data. Get 2.5% Word Error Rate on the Test Dataset.
https://huggingface.co/hishab/hishab_bn_fastconformer
Review of neural codecs
https://arxiv.org/abs/2402.13236
Towards audio language modeling - an overview
Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-yi Lee
Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs). Numerous high-performance neural audio codecs and codec-based LMs have been developed. The paper aims to provide a thorough and systematic overview of the neural audio codec models and codec-based LMs.
The audio-to-audio feature in stableaudio
opens up new workflows for rapid sonic exploration in audiovisual production.
https://twitter.com/jordiponsdotme/status/1775779209891246560
English Anime voice dataset
https://huggingface.co/datasets/ShoukanLabs/AniSpeech
and project to finetune styletts2 using it
https://huggingface.co/ShoukanLabs/Vokan
Race for parameters in TTS models begins
https://www.linkedin.com/feed/update/urn:li:activity:7179705576422039552/
We have just launched our new Speech Synthesis Foundation Model (SSFM), which is a voice AI model that has adopted a large language model with 14x more parameters than our legacy model. The new model has been trained with 75x more speech database, and with it, we have achieved the state of the art expressiveness of synthetic voice. This means that just 10 seconds of a prompt voice can mimic its voice identity and speaking style. Check out the comparison with OpenAI and Azure TTS in the following link: https://lnkd.in/gaU6zGfu
https://typecast.ai/learn/typecast-ssfm-text-to-speech/
Announcing VoiceCraft
SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc.
VoiceCraft works on in-the-wild data such as movies, random videos and podcasts
We fully open source it at https://github.com/jasonppy/VoiceCraft
Distil-Whisper is now more accurate, efficient and accessible 🚀
distil-large-v3 is 6x faster and within 1% WER of large-v3, with reduced hallucinations and better long-form support across all libraries
Previous Distil-Whisper models were trained on an avg audio length of 7-seconds. Predictions beyond this point were thus largely inaccurate
To preserve Whisper's ability to transcribe 30-second chunks, we added pre-processing to pack audios to 30-seconds
Repo: https://huggingface.co/distil-whisper/distil-large-v3
https://twitter.com/sanchitgandhi99/status/1770877844823896117
ICASSP2024 proceedings released
https://ieeexplore.ieee.org/xpl/conhome/10445798/proceeding
Checkpoints for NS3 too
https://huggingface.co/amphion/naturalspeech3_facodec
Some further thougths on NaturalSpeech3
https://alphacephei.com/nsh/2024/03/08/naturalspeech-factorization.html
Industrial cases are always interesting because they demonstrate the real state of the technology. This TTS is nice
https://rime.ai/blog/introducing-mist
SLT 2024 will be in Macao, China Dec 2 - Dec 5 2024
Читать полностью…Sherpa merged NNAPI support on Android
https://github.com/k2-fsa/sherpa-onnx/pull/160
Nice research on more transparent models is going on
https://github.com/lxy-peter/EfficientPunct
Huggingface is down because
It took a while but we finally released our massive youtube speech dataset: https://huggingface.co/datasets/espnet/yodas .370k hours across 140 languages.
https://twitter.com/chenwanch1/status/1762942313972592676
The size is about 100TB
TTS has come to the point where data has no labels
https://arxiv.org/abs/2310.16338
Generative Pre-training for Speech with Flow Matching
Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.