speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

Hm, 12 million hours and only 13% more accurate than Whisper

Key stats:
- Trained on 12.5M hours of training data
- 13.5% more accurate than models like Whisper and >22% more accurate than APIs from Azure/AWS/Google
- Up to 30% fewer hallucinations than seq2seq models like Whisper
- 71% better speaker count estimation and 14% better word timestamp estimation compared to our prior models
- 38 seconds to process a 60-minute audio file

https://twitter.com/AssemblyAI/status/1775527558412460120

Читать полностью…

Speech Technology

Announcing the VoiceMOS Challenge 2024! VMC'24 has been accepted as a special session at SLT2024. There will be 3 tracks. The challenge tentatively starts on 4/10.

Registration form: https://forms.gle/tBBeNdvHghAdjTg27
Website:
https://sites.google.com/view/voicemos-challenge

Читать полностью…

Speech Technology

https://github.com/KdaiP/StableTTS

Читать полностью…

Speech Technology

Interesting concept of error chain in autoregressive models and also nice to remember that phones are real

Transducers with Pronunciation-Aware Embeddings for Automatic Speech Recognition
ICASSP 2024
Hainan Xu; Zhehuai Chen; Fei Jia; Boris Ginsburg

This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model’s decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.

https://ieeexplore.ieee.org/abstract/document/10447685

Читать полностью…

Speech Technology

UTokyo-SaruLab team won 1st place in The Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge TTS (Acoustic + Vocoder) Track

https://arxiv.org/abs/2403.13720

UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito

We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.

Читать полностью…

Speech Technology

The leaderboard for the tasks is now available online and can be accessed at: https://huggingface.co/spaces/NOTSOFAR/CHiME8Challenge.
Additionally, the teams have refined the baselines.

Читать полностью…

Speech Technology

Feels like e2e time ended. Naturalspeech and now this. Coming year will be the year of decomposition

https://arxiv.org/abs/2403.06387

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Yufeng Yang, Ashutosh Pandey, DeLiang Wang

It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by 28.4% relatively with a 5.57% WER, and achieves 3.32/4.44% WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.

Читать полностью…

Speech Technology

https://anonymous.4open.science/w/ham-tts/

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Anonymous Submission to ACL, 2024


Abstract
Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.

Читать полностью…

Speech Technology

Some developers are really fast

https://github.com/open-mmlab/Amphion/pull/152

Читать полностью…

Speech Technology

For the future

https://arxiv.org/abs/2402.01571

Spiking Music: Audio Compression with Event Based Auto-encoders

Martim Lisboa, Guillaume Bellec

Neurons in the brain communicate information via punctual events called spikes. The timing of spikes is thought to carry rich information, but it is not clear how to leverage this in digital systems. We demonstrate that event-based encoding is efficient for audio compression. To build this event-based representation we use a deep binary auto-encoder, and under high sparsity pressure, the model enters a regime where the binary event matrix is stored more efficiently with sparse matrix storage algorithms. We test this on the large MAESTRO dataset of piano recordings against vector quantized auto-encoders. Not only does our "Spiking Music compression" algorithm achieve a competitive compression/reconstruction trade-off, but selectivity and synchrony between encoded events and piano key strikes emerge without supervision in the sparse regime.

Читать полностью…

Speech Technology

Auto-regressive GPT synthesizers are not going to be stable. Transformers are the key

https://github.com/choiHkk/Transformer-TTS-V2

Читать полностью…

Speech Technology

Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

People say it is really good. No code yet

https://github.com/HumanAIGC/EMO

Читать полностью…

Speech Technology

https://www.youtube.com/watch?v=VXlpF3DrVP0

Читать полностью…

Speech Technology

This Bangla ASR Model is Trained with ~18000 Hours of Youtube Data. Get 2.5% Word Error Rate on the Test Dataset.

https://huggingface.co/hishab/hishab_bn_fastconformer

Читать полностью…

Speech Technology

Review of neural codecs

https://arxiv.org/abs/2402.13236

Towards audio language modeling - an overview

Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-yi Lee

Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs). Numerous high-performance neural audio codecs and codec-based LMs have been developed. The paper aims to provide a thorough and systematic overview of the neural audio codec models and codec-based LMs.

Читать полностью…

Speech Technology

The audio-to-audio feature in stableaudio
opens up new workflows for rapid sonic exploration in audiovisual production.

https://twitter.com/jordiponsdotme/status/1775779209891246560

Читать полностью…

Speech Technology

English Anime voice dataset

https://huggingface.co/datasets/ShoukanLabs/AniSpeech

and project to finetune styletts2 using it

https://huggingface.co/ShoukanLabs/Vokan

Читать полностью…

Speech Technology

Race for parameters in TTS models begins

https://www.linkedin.com/feed/update/urn:li:activity:7179705576422039552/

We have just launched our new Speech Synthesis Foundation Model (SSFM), which is a voice AI model that has adopted a large language model with 14x more parameters than our legacy model. The new model has been trained with 75x more speech database, and with it, we have achieved the state of the art expressiveness of synthetic voice. This means that just 10 seconds of a prompt voice can mimic its voice identity and speaking style. Check out the comparison with OpenAI and Azure TTS in the following link: https://lnkd.in/gaU6zGfu

https://typecast.ai/learn/typecast-ssfm-text-to-speech/

Читать полностью…

Speech Technology

Announcing VoiceCraft

SotA for both speech editing and zero-shot text-to-speech, Outperforming VALL-E, XTTS-v2, etc.

VoiceCraft works on in-the-wild data such as movies, random videos and podcasts

We fully open source it at https://github.com/jasonppy/VoiceCraft

Читать полностью…

Speech Technology

Distil-Whisper is now more accurate, efficient and accessible 🚀

distil-large-v3 is 6x faster and within 1% WER of large-v3, with reduced hallucinations and better long-form support across all libraries

Previous Distil-Whisper models were trained on an avg audio length of 7-seconds. Predictions beyond this point were thus largely inaccurate

To preserve Whisper's ability to transcribe 30-second chunks, we added pre-processing to pack audios to 30-seconds

Repo: https://huggingface.co/distil-whisper/distil-large-v3

https://twitter.com/sanchitgandhi99/status/1770877844823896117

Читать полностью…

Speech Technology

ICASSP2024 proceedings released

https://ieeexplore.ieee.org/xpl/conhome/10445798/proceeding

Читать полностью…

Speech Technology

Checkpoints for NS3 too

https://huggingface.co/amphion/naturalspeech3_facodec

Читать полностью…

Speech Technology

Some further thougths on NaturalSpeech3

https://alphacephei.com/nsh/2024/03/08/naturalspeech-factorization.html

Читать полностью…

Speech Technology

https://github.com/nii-yamagishilab/ZMM-TTS

Читать полностью…

Speech Technology

Industrial cases are always interesting because they demonstrate the real state of the technology. This TTS is nice

https://rime.ai/blog/introducing-mist

Читать полностью…

Speech Technology

SLT 2024 will be in Macao, China Dec 2 - Dec 5 2024

Читать полностью…

Speech Technology

Sherpa merged NNAPI support on Android

https://github.com/k2-fsa/sherpa-onnx/pull/160

Читать полностью…

Speech Technology

Nice research on more transparent models is going on

https://github.com/lxy-peter/EfficientPunct

Читать полностью…

Speech Technology

Huggingface is down because

It took a while but we finally released our massive youtube speech dataset: https://huggingface.co/datasets/espnet/yodas .370k hours across 140 languages.

https://twitter.com/chenwanch1/status/1762942313972592676

The size is about 100TB

Читать полностью…

Speech Technology

TTS has come to the point where data has no labels

https://arxiv.org/abs/2310.16338

Generative Pre-training for Speech with Flow Matching

Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

Читать полностью…
Subscribe to a channel