Somewhat interesting discussion on the paper Autoregressive Speech Synthesis without Vector Quantization
https://twitter.com/unilightwf/status/1811610158713413716
Its interesting that Microsoft turns to continuous models. I saw few other papers on the same direction too.
the claims that discrete models have issues with continuity is quite a valid point. For example, its easy to prove speaker similarity is not perfect as discrete representation doesn't doesn't really follow continuous xvector. It is strange that none discrete papers mentioned that.
We have open-sourced Emilia for speech generation, a 101k-hour dataset in six languages from in-the-wild (e.g. talk shows, interviews, debates). Checkout perf of model trained with it.
HF: https://huggingface.co/datasets/amphion/Emilia
ArXiv: https://arxiv.org/abs/2407.05361
Demo: https://emilia-dataset.github.io/Emilia-Demo-Page/
Many LLM papers recently, this one is interesting in claims of very good accuracy for Chinese and English
https://arxiv.org/abs/2407.04675
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.
Lessons From the Autoregressive/Nonautoregressive Battle in Speech Synthesis
Xu Tan
Microsoft Research Asia
xuta@microsoft.com
2024/1/24
https://tan-xu.github.io/AR-NAR-TTS.pdf
Not the only battle. Discrete/continuous is another one.
Kyutai, a french AI lab with $300M in funding, just unveiled Moshi, an open-source GPT-4o competitor.
Moshi is a real-time multimodal model that can listen, hear, and speak.
Code, model, and paper will be release soon.
https://www.youtube.com/live/hm2IJSKcYvo
SPECOM 2024
https://specom2024.ftn.uns.ac.rs/
Paper Submission Deadline is July 15, 2024
Everyone is welcome to participate
https://arxiv.org/abs/2406.18301
MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research
Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan
Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data.
Silero VAD upgraded to V5
https://github.com/snakers4/silero-vad/releases/tag/v5.0
Improved version 3X faster, trained on 6000+ languages
Audio-driven video synthesis. VASA project by Microsoft
https://www.microsoft.com/en-us/research/project/vasa-1/
Recently merged in pyannote
Model here
https://huggingface.co/pyannote/speech-separation-ami-1.0
https://arxiv.org/abs/2403.02288
PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings
Joonas Kalda, Clément Pagés, Ricard Marxer, Tanel Alumäe, Hervé Bredin
A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.
we are thrilled to announce YODAS v2!
- 400k hours, 149 languages of speech data (same to v1)
- supporting long-form speech
- higher sampling rate (24 kHz sampling)
https://huggingface.co/datasets/espnet/yodas2
https://github.com/facebookresearch/ears_dataset
Highlights
* 100 h of speech data from 107 speakers
* high-quality recordings at 48 kHz in an anechoic chamber
* high speaker diversity with speakers from different ethnicities and age range from 18 to 75 years
* full dynamic range of human speech, ranging from whispering to yelling
* 18 minutes of freeform monologues per speaker
* sentence reading in 7 different reading styles (regular, loud, whisper, high pitch, low pitch, fast, slow)
* emotional reading and freeform tasks covering 22 different emotions for each speaker
The code for the above
https://github.com/aispeech-lab/w2v-cif-bert
Three speech LLMs for today
From Alibaba
https://github.com/cwang621/blsp-emo
https://arxiv.org/abs/2406.03872
BLSP-Emo: Towards Empathetic Large Speech-Language Models
Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang
The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.
Another one from a random guy, more a hype thing
https://github.com/fixie-ai/ultravox
Another one from another random guy
https://github.com/jamesparsloe/llm.speech
Some whisper speedups and small bit quantization
https://mobiusml.github.io/whisper-static-cache-blog/
https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE
Speech-MASSIVE is a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages (Arabic, German, Spanish, French, Hungarian, Korean, Dutch, Polish, European Portuguese, Russian, Turkish, and Vietnamese) from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. MASSIVE utterances' labels span 18 domains, with 60 intents and 55 slots. Full train split is provided for French and German, and for all the 12 languages (including French and German), we provide few-shot train, validation, test splits. Few-shot train (115 examples) covers all 18 domains, 60 intents, and 55 slots (including empty slots).
Whisper needs much more exploration actually. This is a great paper on relevant subject
https://arxiv.org/pdf/2406.05806
Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper
Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee
This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.
There are two extremes these days - one party claims that LLMs has magical emergent abilities, another claims that AI is overhyped and will end soon.
The real situation is actually very simple. I said that before in several talks but never saw this simple explanation anywhere. Emergent abilities exist, but they are not magical. LLMs are a real thing, certainly not a hype.
It is actually pretty straightforward why LLMs “reason” or, to be more exact, can operate on complex concepts. By processing huge amount of texts with variety of cost functions they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So LLMs really distill knowledge and build semantic graph. Alternatively you can think about them as a very good principal component analysis that can extract many important aspects and their relations. I said that before that multi-objective is quite important here, it helps to find unrelated concepts faster and Whisper is a good example of it.
Once knowledge is distilled you can build on top of that.
There were many attempts to build semantic graph before, but manual effort never succeeded because of scale. The real huge advancement is that automated process works.
Many blame recent video generation LLMs for misunderstanding physics. Its a temporary thing, soon they will understand physics very well.
ReDimNet from IDVoice coming in Interspeech 2024
Читать полностью…When new tech arrives I try to be optimistic. Another attempt to create SpeechLLM
https://github.com/skit-ai/SpeechLLM
librispeech-test-clean WER is 6.73. Good system WER approaches 1.4 at the same time.
On the other side we see the Google Gemini Pro 1.5 WER is quite good on diverse datasets.
WavLab's XEUS - an SSL speech encoder that covers over 4000+ languages!
XEUS is trained on over 1 million hours of speech. It outperforms both MMS 1B and w2v-BERT v2 2.0 on many tasks.
We're releasing the code, checkpoints, and our 4000+ lang. data!
https://twitter.com/chenwanch1/status/1807834060867186886
Paper: https://wanchichen.github.io/pdf/xeus.pdf
Project Page: https://wavlab.org/activities/2024/xeus/
You can also download the model and crawled data from HuggingFace:
https://huggingface.co/espnet/xeus
There is still a big demand in streaming TTS
https://github.com/OpenT2S/inferStreamHiFiGAN
From Microsoft. The only thing it requires 50k hours for training
https://www.microsoft.com/en-us/research/project/e2-tts/
https://arxiv.org/abs/2406.18009
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See this https URL for demo samples.
🚀 Exciting News! mHuBERT-147 is Here! 🚀
We've just released mHuBERT-147, a compact powerful multilingual model, reaching top position on ML-SUPERB with just 95M parameters! 🌟
Trained on balanced, high-quality, open-license data, this model rivals MMS-1B but is 10x smaller.
https://x.com/mzboito/status/1800509179226148919/photo/1
Another TTS
https://github.com/Camb-ai/MARS5-TTS
We evaluated SLAM-LLM on real tasks actually. Doesn't work as expected. It works well on librispeech but garbage on every real task. For that reason the proper way to integrate speech tokens into LLM is still unknown. This paper from Cambridge is more reasonable and compares with SLAM actually.
https://arxiv.org/abs/2406.00522
Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning
Keqi Deng, Guangzhi Sun, Philip C. Woodland
Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in prior work and preserve the emergent abilities of LLMs, Wav2Prompt takes LLM token embeddings as the training targets and utilises a continuous integrate-and-fire mechanism for explicit speech-text alignment. Therefore, a Wav2Prompt-LLM combination can be applied to zero-shot spoken language tasks such as speech translation (ST), speech understanding (SLU), speech question answering (SQA) and spoken-query-based QA (SQQA). It is shown that for these zero-shot tasks, Wav2Prompt performs similarly to an ASR-LLM cascade and better than recent prior work. If relatively small amounts of task-specific paired data are available in few-shot scenarios, the Wav2Prompt-LLM combination can be end-to-end (E2E) fine-tuned. The Wav2Prompt-LLM combination then yields greatly improved results relative to an ASR-LLM cascade for the above tasks. For instance, for English-French ST with the BLOOMZ-7B1 LLM, a Wav2Prompt-LLM combination gave a 8.5 BLEU point increase over an ASR-LLM cascade.
Meanwhile LAION also plans to implement LLMs with speech tokens for TTS
https://laion.ai/notes/open-gpt-4-o/
A winner in discrete speech TTS challenge
https://arxiv.org/abs/2403.13720
UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge
Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito
We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.