speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

Somewhat interesting discussion on the paper Autoregressive Speech Synthesis without Vector Quantization

https://twitter.com/unilightwf/status/1811610158713413716

Its interesting that Microsoft turns to continuous models. I saw few other papers on the same direction too.

the claims that discrete models have issues with continuity is quite a valid point. For example, its easy to prove speaker similarity is not perfect as discrete representation doesn't doesn't really follow continuous xvector. It is strange that none discrete papers mentioned that.

Читать полностью…

Speech Technology

We have open-sourced Emilia for speech generation, a 101k-hour dataset in six languages from in-the-wild (e.g. talk shows, interviews, debates). Checkout perf of model trained with it.
HF: https://huggingface.co/datasets/amphion/Emilia
ArXiv: https://arxiv.org/abs/2407.05361
Demo: https://emilia-dataset.github.io/Emilia-Demo-Page/

Читать полностью…

Speech Technology

Many LLM papers recently, this one is interesting in claims of very good accuracy for Chinese and English

https://arxiv.org/abs/2407.04675

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou

Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.

Читать полностью…

Speech Technology

Lessons From the Autoregressive/Nonautoregressive Battle in Speech Synthesis
Xu Tan
Microsoft Research Asia
xuta@microsoft.com
2024/1/24

https://tan-xu.github.io/AR-NAR-TTS.pdf

Not the only battle. Discrete/continuous is another one.

Читать полностью…

Speech Technology

Kyutai, a french AI lab with $300M in funding, just unveiled Moshi, an open-source GPT-4o competitor.

Moshi is a real-time multimodal model that can listen, hear, and speak.

Code, model, and paper will be release soon.

https://www.youtube.com/live/hm2IJSKcYvo

Читать полностью…

Speech Technology

SPECOM 2024

https://specom2024.ftn.uns.ac.rs/

Paper Submission Deadline is July 15, 2024

Everyone is welcome to participate

Читать полностью…

Speech Technology

https://arxiv.org/abs/2406.18301

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data.

Читать полностью…

Speech Technology

Silero VAD upgraded to V5

https://github.com/snakers4/silero-vad/releases/tag/v5.0

Improved version 3X faster, trained on 6000+ languages

Читать полностью…

Speech Technology

Audio-driven video synthesis. VASA project by Microsoft

https://www.microsoft.com/en-us/research/project/vasa-1/

Читать полностью…

Speech Technology

Recently merged in pyannote

Model here

https://huggingface.co/pyannote/speech-separation-ami-1.0

https://arxiv.org/abs/2403.02288

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Joonas Kalda, Clément Pagés, Ricard Marxer, Tanel Alumäe, Hervé Bredin

A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.

Читать полностью…

Speech Technology

we are thrilled to announce YODAS v2!
- 400k hours, 149 languages of speech data (same to v1)
- supporting long-form speech
- higher sampling rate (24 kHz sampling)

https://huggingface.co/datasets/espnet/yodas2

Читать полностью…

Speech Technology

https://github.com/facebookresearch/ears_dataset

Highlights
* 100 h of speech data from 107 speakers
* high-quality recordings at 48 kHz in an anechoic chamber
* high speaker diversity with speakers from different ethnicities and age range from 18 to 75 years
* full dynamic range of human speech, ranging from whispering to yelling
* 18 minutes of freeform monologues per speaker
* sentence reading in 7 different reading styles (regular, loud, whisper, high pitch, low pitch, fast, slow)
* emotional reading and freeform tasks covering 22 different emotions for each speaker

Читать полностью…

Speech Technology

The code for the above

https://github.com/aispeech-lab/w2v-cif-bert

Читать полностью…

Speech Technology

Three speech LLMs for today

From Alibaba

https://github.com/cwang621/blsp-emo

https://arxiv.org/abs/2406.03872

BLSP-Emo: Towards Empathetic Large Speech-Language Models

Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang

The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.

Another one from a random guy, more a hype thing

https://github.com/fixie-ai/ultravox

Another one from another random guy

https://github.com/jamesparsloe/llm.speech

Читать полностью…

Speech Technology

Some whisper speedups and small bit quantization

https://mobiusml.github.io/whisper-static-cache-blog/

Читать полностью…

Speech Technology

https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE

Speech-MASSIVE is a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages (Arabic, German, Spanish, French, Hungarian, Korean, Dutch, Polish, European Portuguese, Russian, Turkish, and Vietnamese) from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. MASSIVE utterances' labels span 18 domains, with 60 intents and 55 slots. Full train split is provided for French and German, and for all the 12 languages (including French and German), we provide few-shot train, validation, test splits. Few-shot train (115 examples) covers all 18 domains, 60 intents, and 55 slots (including empty slots).

Читать полностью…

Speech Technology

Whisper needs much more exploration actually. This is a great paper on relevant subject

https://arxiv.org/pdf/2406.05806

Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee

This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.

Читать полностью…

Speech Technology

There are two extremes these days - one party claims that LLMs has magical emergent abilities, another claims that AI is overhyped and will end soon.

The real situation is actually very simple. I said that before in several talks but never saw this simple explanation anywhere. Emergent abilities exist, but they are not magical. LLMs are a real thing, certainly not a hype.

It is actually pretty straightforward why LLMs “reason” or, to be more exact, can operate on complex concepts. By processing huge amount of texts with variety of cost functions they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So LLMs really distill knowledge and build semantic graph. Alternatively you can think about them as a very good principal component analysis that can extract many important aspects and their relations. I said that before that multi-objective is quite important here, it helps to find unrelated concepts faster and Whisper is a good example of it.

Once knowledge is distilled you can build on top of that.

There were many attempts to build semantic graph before, but manual effort never succeeded because of scale. The real huge advancement is that automated process works.

Many blame recent video generation LLMs for misunderstanding physics. Its a temporary thing, soon they will understand physics very well.

Читать полностью…

Speech Technology

ReDimNet from IDVoice coming in Interspeech 2024

Читать полностью…

Speech Technology

When new tech arrives I try to be optimistic. Another attempt to create SpeechLLM

https://github.com/skit-ai/SpeechLLM

librispeech-test-clean WER is 6.73. Good system WER approaches 1.4 at the same time.

On the other side we see the Google Gemini Pro 1.5 WER is quite good on diverse datasets.

Читать полностью…

Speech Technology

WavLab's XEUS - an SSL speech encoder that covers over 4000+ languages!

XEUS is trained on over 1 million hours of speech. It outperforms both MMS 1B and w2v-BERT v2 2.0 on many tasks.

We're releasing the code, checkpoints, and our 4000+ lang. data!

https://twitter.com/chenwanch1/status/1807834060867186886

Paper: https://wanchichen.github.io/pdf/xeus.pdf
Project Page: https://wavlab.org/activities/2024/xeus/

You can also download the model and crawled data from HuggingFace:

https://huggingface.co/espnet/xeus

Читать полностью…

Speech Technology

There is still a big demand in streaming TTS

https://github.com/OpenT2S/inferStreamHiFiGAN

Читать полностью…

Speech Technology

From Microsoft. The only thing it requires 50k hours for training

https://www.microsoft.com/en-us/research/project/e2-tts/

https://arxiv.org/abs/2406.18009

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See this https URL for demo samples.

Читать полностью…

Speech Technology

https://github.com/thinhlpg/viVoice

Читать полностью…

Speech Technology

https://github.com/BakerBunker/FreeV

Читать полностью…

Speech Technology

🚀 Exciting News! mHuBERT-147 is Here! 🚀

We've just released mHuBERT-147, a compact powerful multilingual model, reaching top position on ML-SUPERB with just 95M parameters! 🌟

Trained on balanced, high-quality, open-license data, this model rivals MMS-1B but is 10x smaller.

https://x.com/mzboito/status/1800509179226148919/photo/1

Читать полностью…

Speech Technology

Another TTS

https://github.com/Camb-ai/MARS5-TTS

Читать полностью…

Speech Technology

We evaluated SLAM-LLM on real tasks actually. Doesn't work as expected. It works well on librispeech but garbage on every real task. For that reason the proper way to integrate speech tokens into LLM is still unknown. This paper from Cambridge is more reasonable and compares with SLAM actually.

https://arxiv.org/abs/2406.00522

Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning

Keqi Deng, Guangzhi Sun, Philip C. Woodland

Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in prior work and preserve the emergent abilities of LLMs, Wav2Prompt takes LLM token embeddings as the training targets and utilises a continuous integrate-and-fire mechanism for explicit speech-text alignment. Therefore, a Wav2Prompt-LLM combination can be applied to zero-shot spoken language tasks such as speech translation (ST), speech understanding (SLU), speech question answering (SQA) and spoken-query-based QA (SQQA). It is shown that for these zero-shot tasks, Wav2Prompt performs similarly to an ASR-LLM cascade and better than recent prior work. If relatively small amounts of task-specific paired data are available in few-shot scenarios, the Wav2Prompt-LLM combination can be end-to-end (E2E) fine-tuned. The Wav2Prompt-LLM combination then yields greatly improved results relative to an ASR-LLM cascade for the above tasks. For instance, for English-French ST with the BLOOMZ-7B1 LLM, a Wav2Prompt-LLM combination gave a 8.5 BLEU point increase over an ASR-LLM cascade.

Читать полностью…

Speech Technology

Meanwhile LAION also plans to implement LLMs with speech tokens for TTS

https://laion.ai/notes/open-gpt-4-o/

Читать полностью…

Speech Technology

A winner in discrete speech TTS challenge

https://arxiv.org/abs/2403.13720

UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito

We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.

Читать полностью…
Subscribe to a channel