speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

There are two extremes these days - one party claims that LLMs has magical emergent abilities, another claims that AI is overhyped and will end soon.

The real situation is actually very simple. I said that before in several talks but never saw this simple explanation anywhere. Emergent abilities exist, but they are not magical. LLMs are a real thing, certainly not a hype.

It is actually pretty straightforward why LLMs “reason” or, to be more exact, can operate on complex concepts. By processing huge amount of texts with variety of cost functions they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So LLMs really distill knowledge and build semantic graph. Alternatively you can think about them as a very good principal component analysis that can extract many important aspects and their relations. I said that before that multi-objective is quite important here, it helps to find unrelated concepts faster and Whisper is a good example of it.

Once knowledge is distilled you can build on top of that.

There were many attempts to build semantic graph before, but manual effort never succeeded because of scale. The real huge advancement is that automated process works.

Many blame recent video generation LLMs for misunderstanding physics. Its a temporary thing, soon they will understand physics very well.

Читать полностью…

Speech Technology

ReDimNet from IDVoice coming in Interspeech 2024

Читать полностью…

Speech Technology

When new tech arrives I try to be optimistic. Another attempt to create SpeechLLM

https://github.com/skit-ai/SpeechLLM

librispeech-test-clean WER is 6.73. Good system WER approaches 1.4 at the same time.

On the other side we see the Google Gemini Pro 1.5 WER is quite good on diverse datasets.

Читать полностью…

Speech Technology

WavLab's XEUS - an SSL speech encoder that covers over 4000+ languages!

XEUS is trained on over 1 million hours of speech. It outperforms both MMS 1B and w2v-BERT v2 2.0 on many tasks.

We're releasing the code, checkpoints, and our 4000+ lang. data!

https://twitter.com/chenwanch1/status/1807834060867186886

Paper: https://wanchichen.github.io/pdf/xeus.pdf
Project Page: https://wavlab.org/activities/2024/xeus/

You can also download the model and crawled data from HuggingFace:

https://huggingface.co/espnet/xeus

Читать полностью…

Speech Technology

There is still a big demand in streaming TTS

https://github.com/OpenT2S/inferStreamHiFiGAN

Читать полностью…

Speech Technology

From Microsoft. The only thing it requires 50k hours for training

https://www.microsoft.com/en-us/research/project/e2-tts/

https://arxiv.org/abs/2406.18009

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See this https URL for demo samples.

Читать полностью…

Speech Technology

https://github.com/thinhlpg/viVoice

Читать полностью…

Speech Technology

https://github.com/BakerBunker/FreeV

Читать полностью…

Speech Technology

🚀 Exciting News! mHuBERT-147 is Here! 🚀

We've just released mHuBERT-147, a compact powerful multilingual model, reaching top position on ML-SUPERB with just 95M parameters! 🌟

Trained on balanced, high-quality, open-license data, this model rivals MMS-1B but is 10x smaller.

https://x.com/mzboito/status/1800509179226148919/photo/1

Читать полностью…

Speech Technology

Another TTS

https://github.com/Camb-ai/MARS5-TTS

Читать полностью…

Speech Technology

We evaluated SLAM-LLM on real tasks actually. Doesn't work as expected. It works well on librispeech but garbage on every real task. For that reason the proper way to integrate speech tokens into LLM is still unknown. This paper from Cambridge is more reasonable and compares with SLAM actually.

https://arxiv.org/abs/2406.00522

Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning

Keqi Deng, Guangzhi Sun, Philip C. Woodland

Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in prior work and preserve the emergent abilities of LLMs, Wav2Prompt takes LLM token embeddings as the training targets and utilises a continuous integrate-and-fire mechanism for explicit speech-text alignment. Therefore, a Wav2Prompt-LLM combination can be applied to zero-shot spoken language tasks such as speech translation (ST), speech understanding (SLU), speech question answering (SQA) and spoken-query-based QA (SQQA). It is shown that for these zero-shot tasks, Wav2Prompt performs similarly to an ASR-LLM cascade and better than recent prior work. If relatively small amounts of task-specific paired data are available in few-shot scenarios, the Wav2Prompt-LLM combination can be end-to-end (E2E) fine-tuned. The Wav2Prompt-LLM combination then yields greatly improved results relative to an ASR-LLM cascade for the above tasks. For instance, for English-French ST with the BLOOMZ-7B1 LLM, a Wav2Prompt-LLM combination gave a 8.5 BLEU point increase over an ASR-LLM cascade.

Читать полностью…

Speech Technology

Meanwhile LAION also plans to implement LLMs with speech tokens for TTS

https://laion.ai/notes/open-gpt-4-o/

Читать полностью…

Speech Technology

A winner in discrete speech TTS challenge

https://arxiv.org/abs/2403.13720

UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito

We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.

Читать полностью…

Speech Technology

Conversational Voice Clone Challenge (CoVoC)
ISCSLP2024 Grand Challenge

https://www.magicdatatech.com/iscslp-2024

Call for Participation
Text-to-speech (TTS) aims to produce speech that sounds as natural and human-like as possible. Recent advancements in neural speech synthesis have significantly enhanced the quality and naturalness of generated speech, leading to widespread applications of TTS systems in real-world scenarios. A notable breakthrough in the field is witnessed in zero-shot TTS, driven by expanded datasets and new approaches (e.g.: decoder-only paradigms), attracting extensive attention from academia and industry. However, these advancements haven't been sufficiently investigated to address challenges in spontaneous and conversational contexts. Specifically, the primary challenge lies in effectively managing prosody details in the generated speech, which is attributed to the diverse and intricate spontaneous behaviors that differentiate spontaneous speech from read speech.

Large-scale TTS systems, with their in-context learning ability, have the potential to yield promising outcomes in the mentioned scenarios. However, a prevalent challenge in the field of large-scale zero-shot TTS is the lack of consistency in training and testing datasets, along with a standardized evaluation benchmark. This issue hinders direct comparisons and makes it challenging to assess the performance of various systems accurately.

To promote the development of expressive spontaneous-style speech synthesis in the zero-shot scenario, we are launching the Conversational Voice Clone Challenge (CoVoC). This challenge encompasses a range of diverse training datasets, including the 10K-hour WenetSpeech4TTS dataset, 180 hours of Mandarin spontaneous conversational speech data, and 100 hours of high-quality spoken conversations. Furthermore, a standardized testing dataset, accompanied by carefully designed text, will be made available. The goal of providing these sizable and standardized datasets is to establish a comprehensive benchmark.

Timeline
June 3, 2024 : HQ-Conversations data release

June 10, 2024 : Baseline system release

June 30, 2024 : Evaluation stage starts; Clone-Speaker and Test-Text data release; deadline for challenge registration

July 2, 2024 : Evaluation ends; Test audio and system description submission deadline

July 12, 2024 : Evaluation results release to participants

July 20, 2024 : Deadline for ISCSLP2024 paper submission (only for invited teams)

Читать полностью…

Speech Technology

State space model for realtime TTS

https://cartesia.ai/blog/sonic

In experiments so far, we've found that we can simultaneously improve model quality, inference speed, throughput, and latency compared to widely used Transformer implementations for audio generation. A parameter-matched Cartesia model trained on Multilingual Librispeech for one epoch achieves 20% lower validation perplexity. On downstream evaluations, this results in a 2x lower word error rate and a 1 point higher quality score (out of 5, as measured on the NISQA evaluation). At inference, it achieves lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor), and higher throughput (4x).

Читать полностью…

Speech Technology

Lessons From the Autoregressive/Nonautoregressive Battle in Speech Synthesis
Xu Tan
Microsoft Research Asia
xuta@microsoft.com
2024/1/24

https://tan-xu.github.io/AR-NAR-TTS.pdf

Not the only battle. Discrete/continuous is another one.

Читать полностью…

Speech Technology

Kyutai, a french AI lab with $300M in funding, just unveiled Moshi, an open-source GPT-4o competitor.

Moshi is a real-time multimodal model that can listen, hear, and speak.

Code, model, and paper will be release soon.

https://www.youtube.com/live/hm2IJSKcYvo

Читать полностью…

Speech Technology

SPECOM 2024

https://specom2024.ftn.uns.ac.rs/

Paper Submission Deadline is July 15, 2024

Everyone is welcome to participate

Читать полностью…

Speech Technology

https://arxiv.org/abs/2406.18301

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data.

Читать полностью…

Speech Technology

Silero VAD upgraded to V5

https://github.com/snakers4/silero-vad/releases/tag/v5.0

Improved version 3X faster, trained on 6000+ languages

Читать полностью…

Speech Technology

Audio-driven video synthesis. VASA project by Microsoft

https://www.microsoft.com/en-us/research/project/vasa-1/

Читать полностью…

Speech Technology

Recently merged in pyannote

Model here

https://huggingface.co/pyannote/speech-separation-ami-1.0

https://arxiv.org/abs/2403.02288

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Joonas Kalda, Clément Pagés, Ricard Marxer, Tanel Alumäe, Hervé Bredin

A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.

Читать полностью…

Speech Technology

we are thrilled to announce YODAS v2!
- 400k hours, 149 languages of speech data (same to v1)
- supporting long-form speech
- higher sampling rate (24 kHz sampling)

https://huggingface.co/datasets/espnet/yodas2

Читать полностью…

Speech Technology

https://github.com/facebookresearch/ears_dataset

Highlights
* 100 h of speech data from 107 speakers
* high-quality recordings at 48 kHz in an anechoic chamber
* high speaker diversity with speakers from different ethnicities and age range from 18 to 75 years
* full dynamic range of human speech, ranging from whispering to yelling
* 18 minutes of freeform monologues per speaker
* sentence reading in 7 different reading styles (regular, loud, whisper, high pitch, low pitch, fast, slow)
* emotional reading and freeform tasks covering 22 different emotions for each speaker

Читать полностью…

Speech Technology

The code for the above

https://github.com/aispeech-lab/w2v-cif-bert

Читать полностью…

Speech Technology

Three speech LLMs for today

From Alibaba

https://github.com/cwang621/blsp-emo

https://arxiv.org/abs/2406.03872

BLSP-Emo: Towards Empathetic Large Speech-Language Models

Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang

The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.

Another one from a random guy, more a hype thing

https://github.com/fixie-ai/ultravox

Another one from another random guy

https://github.com/jamesparsloe/llm.speech

Читать полностью…

Speech Technology

Some whisper speedups and small bit quantization

https://mobiusml.github.io/whisper-static-cache-blog/

Читать полностью…

Speech Technology

Usually "explainable" means something weird. Like you try to find some neuron in network responsible for the decision. This paper is somewhat different promoting the attribute approach

https://arxiv.org/abs/2405.19796

prediction accuracy is not as good, so more work is required. But overall direction is nice

Explainable Attribute-Based Speaker Verification

Xiaoliang Wu, Chau Luu, Peter Bell, Ajitha Rajan

This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.

Читать полностью…

Speech Technology

@bjutte from Attendi does nice job on medical transcription. Check his last blog

Attendi/improving-automated-punctuation-of-transcribed-medical-reports-f7c6619b1715" rel="nofollow">https://medium.com/@Attendi/improving-automated-punctuation-of-transcribed-medical-reports-f7c6619b1715

Читать полностью…

Speech Technology

From Apple quite in-depth paper on alternative for LM rescoring. Feels like it is going to be a generic direction for coming years

https://arxiv.org/abs/2405.15216

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a scaled error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several key ingredients: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on test-clean and 3.3% WER on test-other on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

Читать полностью…
Subscribe to a channel