speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

LLMs are frontier in TTS too (true ones, not gpt). SLAM can do it too btw. Microsoft paper

https://arxiv.org/abs/2404.03204

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 5.6% (without reranking) and 1.7% (with reranking) to 2.5% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.

Читать полностью…

Speech Technology

Our current experiments with non-streaming ASR + VAD demonstrated serious problems in the approach. First of all, VAD appeared insufficient for complex audio conditions which is explainable given it doesn't have access to linguistic information. Second, longer audio chunks demonstrate significant delays even with modern fast networks. The problem is getting worse if you have to combine chunks of different length in a batch.

So we consider join streaming/non-streaming approach recently. U2 architecture and wenet actually made a lot of sense. Shinji's paper is similar here

https://arxiv.org/abs/2405.13514

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.

Читать полностью…

Speech Technology

More serious speech LLM

https://github.com/ddlBoJack/SLAM-LLM

Читать полностью…

Speech Technology

Another multimodal LLM doing speech

https://github.com/OpenMOSS/AnyGPT

Читать полностью…

Speech Technology

Valuation 50x revenue. Ok

https://github.com/PolyAI-LDN/pheme

https://www.linkedin.com/posts/seb-johnson_polyai-secures-near-500mn-valuation-in-boost-activity-7196548712846761984-MOLg/

BREAKING: PolyAI RAISES $50M AT $500M VALUATION

On monday I posted about PolyAI and how they were shaping up to become one of the UK's strongest AI companies. Today they announced a huge fundraise from new investors including chipmaking giant NVIDIA.

PolyAI is a Conversational AI platform automating customer service. It was founded in 2017 by Nikola Mrkšić, Tsung-Hsien Wen, and Pei-Hao (Eddy) Su who met at the University of Cambridge’s Machine Intelligence Lab.

As part of the fundraise the company highlighted some INSANE financial results:

* 10m revenue last year
* On track to TRIPLE that this year (i.e. $30m)
* 90% Gross Margin (FY23)

PolyAI grew out of Entrepreneur First before going on to raise several investment rounds, including their $40m Series B which valued the company at nearly $300 million post-money. Two years later this has grown to $500m!

A great result for the team and a further boost for the UK’s ambitions of becoming an AI hub.

Читать полностью…

Speech Technology

Apple optimizes Conformer for mobile devices. Some in-depth ideas

https://arxiv.org/abs/2312.10359

5x faster than real time speech recognition running on an apple watch 7. this is a lot more solved than you would think!

Читать полностью…

Speech Technology

https://github.com/tincans-ai/gazelle

https://tincans.ai/

We're building real-time conversational speech with large language models. Bots should be fun to talk to.

Previously, we built large scale machine learning systems at Cash App, Quora, Airbnb, and more.

Читать полностью…

Speech Technology

Its interesting that Cambridge lab is doing a lot of Whisperology recently

https://arxiv.org/abs/2405.06134

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales

Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as <endoftext>, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's <endoftext> token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.

Читать полностью…

Speech Technology

https://challenge.singfake.org/

We are organizing the inaugural Singing Voice Deepfake Detection (SVDD) 2024 challenge at IEEE Spoken Language Technology Workshop (SLT) 2024 to foster research and development of sophisticated methods for detecting AI-generated singing voices. This is an emerging issue within the music industry that requires specialized solutions.
Our prior analysis using the SingFake dataset showed a marked decline in performance of state-of-the-art speech deepfake countermeasures when applied to singing voices. This highlighted the need for tailored SVDD techniques.

Читать полностью…

Speech Technology

Interesting dataset generated with ElevenLabs. Multiple languages in the same phrase

https://huggingface.co/datasets/MohamedRashad/multilingual-tts

Читать полностью…

Speech Technology

People say this vocoder has a point by joining signal processing with neural tech

https://ast-astrec.nict.go.jp/demo_samples/firnet_icassp2024/

FIRNet: Fast and pitch controllable neural vocoder with trainable finite impulse response filter

Some neural vocoders with fundamental frequency (f0) control have succeeded in performing real-time inference on a single CPU while preserving the quality of the synthetic speech. However, compared with legacy vocoders based on signal processing, their inference speeds are still low. This paper proposes a neural vocoder based on the source-filter model with trainable time-variant finite impulse response (FIR) filters, to achieve a similar inference speed to legacy vocoders. In the proposed model, FIRNet, multiple FIR coefficients are predicted using the neural networks, and the speech waveform is then generated by convolving a mixed excitation signal with these FIR coefficients. Experimental results show that FIRNet can achieve an inference speed similar to legacy vocoders while maintaining f0 controllability and natural speech quality.

https://ast-astrec.nict.go.jp/release/preprints/preprint_icassp_2024_ohtani.pdf

Читать полностью…

Speech Technology

I tested Google ASR recently for English - Chirp, Conformer (latest version) and Gemini. Conformer is not good. Chirp is ok, somewhat better than Whisper V3.

Читать полностью…

Speech Technology

Idea is relevant. Results are somewhat mixed.

https://arxiv.org/abs/2404.06714

Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness

Xincan Feng, Akifumi Yoshimoto

Recent advancements in Natural Language Processing (NLP) have seen Large-scale Language Models (LLMs) excel at producing high-quality text for various purposes. Notably, in Text-To-Speech (TTS) systems, the integration of BERT for semantic token generation has underscored the importance of semantic content in producing coherent speech outputs. Despite this, the specific utility of LLMs in enhancing TTS synthesis remains considerably limited. This research introduces an innovative approach, Llama-VITS, which enhances TTS synthesis by enriching the semantic content of text using LLM. Llama-VITS integrates semantic embeddings from Llama2 with the VITS model, a leading end-to-end TTS framework. By leveraging Llama2 for the primary speech synthesis process, our experiments demonstrate that Llama-VITS matches the naturalness of the original VITS (ORI-VITS) and those incorporate BERT (BERT-VITS), on the LJSpeech dataset, a substantial collection of neutral, clear speech. Moreover, our method significantly enhances emotive expressiveness on the EmoV_DB_bea_sem dataset, a curated selection of emotionally consistent speech from the EmoV_DB dataset, highlighting its potential to generate emotive speech.

Читать полностью…

Speech Technology

https://github.com/speechbrain/benchmarks/tree/main/benchmarks/CL_MASR

CL-MASR: A Continual Learning Benchmark for Multilingual ASR
This is the official benchmark platform accompanying the paper CL-MASR: A Continual Learning Benchmark for Multilingual ASR.

It includes scripts to train Whisper and WavLM-based ASR systems on a subset of 20 languages selected from Common Voice 13 in a continual learning fashion using a handful of methods including rehearsal-based, architecture-based, and regularization-based approaches.

The goal is to continually learn new languages while limiting forgetting the previously learned ones. An ideal method should achieve both positive forward transfer (i.e. improve performance on new tasks leveraging shared knowledge from previous tasks) and positive backward transfer (i.e. improve performance on previous tasks leveraging shared knowledge from new tasks).

Читать полностью…

Speech Technology

Multimodal speech LLM work by Google DeepMind

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems

Frank Palma Gomez, Ramon Sanabria, Yun-hsuan Sung, Daniel Cer, Siddharth Dalmia, Gustavo Hernandez Abrego

Large language models (LLMs) are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn't require speech data during LLM pre-training and can exploit LLM's multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.

https://arxiv.org/abs/2404.01616v2

Читать полностью…

Speech Technology

Sometimes world tells you something. Three unrelated sources on sonification I've met last week

Photo sonification
https://github.com/yarov475/photo-soniphication (from Russian https://habr.com/ru/companies/spbu/articles/816839/)

Images that Sound: Composing Images and Sounds on a Single Canvas
https://arxiv.org/abs/2405.12221 (demo https://ificl.github.io/images-that-sound/)

Sound training platform applied to astronomy
https://arxiv.org/abs/2405.06042

Time to remember Myst and Pythagoreans

Читать полностью…

Speech Technology

The Expresso dataset is a high-quality (48kHz) expressive speech dataset that includes both expressively rendered read speech (8 styles, in mono wav format) and improvised dialogues (26 styles, in stereo wav format). The dataset includes 4 speakers (2 males, 2 females), and totals 40 hours (11h read, 30h improvised). The transcriptions of the read speech are also provided.

https://huggingface.co/datasets/ylacombe/expresso

Читать полностью…

Speech Technology

https://www.youtube.com/watch?v=JyPEZhMCfcU

Читать полностью…

Speech Technology

There will be CHiME-8 webinar! If you're interested in the CHiME-8 challenge, please join!
Date: May 20, 2024
Time: 8:00 AM (US - ET)
Place: https://cmu.zoom.us/j/92314209923?pwd=TFpPUm1DTDhUOHJKbDdndFg1QmxPdz09

Likely meeting will be recorded

Читать полностью…

Speech Technology

Whisper.cpp implemented flash attention

https://github.com/ggerganov/whisper.cpp/releases/tag/v1.6.0

Читать полностью…

Speech Technology

And another one text-guided editing

https://github.com/JinhuaLiang/WavCraft

Читать полностью…

Speech Technology

https://2024.ieeeslt.org/challenges/

Читать полностью…

Speech Technology

Interesting case of split of the SLU between STM32 microcontroller and the cloud

https://arxiv.org/abs/2311.18188

Speech Understanding on Tiny Devices with A Learning Cache
Afsara Benazir, Zhiming Xu, Felix Xiaozhu Lin (University of Virginia)
This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users.

Читать полностью…

Speech Technology

Multi-resolution is always nice

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

https://openreview.net/forum?id=kUuKFW7DIF

Читать полностью…

Speech Technology

Diffusion-based vocoder better than Vocos

https://github.com/bfs18/rfwave

overall, many complains about Vocos quality but I'm not sure why. Technically it is a good architecture. Vocos problems are likely limitation of training setup, not really the architecture itself.

Читать полностью…

Speech Technology

Assembly.AI paper

The technical report about our latest Universal-1 multilingual ASR model is out!

Universal-1 is our latest ASR model in production, designed for high-quality, high-throughput, and large-scale operations. Not only does it demonstrate competitive or superior WERs in English, Spanish, German, and French under various conditions, but it also shows advantages in various practically relevant areas, such as accurate timestamp prediction, robustness against hallucinations, and handling code-switching. In this report, we adopt a holistic, system-centric approach to analyzing various aspects of fully-fledged ASR models to draw practically relevant insights that are useful for real-world services operating at scale. We hope that our report will help advance the speech field as it finds more and more applications in the real world.


https://arxiv.org/abs/2404.09841

Читать полностью…

Speech Technology

ICASSP starts next week. There will be many cool things if I'd have time to read it all

https://research.google/conferences-and-events/google-at-icassp-2024/

maybe we should organize some paper reading

Читать полностью…

Speech Technology

Parler-TTS quality is not something exceptional. But the whole idea working with audio with text prompts is somewhat interesting (audio cleanup, denoising, separation, etc).

Читать полностью…

Speech Technology

https://github.com/ga642381/speech-trident

Читать полностью…

Speech Technology

RALL-E with chain-of-thought (CoT) prompting is helpful to improve the robustness of codec LLM for speech synthesis, reducing error rate from 68% to 4% on extremely hard test sentences.

https://huggingface.co/papers/2404.03204
https://ralle-demo.github.io/RALL-E/

Читать полностью…
Subscribe to a channel