speechtech | Unsorted

Telegram-канал speechtech - Speech Technology

1390

Subscribe to a channel

Speech Technology

Usually "explainable" means something weird. Like you try to find some neuron in network responsible for the decision. This paper is somewhat different promoting the attribute approach

https://arxiv.org/abs/2405.19796

prediction accuracy is not as good, so more work is required. But overall direction is nice

Explainable Attribute-Based Speaker Verification

Xiaoliang Wu, Chau Luu, Peter Bell, Ajitha Rajan

This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.

Читать полностью…

Speech Technology

@bjutte from Attendi does nice job on medical transcription. Check his last blog

Attendi/improving-automated-punctuation-of-transcribed-medical-reports-f7c6619b1715" rel="nofollow">https://medium.com/@Attendi/improving-automated-punctuation-of-transcribed-medical-reports-f7c6619b1715

Читать полностью…

Speech Technology

From Apple quite in-depth paper on alternative for LM rescoring. Feels like it is going to be a generic direction for coming years

https://arxiv.org/abs/2405.15216

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a scaled error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several key ingredients: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on test-clean and 3.3% WER on test-other on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

Читать полностью…

Speech Technology

Sometimes world tells you something. Three unrelated sources on sonification I've met last week

Photo sonification
https://github.com/yarov475/photo-soniphication (from Russian https://habr.com/ru/companies/spbu/articles/816839/)

Images that Sound: Composing Images and Sounds on a Single Canvas
https://arxiv.org/abs/2405.12221 (demo https://ificl.github.io/images-that-sound/)

Sound training platform applied to astronomy
https://arxiv.org/abs/2405.06042

Time to remember Myst and Pythagoreans

Читать полностью…

Speech Technology

The Expresso dataset is a high-quality (48kHz) expressive speech dataset that includes both expressively rendered read speech (8 styles, in mono wav format) and improvised dialogues (26 styles, in stereo wav format). The dataset includes 4 speakers (2 males, 2 females), and totals 40 hours (11h read, 30h improvised). The transcriptions of the read speech are also provided.

https://huggingface.co/datasets/ylacombe/expresso

Читать полностью…

Speech Technology

https://www.youtube.com/watch?v=JyPEZhMCfcU

Читать полностью…

Speech Technology

There will be CHiME-8 webinar! If you're interested in the CHiME-8 challenge, please join!
Date: May 20, 2024
Time: 8:00 AM (US - ET)
Place: https://cmu.zoom.us/j/92314209923?pwd=TFpPUm1DTDhUOHJKbDdndFg1QmxPdz09

Likely meeting will be recorded

Читать полностью…

Speech Technology

Whisper.cpp implemented flash attention

https://github.com/ggerganov/whisper.cpp/releases/tag/v1.6.0

Читать полностью…

Speech Technology

And another one text-guided editing

https://github.com/JinhuaLiang/WavCraft

Читать полностью…

Speech Technology

https://2024.ieeeslt.org/challenges/

Читать полностью…

Speech Technology

Interesting case of split of the SLU between STM32 microcontroller and the cloud

https://arxiv.org/abs/2311.18188

Speech Understanding on Tiny Devices with A Learning Cache
Afsara Benazir, Zhiming Xu, Felix Xiaozhu Lin (University of Virginia)
This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users.

Читать полностью…

Speech Technology

Multi-resolution is always nice

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

https://openreview.net/forum?id=kUuKFW7DIF

Читать полностью…

Speech Technology

Diffusion-based vocoder better than Vocos

https://github.com/bfs18/rfwave

overall, many complains about Vocos quality but I'm not sure why. Technically it is a good architecture. Vocos problems are likely limitation of training setup, not really the architecture itself.

Читать полностью…

Speech Technology

Assembly.AI paper

The technical report about our latest Universal-1 multilingual ASR model is out!

Universal-1 is our latest ASR model in production, designed for high-quality, high-throughput, and large-scale operations. Not only does it demonstrate competitive or superior WERs in English, Spanish, German, and French under various conditions, but it also shows advantages in various practically relevant areas, such as accurate timestamp prediction, robustness against hallucinations, and handling code-switching. In this report, we adopt a holistic, system-centric approach to analyzing various aspects of fully-fledged ASR models to draw practically relevant insights that are useful for real-world services operating at scale. We hope that our report will help advance the speech field as it finds more and more applications in the real world.


https://arxiv.org/abs/2404.09841

Читать полностью…

Speech Technology

ICASSP starts next week. There will be many cool things if I'd have time to read it all

https://research.google/conferences-and-events/google-at-icassp-2024/

maybe we should organize some paper reading

Читать полностью…

Speech Technology

Conversational Voice Clone Challenge (CoVoC)
ISCSLP2024 Grand Challenge

https://www.magicdatatech.com/iscslp-2024

Call for Participation
Text-to-speech (TTS) aims to produce speech that sounds as natural and human-like as possible. Recent advancements in neural speech synthesis have significantly enhanced the quality and naturalness of generated speech, leading to widespread applications of TTS systems in real-world scenarios. A notable breakthrough in the field is witnessed in zero-shot TTS, driven by expanded datasets and new approaches (e.g.: decoder-only paradigms), attracting extensive attention from academia and industry. However, these advancements haven't been sufficiently investigated to address challenges in spontaneous and conversational contexts. Specifically, the primary challenge lies in effectively managing prosody details in the generated speech, which is attributed to the diverse and intricate spontaneous behaviors that differentiate spontaneous speech from read speech.

Large-scale TTS systems, with their in-context learning ability, have the potential to yield promising outcomes in the mentioned scenarios. However, a prevalent challenge in the field of large-scale zero-shot TTS is the lack of consistency in training and testing datasets, along with a standardized evaluation benchmark. This issue hinders direct comparisons and makes it challenging to assess the performance of various systems accurately.

To promote the development of expressive spontaneous-style speech synthesis in the zero-shot scenario, we are launching the Conversational Voice Clone Challenge (CoVoC). This challenge encompasses a range of diverse training datasets, including the 10K-hour WenetSpeech4TTS dataset, 180 hours of Mandarin spontaneous conversational speech data, and 100 hours of high-quality spoken conversations. Furthermore, a standardized testing dataset, accompanied by carefully designed text, will be made available. The goal of providing these sizable and standardized datasets is to establish a comprehensive benchmark.

Timeline
June 3, 2024 : HQ-Conversations data release

June 10, 2024 : Baseline system release

June 30, 2024 : Evaluation stage starts; Clone-Speaker and Test-Text data release; deadline for challenge registration

July 2, 2024 : Evaluation ends; Test audio and system description submission deadline

July 12, 2024 : Evaluation results release to participants

July 20, 2024 : Deadline for ISCSLP2024 paper submission (only for invited teams)

Читать полностью…

Speech Technology

State space model for realtime TTS

https://cartesia.ai/blog/sonic

In experiments so far, we've found that we can simultaneously improve model quality, inference speed, throughput, and latency compared to widely used Transformer implementations for audio generation. A parameter-matched Cartesia model trained on Multilingual Librispeech for one epoch achieves 20% lower validation perplexity. On downstream evaluations, this results in a 2x lower word error rate and a 1 point higher quality score (out of 5, as measured on the NISQA evaluation). At inference, it achieves lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor), and higher throughput (4x).

Читать полностью…

Speech Technology

LLMs are frontier in TTS too (true ones, not gpt). SLAM can do it too btw. Microsoft paper

https://arxiv.org/abs/2404.03204

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 5.6% (without reranking) and 1.7% (with reranking) to 2.5% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.

Читать полностью…

Speech Technology

Our current experiments with non-streaming ASR + VAD demonstrated serious problems in the approach. First of all, VAD appeared insufficient for complex audio conditions which is explainable given it doesn't have access to linguistic information. Second, longer audio chunks demonstrate significant delays even with modern fast networks. The problem is getting worse if you have to combine chunks of different length in a batch.

So we consider join streaming/non-streaming approach recently. U2 architecture and wenet actually made a lot of sense. Shinji's paper is similar here

https://arxiv.org/abs/2405.13514

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.

Читать полностью…

Speech Technology

More serious speech LLM

https://github.com/ddlBoJack/SLAM-LLM

Читать полностью…

Speech Technology

Another multimodal LLM doing speech

https://github.com/OpenMOSS/AnyGPT

Читать полностью…

Speech Technology

Valuation 50x revenue. Ok

https://github.com/PolyAI-LDN/pheme

https://www.linkedin.com/posts/seb-johnson_polyai-secures-near-500mn-valuation-in-boost-activity-7196548712846761984-MOLg/

BREAKING: PolyAI RAISES $50M AT $500M VALUATION

On monday I posted about PolyAI and how they were shaping up to become one of the UK's strongest AI companies. Today they announced a huge fundraise from new investors including chipmaking giant NVIDIA.

PolyAI is a Conversational AI platform automating customer service. It was founded in 2017 by Nikola Mrkšić, Tsung-Hsien Wen, and Pei-Hao (Eddy) Su who met at the University of Cambridge’s Machine Intelligence Lab.

As part of the fundraise the company highlighted some INSANE financial results:

* 10m revenue last year
* On track to TRIPLE that this year (i.e. $30m)
* 90% Gross Margin (FY23)

PolyAI grew out of Entrepreneur First before going on to raise several investment rounds, including their $40m Series B which valued the company at nearly $300 million post-money. Two years later this has grown to $500m!

A great result for the team and a further boost for the UK’s ambitions of becoming an AI hub.

Читать полностью…

Speech Technology

Apple optimizes Conformer for mobile devices. Some in-depth ideas

https://arxiv.org/abs/2312.10359

5x faster than real time speech recognition running on an apple watch 7. this is a lot more solved than you would think!

Читать полностью…

Speech Technology

https://github.com/tincans-ai/gazelle

https://tincans.ai/

We're building real-time conversational speech with large language models. Bots should be fun to talk to.

Previously, we built large scale machine learning systems at Cash App, Quora, Airbnb, and more.

Читать полностью…

Speech Technology

Its interesting that Cambridge lab is doing a lot of Whisperology recently

https://arxiv.org/abs/2405.06134

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales

Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as <endoftext>, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's <endoftext> token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.

Читать полностью…

Speech Technology

https://challenge.singfake.org/

We are organizing the inaugural Singing Voice Deepfake Detection (SVDD) 2024 challenge at IEEE Spoken Language Technology Workshop (SLT) 2024 to foster research and development of sophisticated methods for detecting AI-generated singing voices. This is an emerging issue within the music industry that requires specialized solutions.
Our prior analysis using the SingFake dataset showed a marked decline in performance of state-of-the-art speech deepfake countermeasures when applied to singing voices. This highlighted the need for tailored SVDD techniques.

Читать полностью…

Speech Technology

Interesting dataset generated with ElevenLabs. Multiple languages in the same phrase

https://huggingface.co/datasets/MohamedRashad/multilingual-tts

Читать полностью…

Speech Technology

People say this vocoder has a point by joining signal processing with neural tech

https://ast-astrec.nict.go.jp/demo_samples/firnet_icassp2024/

FIRNet: Fast and pitch controllable neural vocoder with trainable finite impulse response filter

Some neural vocoders with fundamental frequency (f0) control have succeeded in performing real-time inference on a single CPU while preserving the quality of the synthetic speech. However, compared with legacy vocoders based on signal processing, their inference speeds are still low. This paper proposes a neural vocoder based on the source-filter model with trainable time-variant finite impulse response (FIR) filters, to achieve a similar inference speed to legacy vocoders. In the proposed model, FIRNet, multiple FIR coefficients are predicted using the neural networks, and the speech waveform is then generated by convolving a mixed excitation signal with these FIR coefficients. Experimental results show that FIRNet can achieve an inference speed similar to legacy vocoders while maintaining f0 controllability and natural speech quality.

https://ast-astrec.nict.go.jp/release/preprints/preprint_icassp_2024_ohtani.pdf

Читать полностью…

Speech Technology

I tested Google ASR recently for English - Chirp, Conformer (latest version) and Gemini. Conformer is not good. Chirp is ok, somewhat better than Whisper V3.

Читать полностью…

Speech Technology

Idea is relevant. Results are somewhat mixed.

https://arxiv.org/abs/2404.06714

Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness

Xincan Feng, Akifumi Yoshimoto

Recent advancements in Natural Language Processing (NLP) have seen Large-scale Language Models (LLMs) excel at producing high-quality text for various purposes. Notably, in Text-To-Speech (TTS) systems, the integration of BERT for semantic token generation has underscored the importance of semantic content in producing coherent speech outputs. Despite this, the specific utility of LLMs in enhancing TTS synthesis remains considerably limited. This research introduces an innovative approach, Llama-VITS, which enhances TTS synthesis by enriching the semantic content of text using LLM. Llama-VITS integrates semantic embeddings from Llama2 with the VITS model, a leading end-to-end TTS framework. By leveraging Llama2 for the primary speech synthesis process, our experiments demonstrate that Llama-VITS matches the naturalness of the original VITS (ORI-VITS) and those incorporate BERT (BERT-VITS), on the LJSpeech dataset, a substantial collection of neutral, clear speech. Moreover, our method significantly enhances emotive expressiveness on the EmoV_DB_bea_sem dataset, a curated selection of emotionally consistent speech from the EmoV_DB dataset, highlighting its potential to generate emotive speech.

Читать полностью…
Subscribe to a channel