Usually "explainable" means something weird. Like you try to find some neuron in network responsible for the decision. This paper is somewhat different promoting the attribute approach
https://arxiv.org/abs/2405.19796
prediction accuracy is not as good, so more work is required. But overall direction is nice
Explainable Attribute-Based Speaker Verification
Xiaoliang Wu, Chau Luu, Peter Bell, Ajitha Rajan
This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.
@bjutte from Attendi does nice job on medical transcription. Check his last blog
Attendi/improving-automated-punctuation-of-transcribed-medical-reports-f7c6619b1715" rel="nofollow">https://medium.com/@Attendi/improving-automated-punctuation-of-transcribed-medical-reports-f7c6619b1715
From Apple quite in-depth paper on alternative for LM rescoring. Feels like it is going to be a generic direction for coming years
https://arxiv.org/abs/2405.15216
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition
Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly
Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a scaled error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several key ingredients: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on test-clean and 3.3% WER on test-other on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.
Sometimes world tells you something. Three unrelated sources on sonification I've met last week
Photo sonification
https://github.com/yarov475/photo-soniphication (from Russian https://habr.com/ru/companies/spbu/articles/816839/)
Images that Sound: Composing Images and Sounds on a Single Canvas
https://arxiv.org/abs/2405.12221 (demo https://ificl.github.io/images-that-sound/)
Sound training platform applied to astronomy
https://arxiv.org/abs/2405.06042
Time to remember Myst and Pythagoreans
The Expresso dataset is a high-quality (48kHz) expressive speech dataset that includes both expressively rendered read speech (8 styles, in mono wav format) and improvised dialogues (26 styles, in stereo wav format). The dataset includes 4 speakers (2 males, 2 females), and totals 40 hours (11h read, 30h improvised). The transcriptions of the read speech are also provided.
https://huggingface.co/datasets/ylacombe/expresso
There will be CHiME-8 webinar! If you're interested in the CHiME-8 challenge, please join!
Date: May 20, 2024
Time: 8:00 AM (US - ET)
Place: https://cmu.zoom.us/j/92314209923?pwd=TFpPUm1DTDhUOHJKbDdndFg1QmxPdz09
Likely meeting will be recorded
Whisper.cpp implemented flash attention
https://github.com/ggerganov/whisper.cpp/releases/tag/v1.6.0
And another one text-guided editing
https://github.com/JinhuaLiang/WavCraft
Interesting case of split of the SLU between STM32 microcontroller and the cloud
https://arxiv.org/abs/2311.18188
Speech Understanding on Tiny Devices with A Learning Cache
Afsara Benazir, Zhiming Xu, Felix Xiaozhu Lin (University of Virginia)
This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users.
Multi-resolution is always nice
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
https://openreview.net/forum?id=kUuKFW7DIF
Diffusion-based vocoder better than Vocos
https://github.com/bfs18/rfwave
overall, many complains about Vocos quality but I'm not sure why. Technically it is a good architecture. Vocos problems are likely limitation of training setup, not really the architecture itself.
Assembly.AI paper
The technical report about our latest Universal-1 multilingual ASR model is out!
Universal-1 is our latest ASR model in production, designed for high-quality, high-throughput, and large-scale operations. Not only does it demonstrate competitive or superior WERs in English, Spanish, German, and French under various conditions, but it also shows advantages in various practically relevant areas, such as accurate timestamp prediction, robustness against hallucinations, and handling code-switching. In this report, we adopt a holistic, system-centric approach to analyzing various aspects of fully-fledged ASR models to draw practically relevant insights that are useful for real-world services operating at scale. We hope that our report will help advance the speech field as it finds more and more applications in the real world.
https://arxiv.org/abs/2404.09841
ICASSP starts next week. There will be many cool things if I'd have time to read it all
https://research.google/conferences-and-events/google-at-icassp-2024/
maybe we should organize some paper reading
Conversational Voice Clone Challenge (CoVoC)
ISCSLP2024 Grand Challenge
https://www.magicdatatech.com/iscslp-2024
Call for Participation
Text-to-speech (TTS) aims to produce speech that sounds as natural and human-like as possible. Recent advancements in neural speech synthesis have significantly enhanced the quality and naturalness of generated speech, leading to widespread applications of TTS systems in real-world scenarios. A notable breakthrough in the field is witnessed in zero-shot TTS, driven by expanded datasets and new approaches (e.g.: decoder-only paradigms), attracting extensive attention from academia and industry. However, these advancements haven't been sufficiently investigated to address challenges in spontaneous and conversational contexts. Specifically, the primary challenge lies in effectively managing prosody details in the generated speech, which is attributed to the diverse and intricate spontaneous behaviors that differentiate spontaneous speech from read speech.
Large-scale TTS systems, with their in-context learning ability, have the potential to yield promising outcomes in the mentioned scenarios. However, a prevalent challenge in the field of large-scale zero-shot TTS is the lack of consistency in training and testing datasets, along with a standardized evaluation benchmark. This issue hinders direct comparisons and makes it challenging to assess the performance of various systems accurately.
To promote the development of expressive spontaneous-style speech synthesis in the zero-shot scenario, we are launching the Conversational Voice Clone Challenge (CoVoC). This challenge encompasses a range of diverse training datasets, including the 10K-hour WenetSpeech4TTS dataset, 180 hours of Mandarin spontaneous conversational speech data, and 100 hours of high-quality spoken conversations. Furthermore, a standardized testing dataset, accompanied by carefully designed text, will be made available. The goal of providing these sizable and standardized datasets is to establish a comprehensive benchmark.
Timeline
June 3, 2024 : HQ-Conversations data release
June 10, 2024 : Baseline system release
June 30, 2024 : Evaluation stage starts; Clone-Speaker and Test-Text data release; deadline for challenge registration
July 2, 2024 : Evaluation ends; Test audio and system description submission deadline
July 12, 2024 : Evaluation results release to participants
July 20, 2024 : Deadline for ISCSLP2024 paper submission (only for invited teams)
State space model for realtime TTS
https://cartesia.ai/blog/sonic
In experiments so far, we've found that we can simultaneously improve model quality, inference speed, throughput, and latency compared to widely used Transformer implementations for audio generation. A parameter-matched Cartesia model trained on Multilingual Librispeech for one epoch achieves 20% lower validation perplexity. On downstream evaluations, this results in a 2x lower word error rate and a 1 point higher quality score (out of 5, as measured on the NISQA evaluation). At inference, it achieves lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor), and higher throughput (4x).
LLMs are frontier in TTS too (true ones, not gpt). SLAM can do it too btw. Microsoft paper
https://arxiv.org/abs/2404.03204
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 5.6% (without reranking) and 1.7% (with reranking) to 2.5% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.
Our current experiments with non-streaming ASR + VAD demonstrated serious problems in the approach. First of all, VAD appeared insufficient for complex audio conditions which is explainable given it doesn't have access to linguistic information. Second, longer audio chunks demonstrate significant delays even with modern fast networks. The problem is getting worse if you have to combine chunks of different length in a batch.
So we consider join streaming/non-streaming approach recently. U2 architecture and wenet actually made a lot of sense. Shinji's paper is similar here
https://arxiv.org/abs/2405.13514
Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation
Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe
End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.
More serious speech LLM
https://github.com/ddlBoJack/SLAM-LLM
Another multimodal LLM doing speech
https://github.com/OpenMOSS/AnyGPT
Valuation 50x revenue. Ok
https://github.com/PolyAI-LDN/pheme
https://www.linkedin.com/posts/seb-johnson_polyai-secures-near-500mn-valuation-in-boost-activity-7196548712846761984-MOLg/
BREAKING: PolyAI RAISES $50M AT $500M VALUATION
On monday I posted about PolyAI and how they were shaping up to become one of the UK's strongest AI companies. Today they announced a huge fundraise from new investors including chipmaking giant NVIDIA.
PolyAI is a Conversational AI platform automating customer service. It was founded in 2017 by Nikola Mrkšić, Tsung-Hsien Wen, and Pei-Hao (Eddy) Su who met at the University of Cambridge’s Machine Intelligence Lab.
As part of the fundraise the company highlighted some INSANE financial results:
* 10m revenue last year
* On track to TRIPLE that this year (i.e. $30m)
* 90% Gross Margin (FY23)
PolyAI grew out of Entrepreneur First before going on to raise several investment rounds, including their $40m Series B which valued the company at nearly $300 million post-money. Two years later this has grown to $500m!
A great result for the team and a further boost for the UK’s ambitions of becoming an AI hub.
Apple optimizes Conformer for mobile devices. Some in-depth ideas
https://arxiv.org/abs/2312.10359
5x faster than real time speech recognition running on an apple watch 7. this is a lot more solved than you would think!
https://github.com/tincans-ai/gazelle
https://tincans.ai/
We're building real-time conversational speech with large language models. Bots should be fun to talk to.
Previously, we built large scale machine learning systems at Cash App, Quora, Airbnb, and more.
Its interesting that Cambridge lab is doing a lot of Whisperology recently
https://arxiv.org/abs/2405.06134
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models
Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales
Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as <endoftext>, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's <endoftext> token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.
https://challenge.singfake.org/
We are organizing the inaugural Singing Voice Deepfake Detection (SVDD) 2024 challenge at IEEE Spoken Language Technology Workshop (SLT) 2024 to foster research and development of sophisticated methods for detecting AI-generated singing voices. This is an emerging issue within the music industry that requires specialized solutions.
Our prior analysis using the SingFake dataset showed a marked decline in performance of state-of-the-art speech deepfake countermeasures when applied to singing voices. This highlighted the need for tailored SVDD techniques.
Interesting dataset generated with ElevenLabs. Multiple languages in the same phrase
https://huggingface.co/datasets/MohamedRashad/multilingual-tts
People say this vocoder has a point by joining signal processing with neural tech
https://ast-astrec.nict.go.jp/demo_samples/firnet_icassp2024/
FIRNet: Fast and pitch controllable neural vocoder with trainable finite impulse response filter
Some neural vocoders with fundamental frequency (f0) control have succeeded in performing real-time inference on a single CPU while preserving the quality of the synthetic speech. However, compared with legacy vocoders based on signal processing, their inference speeds are still low. This paper proposes a neural vocoder based on the source-filter model with trainable time-variant finite impulse response (FIR) filters, to achieve a similar inference speed to legacy vocoders. In the proposed model, FIRNet, multiple FIR coefficients are predicted using the neural networks, and the speech waveform is then generated by convolving a mixed excitation signal with these FIR coefficients. Experimental results show that FIRNet can achieve an inference speed similar to legacy vocoders while maintaining f0 controllability and natural speech quality.
https://ast-astrec.nict.go.jp/release/preprints/preprint_icassp_2024_ohtani.pdf
I tested Google ASR recently for English - Chirp, Conformer (latest version) and Gemini. Conformer is not good. Chirp is ok, somewhat better than Whisper V3.
Читать полностью…Idea is relevant. Results are somewhat mixed.
https://arxiv.org/abs/2404.06714
Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness
Xincan Feng, Akifumi Yoshimoto
Recent advancements in Natural Language Processing (NLP) have seen Large-scale Language Models (LLMs) excel at producing high-quality text for various purposes. Notably, in Text-To-Speech (TTS) systems, the integration of BERT for semantic token generation has underscored the importance of semantic content in producing coherent speech outputs. Despite this, the specific utility of LLMs in enhancing TTS synthesis remains considerably limited. This research introduces an innovative approach, Llama-VITS, which enhances TTS synthesis by enriching the semantic content of text using LLM. Llama-VITS integrates semantic embeddings from Llama2 with the VITS model, a leading end-to-end TTS framework. By leveraging Llama2 for the primary speech synthesis process, our experiments demonstrate that Llama-VITS matches the naturalness of the original VITS (ORI-VITS) and those incorporate BERT (BERT-VITS), on the LJSpeech dataset, a substantial collection of neutral, clear speech. Moreover, our method significantly enhances emotive expressiveness on the EmoV_DB_bea_sem dataset, a curated selection of emotionally consistent speech from the EmoV_DB dataset, highlighting its potential to generate emotive speech.