Telegram-канал speechtech - Speech Technology: Unsorted

Speech Technology

11 May 2023 06:03

https://www.youtube.com/watch?v=SyJkrdF2Ed4

Читать полностью…

Speech Technology

11 May 2023 05:13

Good VS quality

https://quickvc.github.io/quickvc-demo/

https://github.com/quickvc/QuickVC-VoiceConversion

Читать полностью…

Speech Technology

09 May 2023 00:32

Open Preview for #ICASSP2023 is now available on
@IEEEXplore
! Available through June 10, you can now browse all the papers that were accepted to ICASSP 2023, free of charge. Browse research here: https://hubs.la/Q01N_PdX0

Читать полностью…

Speech Technology

04 May 2023 02:36

New Mandarin TTS dataset

https://www.openslr.org/138/

SHALCAS22A
Identifier: SLR138

Summary: A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.

Category: Speech

License: Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Downloads (use a mirror closer to you):
SHALCAS22A.tgz [3.9G] ( Corpus ) Mirrors: [US] [EU] [CN]

About this resource:

SHALCAS22A is a 1-channel Chinese Mandarin speech corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd. It was collected over a Hi-Fi microphone in a quiet environment. The corpus contains 14,580 utterances from 60 speakers. Each speaker has 243 utterances.
The contents include number passwords, short Chinese words, and long Chinese sentences. The mapping between the content and utterance is given in content.txt.

This corpus can be used in text-dependent speaker verification on number passwords, text-independent speaker verification on short utterances, and other speech-related fields. Please cite the corpus as "SHALCAS22A, a free Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd., 2022".

Contact: Feng Hong, hongfeng@mail.ioa.ac.cn

Читать полностью…

Speech Technology

04 May 2023 00:16

Nvidia published a pack of new fastconformer models

https://github.com/NVIDIA/NeMo/commit/091ce965da99f1ca63f64417b0ea612d744c7c81

For example English one

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_pc

Читать полностью…

Speech Technology

25 Apr 2023 17:56

https://github.com/declare-lab/tango

Читать полностью…

Speech Technology

25 Apr 2023 04:55

LODR decoding in K2

https://mp.weixin.qq.com/s/HJDaZ5BN1TzEa8oWQ9CBhw

Adding LODR to the rescore process only increases the decoding time by 20% compared to beam search, but reduces the word error rate by 13.8%, which is fast and accurate.

Читать полностью…

Speech Technology

24 Apr 2023 10:41

http://www.asru2023.org/

Taiwan, Taipei

December 16-20, 2023

Regular & Challenge paper submission due: July 3, 2023

Читать полностью…

Speech Technology

22 Apr 2023 19:15

https://github.com/152334H/tortoise-tts-fast

Читать полностью…

Speech Technology

19 Apr 2023 21:22

https://www.youtube.com/watch?v=-oDQ4ggjBnQ

Читать полностью…

Speech Technology

17 Apr 2023 23:50

AUDIT:
Audio Editing by Following Instructions with Latent Diffusion Models

Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, Sheng Zhao
Abstract. Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio. However, these methods still have some problems: 1) they have not been trained on editing tasks and cannot ensure good editing effects; 2) they can erroneously modify audio segments that do not require editing; 3) they need a complete description of the output audio, which is not always available or necessary in practical scenarios. In this work, we propose AUDIT, an instruction-guided audio editing model based on latent diffusion models. Specifically, AUDIT has three main design features: 1) we construct triplet training data (instruction, input audio, output audio) for different audio editing tasks and train a diffusion model using instruction and input (to be edited) audio as conditions and generating output (edited) audio; 2) it can automatically learn to only modify segments that need to be edited by comparing the difference between the input and output audio; 3) it only needs edit instructions instead of full target audio descriptions as text input. AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution).

This research is done in alignment with Microsoft's responsible AI principles.

https://audit-demo.github.io/

Читать полностью…

Speech Technology

17 Apr 2023 23:30

Laugh is nice, Russian stress is traditionally bad

https://github.com/suno-ai/bark

Читать полностью…

Speech Technology

10 Apr 2023 09:54

GPU beam search in pytorch

https://github.com/pytorch/audio/pull/3096

Читать полностью…

Speech Technology

04 Apr 2023 03:44

https://www.youtube.com/watch?v=v73YdJQqaQo

Читать полностью…

Speech Technology

03 Apr 2023 03:52

https://groups.inf.ed.ac.uk/edacc/

The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR. Ramon Sanabria, Bogoychev, Markl, Carmantini, Klejch, and Bell. ICASSP 2023. Presentation of the EdAcc.

Читать полностью…

Speech Technology

11 May 2023 05:16

https://github.com/suzuki256/dog-dataset

Читать полностью…

Speech Technology

09 May 2023 21:04

https://www.assemblyai.com/blog/lemur-early-access/

Читать полностью…

Speech Technology

06 May 2023 23:23

Encodec has just changed to an MIT license. Great news for anyone working on LM approaches to audio or just looking for a high-quality audio codec.

No training code but still a really significant change.

https://github.com/facebookresearch/encodec/commit/349b72939f57cb3bc7b60906c0ee8228c849485d

Читать полностью…

Speech Technology

04 May 2023 02:19

Some Indonesian speech data released recently

https://indonlp.github.io/nusa-catalogue/

for example

https://github.com/s-sakti/data_indsp_teldialog_svcsr

Читать полностью…

Speech Technology

25 Apr 2023 18:55

https://www.hackster.io/shahizat/tinyml-baby-cry-detection-using-chatgpt-and-synthetic-data-1e715b

Читать полностью…

Speech Technology

25 Apr 2023 04:56

https://arxiv.org/abs/2203.16776

An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Huahuan Zheng, Keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.

Читать полностью…

Speech Technology

24 Apr 2023 21:25

https://slt2022.org/hackathon_projects.php

Читать полностью…

Speech Technology

24 Apr 2023 02:45

Whisper can actually do speaker diarization with a prompt. Magic is:

or do a crude form of speaker turn tracking (e.g. " - Hey how are you doing? - I'm doing good. How are you?", note that the token for " -" is suppressed by default and will need to be enabled manually.)

https://github.com/openai/whisper/discussions/117#discussioncomment-3727051

Читать полностью…

Speech Technology

20 Apr 2023 20:45

JaX is faster than Pytorch

https://twitter.com/sanchitgandhi99/status/1649046661816287236

Читать полностью…

Speech Technology

19 Apr 2023 06:52

NaturalSpeech 2, a new powerful zero-shot TTS model in NaturaSpeech series🔥
1. Latent diffusion model + continuous codec, avoiding the dilemma in language model + discrete codec;
2. Strong zero-shot speech synthesis with a 3s prompt, singing synthesis with only a speech prompt!

abs: https://arxiv.org/abs/2304.09116
project page: https://speechresearch.github.io/naturalspeech2/

Читать полностью…

Speech Technology

17 Apr 2023 23:44

Not sure about claimed accuracy but numbers are interesting

https://blog.deepgram.com/nova-speech-to-text-whisper-api/

A remarkable 22% reduction in word error rate (WER)

A blazing-fast 23-78x quicker inference time

A budget-friendly 3-7x lower cost starting at only $0.0043/min

Читать полностью…

Speech Technology

12 Apr 2023 15:20

Space is closer than you think. Happy Cosmonautics day my friends.

Читать полностью…

Speech Technology

08 Apr 2023 14:59

NeMo 1.17 is now released and and includes a lot of improvements that users have long requested.

This includes a high level Diarization API, PyCTCDecode support for beam search, InterCTC Loss support, AWS Sagemaker tutorial and more !

https://twitter.com/alphacep/status/1644685634404073472

Читать полностью…

Speech Technology

04 Apr 2023 00:12

Learning model from Whisper

https://github.com/speechcatcher-asr

Читать полностью…

Speech Technology

02 Apr 2023 15:45

https://www.openslr.org/136/

EMNS
Identifier: SLR136

Summary: An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.

Category: Speech, text-to-speech, automatic speech recognition

License: Apache 2.0
About this resource:

Emotive Narrative Storytelling (EMNS) corpus introduces a dataset consisting of a single speaker, British English speech with high-quality labelled utterances tailored to drive interactive experiences with dynamic and expressive language. Each audio-text pairs are reviewed for artefacts and quality. Furthermore, we extract critical features using natural language descriptions, including word emphasis, level of expressiveness and emotion.

EMNS data collection tool: https://github.com/knoriy/EMNS-DCT

EMNS cleaner: https://github.com/knoriy/EMNS-cleaner

Читать полностью…