grokaem_seby | Unsorted

Telegram-канал grokaem_seby - grokaem себя

2354

A bunch of things that I encounter during my journey as NLP/Audio developer

Subscribe to a channel

grokaem себя

Мы говорили немного о diffusion LM, вот очень красивое видео о подходах:

https://youtu.be/bmr718eZYGU?t=750

Читать полностью…

grokaem себя

Одна из самый ужасных вещей - оставлять дело недоделанным, но и предела "доделонности" нет. И таких мини-проектов, мини-идей много, копятся они еще сильнее и вот ты сидишь как в грязной квартире, где и покушать надо приготовить, и пропылесосить, и стирку поставить, а за что взяться не знаешь. Надо просто взяться хотя бы за что-то.

Читать полностью…

grokaem себя

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
https://www.arxiv.org/abs/2510.01722

The common approach to build an emotional TTS is fusing the reference speech into a global style vector. Such an approach is effective in capturing the overall style, but it is not applicable in more gradual variations in emotionality etc phoneme-word-sentence level. The key idea of the paper is to disentangle timbre and emotionality into two extractors: a global timbre embedding with speaker-specific information and emotion embedding that captures only prosodic information.

The introduced model is based on FastSpeech. The authors introduce a Style Encoder after Phoneme Embedder to add a style-specific information. Overall, the architecture is learning first the timbre from a mel-spectrogram. A similar module is used for emotion extractor over the phoneme hidden representations. All of these representations are further broadcast-summed and passed to the decoder. However, it’s important that each block learns distinctive information. The authors used Mutual Information Neural Estimation (MINE) on the pooled embeddings for timbre and emotion. To guide the optimization direction, the authors also add two losses for the global emotion prediction and also speaker prediction.

Code: https://github.com/BaleYang/emotion-timbre-disentangled-tts-code/blob/main/model/modules.py#L767

Читать полностью…

grokaem себя

Yesterday I was reading a paper in German S-Bahn. If you know at least smth about Octoberfest, you can imagine how busy it gets in metro these days. Anyway. Beer was nice, meat even better.


DUALCODEC
https://dualcodec.github.io/

Why did I start reading the paper in the first place? I was curious about the first comment in the previous post. I started searching something that might reveal the number of tokens in codebook vs quality. I mean not just comments but the experiment itself.

DualCodec - a dual-stream approach that integrates both semantically rich self-supervised training and also waveform representations.

First some important notes:

1. Frame rate - the number of segments per second in the codec’s latent representation.
2. Bitrate - the amount of data per unit of time. bitrate=frame rate×codebooks×log2(codebook size)

- Frame rate fixed, increase codebooks → higher bitrate
→ More tokens per frame = richer representation.

- Bitrate fixed, increase frame rate → must reduce codebooks or size
→ More temporal resolution, but less per-frame detail.

- Bitrate fixed, decrease frame rate → can afford bigger codebooks
→ Less temporal detail, more per-frame detail.


In this work, the authors incorporated Descript-Audio Codec (DAC). The descript-audio codec uses very low-dimensional projections and multi-scale mel-loss. The many times mentioned problem of encoding both semantic and audio features is resolved here by spitting it in two different parts: the first codebook is capturing semantic-rich information on the first layer, the left layers are encoding and decoding high-quality audio features with DAC. To achieve disentangle the two RVQ features from first layer are subtracted before applying RVQ for the rest.

The RVQ-1 is using 16th layer in w2v-BERT, the output will be on 50Hz feature, further with 1D pooling it will be downsampled to 12.5Hz. The output of the first RVQ-1 with semantic information will be the input to the rest-RVQ forming (audio features from encoder - RVQ-1).

Now about the ablation studies:

1. Semantic enhancement is essential for the low WER both for English and Chinese.
2. Increasing the codebook size up to 16384 yields a huge drop in WER. Moreover, it improves also the subjective metrics UTMOS and MUSHRA.

The authors also compared the performance of using DualCodec in TTS. The superior performance is achieved using it with SoundStorm TTS and DualCodec at 25Hz.

Читать полностью…

grokaem себя

LIP-to-SPEECH
https://arxiv.org/abs/2509.25670


Lip-to-speech - комплексная проблема viseme to phoneme. Представьте, что вы глухой человек и умеете читать по губам, к сожалению, это не помогает в том, чтобы заговорить идеально самому, правда? Для Mandarin сложность еще и в лексических тонах, так как звучать они будут очень похоже, но значение слов может координально отличаться. Для решения этой проблемы будут рассмотрены два этапа: viseme-to-phoneme от предобученной английской модели и synthesis с допольнительным pitch predictor.

Visual encoder взяли от предобученного английского, далее нам необходимо из encoded visual features перейти к контенту. Для этого авторы взяли любимый wav2vec и обучали новый unit predictor, основанный на архитектуре conformer, на k-means токенах от wav2vec. К этому моменту мы уже сразу можем перейти к речи, но мы же помним, что звучат одинаково, значат по-разному. То, что различает звуки для нас - это F0 contour. На вход pitch predictor’у пойдет довольно большое количество данных: visual features, speaker embedding, speech unit embeddings. Сам pitch predictor - это DiT. Интересно, что также авторы обучают UV predictor, чтобы замаскировать UnVoiced сегменты. Как mel-decoder выбрали faspeech2. Олдскульно скажите вы, где же F5, молодежь уважает только трушный FM. А вот и ответ: теперь это PostNet, который должен улучшить полученную спектрограмму. По метрикам все как обычно, стало лучше. Занимательно, что все таки F0 predictor действительно помогает в intelligibility.

Почему я вдруг решила на ночь глядя читать про китайский липсинк? Я просто задумалась, а ... Если ты идеально понимаешь один язык и можешь читать по губам, можешь ли ты сразу его перевести? Наверное, да. Может тогда сразу lip-to-lip? как бы ужасно это не звучало

Читать полностью…

grokaem себя

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
https://arxiv.org/abs/2509.24629

Word-level emotion controllability. Most of the time, when we’re discussing emotion controllability, we work with utterance-level controllability. However, this approach is not only still not performing well. The underlying problem, as I see it, is the change of emotion or the prosody by itself on the word-level.

In this paper, the authors are trying to resolve it using a self-training strategy.

0️⃣ Self-training - a technique used to improve a model's performance by using unlabeled data to generate pseudo-labels, which then become part of the labeled training set.

1. The model generates labels for a larger pool of unlabeled data, often based on its most confident predictions.
2. These newly labeled data points are added to the training set, and the model is retrained.
3. This cycle repeats, leading to continuous improvement.


The method is based on a two-stage pipeline.

1️⃣ Stage 1: multi-round inference process with transition smoothing.

- transition-smoothing mechanism - a lightweight content aligner that predicts a text token for each speech token serves as a by-pass part for the next round. This way the text tokens from the aligner are passed to the next inference step.
- speech rate is determined by the speech prompt in CosyVoice 2. Here at each of the prompts we use either interpolation or downsampling to slow down and speed up the speech accordingly. Both interpolation and downsampling are done at the speech token level in the prompt.
- In the multi-stage inference process we also need to preserve speaker consistency. The authors build up the speaker-aware prompt selection strategy by prioritizing emotional prompts.

Overall, different emotional prompts are used at each inference steps, content aligner helps to pass further the content tokens from the predicted speech sample, at each step a separate emotion-based speaker embedding is passed.

2️⃣ Obviously, such a multi-round inference is significantly complex ❗️. This overhead is mitigated through self-training in the second stage. The pipeline is similar to most of the common self-training designs: (1) emotion-transition text is generated, (2) the teacher synthesizes speech with word-level variations, (3) outputs are filtered based on accuracy and expressive similarity, and (4) the student model is fine-tuned on these examples.

However, simply concatenating different emotional prompts introduces emotion inconsistency. To preserve the drift of attention when transitioning between emotions, the authors introduce a dynamic emotion attention bias. For each speech token, an emotional tag is predicted using a causal lightweight transformer. This output is then passed through an MLP and a softmax, which produce a weight vector of length 7. The weight is used to compute a dynamic attention bias by combining predefined attention templates. These templates are manually created and represent different attention strategies, such as full access for prompt encoding or full access for target sequence generation, etc.

3️⃣Emotion transitioning was evaluated with https://github.com/fcumlin/DNSMOSPro, prosody is computed with AutoPCP. Overall, the model is not surpassing intelligibility metrics but does outperform all of the subjective evaluations.

I think, the authors did a good job in creating a pipeline where you do not need an additional dataset and “break” the tts model in parts. However, the pre-defined attention templates seemed quite cherry-picked.

Читать полностью…

grokaem себя

SELECTIVE CLASSIFIER-FREE GUIDANCE FOR ZERO-SHOT TEXT-TO-SPEECH
https://arxiv.org/abs/2509.19668


Originally we apply CFG at every time-step in resolving the trajectory. In this paper the authors are exploring the need for every time-step cfg and the way to do it.  

Some new techniques from the paper:

1. Perpendicular component between two prediction vector. Positive prediction perpendicular to the unconditional prediction, focusing more of the directivity diffences.
2. Use standard CFG for the early time-steps and the partially conditioned on the text CFG on t > 0.08. Partially conditioned CFG is CFG with dropping only the additional speaker conditioning and leaving the text conditions. The t threshold is set under listening experiments to balance the SIM and WER. It was observed that the first ~6 time-steps already define the text adherence and might not be necessary. The partially conditioned text adherence introduced in MegaTTS-3 helps to emphasize the speaker similarity in later steps.

The last CFG strategy introduces a good balance boosting SIM without significantly descreasing WER. However, diffusion-inspired perpendicular component method does not perform any better for SIM or WER.

Additionally, the authors notice that there is a underlying language dependency in F5-TTS while in CosyVoice2 the semantic tokens and F5-TTS ConvNeXt layers are not even comparable in the parameter-wise strength. However, it might not be the underlying issue in the language dependency in F5-TTS as WER are not much different.



Overall, neither of the strategies are believed to drastically improve the performance. However, I believe that the latter CFG strategy might be combined in used multiple different conditions (speaker, emotion, env and etc) on different time-steps considering the inference process as a refining the details procedure.

Читать полностью…

grokaem себя

MEASURING PROSODY DIVERSITY IN ZERO-SHOT TTS: A NEW METRIC,
BENCHMARK, AND EXPLORATION
https://arxiv.org/html/2509.19928v1

Prosody is an umbrella term for emotion, attitude, intent. I clearly remember from some of my “favorite” lectures in my bachelors how we were taught about the “correct” prosody in different types of sentences.
*Since we can’t ever express things in the expected way, we might as well leave this task to computers.*


as ridiculous as it sounds it makes total sense because we love benchmarks and fair publications. Among some of the new metrics I think Meta metrics became the most popular. The authors of this paper introduce DS-WED a Discretized Speech Weighted Edit Distance, weighted edit distance over semantic token sequences derived from speech SSL (Hubert, WavLM) through k-means clustering.

Solution: create a dataset using 7 public tts models (1000 samples), ask humans, find correlation.

Score: encoded via SSL model features on two generated samples, the weighted Levenshtain distance is computed.

Raters: 20 graduate students with TTS experience. Well, all i can say is people with experience are biased in many ways.

Obviously, DS-WED gets the highest correlation among RMSE and MCD.

My opinion about the metric: I personally do not understand the metric choice as it is now, it would be more fair and easier to compare it with ground truth in my opinion maybe I missed something in librispeech and seed-tts results, do they actually take it the true sample to compare with the generated one?

Even if not the interpretability is still laking to me: we’re computing the distance over two generated samples, on a benchmark how do we set it up? Shall we compare with the same generated sample? Same model? Moreover, the more the better? Why so?

I like the efforts in comparing the recent TTS models available to public as it may be a good starting point to choose an architecture for emotional tts u want to build. However, I agree with /channel/speechtech/2183 that it would be nice to see different setups in models evolution.

Читать полностью…

grokaem себя

Интересный пост о новом перформансе Suno /channel/denissexy/10791

Читать полностью…

grokaem себя

В эту же тему ттс я считаю прошлую себя гораздо умнее и сильнее, сейчас я вообще какая-то совсем не такая. А вот Милана в прошлом писала посты, которые я люблю сейчас перечитывать.

0. Мой любимый пост о alignment в ттс. /channel/grokaem_seby/342
1. Та же история про семантик кодек, похожие рассуждения, что и тут /channel/grokaem_seby/354
2. Аудиобокс /channel/grokaem_seby/261

Читать полностью…

grokaem себя

Какие вам давали задачи на leetcode интервью?

Читать полностью…

grokaem себя

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

https://arxiv.org/abs/2508.08961

When somethings works well, we want to use it everywhere. LLMs for speech tasks are no exception.

However, building a speech understanding and generation model faces huge modality gap between text and audio. Moreover, speech understanding and generation favor different knowledge on different levels.

What do authors suggest to implement instead? Another tokenizer. The major goal of this tokenizer is to learn a better modality commonality both for text and speech. Learning = loss function. The loss functions will be build on different levels of the tokenizer.

1. To learn that speech has semantics we will prompt with speech tokens our text TLLM and the text answer. These speech tokens or USTokens are taken from VQ layer. VQ layer discretizes H features from Encoder. Encoder encodes them from hidden whisper layers. (understanding driven loss)
2. To keep rich acoustic features in the discrete tokens after VQ we take the quantized H’ embeddings from VQ and decode them. Comparing them with original hiddens from whisper is a reconstruction loss.

As most of us might guess whisper might not have such rich features in the first place but we’ll leave it behind the scenes.

For training robustness authors also randomly sample as a prompt for speech generation predicted speech tokens, hidden states or their concatenation. This makes model more flexible on relying on any of those.

Most of anything else in the papers I enjoy ablation studies as I don’t have time for that. The authors prove that every part of their loss function (reconstruction, semantic loss or understanding driven loss) contribute to both of the modality performance (understanding and generation).

I wonder whether whisper might be even a pitfall in the architecture. Maybe we can do a better reconstruction still preserving the semantics? Is asr the only task for that? what about the contrastive loss on the hiddens?

Читать полностью…

grokaem себя

Знаете бывают такие книги, которые начинаешь читать и где-то в середине понимаешь, что ну не идет. В этой статье также, подумав вчера вечером, что как-то я не врубаюсь в результаты статьи, я решила оставить очерки в заметках и утром под свежий кофе отредактировать. Может быть, я уже просто что-то не понимаю туманным взглядом?

RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning
https://arxiv.org/abs/2507.08012

Controllability has been considered as adjusting tiny knobs that are given. To adjust speech rate, pitch or even that old familiar emotionality. However, all of these features are one hand constrained - control is limitied to the acoustic features exposed to the model during training, while being too flexible on the output yielding different outputs for the same input.

The idea is to find additional controllable features for the models that receive a text description prompt as input. As a backbone of such model the authors take ParlerTTS checkpoint.

As a key to find new discoverable features the mean of all predicted wav2vec embeddings is calculated on 4th layer. The authors also noticed that features from wav2vec are text and speaker dependent, therefore they fix these parameters during generation.

After generating multiple different samples, they projected them with PCA. The new labels (features) will be assigned according to the created clusters. For discovered continuous gradients, the samples are discretised into n bings along the principal axis of variation. The labels are given for the mean embeddings computed with a cosine distance. I did not understand how they actually label the classes. Maybe it is manual work. What they find out for the model that was not trained with emotional labels first is that the samples with PCA actually get to 3 clusters. Moreover, after relabeling the samples and finetuning the model again on them, the variance in the clusters is reduced. After third finetuning the substantial reduction in confusion for neutral utterances was observed.

For the model that was trained with emotions not much was found as 1000 samples have the same contour in F0 (why only f0 is observed?) that indicates that the model does not have such a wide range of rendering the emotions as was thought. To get more variations the authors exclude the emotionality from the prompt and the model and get a strong negative correlation for the first principal component and the loudness. Leaving the data distribution guilty, authors do not perform fine-tuning.

Читать полностью…

grokaem себя

Я читала эту статью, когда она только вышла, однако а) тогда я не могла писать в канал б) я была сильно занята и читала вскользь. Сегодня в 11.52 я искала что-то легкое, чтобы свелонить, так как устала с переездами и вот она: BASE TTS.


BaseTTS
https://huggingface.co/papers/2402.08093

LLM-based approach: text tokens are followed by speech tokens. The later ones are received from decoder only tokenizer. Predicted tokens from LLM are concatenated with speaker embeddings and reconstructed via decoder.

Here you might get questioned: why do we need to add speaker info to speech tokens if we might already have it there? The answer is that we don’t as it is in the objectives of the loss function. Speech tokens are more or less what is being said not by whom or what microphone was used. Obvious fact is actually obscure to implement as all the disentangleable features actually build the sound the way we like it. However, we’re not strong enough to learn how to distinguish what is useful for us or not.

Speechcodes implementation: waveform → WavLM → content and speaker linear regressors → conv residual encoder → vector quantization module. To support the speaker and content info separation authors add the weighted terms in the loss with cosine similarity and contrastive loss.

In SpeechGPT module speech and texts have different positional embeddings and separate prediction heads.

Interesting that these codes predicted by SpeechGPT are not explicitly used by vocoder. Instead the last hidden states go through speechcode decoder (HifiGAN like) and afterwards fed to BigVGAN vocoder.

With all that said scaling matters and the law from the authors is Combined with the findings from naturalness MUSHRA, we believe that scaling
GPT-based TTS model from 1000+ to 10000+ hours and model size from 100 million to 500 million is the point at which "emergent abilities" [32] start to occur for our TTS.

I was still surprised to see old tortoise and bark in the evaluation part.

For emotionality I kinda like the approach inspired from this saying: These systems require high-quality recordings with a forced speaking style and annotated audio
and text data, limiting their usefulness due to the sheer number of emotions expressible in human
speech [87]. BASE TTS benefits from being a language model which is both acoustic and semanti-
cally/syntactically aware.

Читать полностью…

grokaem себя

Universal Adaptor

https://arxiv.org/abs/2204.00170

One of the less obvious problems in cascade TTS systems is the mismatch of parameters in the representations used by the synthesizer and the vocoder. Synthesizer outputs a melspectrogram and vocoder takes it and transforms to a waveform. However, a melspectrogram is already an audio transformation, and in this transformation we have a bunch of different parameters to tune (sr, log_base, padding, win_length and etc). Th problem arises if synthesizer and vocoder use different parameter settings. Although rare, such mismatches do occur.

Authors of this paper suggest to use a separate adapter that can transform any such representation to a common both for a synthesizer and vocoder even if they were not trained with identical parameters.

This solution might also be helpful for fairness in systems’ comparison since most of the metrics we use are based on an audio representation and any subtle difference can affect the metrics.

The adapter takes two configurations (source and target) along with the source mel-spectrogram, and produces a target mel-spectrogram using the target parameters.

Stage 1: approximation via inverse transform to linear spectrogram and wave with Griffin-Lim algorithm and spectrogram again with target parameters. This stage gives a rough low-quality estimated spectrogram.

Stage 2: the goal of this part is to approximate all non-normalizing parameters by transforming the low-quality target spectrogram according to target parameters. The idea is to pass the parameters through linear layers and add in the convblock that is in the heart of UNet.

For all the vocoders adapter usage resulted in either slightly worse performance or even better. However, the important factor is if we move to a different configuration results to still comparable results.

Читать полностью…

grokaem себя

PROMPT-AWARECLASSIFIER FREE GUIDANCEFORDIFFUSION MODELS
https://www.arxiv.org/abs/2509.22728

CFG as a technique has received significant attention and might be considered one of the primary reasons for the remarkable success of diffusion models. However, the choice of the guidance scale remains underexplored.

A fixed scale is often suboptimal and can lead to various artifacts, such as distorted timbre or weak semantic guidance.

One possible solution is iterative inference, which I have used myself. For each preset scale, we compute predefined metrics and choose the best sample across trials. Obviously, this is time-consuming. When one of the metrics is computationally heavy, we have to rely on a proxy. The authors introduce an efficient surrogate for a pretrained diffusion model.

Their idea is to use a lightweight quality estimator that adapts to the prompt’s complexity. First, they build a guidance-based dataset by generating multiple samples across different candidate scales. Each sample is evaluated with a modality-specific score — for audio, this is AudioBox-Aesthetics. Then, they train a score estimator conditioned on the prompt’s semantics and complexity, using CLAP and certain lexical statistics functions. The model is trained to approximate the scores obtained during dataset generation.

The authors show that the proposed approach yields slight improvements compared to non-adaptive scales. I wish they had also compared the quality for complex prompts to better demonstrate the adaptive capabilities of the scale estimator.

Читать полностью…

grokaem себя

По стопам @applied_scientist_blog , я пришла к действительно что-то новое о codec. Мы обсуждали их в uni-audio, DualSpeechLM, Moshi, DualCodec. Кратко: пробуем решить проблему зависимости от небольших изменений (шумов и тд) для codec. Решаем ее с помощью n проекций и “голосования” среди всех, позволяя какой-то из n голов ошибиться, но выиграть все командой общего голосования.

STABLETOKEN: A NOISE-ROBUST SEMANTIC SPEECH
TOKENIZER FOR RESILIENT SPEECHLLMS
https://www.arxiv.org/abs/2509.22220

----------
Two problems in semantic tokenizers:

1. Fault tolerance - slight perturbations result in magnified output changes.
2. Distant supervisory signal - no intermediate token stability is preserved as the loss supervises only the final output.

In simple words, RVQ token creation is rather brittle forcing very complex continuous state into a discrete bucket in one shot aka a few ms to one token.

StableToken is based on Voting-LFQ Module, a multi-branch quantizer that builds in architectural robustness. In a few brief steps let’s try to understand the algorithm:

1. Speech encoder produces hiddens → h.
2. n heads of linear projections create n independent projections → p_i
3. each projected vector is binarized with B_i {-1, +1}
4. aggregate bit-wise each B_i over the n heads → B_final
5. (inference) apply a sign function over B_final with consensus-based binary vectorization with the majority voting
6. each vector is diterminstically mapped to an integer index

Trick or treat? 🎃 N heads allow to use different representations rich and all the slight nuances in the h. The consensus-based voting allows to self-correct the output solving the fault tolerance.

The fault tolerance is also being resolved in the noise-aware consensus training. The wow moment for me was the augmentation integration during training. You might follow the old path of just applying augmentation on the samples and adding them in training, common book strategy. The authors instead encode both augmented and clean example. Then some heads receive an augmented h and some a clean one. The loss function then forces all heads to produce similar projections.

Don’t need to mention that the results are better on different metrics for this method. I truly hope that the authors get published and can open-source their code.

After reading this paper, I wonder whether the consensus function forces repeated information over the heads. Moreover, it would be interesting to explore the heads as the experts and drop some during training and then have a voting mechanism over all the heads only on confusing parts. It might save the compute but then we should consider the heads as the same referees that posses the same information.

Читать полностью…

grokaem себя

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Speech assistance is not only about the quality of the sound, but also about what we actually say. It might be the right fact to a question about history or names of a meeting at 17 p.m. Invoking broadly used RAG systems in Speech Assistance introduces additional latency leading to even more awkward silence.

The first problem is streaming: the query is being created on the way, waiting for the full question takes time, starting after the full query is created introduces latency.

The authors build on three key design components:

1. Trigger: when to start a new query.
2. Threads: number of parallel tool query threads to have.
3. Reflector: whether the gathered results are sufficient for the final output.

THINK ON EACH STEP: Let’s try to imagine and then follow the authors thinking strategy. If we have an utterance: oh, today I saw a nice blue dress in prada store, I know it’s expensive, but i just wonder how much does it really cost? This sentence might be split in a few chunks. Processing each chunk on the go, we can start the retrieval process. For example, the first chunk of 5 words might not give us good web sources, but the second one “blue dress in prada store” will certainly trigger the right resources. So we might just retrieve on each chunk we receive and if we’re lucky at the end and our last chun matches with one of the previous ones → let’s just output the already processed results from the previous good chunk.

*Since most tool call latency arises from chunking and reranking the chunks of web documents, these checks enable significant latency savings without any compromise in model performance. A key advantage of this strategy is its plug-and-play nature: it requires no additional post-training for speech-in, speech-out models and can be directly applied at inference time across a variety of architectures.*


APPROACH ONE IS INTENSIVE: However, obvious disadvantage is processing chunks of determined length. How to know when to trigger a tool query? Maybe it only when we have a verb in a sentence? Or maybe a call for named entity? Cannot determine, so we model it. At each chunk we decide whether we need an additional call or not, to decide the model receives both accumulated speech to this recent chunk and the latest tool query. Moreover, if the model starts a new query, any previous ones will be terminated. This approach is perfect for no parallel threads reducing the computational overhead in any resource-constrained devices. Additionally, we do not need any reflector module to choose the best call because we always save only the latest and the most information rich call.

DATA: The data is created using a similarity function as a pseudo-metric. Interestingly, training the model on always correct labels might lead to misinterpretations in the first steps and the model never picks up additional further information. To mitigate it, the authors introduced negative sampling strategy and teach the model to recover from the previous mistakes.

Читать полностью…

grokaem себя

Новая on-device 0.5B LLM TTS. Звучит очень хорошо в примерах. Будем пробовать. Под капотом https://huggingface.co/neuphonic/neucodec.

https://github.com/neuphonic/neutts-air?tab=readme-ov-file

Читать полностью…

grokaem себя

Вышла новая версия pyannote-audio, обещают лучше диаризацию и speaker counting.

https://github.com/pyannote/pyannote-audio/releases/tag/4.0.0

Читать полностью…

grokaem себя

Люди, которые занимаются вайбкодингом, как? Это такая мука честное слово. Нужно было сделать быстро один супорт с незнакомым api, так это такая долгая история... использую cursor и chatgpt.

Читать полностью…

grokaem себя

Scaling Transformer-based Text-to-Speech with Knowledge Distillation
https://research.atspotify.com/2025/8/scaling-transformer-based-text-to-speech-with-knowledge-distillation

CFG is one of the most powerful and yet time-consuming operations in recent TTS systems. It gives an opportunity to get a more diverse output generating one sample with conditions given and one without any conditions. In this paper, the authors aim to remove the need for CFG in inference. Additionally, it would be nice to also get a smaller model through the process of knowledge distillation.

During training, we might set the drop for different conditions (text to synthesize, audio prompt or/and your conditions). The authors of this paper CFG guided result as a sum of a full-conditioned generated sample and differences between unconditional and partially conditioned examples (picture 1).

The knowledge distillation will be done using only KL divergence between teacher which used CFG and a student that is not aware of CFG on inference. All of the student models are first pre-trained with cross-entropy loss with the teacher model (i dare to assume).

The authors state that the student model became more conservative in its outputs. The results are drawn both from the intelligibility metrics and also heatmap in encoding the prompt. Overall, the medium_student (half the size of teacher) shows no stat significant differences with teacher. The quality does decrease. The results demonstrate that this simple knowledge distillation process is an effective approach for achieving faster inference and a smaller memory footprint.

If you’re interested in TTS CFG improvement, I also highly recommend this paper: https://arxiv.org/abs/2504.20334 The results are reproducible (checked myself).

Читать полностью…

grokaem себя

С чем с чем, но с тревогой я не научилась бороться и по сей день. Не могу сказать, что у меня есть перфекционизм, но я очень сильно сомневаюсь в своих знаниях, девочки.

Вчера листала вот эту статью, готовясь к собеседованию. В ней о разных оптимизациях для transformer сеток (базовые sparse attention, efficient kv cache, flash attention). Также более сложные linear sequence modelling (attention reformulation with rnn and space models). Очень хорошие визуализации. Также в ней рассказывается кратко о moe, diffusion llm и других applications.

https://arxiv.org/abs/2508.09834

Читать полностью…

grokaem себя

Mimo.
https://github.com/XiaomiMiMo/MiMo-Audio/tree/main

Я глубоко уважаю Xiaomi, у меня от них и электрическая щетка, и ирригатор, и вообще ребята у них продуктивные. Но кажется в этой многим (не только мне) не хватило ответа на простой вопрос А ПОЧЕМУ ТАК?

🕐 Tokenizer: As old as hills the same problem in semantics and audio features in the tokenizer. Читая прелюдию к новому токенайзеру, я уже got excited and started typing this post. But then realized nothing new happened. Все, что авторы оставили - это тот же самый RVQ, но ничего не сделали, чтобы как-то регулировать влияние семантики или аудио фичей. Да, берут 3 и последний слой в Энкодере, но в этом ли счастье? Почему 3? До ablation studies не дошли и не показали. И даже не смотря на то, что с этим токенайзером модель хорошо показывает себя на бенчах. В этом ли сила, брат, при таком сумашедшем количестве данных? (over one hundred million).

🕑Кстати о данных, авторы утверждают, что данные прошли строгий отбор (похлеще психотерапевтов на ясно). Кроме привычного шума также был убран unsafe content. Ничего о статистике, сколько процентов данных убрали. В то время как the authors upsampled high-quality corpuses.

🕒Так как токены у нас все еще приходят от RVQ, надо выкрутиться это как-то генерить (проблему описала в прошлом посте). Здесь авторы работают с patch encoder и decoder. Интересно, что мотивация для них была

Since audio sequences have relatively low information density, individual audio frames convey much less information than text tokens.

Я не очень поняла supporting argument for it but anyway. Each encoder transforms a patch in a single hidden vector (single!). It is a very agressive compression of information. Декодер в свою очередь должен из предсказанного эмбеддинга дать новый патч токенов с тем же трюком delayed factor, о котором я писала вчера.

На мой взгляд одна из сильных частей всей модели это как раз encoder - decoder, так как в нем происходит такое сильное сжатие информации и потом ее развертывание.

🕓will all these questioning remarks the model is capable of working on new tasks although it was trained in multitask scenario. Gpt-3 момент все таки произошел.

-----
За более красивым текстом открывайте канал Ани, с которой я недавно познакомилась /channel/applied_scientist_blog/125

Будь у вас мильон денег часов данных, что бы вы сделали?

Читать полностью…

grokaem себя

Moshi
https://arxiv.org/html/2410.00037v2

Today I want to re-read the tokenizer and architecture of Moshi.

Audio Tokenizer 😎

As we’ve covered briefly audio tokens from audio codec models usually have rich knowledge to generate speech/audio/music as they’re trained via reconstruction while their counterpart are semantic tokenizers that match with the linguistic tokens. This similarity with language allows generating intelligible and yet consistent speech by prompting the model with content tokens (AudioLM approach). However, the speech tokens are modelled on the whole sequence and such 2-3 stage architecture is not compatible in real-time applications.

Their encodec is following the previous designs of SoundStream, they use an SeaNet autoencoder. It’s important that the convolutions in autoencoder are casual, so the model can be used in streaming manner. The encoder results in 12.5 frames per second for a latent representation. RVQ tokens applied on the resulting latent space. Remember that semantic - acoustic cancellation? Well, the authors tried to distill the knowledge from a speech tokenizer to their tokenizer in the first RVQ level by a cosine distance based knowledge distillation. However, this distillation loss got conflicted with the recostruction loss (!) as the RVQ follows the residual pattern the next RVQ[1:] levels relied on the first one and could not trade phonetics and audio quality. To address this the authors introduced a separate VQ that is then summed with the RVQ layers.


Model 😉

One way or another the RVQ will give us Q (number of layers) codebooks per frame. If we want to flatten across the time axis, we’re supposed to generate approximately 100 tokens per second! (12.5 * 8 codebooks). This approach is not only incompatible for streaming it also does not seem conventional for a casual models. Otherwise, we would have created a window of Q codebooks still assuming casuality of interdependence of the codebooks tokens. Knowing that one frame is represented as Q tokens lets try to model it. I think about it as a mel bin.

What authors suggest is using another transformer on the depth level. This transformer will be predicting each of the codebooks for a timestep s. One might think that the casuality issue is not being resolved here. However, with the second depth transformer we’re modaling only the tokens within one timesteps recreating the residual features of the previous codebook. While the temporal codec gives us a similar representation as we’d get from the autoencoder part in RVQ. While the first codebook was semantic, authors suggest to have a delay factor to allow the model to first generate the semantic information and only afterwards move on the details factors for further acoustic features.

#streaming_tts #audiollm #encodec

Читать полностью…

grokaem себя

Сегодня читала High-Fidelity Simultaneous Speech-To-Speech Translation (https://arxiv.org/abs/2502.03382), не очень поняла архитектуру на этапе output'a. Попробую еще раз.

Читать полностью…

grokaem себя

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

https://arxiv.org/abs/2410.14157

Reasoning and long-term planing are two tasks that require not only different planning distances (PD) but also can include a lot of distracting paths. Some examples of such tasks are sudoku or count down game. AR models do not perform well on these tasks. The authors prove in 3.1. Section that AR, reverse AR and teacherless trainining techniques do not perform on a toy examples of PD more than 1. Discrete diffusion models build one of the alternative branches for AR while being something between two worlds of masking techniques and diffusion (https://www.youtube.com/watch?v=WjAUX23vgfg).

One of the primarly ideas is that some subgoals (paths) might be more difficult and therefore require more context and time to be resolved. From multi-view perspective the x_t on any of the token n (node, action) is a distinct view of x_0. This one view provides different information about x_0. In the simplified weighted cross-entropy for discrete diffusion with absorbing states we use a w(t) weighting term which places higher weights in early t → 0. However, we know that some tokens might be more difficult to learn and therefore we’ve got to add the importance or the easiness to reconstruct the token n. Authors introduce an additional v(x_t, n) adaptive token-level re-weighting term which emphasizes harder tokens. For inference the authors employ easy-first topK decoding strategy. The suggested re-weightning technique led to improvements over random selection.

There is also a very nice error analysis for AR models saying that the number of errors for the last equations is much higher than for the first ones. The reasons are underlying in the planning strategy for the first steps. (Countdown game).

Читать полностью…

grokaem себя

Переезд в этот раз более уютный, первый шаг готов ❤️

Читать полностью…

grokaem себя

TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet
https://arxiv.org/abs/2507.04349


One of my research interests is controllability in TTS. More than anything else I’m interested in emotionality. However, mostly all discussions end up with one phrase “should we just get more data and train the model?”. Training a model is alwayds expensive. In this paper authors introduce an approach we can use for not only emotionality but other controllability techniques.

Instead of creating a separate module or adding an adapter, we will just copy the model and train it with an additional input (emotion). As a backbone they use a flow matching model as in f5tts.

During fine-tuning the model is initialized with zero-convolution to minimize overfitting.

Intersting observations:

1. only some layers contribute to the final quality in an emotion added model
2. using all sampling steps [0, t] is decreasing wer, authors apply the added model to only some steps.

Drawbacks:

1. quality is still dependent on SER
2. SER is calculated on audio and not on text
3. control scale of around 0.3 significantly lowers intelligibility which (in my opinion) supports an idea that the trade-off is not resolved (emotion and correct speech is not as obvious in this pipeline

Читать полностью…

grokaem себя

Vikhr Borealis - первая русскоязычная открытая audio llm

Мы долго и не очень успешно развивали свой tts - Salt, от него исторически осталось довольно много данных и наработок, мы решили - чо бы не сварить asr + llm как модно?

Ну и сварили. Архитектурно - whisper + qwen, учили на 7к часов аудио только адаптер+llm, сейчас работает только в ASR режиме, позже возможно довезем инструктивный режим. Так же выйдет бенчмарк для русского asr, он пока в доработке.
Блог так же выйдет, там будут небольшие аблейшены по данным

Модель в данный момент бьет whisperы на русском и на части бенчей лучше чем gigam.

Модель
Сolab поиграться

Читать полностью…
Subscribe to a channel