grokaem_seby | Unsorted

Telegram-канал grokaem_seby - grokaem себя

2306

A bunch of things that I encounter during my journey as NLP/Audio developer

Subscribe to a channel

grokaem себя

WHISPER MODEL ASR

OpenAI published a new ASR model with outperforms others in zero shot evaluation and uses diverse huge dataset of 680,000 hours and can understand other languages. Check out how it deals with nirmalya.ghosh/transcribe-singlish-with-openais-whisper-can-meh-118328e4866f">singlish.

When I read all articles I always look for something new for me and make notes on papers, this time I created a 👉🏼 milana.shxanukova15/what-techniques-can-you-learn-from-whisper-a5bb3bd75daa">medium post👈🏼 about these new things:

- weakly supervised learning and its types
- new text standardization for ASR (ps is american really better than british?)
- robustness
- how to multitask?


p.s. In terms of weakly supervised learning i've found a library cleanlab that helps to find mistakes in datasets.

#grokaem_audio
#nlp #audio #milana_medium

Читать полностью…

grokaem себя

I hope everyone who is reading this post is safe and at peace. I have a sincere faith that this chaos will be over as soon as possible. If you need some help or just want to chat, I'm as always happy to communicate.

I understand that it's almost impossible to think of anything except the war now. But work and studying are things that distract me, hope that these posts can help you to escape from everything what is happening now at least for a while.

Читать полностью…

grokaem себя

STANFORD NLP MATERIALS

Standford NLP materials about text and speech processing
advantages:
- huge! The whole book is around 653 pages
- comprehensive
- easy to read
- questions at the end of paragraphs. (if you used to skip them at school as I did it's a huge recommendation to quit this habit and answer questions in this book)

disadvantages:
- no code
- not links

p.s. I also stumbled upon a course about speech processing in Nady's channel. Also a huge recommendation to check

#grokaem_courses #grokaem_nlp

Читать полностью…

grokaem себя

DATAPREP

Just encountered a pretty interesting library to make eda fast and comprehensive, preprocessing easy and fast. I haven't checked the performance in time, but the amount of functions is expansive.

I prefer to use seaborn, but it always feels like I miss some details. Also it's always copypasting or pull together python scripts. So such a library saves time and effort!


DataPrep allows to plot all columns, missing values, distributions, correlations in one report.
It also provides interface for preprocessing. In terms of text it convers replacement of urls, brackets, stopwords, punctuation, digits and so on. Relatively easy to use.


You can check their user guide.

It is also interactive, which I value the most!

#grokaem_ml
#machine_learning #eda #preprocessing

Читать полностью…

grokaem себя

SENSITIVITY AND SPECIFICITY

- We care about accuracy!
- Oh, okay, it's your choice but.... don't you care about specificity and sensitivity?
ps. I'm working on dementia, so my examples have dementia not cancer

Sensitivity = tp / (tp + fn) - the proportion of dementia ill patients classified as ill in regard to patients with dementia but classified as healthy
Specificity = tn / (fp + tn) - the proportion of healthy patients and classified healthy in regard to healthy patients classified ill

If you're aware of precision and recall you may notice, that sensitivity is recall - how good our model finds really ill patients.
If you look close to specificity you may notice it's recall too! Wait what? Yes, it's recall in regard to negative class. It shows how well models finds really healthy patients.

What is left?
Precision. We were speaking about sensitivity and specificity from data side. Precision shows us the results from model side - how many patients from all patients classified ill are really ill.

(recall - это то, какую долю объектов класса больных из всех больных нашёл алгоритм и specificity - то, какую долю из класса здоровых из всех здоровых нашёл алгоритм, precision - то сколько действительно больных из всех тех, что алгоритм обозначил больными, грубо говоря у нас разный делитель, в случае recall и specificity мы делим на то, что у нас действительно в данных, в precision на количество примеров, которые дала нам модель)

help links:
many examples - i recommend to first analyze tables, calculate metrics and then check yourself
easy video about sensivity and specificity

pictures below may help

#grokaem_ml
#machine_learning #metrics

Читать полностью…

grokaem себя

FINE-TUNE HEAVY MODELS

- If we fine-tune this pretrained bert, we will boost our results! But we are poor and don't have enough gpu... i always run into cuda out of memory 🤯
- Being poor is not an argument to give up, let's figure it out.

checklist how to fine-tune a heavy model:
Presteps and caution:
0️⃣.1️⃣check that it is really cuda problem, try batch size of 1 on CPU, maybe you have some operations that trigger the error. Bugs are much better explained when running on CPU. p.s. it's also the best practice to turn off GPU, take only one small batch and check the expected result etc and only then run on GPU. It saves time and GPU usage.
0️⃣.2️⃣ if distilled version of the model is available check it (distilled model is a student model that gained knowledge from its teacher) Distilled model is usually times lighter and will take less GPU memory.
0️⃣.3️⃣ check smaller batch size - be aware that training on a small batch size can lead to worse overall results. The coverage can be slower and you can even achieve the not the precise desiriable result.

If you don't get the idea about GPU, CPU and TPU, don't worry you were not born with this knowledge, check out this and this video.
Most of the this problem happens when you try to allocate too much memory during training, so your operations are heavy like matrix multiplacation, your model is heavy, your optimizer is heavy, your network is deep and you save nodes' gradient.

1️⃣ try gradient checkpoints. tenserflow and explanation, pytorch and explanation
It is a technique when you save in memory only some nodes from the network that are used for backpropogration, then you recount them, so it's a trade off between runtime and memory efficiency.
important note: blocks with dropout/BatchNorm should not be recomputed due to randomness in two runs.
If you use hugging face, check out gradient_checkpoints_steps.
2️⃣ try gradient accumulation (I find it fascinating!) - you just save loss for a few mini-batches and update only when you have gained a needed number (e.g. during main training authors were using batch size of 128, you cannot afford it and use 8 batch size, update optimizer only after 128 / 8 = 16 passes) You can do it in your loop, if you use hugging face, check out gradient_accumulation_steps. code snippet to use in a regular training loop.
3️⃣ use Adam8bit It allows to run operations in optimizer, which saves a lot of memory in 8bit mode. Additional post about adam differences will be here one day.
4️⃣ use different precisions for training. Precision is much more how accurate numbers are during training, it's pretty neet idea that you can catch in the tutorial. Hugging face has a very user-friendly interface for all fp16, bfp16, tp32. By the way, tp32 can be accessed just with one line of code for torch: torch.backends.cuda.matmul.allow_tf32 = True
important note: using different precision remember about coverage time and precision. It's most of the time a trade off between the size of the model, its speed and precision.
By the way 2, if you need to use your model later, you can quantize it. I personally have used only torch resources and intel openvino.
5. Be aware of parameters in DataLoaders and how you save your data. It's a recommendation to not save all your data on GPU even if it can fit in, it's better and faster to set it on the right device during training and make use of higher batch size. If you're worried about time, use pin_memory parameter = True, so your data is pinned directly from CPU to GPU. Illustration and details.
6. Nearly forgot: be aware what operations you can do on CPU, what on GPU. Like the calculation loss function can be done on CPU and observe if you save loss.item() in case of GPU usage. It can overload the memory. The thread when to count loss function on gpu and the comment about using it in regard to cpu and gpu synchronization.

it's not all, but seems like the most obvious. what techniques do you use?

#grokaem_dl
#deep_learning #cpu #gpu #pytorch #hugging_face

Читать полностью…

grokaem себя

TOPIC MODELING

Topic modelling is the task of finding related topics to a bunch of documents. Each document is meant to be constructed of a few topics with their relative importance. You can use topic modelling even when it's not the main task to check data distribution and correlated topics in it and make use of this information by adding new data or using topic-based pre-trained models.

⚡️IMPORTANT:
I've published all links with some more explanations and questions in my notion, please check the link. I also published the same material on medium, my first try, check out milana.shxanukova15/topic-modeling-35cd2acb515f">here. There you can find the link to code samples.

0️⃣ basic info:
Data is a few samples of text, the number of texts should be more or equal to the number of topics.
output is a representation of existing topics in all documents.

1️⃣ approaches:
LDA - a generative model that creates topics based on words that generate texts and appear together.
LSA - model that makes use of SVD on document-term matrix to produce topics.
NMF - the same idea as LSA but with sparse matrixes.
BERTopic - bert embeddings + dimensionality reduction + clustering algorithm.
BERT + LDA - usage of both LDA feature representation and bert embeddings as concatenated input to autoencoder and clustering algorithm as the main topic model.

2️⃣ Metric:
Topic Coherence - how well a topic is supported by a text set - reference corpus. We measure topic coherence as some score of words that occur together in the reference corpus.

#nlp #deep_learning #topic_modeling #bert #milana_medium
#grokaem_nlp

Читать полностью…

grokaem себя

TF-IDF FEATURE SELECTION

#grokaem_nlp
#nlp #data_science #tfidf #feature_selection


Tf-idf надо знать по дефолту и вроде это лакмусовая бумажка для ваших первых гипотез, но в чем могут быть загвоздки?

- Выход у tf-idf матрицы размерности (n_samples, n_features), где n_features количество слов словаря, который вы либо передали, либо он был создан внутри, n_samples - количество текстов.
Понятное дело, что слов у нас может быть дофига, тогда и матрица будет оверогромная. Что делать?

0️⃣ Мы берем за данность, что все стоп слова убрали. Но можно еще и убрать всякие предлоги, которые не всегда стоп слова или выбрать себе набор частей речи и взять их. Если не очень хорошо справляется лемматизатор - взять стемминг поверх, так можно получить псевдооснову и фичей будет меньше. Также всякие имена, названия можно вынести отдельно или удалить вовсе и тд и тп...
1️⃣ Выбрать те фичи (слова или n-граммы), что оставить с помощью SelectKBest, mutual information, variance threshold.

SelectKbest - обертка для score functions, чтобы выбрать на их основанные k лучших скоров и им принадлежащих фич. Для выбора используются функции:
- chi2 (объяснение функции , объяснение как работает в tfidf ) анализ отношений expected / observed frequencies
- mutual information - работает также как information gain в деревьях решений, то есть статистическая зависимость наличия или отсутствия фичи, объяснение раз и два
- anova - еще одна статистика (объяснение через variance)

2️⃣ Выкинуть какие-то фичи после анализа feature importance. Как делается?

Какие методы вы используете?

Читать полностью…

grokaem себя

WIDENING FACTOR
Это понятие довольно старое, но я не cv-шник, так что именно на практике встретила его только в этом году.

widening factor - это фактор, который отвечает за количество feature maps. Он был предложен в статье WideResnet. Как работает обычный resnet.

0️⃣ Главный плюс resnet - это residual block, который должен решать проблему затухающих градиентов при глубоких сетях.
Но у resnet есть минус - это diminishing feature reuse - ситуация, когда либо только несколько блоков действительно 'выучивают' какую-то информацию, либо много слишком мало. (2 страница, wideresnet)

1️⃣ Одно из предложений - это как раз widening factor, который по факту показывает, во сколько раз больше будет аутпут в одном блоке.
Это лучше сто стороны вычислений и глубины сети. (5 страница, wideresnet) Работает просто как умножение количества output feature maps на определенное значение. Усе))

2️⃣ Из плюшек, которые еще показали:
- ввод dropout между слоями конволюций, а не между блоками, что было использовано как некая регуляризация.

3️⃣ Что важно понимать?
- линейно увеличиваются параметры при увеличении глубины сети и слоев внутри блока, но квадратично относительно ширины блока, поэтому все параметры необходимо 'тонко настраивать'.
Имплементация на pytorch
не вау видео, но хорошее
статья

#grokaem_dl
#deep_learning #dl #resnet #wideresnet #cnn

Читать полностью…

grokaem себя

ITERATORS AND GENERATORS

0️⃣ Iterables
- are objects capable of returning its members one by one.
some examples: lists, strings, and tuples.
protocol: getitem() and length()

1️⃣ Iterator is an object, which is used to iterate over an iterable object using the next method.
protocol: iter() and next()

2️⃣ Generator is a function which returns a generator iterator. It looks like a normal function except that it contains yield expressions for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function. credit
Generator functions use yield instead of return, this yield statement helps to remember the current state of the object and execute operation when asked.
Generator expressions are similar to list comprehension except only the previously mentioned characteristic that each element is executed only when asked next(). Example: my_generator = (x**2 for x in my_list)

⚡️We can say that generator is an easy form of iterator as we don't need to implement iterator protocol to create a generator. They both inherit the lazy-evaluation concept which says that the object is evaluated when it is needed, not when it is created. credit

⚡️Why should I use them?
Iterators are memory efficient as all operations are executed on call not during initialization. It is sufficient to be used for large datasets. One drawback is that iterator can be iterated over only once.

python iterators building blocks to use
article about iterators in russian
tricky questions
comprehensive article about iterators
example of a custom iterator
example of a generator

#grokaem_programming
#python #iterators #generators

Читать полностью…

grokaem себя

TEXT CLASSIFICATION

One of the simplest yet important task in NLP is text classification. I decided to collect some of known solutions in this task as part of my notion NLP book. You can find code, notion page,milana.shxanukova15/text-classification-8d937b5d00c5"> medium page using provided links.

In addition to easy-to-use models such as SVM, Decision Trees and MLP which I've added and covered you can also find so info about unusual techniques:

0️⃣CNN + RNN
We can first capture local context information using CNN and then apply RNN.

1️⃣sepCNN
sepCNN is a type of CNN where we split the kernel in a few matrixes. It can be a spatial convolution where we don’t use the depth of an image or depthwise separable convolutions where we first apply a regular convolution keeping depth dimension as it is and then apply 1*1 kernel to get more channels in depth as it happens with regular convolutions.
Mainly sepCNN due to splitting the kernel require much less computations but at the same time we get less parameters.

2️⃣BERT embeddings + numerical features
We can take already developer bert classification model which is basically BERT with a linear layer at the end but also we can implement it inplace and add numerical features that you’ve collected from your data. These features can represent anything you want. For instance, it can be the price of a car from your data or location.
There are a few methods how you can add such features in the model.
1) Make an ensemble model. One is just text classification, another works with all numerical features. Then you weight the results from these models and aggregate the final decision.
2) Concatenate inputs. We use BERT as embedder, so as output we have word embeddings. Then we concatenate these embeddings with our feature vector and use it as input for classifier which is build of linear layers. Note, that normalization is frequently used in to preserve the distribution of vector features.

I've decided to experience with tensorflow this time so if you have ideas how to make code better, I would be happy to get your comments.

#grokaem_nlp #milana_medium

Читать полностью…

grokaem себя

#advice

I need a piece of advice about visualization. What tools do you usually use to visualize models?

Читать полностью…

grokaem себя

N_FFT

Once I asked my coworker about the way he chose parameters for spectrograms, melspectrograms and so on. I got the worst answer for a young developer - I just do it according to my intuition and experience.
How to deal with this sort of stuff if you need to rely on something more verified than just your intuition?


Uncertainty principle
Firstly, we need to remember about uncertainty principle or gabor limit which states that signal simultaneously sharply localized in both time and frequency domains. It leads us to already familiar trade-off between different parameters. We always need to make a choice between the time resolution and frequency resolution.
short video about math behind
short explanation

What is frequency resolution?
It's how good you can differentiate between frequency components which are at least this amount far apart. e.g. resolution = sample_rate / n_fft = 22050 / 1024 = 21.5 Hz, so we distinguish frequencies in 21.5 apart of each other in each bin.
one_explanation
• time resolutions considers windowing, let's cover it in the next post

Let's take the first parameter - n_fft.

N_fft - number specifies the FFT length, number of bins. The number of horizontal bins in a sample, y axis.

Let's take for example 10 seconds speech of a control patient. I chose n_fft = 1024, sr = 44100. The shape of a spectrogram is [513, 862].

Why is the height 513 of n_fft = 1024?
In one sample n_fft // 2 + 1 bins will be created. Why so? It's the property of nq theory, we don't need the symmetrical part of of DFT. You can see examples when we don't truncate the symmetrical part.

Is the more the better?
Yes, but it depends on what you want. For computational efficiency we use number of power 2, typical numbers are 512, 1024 and so on. The best tip is to first understand how important you frequency resolution for a specific task and then estimate your n_fft according to the formula.

N_fft is connected with window size, which we'll cover later.
used links:
nice pictures
great explanation with the picture example
detailed and short video about dft

#grokaem_audio
#audio #nlp #audio_fundamentals

Читать полностью…

grokaem себя

#research #milana_medium

sometimes I seem like 'душнила' because I like when I feel control under everything what is happening, understand every detail, get the idea what others in my team doing and what is the purpose and the most important thing - how everything is organized!

Check out my new post on medium how I read articles and how to help yourself with organizing materials for your projects.
milana.shxanukova15/8fa7340e69b1" rel="nofollow">https://medium.com/@milana.shxanukova15/8fa7340e69b1

Читать полностью…

grokaem себя

#grokaem_seby_thoughts
It's the second anniversary of this channel, 17th of July. 🎉Thanks to everyone who said that they liked posts here, I appreciate your attention and time a lot.
-
But to be honest, I was really exhausted in the last few months and wanted to abandon everything. It could be caused by many reasons from work-study arrangements draining the life energy up to my private life that I was underestimating for the last 4 years. One way or another, I was in such a dejected condition for too long.
————
If you're in the same situation may be one idea which helped me will be applicable to you too.

I've been always of the opinion that people should make value in any form. Helping others in need was my way of thinking about being beneficial. But if you're subscribed to ai channels, you most of the time (this June was the perfect example) see posts about art generation which is much more about entertainment than necessity. (I don't want to diminish art generation). But AI is not just about it, it's helpful. I literally cried when understood that a girl, Clara woods, could express her thoughts due to text to speech technologies. It's only one example. There are thousands starting from subtitles for deaf people finishing with cancer diagnoses using cv.

You can be beneficial with your knowledge. You can make a difference. You can create history. Even your small steps every day when you struggle with this silly cuda bug can one day be a part of your work for others. Of course, it's not motivational for everyone but I gained power and belief in my own work reminding myself about the perspective to be important. So I will continue working on dementia hoping to make diagnoses faster, and investigate audio and nlp to make solutions for deaf people.

Читать полностью…

grokaem себя

I've decided to make all posts here in English. There are a few reasons:
- maybe the major reason is that I'm losing my English skills while learning German... sad but true... Even though a huge amount of everything I read and listen to is in English (not to mention research publications and technical literature) I speak or write not as frequently as I did before.
- the sphere is fully in English and building communication can be done in English for both Russian and non-Russian speakers
- these posts can better my academic English which I dislike a little bit...
- I enjoy gluttoning myself for challenges....

By the way, you can practise mistakes' corrections by reading my posts, I'm as always open to criticism. Hope you stay tuned.

p.s. I'm currently in Nizhniy Novgorod, so if you're here too and want to practice English with me, I'm very open to try.

Читать полностью…

grokaem себя

GFCC, MFCC, MELSPEC

Спойлер - это сумбурный пост про gfcc, которые используются в обработке звука. Если вам эта область не близка, то мб вам будет просто интересен видос про то, как мы воспринимаем частоты, там про волоски в ушах и их толщину. Тема школьная, но все еще поражающая мое сознание))
Спойлер 2 - все представления пытаются имитировать слух человека. По приколу можете проверить, как вы различаете частоты вот
тут. (лучше один наушник снять и убавить), а тут до какой частоты вы слышите. (я до 14к)
________________
Существуют: spectrogram, mel-spectrogram, mfcc, gfcc.
Краткий обзор отличий:
0️⃣ spectrogram отличается от mel-spectrogram mel-filter banks, которые пытаются имитировать то, как слышит человек, где нижние частоты различаются лучше, чем высокие.
1️⃣ mel-spectrogram отличается от mfcc тем, что mfcc делает так называемую декореляцию бинов mel-spectrogram. В видео ниже очень круто это объясняется через понятие vocal track frequency response, в нем у нас отражаются форманты, которые как раз отражают identity of the sound.
2️⃣ mfcc отличаются от gfcc во многом, подробнее в статье, но главные вещи это - фильтр, здесь у нас это не mel filter banks, а gammatone filter bank + берется не log, а кубический корень, что позволяет 0 коэффициент, который отвечает за energy в звуке переносить в сами коэффициенты и таким образом gfcc не scale invariant.
________________

GFCC - gammotone filter cepstral coefficients.
Помним про melspec, там у нас были mel filters, которые по сути дают определенный вес каждому значению частоты. Они треугольные и справляются с задачей - разделения расстояния между низкими и верхними частотами. Gammotone filters схожи с mel filters, но имеют другую форму, их края более мягкие и представлены от impulse response. Впервые их предложили в 1972 году, как говорит Википедия, когда исследовали cochlear nucleus (это нерв такой) у котиков. Форма gammotone filter.
То есть по сути gfcc это тот же melspec, только фильтры не треугольные)

Важно, что хоть в этом сумбурном посте я различаю именно gfcc и mfcc, нужно помнить, что mfcc наследуется от спектрограммы, я gfcc принимает сырой звук.
И melspec, и mfcc работают в torchaudio, а вот gfcc можно посчитать с spafe.

неплохие посты про mfcc и mel spec
лучшее видео про mfcc, melspec
статья про mfcc и gfcc и их разницу

#grokaem_audio
#deep_learning #audio_processing

Читать полностью…

grokaem себя

#датасайнс_ссылки

Удобное расширение, чтобы копировать код и не код со слайдов и видео в обычный текст. Особенно актуально, когда смотришь видос с кодом, а в описании нет ссылки на гитхаб.

blackbox.io

Читать полностью…
Subscribe to a channel