snakers4 | Technologies

Telegram-канал snakers4 - Spark in me

2278

Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.

Subscribe to a channel

Spark in me

What is amazing about tf and CUDA / CUDNN drivers - that documentation is not updated when newer versions are released - and they are always changing library file names which is annoying af.

Arguably Google and Nvidia are the richest companies from the whole DS stack - but their documentations is the worst of all the richest companies.

So if you are updating your docker container and libraries suddenly start producing weird errors - look for compatibility guidelines like this one - https://goo.gl/cF3Swy

Of course docs and release note will have no mention of this. Because Google.

Also docker hub contains all the versions of CUDA+CUDDNN packaged, which helps
- https://hub.docker.com/r/nvidia/cuda/

PS
Pytorch has all this embedded into their official repo list
- http://prntscr.com/i6nfsl

Google, why do you make us suffer?

#deep_learning

Читать полностью…

Spark in me

Best link about convolution arithmetic
-https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md

#deep_learning

Читать полностью…

Spark in me

New dimensionality reduction technique - UMAP
- https://github.com/lmcinnes/umap

I will write more as I test it / learn more.

Works well with HDBSCAN and CNNs I guess
- https://goo.gl/9hYAXL

Usage examples
- https://goo.gl/QuYWJF

#data_science

Читать полностью…

Spark in me

Playing with HDBSCAN in practice.

What I learned. If you have a non-sparse feature vector, i.e. 1000+ - 5000+ dimensions, then you should use PCA before using HDBSCAN.

Their scalability how-to (https://goo.gl/iR9HQu) does all the benchmarks on 10 dimension vectors. In practice anything above 50-100 dimensions faced some kind of bottle-neck - the memory consumption was low, the CPU consumption was also low - but nothing pretty much happened for hours.

Also if you want to have large clusters and set (https://goo.gl/eikRy4) min_samples value to >> 100, then there will me a memory explosion due to some kind of caching issue. So if your cluster size should be 5000+, then you are compelled to use min_samples ~ 100.

#data_science

Читать полностью…

Spark in me

Last post should have contained "DrivenData". I stand corrected.

Читать полностью…

Spark in me

Key / classic CNN papers

ShuffleNet

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
- a small resnet-like network that uses pointwise separable covolutions and depthwise separable convolutions and a shuffle layer
- authors - Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun
- paper - http://arxiv.org/abs/1707.01083
- key
-- on ARM devices 13x faster that Alexnet
-- lower top1 error than MobileNet at 40 MFLOPs
- comparable to small versions of NASNET
- 2 ideas
-- use depth-wise separable convolutions for 3x3 and 1x1 convolutions
-- use shuffle layer (flatten, transpose, resize back to original dimension)
- illustrations
-- shuffle idea - https://goo.gl/zhTV4E
-- building blocks - https://goo.gl/kok7bL
-- vs. key architectures https://goo.gl/4usdM9
-- vs. MobileNet https://goo.gl/rGoPWX
-- actual inference on mobile device - https://goo.gl/X6vbnd

#deep_learning
#data_science

Читать полностью…

Spark in me

For new (!) people on the channel:

- This channel is a practicioner's channel on the following topics: internet, data science, math, deep learning, philosophy
- Focus is on data science
- Don't get your opinion in a twist if your opinion differs. You are welcome to contact me via telegram @snakers41 and email - aveysov@gmail.com
- No bs and ads

Give us a rating:
- /channel/tchannelsbot?start=snakers4

Donations
- Buy me a coffee https://buymeacoff.ee/8oneCIN
- Direct donations - https://goo.gl/kvsovi - 5011673505 (paste this agreement number)
- Yandex - https://goo.gl/zveIOr

Our website
- http://spark-in.me
Our chat
- https://goo.gl/IS6Kzz
DS courses review
- http://goo.gl/5VGU5A
- https://spark-in.me/post/learn-data-science
GAN papers review
- https://spark-in.me/post/gan-paper-review

Читать полностью…

Spark in me

Internet Digest
- Ben Evans - https://goo.gl/TPyLoD
- Youtube tightening moderation screws for small channels - https://goo.gl/SHpC2h
- Camera strapped to plane - https://vimeo.com/240106846
- Guardian online getting profitable - https://goo.gl/CDpNFb
- Amazon testing a shop wo cashiers - you just take goods and walk out - https://goo.gl/hvh63Z
- Drone saving a drowning person - https://goo.gl/RdGYDx

ГЫ
- А это отлично зайдет русским ко-ко-ко разрабам и культуре "обсирания всего", которая царит в нашем IT - https://goo.gl/S5poqv


#internet
#digest

Читать полностью…

Spark in me

Pytorch in a year review
http://pytorch.org/2018/01/19/a-year-in.html

#deep_learning

Читать полностью…

Spark in me

Tested bcolz on a simple premise - how fast can it process 1M (1,3476) feature vectors from CNN. Also looks like it provides 2-3x compression straight out of the box. Nice.

Blazingly fast!
- https://goo.gl/z1MKmH

#data_science

Читать полностью…

Spark in me

2017 DS/ML digest 1

Did not do digests quite for some time =)

1. Annual digests
1.1 Google Brain one - https://goo.gl/VQhZmP two https://goo.gl/XkTRhp
Highlights
- Speech generation https://goo.gl/MEDv7M
- Speech recognition https://goo.gl/tCEkVz
- Auto ML https://goo.gl/fx2FuP
-- NASNET - https://goo.gl/becAET

1.2
Posted before - but WildML 2017 summary is also awesome https://goo.gl/ZFtFVT

2. Datasets
→ YouTube-8M (https://goo.gl/nyP9gp): >7 million YouTube → videos annotated with 4,716 different classes
→ YouTube-Bounding Boxes (https://goo.gl/c3K6YY): 5 million bounding boxes from 210,000 YouTube videos
→ Speech Commands Dataset (https://goo.gl/TWsTi8): thousands of speakers saying short command words
→ AudioSet (https://goo.gl/TVA3LJ): 2 million 10-second → → YouTube clips labeled with 527 different sound events
→ Atomic Visual Actions (AVA) (https://goo.gl/Ba4U73): 210,000 action labels across 57,000 video clips
→ Open Images (https://goo.gl/2Xj8Xd): 9M creative-commons licensed images labeled with 6000 classes
→ Open Images with Bounding Boxes (https://goo.gl/qRkvMy): 1.2M bounding boxes for 600 classes
→ QuickDraw dataset (https://goo.gl/FSsfYm)

3.
Uber about genetic approach to neural networks - https://eng.uber.com/deep-neuroevolution/

#digest
#data_science
#deep_learning
#machine_learning

Читать полностью…

Spark in me

@vote Like the twitter repost idea?

Читать полностью…

Spark in me

Just found out about Facebook's fast text
- https://github.com/facebookresearch/fastText

Seems to be really promising

#data_science
#nlp

Читать полностью…

Spark in me

Wine3.0 - зарелизилась третья мажорная версия эмулятора системных вызовов Windows. Именно эта система лежит в основе портов бОльшей части старых игр на мак и линукс. Ну и на сегодняшний момент это единственный нормальный способ запустить на линуксе Майкрософт Офис и последний Фотошоп. Все время удивляюсь, как у ребят хватает энтузиазма уже больше десяти лет развивать этот продукт, мои большие поздравления команде!

В этом релизе практически полная совместимость с базовыми уровнями DirectX/3D 11 и поддержка Андроида.

https://www.winehq.org/news/2018011801

Читать полностью…

Spark in me

Nice presentation to learn about Semantic Segmentation
http://slides.com/vladimiriglovikov/title-texttitle-text/fullscreen#/0/5
https://www.youtube.com/watch?v=MYp3OwkiJAs

#data_science
#deep_learning

Читать полностью…

Spark in me

https://youtu.be/spUNpyF58BY

Читать полностью…

Spark in me

https://youtu.be/SHTOI0KtZnU

Читать полностью…

Spark in me

When looking at WorldBank, WEF and some consulting company reports and white-papers I always wondered if anybody reads them.

Here is a possible answer - No
https://img.washingtonpost.com/blogs/wonkblog/files/2014/05/pdfs.jpg

They do not understand that making content more reachable and SEO-friendly helps long-term. But SEO-friendly websites are usually full of bullshit.

#internet

Читать полностью…

Spark in me

If after updating your Ubuntu packages your dockerized application suddenly fails to see the GPU(s), then you should migrate to nvidia-docker-2.

Links:
https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)

Do not forget to read section Removing nvidia-docker 1.0
In my case all was solved simply by copy-pasting their commands:

# https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
# migrate to NVIDIA docker 2
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge nvidia-docker

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

sudo apt-get install nvidia-docker2
sudo pkill -SIGHUP dockerd

# testing
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
#linux

Читать полностью…

Spark in me

Kaggle stats for 2017
- http://blog.kaggle.com/2018/01/22/reviewing-2017-and-previewing-2018/

I literally choked when I read this:
- $1.5MM competition with TSA to identify threat objects from body scans
- this was a competition where only US citizens were granted prizes => 10 stacked resnets won
- $1.2MM competition with Zillow to improve the Zestimate home valuation algorithm - this has 2 stages, first stage prize was US$50k
- $1MM competition with NIH and Booz Allen to diagnose lung cancer from CT scans - this one was really great - but I did not know much back then, it was early 2017 =(

Also I am not a great data scientist per se, but just comparing the amount of cringe and shitty train/test splits - DoubleData is much better than Kaggle in terms of data SCIENCE.

#data_science

Читать полностью…

Spark in me

A couple more articles about idiomatic pandas
- https://tomaugspurger.github.io/modern-4-performance
- https://tomaugspurger.github.io/modern-5-tidy

What was useful for me
- Faster reading of the dataframes
- Stack, unstack
- Melt

#data_science
#pandas

Читать полностью…

Spark in me

A new interesting competition on topcoder
- https://goo.gl/ix7xpx

At least at first glance)

#data_science
#deep_learning

Читать полностью…

Spark in me

So following out tweetsender (https://goo.gl/uqGRRA) (which has some bugs still) github.com/borsik helped us to create high-level telegram to twitter bot:

- Code https://github.com/snakers4/telegram2twitter
- Twitter channel https://twitter.com/AlexanderVeysov

Effectively this posts our channel to Twitter. Please star/use code if you need it.

#data_science

Читать полностью…

Spark in me

Kaggle launching ... MS/CNN/DS education course
- https://twitter.com/fchollet/status/955538373992591360
- https://www.kaggle.com/learn/overview

Once again syllabus looks ok, but I would not really have time now to analyze it.
Please pm me if you have watched it.
If you are just starting - Andrew Ng would be a safer bet.

#data_science
#course

Читать полностью…

Spark in me

A list of nice to read articles (RU)
- Nice article about credit score competition - https://goo.gl/7cy3Y1
- Feature engineering https://goo.gl/NkoxWQ
- If you a hardware strapped - RPI + Movidius stick may work for inference better that just RPI - https://goo.gl/HC7Uj8

#data_science
#deep_learning

Читать полностью…

Spark in me

Like the twitter repost idea?

Meh – 11
👍👍👍👍👍👍👍 52%

Yes – 8
👍👍👍👍👍 38%

No – 2
👍 10%

👥 21 people voted so far.

Читать полностью…

Spark in me

Soon we will be broadcasting out channel to Twitter (and release our code for that)
- https://twitter.com/AlexanderVeysov

This is the first test post.

Читать полностью…

Spark in me

List of impressive ML projects in 2017
- https://habrahabr.ru/company/cloud4y/blog/346968/

The majority of them are totally impractical, of course =)

#data_science

Читать полностью…

Spark in me

Following our tweet-sender I had an idea.

Both Twitter and Telegram have APIs and python bindings.

So why not stream our telegram channel to Twitter? If you want to help us write a class for a $$ reward - please contact me.

Читать полностью…

Spark in me

New fast, easy-to-use and efficient clustering Algorithm - HDBSCAN. It is really amazing. I am not joking.

Quick links:
- paper https://arxiv.org/pdf/1602.03730.pdf
- library http://hdbscan.readthedocs.io/en

List of plain vanilla algorithms:
- (data) https://goo.gl/6KexoU
- K-Means (https://goo.gl/kjbA1f)
- Affinity Propagation (https://goo.gl/VrX4sy)
- Mean Shift (https://goo.gl/TekyML)
- Spectral Clustering https://goo.gl/RUifoa
- Agglomerative Clustering - https://pypi.python.org/pypi/fastcluster

Newer ones
- DBSCAN (https://goo.gl/DQK2Z3)
-- 2 steps
--- transform - points in dense regions are left alone, while points in sparse regions are moved further away
--- apply single linkage clustering to the transformed space results in a dendrogram

- HDBSCAN (https://goo.gl/XD4y8T)
-- goal was to allow varying density clusters
-- transform the space according to density (!)
-- single linkage clustering on the transformed space
-- the dendrogram is condensed by viewing splits that result in a small number of points splitting off as points ‘falling out of a cluster’

Key evaluation criteria
- Don’t be wrong!
- Intuitive parameters
- Stability
- Performance

Plain English comparsion of different algorithms
- http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
- notebook http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb

Naive Benching different clustering algorithms
- http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
- http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations-v0.7.ipynb
- speed bench http://hdbscan.readthedocs.io/en/latest/_images/performance_and_scalability_9_1.png
- huge datasets http://hdbscan.readthedocs.io/en/latest/_images/performance_and_scalability_24_1.png

How HDBSCAN works
- http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
- Notebook https://goo.gl/iPy23p
- Transform the space according to the density/sparsity
- Build the minimum spanning tree of the distance weighted graph - https://goo.gl/EmQwd8
- Construct a cluster hierarchy of connected components - https://goo.gl/oHZo2W
- Condense the cluster hierarchy based on minimum cluster size - https://goo.gl/awSXjC
- Extract the stable clusters from the condensed tree

Also HDBSCAN has a notion of soft-clustering a custom cluster distans that works for oddly shaped clusters
- http://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html
- http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/How%20Soft%20Clustering%20for%20HDBSCAN%20Works.ipynb

#data_science
#clustering

Читать полностью…
Subscribe to a channel