Kaggle launching ... MS/CNN/DS education course
- https://twitter.com/fchollet/status/955538373992591360
- https://www.kaggle.com/learn/overview
Once again syllabus looks ok, but I would not really have time now to analyze it.
Please pm me if you have watched it.
If you are just starting - Andrew Ng would be a safer bet.
#data_science
#course
A list of nice to read articles (RU)
- Nice article about credit score competition - https://goo.gl/7cy3Y1
- Feature engineering https://goo.gl/NkoxWQ
- If you a hardware strapped - RPI + Movidius stick may work for inference better that just RPI - https://goo.gl/HC7Uj8
#data_science
#deep_learning
Like the twitter repost idea?
Meh – 11
👍👍👍👍👍👍👍 52%
Yes – 8
👍👍👍👍👍 38%
No – 2
👍 10%
👥 21 people voted so far.
Soon we will be broadcasting out channel to Twitter (and release our code for that)
- https://twitter.com/AlexanderVeysov
This is the first test post.
List of impressive ML projects in 2017
- https://habrahabr.ru/company/cloud4y/blog/346968/
The majority of them are totally impractical, of course =)
#data_science
Following our tweet-sender I had an idea.
Both Twitter and Telegram have APIs and python bindings.
So why not stream our telegram channel to Twitter? If you want to help us write a class for a $$ reward - please contact me.
New fast, easy-to-use and efficient clustering Algorithm - HDBSCAN. It is really amazing. I am not joking.
Quick links:
- paper https://arxiv.org/pdf/1602.03730.pdf
- library http://hdbscan.readthedocs.io/en
List of plain vanilla algorithms:
- (data) https://goo.gl/6KexoU
- K-Means (https://goo.gl/kjbA1f)
- Affinity Propagation (https://goo.gl/VrX4sy)
- Mean Shift (https://goo.gl/TekyML)
- Spectral Clustering https://goo.gl/RUifoa
- Agglomerative Clustering - https://pypi.python.org/pypi/fastcluster
Newer ones
- DBSCAN (https://goo.gl/DQK2Z3)
-- 2 steps
--- transform - points in dense regions are left alone, while points in sparse regions are moved further away
--- apply single linkage clustering to the transformed space results in a dendrogram
- HDBSCAN (https://goo.gl/XD4y8T)
-- goal was to allow varying density clusters
-- transform the space according to density (!)
-- single linkage clustering on the transformed space
-- the dendrogram is condensed by viewing splits that result in a small number of points splitting off as points ‘falling out of a cluster’
Key evaluation criteria
- Don’t be wrong!
- Intuitive parameters
- Stability
- Performance
Plain English comparsion of different algorithms
- http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
- notebook http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb
Naive Benching different clustering algorithms
- http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
- http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations-v0.7.ipynb
- speed bench http://hdbscan.readthedocs.io/en/latest/_images/performance_and_scalability_9_1.png
- huge datasets http://hdbscan.readthedocs.io/en/latest/_images/performance_and_scalability_24_1.png
How HDBSCAN works
- http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
- Notebook https://goo.gl/iPy23p
- Transform the space according to the density/sparsity
- Build the minimum spanning tree of the distance weighted graph - https://goo.gl/EmQwd8
- Construct a cluster hierarchy of connected components - https://goo.gl/oHZo2W
- Condense the cluster hierarchy based on minimum cluster size - https://goo.gl/awSXjC
- Extract the stable clusters from the condensed tree
Also HDBSCAN has a notion of soft-clustering a custom cluster distans that works for oddly shaped clusters
- http://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html
- http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/How%20Soft%20Clustering%20for%20HDBSCAN%20Works.ipynb
#data_science
#clustering
Digital Ocean just improved their tariffs (more storage and RAM) - best modern VDS provider.
Transferring your application literally takes 2 minutes
- https://goo.gl/AtVLns
#internet
Fast.ai about SF culture
- http://www.fast.ai/2018/01/08/startups/
Nice article about remote working
- https://hackernoon.com/the-stress-of-remote-working-38be5bdcf4da
A subscriber asked about being efficient under the end-of 2017 article.
Here is my rant in reply
https://spark-in.me/post/plain-efficiency
Let the war in comments begin.
#philosophy
Interesting links / news / reports / data
Technology
- TVs and household items being replaced by smartphones => good for ecology and resources - https://goo.gl/3nw15t
- Once again - Meltdown + Spectre - https://goo.gl/fNrZGV
Internet
- Ben Evans - https://goo.gl/usr11B
- Amazon business structure - https://goo.gl/YKAB9F - hundreds of separate business units
- Uber management planning to sell shares - https://goo.gl/yJMqgc
- Google sold 6M smart speakers in 2017 - https://goo.gl/TVnSyY
- Amazon will use Alexa ... for ads - https://goo.gl/tS3gTU
- Facebook vs fake news https://goo.gl/mabfp6
- Dark side of the Internet - moderation - https://goo.gl/gBcyXx
Mobile
- Apple cripples 3rd party AdTech - https://goo.gl/QdpWwX
- Stats about Facebook chat app - https://newsroom.fb.com/news/2017/12/messengers-2017-year-in-review/
- In USA instagram is dominated by bra commercials - https://goo.gl/Ch7ipB
- Dating apps kill gay bars - https://goo.gl/qyTTk9
- App store 2017 YoY +30% revenue growth - https://goo.gl/xQFBxz
- 50%+ households in the USA are wireless only - https://goo.gl/WUXNRY
ML / DS
- If you have not seen WaveNet speech generation examples go here - https://goo.gl/kbjWXJ
- Apple Maps vs Google Maps - https://goo.gl/yMNth3
-- Looks like google is using some processing and ML to enhance their maps constantly
-- 3D buildings, small buildings, areas of interest etc
-- Timeline http://prntscr.com/i0kf4x
- Solid state LIDARs will be much cheaper - https://goo.gl/YZomWc
- Creepy ML - Google street images => car models => predictions about race / income / job per household / address / zip-code - https://goo.gl/mTXyW5
- An astronomer shared his experience after spending 3 years getting a Data Science degree - https://goo.gl/KgTmNp
#digest
Interesting datasets from Kaggle
Predict breast cancer from slide images
https://goo.gl/rDxrpZ
High quality academic dataset of 26k images of 41 fruits
https://goo.gl/JLWvLD
Gorgeous illustration of different network algorithms
https://goo.gl/z7oori
Crowd-sourced translation of parallel sentence pairs
https://goo.gl/7ky8Vw
5 years of hourly weather data for 36 cities
https://goo.gl/jjkRSq
#data_science
#datasets
A small hack for using multi-line python CLI commands via bash.
Just paste your long python command into script.sh
python3 train_satellites.py \
--arch linknet34 --batch-size 16 \
--imsize 320 --preset mul_urban --augs True\
--workers 6 --epochs 30 --start-epoch 0 \
--seed 42 --print-freq 50 \
--lr 1e-4 --optimizer adam \
--tensorboard True --tensorboard_images True --lognumber test\Then just:
sh script.sh
#data_science
Tested bcolz on a simple premise - how fast can it process 1M (1,3476) feature vectors from CNN. Also looks like it provides 2-3x compression straight out of the box. Nice.
Blazingly fast!
- https://goo.gl/z1MKmH
#data_science
2017 DS/ML digest 1
Did not do digests quite for some time =)
1. Annual digests
1.1 Google Brain one - https://goo.gl/VQhZmP two https://goo.gl/XkTRhp
Highlights
- Speech generation https://goo.gl/MEDv7M
- Speech recognition https://goo.gl/tCEkVz
- Auto ML https://goo.gl/fx2FuP
-- NASNET - https://goo.gl/becAET
1.2
Posted before - but WildML 2017 summary is also awesome https://goo.gl/ZFtFVT
2. Datasets
→ YouTube-8M (https://goo.gl/nyP9gp): >7 million YouTube → videos annotated with 4,716 different classes
→ YouTube-Bounding Boxes (https://goo.gl/c3K6YY): 5 million bounding boxes from 210,000 YouTube videos
→ Speech Commands Dataset (https://goo.gl/TWsTi8): thousands of speakers saying short command words
→ AudioSet (https://goo.gl/TVA3LJ): 2 million 10-second → → YouTube clips labeled with 527 different sound events
→ Atomic Visual Actions (AVA) (https://goo.gl/Ba4U73): 210,000 action labels across 57,000 video clips
→ Open Images (https://goo.gl/2Xj8Xd): 9M creative-commons licensed images labeled with 6000 classes
→ Open Images with Bounding Boxes (https://goo.gl/qRkvMy): 1.2M bounding boxes for 600 classes
→ QuickDraw dataset (https://goo.gl/FSsfYm)
3.
Uber about genetic approach to neural networks - https://eng.uber.com/deep-neuroevolution/
#digest
#data_science
#deep_learning
#machine_learning
Just found out about Facebook's fast text
- https://github.com/facebookresearch/fastText
Seems to be really promising
#data_science
#nlp
Wine3.0 - зарелизилась третья мажорная версия эмулятора системных вызовов Windows. Именно эта система лежит в основе портов бОльшей части старых игр на мак и линукс. Ну и на сегодняшний момент это единственный нормальный способ запустить на линуксе Майкрософт Офис и последний Фотошоп. Все время удивляюсь, как у ребят хватает энтузиазма уже больше десяти лет развивать этот продукт, мои большие поздравления команде!
В этом релизе практически полная совместимость с базовыми уровнями DirectX/3D 11 и поддержка Андроида.
https://www.winehq.org/news/2018011801
Nice presentation to learn about Semantic Segmentation
http://slides.com/vladimiriglovikov/title-texttitle-text/fullscreen#/0/5
https://www.youtube.com/watch?v=MYp3OwkiJAs
#data_science
#deep_learning
Internet digest
- Ben Evans - https://goo.gl/Cymhkf
- New post about chain effects in retail / TV / technology - https://goo.gl/gwuynK
- 39M smart speakers in the US https://goo.gl/nkvUc4
- US$1bn ticketing IPO in China - https://goo.gl/Zt1CmZ
Social Media
- FB updates its news feed algorithm to promote content you are more likely to interact with
https://newsroom.fb.com/news/2018/01/news-feed-fyi-bringing-people-closer-together/
Trivia
- Magnetic disks work after 30 years - https://goo.gl/oWoaWi
- Self-driving cars being DEPLOYED for SECOND time for one district with retired people - https://goo.gl/AKowqX
#internet
#digest
TF speech competition ended.
- https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/leaderboard
In my opinion it was a very interesting domain, but on day one it was apparent that there is a public repo with 87% accuracy. So I guess 90% is a decent improvement, but judging by team sizes - it is just stacking. Also in such competitions there is no chance in winning money. Also also - this was just blatant TF marketing.
New competitions
- https://www.kaggle.com/c/data-science-bowl-2018
-- This year it sucks - small prizes, small data, will be just stacking 100 Unets =( Last year I was too unexperienced to participate =(
- https://goo.gl/qXPUoG - Intel Movidius competition. It also sucks - because you have to use only limited types of hardware and software. Basically this is a marketing campaign
#data_science
#competitions
Next hobby project?
Something more community related - like making a pack of ML-themed stickers – 3
👍👍👍👍👍👍👍 75%
Satellite imaging => roads => road graphs – 1
👍👍 25%
GANs for specific domain search problem
▫️ 0%
Your idea (message me)
▫️ 0%
👥 4 people voted so far.
Amazing spoofers
- looks like they scrape domains periodically
- looks like they know basic domain registration timelines
- I chose to protect my whois information => they scrape websites for emails
- https://pics.spark-in.me/upload/26e127ea9c9aed8d004528f1d599defe.jpg
#internet
Interesting - somebody is developing linear regressing bindings for PyTorch
- https://twitter.com/i/web/status/950712702325936128
#data_science
Now I know how to make my remote learning rig perfect - add a hardware reboot watchdog
https://aminux.wordpress.com/2018/01/12/usb-watchdogs-opendev-vs-bitroleum/
#hardware
A nice post of ML predictions for 2018
- https://blog.goodaudience.com/ai-in-2018-for-researchers-8955df0caaf9
#data_science
Trick for image preprocessing - histogram equalization
- http://scikit-image.org/docs/dev/auto_examples/color_exposure/plot_equalize.html
#cv
US$1 million prize US-citizen exclusive Kaggle challenge ... for just stacking Resnets?
- https://www.kaggle.com/c/passenger-screening-algorithm-challenge/discussion/45805
America is fucked up bad...
Also notice the shake-up and top scores
- Public https://goo.gl/2utoDC
- Private https://goo.gl/GXpnWe
#data_science
#sick_sad_worlds