[OC] Yeah Science! Scientific Output vs. National Wealth
/r/dataisbeautiful
https://redd.it/ztjxbn
Interactive map: Coldest day of the year across the United States
https://www.climate.gov/news-features/understanding-climate/interactive-map-coldest-day-year-across-united-states
/r/dataisbeautiful
https://redd.it/zu1vuj
[OC] Big fan of IASIP here! Put together a fun little visual. Happy holidays!
/r/dataisbeautiful
https://redd.it/ztrxrf
US Prison Population and Crime Rate over time [OC]
/r/dataisbeautiful
https://redd.it/ztzln1
[OC] Mexico welcomes nearly half of all international arrivals into Latin America + the Caribbean each year.
/r/dataisbeautiful
https://redd.it/ztrbdj
P Extracting and Structuring Recipes Using GPT3
I've been experimenting with GPT3 for different use cases over the past few weeks, the latest one was seeing how well it could parse out structured data from recipe free text, and how well it could further enrich this data.
The general idea was to have a few different prompts to the model, with output from one prompt inputting into the next prompt:
1. Extract ingredients and instructions from the recipe
2. Given the ingredients, group them together into categories
3. Given the full structured recipe generated above, enrich it further with additional metadata (time to cook, healthiness, etc)
This worked out better than I expected - given an input recipe I'm able to consistently (and accurately) extract the constituent parts and group the ingredients together logically (like grains, dairy, etc).
I wrote about it here: https://binal.pub/2022/12/extracting-and-structuring-recipes-using-gpt3/
One thing that I was surprised by as well was this turned out to be a decent recipe generator. So instead of using a full recipe I could input "Pumpkin Pie" and the structured response at the end would be the ingredients and instructions to bake a pumpkin pie with quantities/timings that seemed to be about what you'd expect.
/r/MachineLearning
https://redd.it/zrzhtq
[OC] Top 10 most used words by subreddit in 2022 (reuploaded)
https://redd.it/zt0zab
@datascientology
Discussion Anyone else having a hard time not getting mad/cringing at the general public anthropomorphizing the hell out of chatGPT?
It was one thing with DALLE-2, but at least it couldn’t talk back to them. I mean I have been in board meetings with powerful people in leadership positions that have nothing to do with tech have absolutely horrendous ideas about what ChatGPT is- I am not lying, I have genuinely heard them say they believe it’s basically conscious and using excerpt screenshots of it saying it hates humans as a basis to make business decisions about the future of AI in their company. Like….WHAT? Have other people heard absurd things like this too?
I think it’s just hard to see the professional reality of machine learning, becoming extremely debased from the general public idea of machine learning. I’m sure as we all get even better at our jobs it’s only going to get much much worse. I wouldn’t be surprised if soon we are the new magical witches of the world. i’ll see you guys on the pyres in 20 years.( ok really I’m just joking on that last part)
What do you all think?
/r/MachineLearning
https://redd.it/ztbsf5
Wheel of Emotional Granularity [by Abby VanMuijen with Michelle McGhee]
/r/Infographics
https://redd.it/zs39x4
Sample Peyote: generate multi-table synthetic data on any topic using GPT-3
Last weekend, I created a tool that uses GPT-3 to create synthetic datasets. I call it Sample Peyote, because it hallucinates sample data sets.
Here's a Star Wars dataset that it generated. There are several more examples linked from the README on github. Source code is there, too.
​
This was mostly a kick-the-tires project to understand what GPT is capable of, but I wanted it to be based in a real workflow with nontrivial requirements:
Start from scratch: Most synthetic data generators work by taking a sample of real data, and generating a fake dataset that has similar properties. I want to generate (aka "hallucinate") data starting from just an idea.
Cover any topic: I want to be able to generate data related to many different topics.
Generate a database, not just a table: I don't just want to generate a table. I want to generate a realistic-feeling database, with multiple tables and realistic use of things like foreign keys, ENUMs, and timestamps.
Pass the **Enhance That! test**: Generate data that "feels authentic."
​
I'd love feedback, and ideas for use cases.
/r/datasets
https://redd.it/zrr2yr
TIL: the Uruguay river is mostly not in Uruguay and it doesn't start there.
/r/MapPorn
https://redd.it/zt3azv
Hey guys, I have 19 days to prepare for a 2hours onsite Data Science Interview. How should I prepare for it to maximise my chances?
The company is in the **aerospace and defense sector**.
Job Requirements:
* Good understanding of Python
* Solid bases in statistics and Machine Learning, in particular in forecasting methods
* Solid knowledge of database operations (SQL and NoSQL)
* A strong taste for creating impactful visualizations (dataviz)
* Experience, if possible in an industrial context, with at least one dashboard creation software/library and data visualization libraries (PowerBI, Grafana, Tableau, Dash, Shiny, D3.js, ...)
If you have roadmaps to prepare this kind of interviews please share them with me, it's about the job of my dreams and I am willing to work hard to get it. I have some basics in Data Science, statistics, and Probability but I want to start from scratch.
/r/datascience
https://redd.it/zsvcys
[OC] NFL wide receiver Justin Jefferson is on pace to break the single-season receiving yards record
/r/dataisbeautiful
https://redd.it/ztwgd5
Compared to your other family members, how would you rate yourself as a gift giver on a scale of 0-100? [OC]
/r/dataisbeautiful
https://redd.it/ztjykq
Inequality in annual earnings worsens in 2021: Top 1% of earners get a larger share of the earnings pie while the bottom 90% lose ground
https://www.epi.org/publication/inequality-2021-ssa-data/?utm_source=sillychillly
/r/dataisbeautiful
https://redd.it/ztrxaf
Take a moment. How certain are you that you are currently awake and not dreaming? (Everyone)
https://forms.gle/B16YJVwH5feK4HQKA
/r/SampleSize
https://redd.it/ztpcdv
Q If randomization can't assure of balanced proportions of potentially influential characteristics within experimental groups, how useful are quasi causal analysis like matching?
/r/statistics
https://redd.it/ztf5cv
Master u/etoipiplus1, I salute you with this Desmos homage to your fractal creations
/r/mathpics
https://redd.it/zpquyd
[OC] Animation of Arctic blast sweeping down and across Canada and the US
/r/dataisbeautiful
https://redd.it/ztkgvh
Is your country in green or slightly different green?
/r/dataisugly
https://redd.it/ztik6h
P App that Determines Whether You've Been Naughty or Nice Based on Your Reddit Comments
Hex Application
Since we are heading into the holiday season, I thought it would be interesting to take a look if you could create a model to look at morality with user's Reddit comments. I used Scikit-Learn's Logistic Regression Model for this.
I started by downloading around 750 comments from Social Grep's website. They have pulled Reddit comments from different sets of subreddits. I pulled from their datasets for confession-like subreddits, the irl subreddits, and the dataset subreddit. I classified the comments manually by a set rule of morality. Once they were scored, I trained/tested the Logistic model with those comments.
For the specific user testing, I used PRAW to pull the most recent 50 comments from the username provided in the Hex Application. I ran the trained model and outputted the probability of each comment being nice and took an average of the probabilities and used that value to determine whether the user was naughty or nice. I use a script to email a CSV with all of the tested comments and the final score to the user.
Based on the results that have came through so far, the model is definitely biased towards giving the user a nice decision. I believe that is based on the training data being around 70% nice versus naughty. Does anyone have a way to help the model from being biased like that?
Feel free to try the app out and let me know what you think!
/r/MachineLearning
https://redd.it/zspu96
[OC] The cost of Christmas varies widely across the world, from less than $100 to over $2000
/r/dataisbeautiful
https://redd.it/ztbovn
How a spider builds an orb web: generate high tensile strength anchor, bridge & frame threads; thin radius threads, and a sticky capture spiral
/r/Infographics
https://redd.it/zta47e
Trinidad and Tobago’s Annual Air Visitor Arrivals
/r/Infographics
https://redd.it/zsrbss
Countries where there were executions in 2022
https://redd.it/zsy3lt
@datascientology
P A self-driving car using Nvidia Jetson Nano, with movement controlled by a pre-trained convolution neural network (CNN) written in Taichi
Intro & source code: https://github.com/houkensjtu/taichi-hackathon-akinasan
1. The circuit of an ordinary RC toy car is modified so that Jetson Nano can control the movement of the car through GPIO port. Of course, we need to use motor drive controller here, because the upper limit of the output current of Jetson Nano is not enough to drive the car motor directly.
2. The convolution neural network (CNN) is implemented using Taichi programming language.
3. The road data was collected, then classified and labeled, and finally used in the training of CNN models.
4. The pre-trained model is imported into Jetson Nano and the action prediction made for the images captured during driving.
Demo:
https://reddit.com/link/zshrlv/video/pcm3f6id3f7a1/player
/r/MachineLearning
https://redd.it/zshrlv
D When chatGPT stops being free: Run SOTA LLM in cloud
TL;DR: I found GPU compute to be generally cheap and spot or on-demand instances can be launched on AWS for a few USD / hour up to over 100GB vRAM. So I thought it would make sense to run your own SOTA LLM like Bloomz 176B inference endpoint whenever you need it for a few questions to answer. I thought it would still make more sense than shoving money into a closed walled garden like "not-so-OpenAi" when they make ChatGPT or GPT-4 available for $$$. But I struggle due to lack of tutorials/resources.
Therefore, I carefully checked benchmarks, model parameters and sizes as well as training sources for all SOTA LLMs here.
Knowing since reading the Chinchilla paper that Model Scaling according to OpenAI was wrong and more params != better quality generation. So I was looking for the best performing LLM openly available in terms of quality and broadness to use for multilingual everyday questions/code completion/reasoning similar to what chatGPT provides (minus the fine-tuning for chat-style conversations).
My choice fell on Bloomz (because that handles multi-lingual questions well and has good zero shot performance for instructions and Q&A style text generation. Confusingly Galactica seems to outperform Bloom on several benchmarks. But since Galactica had a very narrow training set only using scientific papers, I guess usage is probably limited for answers on non-scientific topics.
Therefore I tried running the original bloom 176B and alternatively also Bloomz 176B on AWS SageMaker JumpStart, which should be a one click deployment. This fails after 20min. On Azure ML, I tried using DeepSpeed-MII which also supports bloom but also fails due the instance size of max 12GB vRAM I guess.
From my understanding to save costs on inference, it's probably possible to use one or multiple of the following solutions:
- Precision: int8 instead of fp16
- Microsoft/DeepSpeed-MII for an up 40x reduction on inference cost on Azure, this thing also supports int8 and fp16 bloom out of the box, but it fails on Azure due to instance size.
- facebook/xformer not sure, but if I remember correctly this brought inference requirements down to 4GB vRAM for StableDiffusion and DreamBooth fine-tuning to 10GB. No idea if this is usefull for Bloom(z) inference cost reduction though
I have a CompSci background but I am not familiar with most stuff, except that I was running StableDiffusion since day one on my rtx3080 using linux and also doing fine-tuning with DreamBooth. But that was all just following youtube tutorials. I can't find a single post or youtube video of anyone explaining a full BLOOM / Galactica / BLOOMZ inference deployment on cloud platforms like AWS/Azure using one of the optimizations mentioned above, yet alone deployment of the raw model. :(
I still can't figure it out by myself after 3 days.
TL;DR2: Trying to find likeminded people who are interested to run open source SOTA LLMs for when chatGPT will be paid or just for fun.
Any comments, inputs, rants, counter-arguments are welcome.
/end of rant
/r/MachineLearning
https://redd.it/zstequ