datascientology | Education

Telegram-канал datascientology - Data Scientology

1234

Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf

Subscribe to a channel

Data Scientology

Machine Learning on Spherical Manifold [R]
https://eesuck1.github.io/machine-learning-on-spherical-manifold/

https://redd.it/1tif576
@datascientology

Читать полностью…

Data Scientology

arXiv implements 1-year ban for papers containing incontrovertible evidence of unchecked LLM-generated errors, such as hallucinated references or results. N

From Thomas G. Dietterich (arXiv moderator for cs.LG) on 𝕏 (thread):
https://x.com/tdietterich/status/2055000956144935055
https://xcancel.com/tdietterich/status/2055000956144935055

"Attention arXiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated.

If generative AI tools generate inappropriate language, plagiarized content, biased content, errors, mistakes, incorrect references, or misleading content, and that output is included in scientific works, it is the responsibility of the author(s).

We have recently clarified our penalties for this. If a submission contains incontrovertible evidence that the authors did not check the results of LLM generation, this means we can't trust anything in the paper.

The penalty is a 1-year ban from arXiv followed by the requirement that subsequent arXiv submissions must first be accepted at a reputable peer-reviewed venue.

Examples of incontrovertible evidence: hallucinated references, meta-comments from the LLM ("here is a 200 word summary; would you like me to make any changes?"; "the data in this table is illustrative, fill it in with the real numbers from your experiments")."

https://redd.it/1tdje2d
@datascientology

Читать полностью…

Data Scientology

Odyseus - Spatial VLM : Projecting 2D reasoning into 3D outputs (open source repo)

https://redd.it/1t9tomn
@datascientology

Читать полностью…

Data Scientology

Open Source project to calibrate for fisheye cameras

https://redd.it/1t57dip
@datascientology

Читать полностью…

Data Scientology

Is it just me or is the Conference Lottery culture killing research? D

I need to vent before I completely burn out. My supervisor has started treating major conferences like weekend hackathons, and I'm losing my mind. We are told to come up with something to submit roughly two weeks before the deadline, and he doesn't even care if it gets rejected. Apparently, the experience of trying is the goal.

It's no wonder top-tier conferences receive tens of thousands of submissions. and I hate my life.

https://redd.it/1t0mct7
@datascientology

Читать полностью…

Data Scientology

I did 15 AI Engineer interviews in the last 6 months R

I’ve spent the last half of 2025 in interview hell. I walked into my first few rounds prepared for deep math proofs, Transformer internals, and heavy LeetCode, but almost none of that came up. 

What they asked was way more practical, and I failed the first three rounds because I was over-preparing for the wrong things. Recruiters don't want a lecture on attention mechanisms anymore, they want to hear about your decisions.

Whenever I walked through a project, the questions were always: "Why RAG instead of fine-tuning for this?" or "How did you actually evaluate the hallucinations?" I failed early on because I’d just say, "I built a PDF chat app." Now, I lead with the trade-offs. 

I explain that I chose RAG because fine-tuning was too expensive for the dataset, used MiniLM for speed, and implemented a semantic chunking strategy that dropped the hallucination rate by 40%. That shift in how I talked about my work changed everything.

Another huge factor is cost and latency. I got my best offer because I could explain exactly how I cut inference costs by 60% using a hybrid local/cloud setup with Phi-3.5-mini and aggressive request caching. 

Companies want to know you aren't just burning GPU credits for fun. During live coding, they usually just had me "build a simple retriever" or fix a hallucination. I used to code in silence and fail; now, I narrate the whole time. 

If I’m using a FAISS flat index, I explain it’s for a small dataset but mention I’d pivot to HNSW for speed if we hit a million vectors. They don't want perfect code, they want to hear you architecting out loud.

The next time you’re in a technical round, don't just describe what you built. Describe why you didn't build it the other way. Showing that you weighed the cost of tokens against the accuracy of the model is exactly what separates a hobbyist from a senior engineer.



https://redd.it/1swxcvo
@datascientology

Читать полностью…

Data Scientology

INT3 compression+fused metal kernels R

Hey guys, I am a researcher and solo founder. I compress models with INT3 at +0.14 nats and built a 2-bit KV cache for long-horizon tasks. I shipped both (INT3 model + INT2 KV) with custom fused Metal kernels for Mac (M-series). Currently Qwen 7B is available in preview.

#install
brew install reinforceai/spiral/spiral

#chat
spiral-chat



I am optimizing kernels further and working on Triton kernels for GPU support. There is still more room to pack more efficiently, I will share more models soon. I will appreciate any feedback or any model you want me to compress within 100B parameters.

github.com/ReinforceAI/spiral

https://redd.it/1ssdt0z
@datascientology

Читать полностью…

Data Scientology

Was looking at a ICLR 2025 Oral paper and I am shocked it got oral D

After my last post about score analysis of ICLR, I am looking into the review itself now.

They evaled SQL code generation by LLM using nature language metric and not executation metric, and they tested it and found around 20% false positive rate. This is a major flaw how is it even getting oral?

https://openreview.net/forum?id=GGlpykXDCa

https://redd.it/1slxqac
@datascientology

Читать полностью…

Data Scientology

For Physical AI applications, why do most robotics companies use 3D cameras?

Hi there! I'm a regular guy working at a company that makes cameras and CCTVs. After watching how BIG "physical AI" was at CES 2026, my boss asked me to do research on whether my company could enter the market with some kind of a robotic vision system/module.

At first, my thought was that we could just start off by making active stereo cameras like RealSense since lots of companies seem to be making heavy use of stereo vision systems in their designs. But as I did more research, I was told multiple times that most calculations are actually done with 2D RGB images, not with the point cloud data which the 3D cameras are intended to produce.

Is this true? Are 3D cameras being used just as a temporary step before moving completely into multiple RGB cameras? Is there any consensus on how the robotic vision system would look like in the future?

Thank you for reading my post.

https://redd.it/1sh9gia
@datascientology

Читать полностью…

Data Scientology

D thoughts on the controversy about Google's new paper?

Openreview: https://openreview.net/forum?id=tO3ASKZlok

It's sad to see almost no one mention this on Reddit and people are being mean to people who point out concerns


Edit: google is allegedly doing this in their trending TurboQuant paper

1. Did not attribute a pervious work RaBitQ fully

2. Did unfair comparison with RaBitQ (single core CPU vs GPU)



https://redd.it/1s7m7rn
@datascientology

Читать полностью…

Data Scientology

Depth Perception Blender Add-on

https://redd.it/1ropx3s
@datascientology

Читать полностью…

Data Scientology

D Saw this papaer from ICLR with scores 2,2,2,4 and got accepted, HOW

https://openreview.net/forum?id=05hNleYOcG

How is this even possible

https://redd.it/1qxdaqk
@datascientology

Читать полностью…

Data Scientology

D 100 Hallucinated Citations Found in 51 Accepted Papers at NeurIPS 2025

https://gptzero.me/news/neurips

I remember this was shared last month about ICLR where they found hallucinations in submitted papers, but I didn't expect to see them in accepted papers as well

https://redd.it/1qjz88r
@datascientology

Читать полностью…

Data Scientology

Oh how far we've come

https://redd.it/1q7hizl
@datascientology

Читать полностью…

Data Scientology

[P] The State Of LLMs 2025: Progress, Problems, and Predictions
https://magazine.sebastianraschka.com/p/state-of-llms-2025

/r/MachineLearning
https://redd.it/1pzrfbf

Читать полностью…

Data Scientology

Slop is making me feel disconnected from AI Research D

Hello everyone. This is just a small rant on my part. I’m relatively young, a final year undergrad, and I’ve been interested in AI researcher since I was in high school. Over that period of time I feel there has been a significant shift in the landscape regarding the culture surrounding the research.

While I’ve really enjoyed producing some interesting and creative work, I can’t help but feel that slowly the wave of low quality AI research and researchers are really making me feel frustrated. To just give a summary of what I and many others have seen:

\- Papers with hallucinated citations and even prompts contained in the papers
\- Papers with clearly misleading data that does not tell the whole picture.
\- Labs who have built a culture around quantity over quality, pumping out pubs, citing each other, and having all of the lab on each paper to inflate each students publication record.
\- Highschoolers…. Yes HIGHSCHOOLERS, becoming more common submitting at conferences that don’t really know what they are doing but paying a pretty penny to participate in “research programs” which are really just cash cows taking advantage of the fierce competition. See the post on the subreddit for more info.
\- Even the so called “top labs” producing work that is somewhat misleading or not fully representative. For instance see what happened recently with TurboQuant.
\- Research from “low tier institutions” being drowned out because they are not good for click baiting and farming views on LinkedIn and X, even if they are high quality.

It’s… a lot I know. Of course these problems have been around for a long time, but I feel as if lately they have become more and more exacerbated. I originally felt that I was attached to AI research primarily for the creativity and freedom, but I feel that ironically AI itself has been a hindrance on the quality of work being published.

Of course I don’t mean to say that all AI has been bad for ML research, I mean even I use it extensively to help me polish my writing and generate seaborn plots for my data, but that is very very different from just pumping out low quality cookie cutter work.

Anyways, just wondering if anyone else shares similar thoughts. I know I’m relatively young here so maybe some of you have better insights into the broader trends over the decades.

https://redd.it/1tfv0vh
@datascientology

Читать полностью…

Data Scientology

I built an open-source real-time driver monitoring system that detects drowsiness and driver state from a webcam

https://redd.it/1tbravo
@datascientology

Читать полностью…

Data Scientology

Getting harassed by an aggressive “independent researcher” demanding very specific citations and phrasing in my paper D

Hey Reddit,

I’m a researcher in a niche theoretical CS/ML area. Recently I’ve been dealing with repeated emails from an “independent researcher” that feel like straight-up citation harassment.

This person keeps sending follow-ups (including involving editors) insisting I add multiple citations to his arXiv preprints. It’s not a normal “you should cite this” request — he provides exact suggested paragraphs with specific wording about how his papers are “complementary,” “parallel,” foundational to certain results, etc. He nitpicks my current related-work phrasing (e.g. complaining about words like “encompass”), pushes for changes even after camera-ready deadlines, and follows up when I don’t respond quickly.

He frames it all very politely with phrases like “narrow remaining concerns” and “I would be grateful,” but the persistence, detailed boilerplate text he wants me to insert, and looping in others makes it exhausting and inappropriate.

I understand wanting visibility and relevant work deserves citations. But this level of badgering and trying to dictate exact text in someone else’s paper crosses a line.

Has anyone else experienced this kind of aggressive citation solicitation? Is it becoming more common? Or am I overreacting?
Publish-or-perish is bad enough without having to deal with this.

https://redd.it/1t6vvjc
@datascientology

Читать полностью…

Data Scientology

Are modern ML PhDs becoming too incremental, or is this just what research looks like now? D

I’ve been thinking about the current state of machine learning PhDs, including my own work, and I’d like to hear how others see it.
My impression is that a large fraction of modern ML PhD work follows a fairly predictable pattern: take an existing idea, connect it to another existing idea, apply it in a slightly different setting or community, tune the system carefully, add some benchmark results, and present the method as a new state-of-the-art approach. Another common pattern is mostly empirical: run benchmarks, report observations, provide some analysis, and frame that as the main contribution.
To be clear, I’m not saying this work is useless. Incremental progress matters, and not every PhD needs to invent a new paradigm. But sometimes it feels like many ML PhDs are closer to extended master’s theses: more experiments, more compute, more polished writing, and more benchmarks, but not necessarily a deeper scientific contribution.
What bothers me is that the same pattern appears even in top-tier conference papers. A paper may look strong because it has a clean story, a benchmark win, and good presentation, but after removing the “SOTA” claim, it is not always clear what lasting knowledge remains. Did we learn something general? Did we understand a mechanism better? Did we identify a failure mode? Did we create a reusable method or evaluation protocol? Or did we mostly produce another temporary leaderboard improvement?
I’m also reflecting this back onto my own PhD. I see some of the same patterns in my work, so this is not meant as an attack on others. It is more of a concern about the incentives of the field. ML seems to reward publishable deltas: small method variations, new combinations, benchmark improvements, and convincing empirical stories. But I’m less sure whether it consistently rewards deeper understanding.
So my question is:
Have ML PhDs become lower-quality compared to PhDs in other fields, or is this simply the normal shape of cumulative research in a fast-moving empirical field?
And maybe more importantly:
What separates a genuinely strong incremental ML PhD from one that is basically a collection of polished benchmark papers?

https://redd.it/1t311vb
@datascientology

Читать полностью…

Data Scientology

The difference between CPU and GPU, explained way too simply.

https://redd.it/1syrnhr
@datascientology

Читать полностью…

Data Scientology

Tried to use seam carving to try to preserve labels while reducing image size dramatically and the results are really wild
https://redd.it/1su3q22
@datascientology

Читать полностью…

Data Scientology

[D] It seems that EVERY DAY there are around 100 - 200 new machine learning papers uploaded on Arxiv.
https://arxiv.org/list/cs.LG/recent?skip=0&show=500

https://redd.it/1sqi69n
@datascientology

Читать полностью…

Data Scientology

ICML 2026 Extending the deadline for reviewer final justifications while not extending for Author-AC comments was a huge mistake D

Just as the title says, I believe the decision to extend the deadline for reviewers to post their final justifications while not allowing authors to contact their ACs was a big misstep. I have a reviewer who, in their final justification is questioning the reliability of experimental setup and evaluation, as was as the fairness of comparison, issues that were never brought up during the initial review or their response to our rebuttal. It seems as though they were looking for reasons to justify not wanting to move their score from weak accept. It now feels like, despite having otherwise strong reviews that are leaning accept, this review might tank the paper.



https://redd.it/1sjzr15
@datascientology

Читать полностью…

Data Scientology

D How to break free from LLM's chains as a PhD student?

I didn't realize but over a period of one year i have become overreliant on ChatGPT to write code, I am a second year PhD student and don't want to end up as someone with fake "coding skills" after I graduate. I hear people talk about it all the time that use LLM to write boring parts of the code, and write core stuff yourself, but the truth is, LLMs are getting better and better at even writing those parts if you write the prompt well (or at least give you a template that you can play around to cross the finish line). Even PhD advisors are well convinced that their students are using LLMs to assist in research work, and they mentally expect quicker results. I am currently trying to cope with imposter syndrome because my advisor is happy with my progress. But deep down I know that not 100% of it is my own output. I have started feeling like LLMs have tied my hands so tightly that I can't function without them.

What would be some strategies to reduce the dependency on LLM for work?

https://redd.it/1sdmn97
@datascientology

Читать полностью…

Data Scientology

Building an A.I. navigation software that will only require a camera, a raspberry pi and a WiFi connection (DAY 6)

https://redd.it/1ryiw07
@datascientology

Читать полностью…

Data Scientology

[P] A Python library processing geospatial data for GNNs with PyTorch Geometric

https://redd.it/1r02y6y
@datascientology

Читать полностью…

Data Scientology

CV / ML / AI Job Board
https://redd.it/1qqzcsq
@datascientology

Читать полностью…

Data Scientology

Can You MAKE it!

https://redd.it/1qbqpu8
@datascientology

Читать полностью…

Data Scientology

Frustrated with the lack of ML engineers who understand hardware constraints

We're working on an edge computing project and it’s been a total uphill battle. I keep finding people who can build these massive models in a cloud environment with infinite resources, but then they have no idea how to prune or quantize them for a low-power device. It's like the concept of efficiency just doesn't exist for a lot of modern ML devs. I really need someone who has experience with TinyML or just general optimization for restricted environments. Every candidate we've seen so far just wants to throw more compute at the problem which we literally don't have. Does anyone have advice on where to find the efficiency nerds who actually know how to build for the real world instead of just running notebooks in the cloud?

/r/computervision
https://redd.it/1q1uchd

Читать полностью…

Data Scientology

How do you as an AI/ML researcher stay current with new papers and repos? D

For those doing AI/ML research or engineering:

1. How do you currently discover and track new research?
2. What's the most frustrating part of your research workflow?
3. How much time per week do you spend on research/staying current?

Genuinely curious how others handle this and how much time you’re spending. Thanks!

/r/MachineLearning
https://redd.it/1pxz7it

Читать полностью…
Subscribe to a channel