datascientology | Education

Telegram-канал datascientology - Data Scientology

1234

Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf

Subscribe to a channel

Data Scientology

A physical, working LeNet-1 (1989) built from transparent PCBs, glass and aluminium.

https://redd.it/1uhr1g1
@datascientology

Читать полностью…

Data Scientology

DeepSWE: new benchmark looking at how well today's frontier models can actually write code R

DeepSWE delivers four advances over existing public benchmarks:

Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
Real-world complexity: Prompts are \~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and \~2x more output tokens.
Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

https://preview.redd.it/lacvagyr159h1.png?width=1373&format=png&auto=webp&s=6514340a15d51d7f03da733f08fb3f6a302cac75

It's open-source: https://github.com/datacurve-ai/deep-swe

https://redd.it/1ue0hlp
@datascientology

Читать полностью…

Data Scientology

C++ tracker for small aerial targets

https://redd.it/1u9eder
@datascientology

Читать полностью…

Data Scientology

How does the ML community view evolutionary algorithm research? Career implications of an EA PhD? D

How does the ML research community feel about evolutionary algorithms? Should I do a PhD in this area?

Quick remark: I know some people in the ML community dunk on evolutionary algorithms because there’s often a better optimizer, but they do have their place, which is what researchers in my community aim to quantify.

Background:

I just finished my first year as a mathematics master’s student working on the theory of evolutionary algorithms (EAs)/randomized search heuristics. I’m fortunate to be on a research assistantship and have already coauthored several papers in strong conferences in our area.

I’ve always been more interested in classical ML/deep learning theory but haven’t had anyone to work with. Researchers in my field, including my advisor, occasionally publish in mainstream ML venues such as AAAI and NeurIPS, but it’s primarily the EA venues.

For a while now, I’ve been independently studying deep learning and statistical learning theory, and I have found intersections with my current research that I plan to pursue for my thesis.

With my current CV, it’s looking like I could get into some of the best PhD programs in my area, but I’m wondering if I should try to go to a more ML-centric PhD, even if it means going to a less prestigious institution/group for the sake of my career.

I’m not sure yet what I want to do after my PhD and a possible postdoc, but I want to keep myself competitive for top-tier opportunities.

What implications might doing an EA PhD have for my career? With strong EA publications, could I get into a good ML PhD program if I pitch myself appropriately? Could staying somewhat outside mainstream ML actually be a good career move, given how competitive and crowded ML has become?

https://redd.it/1u66q3l
@datascientology

Читать полностью…

Data Scientology

Introducing Papers Without Code P

Hi, Niels here from the open-source team at Hugging Face.

I've recently relaunched paperswithcode.co as a source for finding the state of the art (SOTA) across various AI domains, from 3D generation to AI agents. This is done by automatically parsing research papers published on arXiv/Hugging Face, enabling leaderboards to be created. See BrowseComp below as an example (a scatter plot and a table are available for each benchmark).

\- Scatter plot (you can hover over the dots to see the models):

https://preview.redd.it/9rz2r3ffcf6h1.png?width=2880&format=png&auto=webp&s=b3f8e7a870802f6ef8227ecc0619e9e1057554b0

\- Table:

https://preview.redd.it/qoqriddw5f6h1.png?width=2862&format=png&auto=webp&s=a0034574f693847537037013672fb61daf27b16e

As you can see, I've added support for viewing evals for closed-source models, too, given that many benchmarks are nowadays dominated by them, like GPT-5.5 and Mythos 5. You can always disable viewing closed-source evals with a toggle or in your PwC settings:

https://preview.redd.it/p3k6jt6q6f6h1.png?width=1582&format=png&auto=webp&s=40149e51d6b326a77e53e33baf70d9850b3de365

When you turn them off, here's what the open model leaderboard looks like:

https://preview.redd.it/tg42sin36f6h1.png?width=2838&format=png&auto=webp&s=1330a117ae9b4e0ce6d459493ae9e8f64107310a

Closed-source papers are treated as regular "papers", although they can be any source, like a blog post (given that PwC supports submitting any source beyond arXiv). See the GPT-5.5 or Mythos 5 papers as examples, with their evals at the bottom. Notice the "closed" tag on their evals. Hence, you could jokingly call these "papers without code".

Let me know what you think of this, and whether anything needs to be changed or added!

Kind regards,
Niels

https://redd.it/1u1wq0a
@datascientology

Читать полностью…

Data Scientology

3D Reconstruction from Video - Class Final Project

https://redd.it/1tx9oss
@datascientology

Читать полностью…

Data Scientology

LibreYOLO v1.2.0 epic release: 16 model families now supported
https://redd.it/1tt6pl8
@datascientology

Читать полностью…

Data Scientology

Deep Neural Network that turns any Image into a Playable Game ! All on consumer GPUs.

https://redd.it/1tom3oa
@datascientology

Читать полностью…

Data Scientology

Ultralytics Just Added Semantic Segmentation Models & They Look INSANE

https://redd.it/1tkane3
@datascientology

Читать полностью…

Data Scientology

Slop is making me feel disconnected from AI Research D

Hello everyone. This is just a small rant on my part. I’m relatively young, a final year undergrad, and I’ve been interested in AI researcher since I was in high school. Over that period of time I feel there has been a significant shift in the landscape regarding the culture surrounding the research.

While I’ve really enjoyed producing some interesting and creative work, I can’t help but feel that slowly the wave of low quality AI research and researchers are really making me feel frustrated. To just give a summary of what I and many others have seen:

\- Papers with hallucinated citations and even prompts contained in the papers
\- Papers with clearly misleading data that does not tell the whole picture.
\- Labs who have built a culture around quantity over quality, pumping out pubs, citing each other, and having all of the lab on each paper to inflate each students publication record.
\- Highschoolers…. Yes HIGHSCHOOLERS, becoming more common submitting at conferences that don’t really know what they are doing but paying a pretty penny to participate in “research programs” which are really just cash cows taking advantage of the fierce competition. See the post on the subreddit for more info.
\- Even the so called “top labs” producing work that is somewhat misleading or not fully representative. For instance see what happened recently with TurboQuant.
\- Research from “low tier institutions” being drowned out because they are not good for click baiting and farming views on LinkedIn and X, even if they are high quality.

It’s… a lot I know. Of course these problems have been around for a long time, but I feel as if lately they have become more and more exacerbated. I originally felt that I was attached to AI research primarily for the creativity and freedom, but I feel that ironically AI itself has been a hindrance on the quality of work being published.

Of course I don’t mean to say that all AI has been bad for ML research, I mean even I use it extensively to help me polish my writing and generate seaborn plots for my data, but that is very very different from just pumping out low quality cookie cutter work.

Anyways, just wondering if anyone else shares similar thoughts. I know I’m relatively young here so maybe some of you have better insights into the broader trends over the decades.

https://redd.it/1tfv0vh
@datascientology

Читать полностью…

Data Scientology

I built an open-source real-time driver monitoring system that detects drowsiness and driver state from a webcam

https://redd.it/1tbravo
@datascientology

Читать полностью…

Data Scientology

Getting harassed by an aggressive “independent researcher” demanding very specific citations and phrasing in my paper D

Hey Reddit,

I’m a researcher in a niche theoretical CS/ML area. Recently I’ve been dealing with repeated emails from an “independent researcher” that feel like straight-up citation harassment.

This person keeps sending follow-ups (including involving editors) insisting I add multiple citations to his arXiv preprints. It’s not a normal “you should cite this” request — he provides exact suggested paragraphs with specific wording about how his papers are “complementary,” “parallel,” foundational to certain results, etc. He nitpicks my current related-work phrasing (e.g. complaining about words like “encompass”), pushes for changes even after camera-ready deadlines, and follows up when I don’t respond quickly.

He frames it all very politely with phrases like “narrow remaining concerns” and “I would be grateful,” but the persistence, detailed boilerplate text he wants me to insert, and looping in others makes it exhausting and inappropriate.

I understand wanting visibility and relevant work deserves citations. But this level of badgering and trying to dictate exact text in someone else’s paper crosses a line.

Has anyone else experienced this kind of aggressive citation solicitation? Is it becoming more common? Or am I overreacting?
Publish-or-perish is bad enough without having to deal with this.

https://redd.it/1t6vvjc
@datascientology

Читать полностью…

Data Scientology

Are modern ML PhDs becoming too incremental, or is this just what research looks like now? D

I’ve been thinking about the current state of machine learning PhDs, including my own work, and I’d like to hear how others see it.
My impression is that a large fraction of modern ML PhD work follows a fairly predictable pattern: take an existing idea, connect it to another existing idea, apply it in a slightly different setting or community, tune the system carefully, add some benchmark results, and present the method as a new state-of-the-art approach. Another common pattern is mostly empirical: run benchmarks, report observations, provide some analysis, and frame that as the main contribution.
To be clear, I’m not saying this work is useless. Incremental progress matters, and not every PhD needs to invent a new paradigm. But sometimes it feels like many ML PhDs are closer to extended master’s theses: more experiments, more compute, more polished writing, and more benchmarks, but not necessarily a deeper scientific contribution.
What bothers me is that the same pattern appears even in top-tier conference papers. A paper may look strong because it has a clean story, a benchmark win, and good presentation, but after removing the “SOTA” claim, it is not always clear what lasting knowledge remains. Did we learn something general? Did we understand a mechanism better? Did we identify a failure mode? Did we create a reusable method or evaluation protocol? Or did we mostly produce another temporary leaderboard improvement?
I’m also reflecting this back onto my own PhD. I see some of the same patterns in my work, so this is not meant as an attack on others. It is more of a concern about the incentives of the field. ML seems to reward publishable deltas: small method variations, new combinations, benchmark improvements, and convincing empirical stories. But I’m less sure whether it consistently rewards deeper understanding.
So my question is:
Have ML PhDs become lower-quality compared to PhDs in other fields, or is this simply the normal shape of cumulative research in a fast-moving empirical field?
And maybe more importantly:
What separates a genuinely strong incremental ML PhD from one that is basically a collection of polished benchmark papers?

https://redd.it/1t311vb
@datascientology

Читать полностью…

Data Scientology

The difference between CPU and GPU, explained way too simply.

https://redd.it/1syrnhr
@datascientology

Читать полностью…

Data Scientology

Tried to use seam carving to try to preserve labels while reducing image size dramatically and the results are really wild
https://redd.it/1su3q22
@datascientology

Читать полностью…

Data Scientology

ShadeNet 28M — Dual-mode PBR material estimation from any RGB image

https://redd.it/1ufmhd4
@datascientology

Читать полностью…

Data Scientology

I've also been looking for the plane!
https://redd.it/1ucd6rd
@datascientology

Читать полностью…

Data Scientology

Next-Latent Prediction Transformers R

Microsoft Research Preprint

Next-token prediction is myopic. What if transformers learn to predict their own next latent state?

Microsoft Research present Next-Latent Prediction (NextLat): a self-supervised learning method that teaches transformers to form compact world models for reasoning and planning. It also unlocks up to 3.3x faster inference via self-speculative decoding!

On top of next-token prediction, NextLat trains the transformer to predict its own next latent state given the current latent state and next token.

NextLat has a few key benefits:

1. Representation Learning: NextLat encourages transformers to compress history into compact belief states.
2. Better Data Efficiency: predicting in latent space provides denser supervision than predicting one-hot tokens.
3. Faster Inference: via recursive multi-step lookahead.

I'm super excited about this work. Please do check it out below:

💬 Blog: https://jaydenteoh.github.io/blog/2026/nextlat
💻 Code: https://github.com/JaydenTeoh
📝 Paper: https://arxiv.org/abs/2511.05963

https://redd.it/1u84mio
@datascientology

Читать полностью…

Data Scientology

Which software or tools are used to make these kinds of diagrams or animations?
https://redd.it/1u3bh7r
@datascientology

Читать полностью…

Data Scientology

Greater than 80% of researchers at CVPR are chinese. This speak volumes on the chinese nexus in research, and something needs to be done about it. D

There are coordinated efforts where people have favoured and jeopardised the double blind review process.

No doubt out of these 80% there are great talent but we have to acknowledge that non chinese have been sobotaged and this was also reflected in the recent leaks of the reviewer data from the top ml conferences (won’t name them but they start with i).

I have also personally faced such discrimination and had a discussion on the subreddit asking others if they have witnessed something similar. It was shocking to know that this is occurring on large scale.

The question is how do we stop it, or highlight this? We have to preserve the sanctity of the research.

https://redd.it/1u00gdg
@datascientology

Читать полностью…

Data Scientology

Built an open-source hub of CV notebooks for almost every real-world use cases and Models

https://redd.it/1tvgjg0
@datascientology

Читать полностью…

Data Scientology

Its been a decade
https://redd.it/1tqv73m
@datascientology

Читать полностью…

Data Scientology

How do ML practitioners select hyperparameters, architectures, etc for self-supervised representation learning when the loss is non-monotonic? D

Non-contrastive SSL methods like BYOL/JEPA/data2vec seem promising, but I have no idea what is being learned, or how well; it’s models all the way down. Maybe I’ve got supervised tasks for which I’d like to see transfer, and I can evaluate linear probe/KNN results during training, but that seems like a way to efficiently abuse researcher degrees of freedom.

I know RankMe is meant to help address this: embed some data and SVD the embedding matrix. A healthy learner should produce an embedding with a high effective rank.

But JEPA methods already require an entropy-collapse term like Barlow Twins/SIGREG, so the RankMe criterion just becomes part of training. It gets absorbed into a loss which wasn’t monotonic to begin with, and I ought to be able to inflate it by increasing the penalty weight. Surely it’s no longer an effective criterion, right? What else is there?

https://redd.it/1tmprdm
@datascientology

Читать полностью…

Data Scientology

Machine Learning on Spherical Manifold [R]
https://eesuck1.github.io/machine-learning-on-spherical-manifold/

https://redd.it/1tif576
@datascientology

Читать полностью…

Data Scientology

arXiv implements 1-year ban for papers containing incontrovertible evidence of unchecked LLM-generated errors, such as hallucinated references or results. N

From Thomas G. Dietterich (arXiv moderator for cs.LG) on 𝕏 (thread):
https://x.com/tdietterich/status/2055000956144935055
https://xcancel.com/tdietterich/status/2055000956144935055

"Attention arXiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated.

If generative AI tools generate inappropriate language, plagiarized content, biased content, errors, mistakes, incorrect references, or misleading content, and that output is included in scientific works, it is the responsibility of the author(s).

We have recently clarified our penalties for this. If a submission contains incontrovertible evidence that the authors did not check the results of LLM generation, this means we can't trust anything in the paper.

The penalty is a 1-year ban from arXiv followed by the requirement that subsequent arXiv submissions must first be accepted at a reputable peer-reviewed venue.

Examples of incontrovertible evidence: hallucinated references, meta-comments from the LLM ("here is a 200 word summary; would you like me to make any changes?"; "the data in this table is illustrative, fill it in with the real numbers from your experiments")."

https://redd.it/1tdje2d
@datascientology

Читать полностью…

Data Scientology

Odyseus - Spatial VLM : Projecting 2D reasoning into 3D outputs (open source repo)

https://redd.it/1t9tomn
@datascientology

Читать полностью…

Data Scientology

Open Source project to calibrate for fisheye cameras

https://redd.it/1t57dip
@datascientology

Читать полностью…

Data Scientology

Is it just me or is the Conference Lottery culture killing research? D

I need to vent before I completely burn out. My supervisor has started treating major conferences like weekend hackathons, and I'm losing my mind. We are told to come up with something to submit roughly two weeks before the deadline, and he doesn't even care if it gets rejected. Apparently, the experience of trying is the goal.

It's no wonder top-tier conferences receive tens of thousands of submissions. and I hate my life.

https://redd.it/1t0mct7
@datascientology

Читать полностью…

Data Scientology

I did 15 AI Engineer interviews in the last 6 months R

I’ve spent the last half of 2025 in interview hell. I walked into my first few rounds prepared for deep math proofs, Transformer internals, and heavy LeetCode, but almost none of that came up. 

What they asked was way more practical, and I failed the first three rounds because I was over-preparing for the wrong things. Recruiters don't want a lecture on attention mechanisms anymore, they want to hear about your decisions.

Whenever I walked through a project, the questions were always: "Why RAG instead of fine-tuning for this?" or "How did you actually evaluate the hallucinations?" I failed early on because I’d just say, "I built a PDF chat app." Now, I lead with the trade-offs. 

I explain that I chose RAG because fine-tuning was too expensive for the dataset, used MiniLM for speed, and implemented a semantic chunking strategy that dropped the hallucination rate by 40%. That shift in how I talked about my work changed everything.

Another huge factor is cost and latency. I got my best offer because I could explain exactly how I cut inference costs by 60% using a hybrid local/cloud setup with Phi-3.5-mini and aggressive request caching. 

Companies want to know you aren't just burning GPU credits for fun. During live coding, they usually just had me "build a simple retriever" or fix a hallucination. I used to code in silence and fail; now, I narrate the whole time. 

If I’m using a FAISS flat index, I explain it’s for a small dataset but mention I’d pivot to HNSW for speed if we hit a million vectors. They don't want perfect code, they want to hear you architecting out loud.

The next time you’re in a technical round, don't just describe what you built. Describe why you didn't build it the other way. Showing that you weighed the cost of tokens against the accuracy of the model is exactly what separates a hobbyist from a senior engineer.



https://redd.it/1swxcvo
@datascientology

Читать полностью…

Data Scientology

INT3 compression+fused metal kernels R

Hey guys, I am a researcher and solo founder. I compress models with INT3 at +0.14 nats and built a 2-bit KV cache for long-horizon tasks. I shipped both (INT3 model + INT2 KV) with custom fused Metal kernels for Mac (M-series). Currently Qwen 7B is available in preview.

#install
brew install reinforceai/spiral/spiral

#chat
spiral-chat



I am optimizing kernels further and working on Triton kernels for GPU support. There is still more room to pack more efficiently, I will share more models soon. I will appreciate any feedback or any model you want me to compress within 100B parameters.

github.com/ReinforceAI/spiral

https://redd.it/1ssdt0z
@datascientology

Читать полностью…
Subscribe to a channel