Yolov8 free alternatives
I'm currently using Yolov8 for some object detection and classification tasks. Overall, I like the accuracy and speed. But it is licensed. What are some free alternatives to it that offers both detection and classification?
/r/computervision
https://redd.it/1eyet6m
Project I Created the Definitive AUTOMATIC Shiny Hunter for Pokémon BDSP
Hey everyone! I am Dinones! I coded a Python program using object detection that lets my computer hunt for shiny Pokémon on my physical Nintendo Switch while I sleep. So far, I’ve automatically caught shiny Pokémon like Giratina, Dialga or Azelf, Rotom, Drifloon, all three starters, and more in Pokémon BDSP. Curious to see how it works? Check it out! The program is available for everyone! Obviously, for free; I'm just a student who likes to program this stuff in his free time :)
The games run on a Nintendo Switch (not emulated, a real one). The program gets the output images using a capture card, then, it process them to detect whether the pokemon is shiny or not (OpenCV). Finally, it emulates the joycons using bluetooth (NXBT) and control the Nintendo. Also works on a Raspberry Pi!
I don't make money with this, I just feel my project can be interesting for lot of people.
📽️ Youtube: https://www.youtube.com/watch?v=84czUOAvNyk
🤖 Github: https://github.com/Dinones/Nintendo-Switch-Pokemon-Shiny-Hunter
https://preview.redd.it/7jbe6fdxrijd1.png?width=1920&format=png&auto=webp&s=626c801925fb0769f59e62ece09f0e00b18b828e
https://preview.redd.it/2h2alqcxrijd1.png?width=1920&format=png&auto=webp&s=fddd11c5c04c58268bbaf0e8bca0fd7081a7f775
/r/MachineLearning
https://redd.it/1evp3wz
Convince me to learn C++ for computer vision.
PLEASE READ THE PARAGRAPHS BELOW
HI everyone. Currently I am at the last year of my master and I have good knowledge about image processing/CV and also deep learning and machine learning. I plan to pursue a career in computer vision (currently have a job on this field). I have some c++ knowledge and still learning but not once I've came across an application that required me to code in c++. Everything is accessible using python nowadays and I know all those tools are made using c/c++ and python is just a wrapper. I really need your opinions to gain some insight regarding the use cases of c/c++ in practical computer vision application. For example Cuda memory management.
/r/computervision
https://redd.it/1epl53b
RPC — A New Way to Build Language Models
Article: jpmag7/rpc-language-modeling-by-relevant-precedence-compression-3d09bb4f23e6">RPC — A New Way to Build Language Models
One of the reasons I really like software engineering in general is because anyone can do almost anything with just a computer. But when it comes to Al and specifically LLMs you need a tone of resources and money to do anything interesting by yourself.
So recently I've been trying to find a way to build language models with far less training data and far less compute. RPC is my closest attempt at that. It compresses the prompt into a vector representation and then performs a search in a vector database to find the most appropriate next token. It works remarkably well.
I haven't got the time to properly evaluate and test it yet. That's why I'm sharing this with the community, in the hope that someone will give some feedback or even try to replicate it. I'd love for you to take a look at the article and share some thoughts here.
/r/deeplearning
https://redd.it/1ehp00w
How do researchers come up with these ideas?
Hi everyone. I have a question which is tickling my mind for a while now and I was hoping maybe you can help me. How do cv researchers come up with their ideas? I mean I have read over 100 cv papers (not much I know) but every single time I asked myself how? How is this justified? For example in object detection I've read Yolo v6, all I saw was that they experimented so many configuration with little to no insight, the same goes to most other papers, I mean yes I can understand why focal loss or arcface might help learning procedure but I cannot understand how traversing feature pyramid top to bottom or bottom to top or bidirectional or etc might help when there is no proper justification provides. Where is the intuition? I read a paper, the author stated that we fuse only top layers of FP together and bottom layers together and it works, why? How? I am really confused specially since started to work on my thesis. Which is about object detection.
/r/computervision
https://redd.it/1e8k928
Transfer Learning vs. Fine-tuning vs. Multitask Learning vs. Federated Learning
/r/deeplearning
https://redd.it/1e31jgt
Would you choose to work as NLP research engineer or PhD starting this year?
Hi everyone,
I recently graduated from college with a couple of co-authored NLP papers (not first author) and will soon start a one-year MSE program at a top-tier university. I’m currently debating between pursuing a career as a Research Engineer (RE) or going for a PhD after my master’s.
Given some financial pressure from my family, the idea of becoming a Research Engineer at companies like Google or Anthropic is increasingly appealing. However, I’m uncertain about the career trajectory of an RE in NLP. Specifically, I’m curious about the potential for Research Engineers to transition into roles focused on research science or product development within major tech companies.
I would greatly appreciate any insights or advice from those with experience in the field. What does the career path for Research Engineers typically look like? Is there room for growth and movement into other areas within the industry?
Thank you in advance!
/r/LanguageTechnology
https://redd.it/1dv90hv
D Simple Questions Thread
Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
Thanks to everyone for answering questions in the previous thread!
/r/MachineLearning
https://redd.it/1dh9f6b
How do cheap CCTV cameras have good object detection and tracking features?
Most of them have extremely low power inputs and comes at very cheap prices. How are they able to do the task so well?
Any leads on the tech or algos they use will be very helpful.
/r/computervision
https://redd.it/1dfhkvk
Tiny Time Mixers(TTMs): Powerful Zero/Few-Shot Forecasting Models by IBM
𝐈𝐁𝐌 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡 released 𝐓𝐢𝐧𝐲 𝐓𝐢𝐦𝐞 𝐌𝐢𝐱𝐞𝐫𝐬 (𝐓𝐓𝐌):A lightweight, Zero-Shot Forecasting time-series model that even outperforms larger models.
And the interesting part - 𝐓𝐓𝐌 does not use Attention or other Transformer-related stuff!
You can find an analysis & tutorial of the model here.
/r/deeplearning
https://redd.it/1d867c2
Any lessons to be mindful of building a production-level RAG?
I will be working on an RAG system as my graduation project. The plan is to use Amazon Bedrock for the infrastructure while I am scraping for relevant data (documents). For those of you who have had experience working with RAG, are there any lessons/mistakes/tips that you could share? Thanks in advance!
/r/LanguageTechnology
https://redd.it/1d1fyyq
R Introducing SSAMBA: The Self-Supervised Audio Mamba!
Hey Reddit,
Tired of transformers? Is attention really all you need? Meet SSAMBA (Self-Supervised Audio Mamba)! 🐍✨
This attention-free, purely state-space model (SSM)-based, self-supervised marvel doesn’t just hiss—it roars! SSAMBA achieves better or similar performance to its transformer-based counterparts (SSAST) on tasks like speaker identification, keyword spotting, and audio classification. But here's the kicker: it’s much more GPU memory efficient and quicker at inference, especially with longer audio lengths.
Curious? Check out the full paper here: SSAMBA on arXiv
Thanks for tuning in!
/r/MachineLearning
https://redd.it/1cz1yoa
D GPT-4o "natively" multi-modal, what does this actually mean?
What are your best guesses on how it works (training and architecture) vs. the typical VL formula of pretrained vision encoder + pretrained LLM -> fine-tune with multimodal tasks?
E.g. Is it fully mixed modality pre-training the entire system? Does model embed all modalities into a shared space for prediction? Does the system "self-select" the modality of output tokens (can flexibly choose to output audio vs. text based on input tokens) or is this user specified?
/r/MachineLearning
https://redd.it/1crzdhd
a different aspect of the overall experience, UniSim can emulate how humans and agents interact with the world by simulating the visual outcome of both high-level instructions such as “open the drawer” and low-level controls such as “move by x,y” from otherwise static scenes and objects. There are numerous use cases for such a real-world simulator. As an example, we use UniSim to train both high-level vision-language planners and low-level reinforcement learning policies, each of which exhibit zero-shot real-world transfer after training purely in a learned real-world simulator. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience in UniSim, opening up even wider applications.
**Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors**
Ido Amos, Jonathan Berant, Ankit Gupta
Abstract: Modeling long-range dependencies across sequences is a longstanding goal in machine learning and has led to architectures, such as state space models, that dramatically outperform Transformers on long sequences. However, these impressive empirical gains have been by and large demonstrated on benchmarks (e.g. Long Range Arena), where models are randomly initialized and trained to predict a target label from an input sequence. In this work, we show that random initialization leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using only the downstream task data, leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained, and we improve the best reported results of SSMs on the PathX-256 task by 20 absolute points. Subsequently, we analyze the utility of previously-proposed structured parameterizations for SSMs and show they become mostly redundant in the presence of data-driven initialization obtained through pretraining. Our work shows that, when evaluating different architectures on supervised tasks, incorporation of data-driven priors via pretraining is essential for reliable performance estimation, and can be done efficiently.
/r/MachineLearning
https://redd.it/1co4kfw
D Self-Promotion Thread
Please post your personal projects, startups, product placements, collaboration needs, blogs etc.
Please mention the payment and pricing requirements for products and services.
Please do not post link shorteners, link aggregator websites , or auto-subscribe links.
--
Any abuse of trust will lead to bans.
Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
--
Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.
/r/MachineLearning
https://redd.it/1euyfi6
D Monthly Who's Hiring and Who wants to be Hired?
For Job Postings please use this template
>Hiring: [Location\], Salary:[\], [Remote | Relocation\], [Full Time | Contract | Part Time\] and [Brief overview, what you're looking for\]
For Those looking for jobs please use this template
>Want to be Hired: [Location\], Salary Expectation:[\], [Remote | Relocation\], [Full Time | Contract | Part Time\] Resume: [Link to resume\] and [Brief overview, what you're looking for\]
​
Please remember that this community is geared towards those with experience.
/r/MachineLearning
https://redd.it/1egc1um
I wish this “AI is one step from sentience” thing would stop
The amount of YouTube videos I’ve seen showing a flowchart representation of a neural network next to human neurons and using it to prove AI is capable of human thought...
I could just as easily put all the input nodes next to the output, have them point left instead of right, and it would still be accurate.
Really wish this AI doomsaying would stop using this method to play on the fears of the general public. Let’s be honest, deep learning is no more a human process than JavaScript if/then statements are. It’s just a more convoluted process with far more astounding outcomes.
/r/deeplearning
https://redd.it/1els27c
D Simple Questions Thread
Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
Thanks to everyone for answering questions in the previous thread!
/r/MachineLearning
https://redd.it/1ee9dra
RAG evaluation framework
Hi,
I am looking for some good resources for RAG evaluation.
/r/LanguageTechnology
https://redd.it/1eawoog
AI Study GroupAny willing study partners to create some group to learn architectures, implement them, discuss them and create some application level projects?D
Basically, I am interested in learning and discussing architectures and implementing them and doing some projects. I prefer to make a group where we get productive, share our learnings, teach each other and have some accountability.
Rather than experts, I would love to connect with those who are intermediate with ML and DL architectures, and are willing to explain and implement things they are interested in. Any country, any age.
If anyone is willing to, please feel free to DM or comment.
Do mention your expertise level and your areas of interest!
/r/MachineLearning
https://redd.it/1e59xvu
D Self-Promotion Thread
Please post your personal projects, startups, product placements, collaboration needs, blogs etc.
Please mention the payment and pricing requirements for products and services.
Please do not post link shorteners, link aggregator websites , or auto-subscribe links.
--
Any abuse of trust will lead to bans.
Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
--
Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.
/r/MachineLearning
https://redd.it/1dx5tpo
How Does Alexa Avoid Interrupting Itself When Saying Its Own Name?
Hello r/deeplearning community,
I've noticed that Alexa doesn't interrupt itself when it says "Alexa," but it does respond when someone else says it. How does it achieve this? Here are a few questions I have:
1. Self-Recognition: How does Alexa distinguish between its own voice and a user's voice saying "Alexa"?
2. Voice Characteristics: What specific features (e.g., pitch, tone) does Alexa analyze to recognize its own TTS voice?
3. Algorithms and Models: What machine learning models or algorithms are used to handle this task effectively?
4. Implementation: Are there any open-source libraries or best practices for developing a similar functionality?
Any insights or resources would be greatly appreciated. Thanks!
/r/deeplearning
https://redd.it/1do9lcs
Lightning-Fast Text Classification with LLM Embeddings on CPU
I'm happy to introduce fastc, a humble Python library designed to make text classification efficient and straightforward, especially in CPU environments. Whether you’re working on sentiment analysis, spam detection, or other text classification tasks, fastc is oriented for small models and avoids fine-tuning, making it perfect for resource-constrained settings. Despite its simple approach, the performance is quite good.
Key Features
Focused on CPU execution: Use efficient models like deepset/tinyroberta-6l-768d for embedding generation.
Cosine Similarity Classification: Instead of fine-tuning, classify texts using cosine similarity between class embedding centroids and text embeddings.
Efficient Multi-Classifier Execution: Run multiple classifiers without extra overhead when using the same model for embeddings.
Easy Export and Loading with HuggingFace: Models can be easily exported to and loaded from HuggingFace. Unlike with fine-tuning, only one model for embeddings needs to be loaded in memory to serve any number of classifiers.
https://github.com/EveripediaNetwork/fastc
/r/LanguageTechnology
https://redd.it/1d9g5fa
Understanding the Receptive Field in CNNs
Hey everyone,
I just dropped a new video on my YouTube channel all about the receptive field in Convolutional Neural Networks. I animate everything with Manim. Any feedbacks appreciated. :)
Here's the link: https://www.youtube.com/watch?v=ip2HYPC\_T9Q
In the video, I break down:
What the receptive field is and why it matters
How it changes as you add more layers to your network
The difference between the theoretical and effective receptive fields
Tips on calculating and visualizing the receptive field for your own model
/r/computervision
https://redd.it/1d6irm9
Is there a pre-trained vision model that's good for zero-shot clustering?
I have a dataset of images from The Simpsons. I have trained a face-detection model for Simpsons characters with good results, and after lots of experimenting, I have written a script that gives relatively accurate binary images of the faces. I will attach a screenshot as an example. I have also tried using cv2.findContours to find the contours in these images and treat them as matrices to compute the difference between 2 faces, with no luck.
My end goal is to be able to cluster these faces by character. I know how to train a classifier for this type of thing, but I want to find a method of clustering faces that the model has not seen before. I have tried some more basic ML algorithms without success, and now I think this task may be too complex for that.
I am wondering if there is a vision model that could be well-suited for this? Or if anyone has suggestions for other approaches that would be great too.
Here is an example of my processed Simpsons faces that I want to cluster:
https://preview.redd.it/xpgnocmuxt2d1.png?width=2476&format=png&auto=webp&s=3028610d60ed09978daf4b6394494385988d7e24
I'm still working on isolating the faces of characters with darker skin colors, but I think this should be good enough to start putting together a clustering method.
As a side note, I don’t care that much if, for example, images of Bart Simpson from a front angle end up in a different cluster from images of him from a side angle, as it will be easy enough to manually merge these clusters after the fact.
/r/computervision
https://redd.it/1d1a8rt
Machine Learning Books that emphasize MATH?
Hi all! So far, the best machine learning book that I've come across is ISLP (Introduction to Statistical Learning in Python/R). There is also a book by Dr. Manel Martinez-Ramon that is set to publish in October that I've eagerly waiting for (took his class, failed it massively, still think he is one of the coolest dudes ever). In the meantime, I'm looking for any books that REALLY help consolidate the mathematical learning into a single resource as best as possible, with references for further reading when necessary. Has anyone come across a deep learning book that is LESS concerned with programming and MORE concerned with the mathematical structures behind the deep learning processes? (ISLP is a great machine learning resource but only has one chapter on deep learning...)
/r/deeplearning
https://redd.it/1cx8ilz
R Our new classification algorithm outperforms CatBoost, XGBoost, LightGBM on five benchmark datasets, on accuracy and response time
Hi All!
We're happy to share LinearBoost, our latest development in machine learning classification algorithms. LinearBoost is based on boosting a linear classifier to significantly enhance performance. Our testing shows it outperforms traditional GBDT algorithms in terms of accuracy and response time across five well-known datasets.
The key to LinearBoost's enhanced performance lies in its approach at each estimator stage. Unlike decision trees used in GBDTs, which select features sequentially, LinearBoost utilizes a linear classifier as its building block, considering all available features simultaneously. This comprehensive feature integration allows for more robust decision-making processes at every step.
We believe LinearBoost can be a valuable tool for both academic research and real-world applications. Check out our results and code in our GitHub repo: https://github.com/LinearBoost/linearboost-classifier . The algorithm is in its infancy and has certain limitations as reported in the GitHub repo, but we are working on them in future plans.
We'd love to get your feedback and suggestions for further improvements, as the algorithm is still in its early stages!
/r/MachineLearning
https://redd.it/1cqv5y4
D ICLR Outstanding Paper Awards. Congratulations!
**Vision Transformers Need Registers**
Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski
Abstract: Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
**Generalization in diffusion models arises from geometry-adaptive harmonic representations**
Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, Stéphane Mallat
Abstract: Deep neural networks (DNNs) trained for image denoising are able to generate high-quality samples with score-based reverse diffusion algorithms. These impressive capabilities seem to imply an escape from the curse of dimensionality, but recent reports of memorization of the training set raise the question of whether these networks are learning the “true” continuous density of the data. Here, we show that two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, when the number of training images is large enough. In this regime of strong generalization, diffusion-generated images are distinct from the training set, and are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density. We analyze the learned denoising functions and show that the inductive biases give rise to a shrinkage operation in a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous regions. We demonstrate that trained denoisers are inductively biased towards these geometry-adaptive harmonic bases since they arise not only when the network is trained on photographic images, but also when it is trained on image classes supported on low-dimensional manifolds for which the harmonic basis is suboptimal. Finally, we show that when trained on regular image classes for which the optimal basis is known to be geometry-adaptive and harmonic, the denoising performance of the networks is near-optimal.
**Learning Interactive Real-World Simulators**
Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, Pieter Abbeel
Abstract: Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different axes (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing