270
Reddit DevOps. #devops Thanks @reddit2telegram and @r_channels
I have access to infinite sandbox creds, pipelines, and AI. Useful things for my resume?
We have a sandbox environment, I have access to all our test automation, dev repos, any number of bedrock tokens, use terraform as IaC, grafana dashboards
I can't inject any new dependencies in our pipelines without security approval, but other than that, I can go ham.
What's some useful things I can make that will look good for getting into a more platform-focused role from SDET.
I really enjoy SDET (which just means managing tests in pipeline/automating regression tests tbh), but lots of teams are shifting-left and just depending on the devs to manage these now.
I have no interest in doing developer work. I build internal tools for my team that are like 90% vibe coded, but matching figma mocks with React & writing performant backend code is not fun for me.
For SDET the main bullet points people look for are
\- Automated {x} % or number of manual regression cases reducing time to feedback by {x}%
\- Created, or had ownership over some test pipeline and either reduced how long it took to run, or increase coverage
\- Then random misc bullet points like mobile automation, load testing, environment management
What are the equivalent main bullet points for DevOps work that could land me an associate role.
https://redd.it/1t1s5v7
@r_devops
CloudFront reverse proxy to Vercel with Lambda@Edge host rewrite → 502/503 errors
Hi there. I need to set up reverse proxy behavior using CloudFront. I have two origins: the main one (doesn’t really matter, let it be an EC2 instance) and a Vercel origin front.vercel.app.
I have two behaviors:
\\- default → returns EC2
\\- /front\\* → returns Vercel origin
The problem is that I can’t move my domain to Vercel, so I need to rewrite the Host header to snoika-blog-template.vercel.app.
For that, I used Lambda@Edge (Origin Request) with the following code:
exports.handler = async (event) => { const request = event.Records\\[0\\\].cf.request; request.headers\\['host'\\\] = \\[{ key: 'host', value: 'front.vercel.app' }\\\]; return request; };
As a result, I get a 503 error saying the Lambda function is invalid or doesn’t have the required permissions.
What I already checked:
\\- Lambda is in us-east-1
\\- Using a published version (not $LATEST)
\\- Attached to Origin Request
\\- Tried different origin request policies (AllViewer, AllViewerExceptHostHeader)
\\- Gave full IAM permissions to the execution role
\\- Rebuilt CloudFront multiple times
Also, one time it worked for about 5 minutes, then stopped again. It feels like some propagation or replication issue.
Without Lambda, I get:
502 Bad Gateway (CloudFront can’t connect to origin)
Flow should be:
\\- any route → default website (EC2)
\\- /front → Vercel website
If anyone has experience with something like this, I’d really appreciate any suggestions 🙏
https://redd.it/1t0zptv
@r_devops
Grouping CI test failures by error signature, is this the right approach?
I have a hobby project that I would like to get input on.
Im a software engineer, much less in devops, but one of the things I work with at my job is GitHub test pipelines.
We have multiple long-running pipelines. Some take several hours, and some can take days. They run 1,300+ test cases, and it is common for a run to have 200+ failed tests. Many of the tests are flaky or unreliable. Its a mess, I know, but im sure its not uncommon.
That makes debugging painful. A failed run with 200–300 failed test cases is not really actionable. It is hard to know where to start, and it is hard to route the failures to the right engineers.
The approach that has made the most sense to me is to group failures by error signature.
Right now the idea is to ingest JUnit XML reports from the pipeline and use those as the main source of truth. Instead of treating every failed test as a separate problem, I identify common failure patterns from the JUnit failure/error output and cluster them together.
In practice, this can reduce something like 200–300 failed tests into something more like 15 distinct failure groups/problems. That is much easier to debug, track, and assign.
It also makes triage easier. If I see one failure group with a single shared signature and many failed tests assigned to it, that is usually higher priority than a small group with only a few failed tests. A large group often means one underlying issue is causing a wide blast radius across the suite.
I started building a dashboard and backend service around this idea. The dashboard shows the test pipelines, failure signatures, how many tests belong to each group, and how those signatures compare to the previous pipeline run. I compare with "x new, y persistent, z resolved" and ideally you want x to be zero, y to be low and z to be high. You quickly see the health of the pipeline and if its regressing or progressing compared to previous run.
The main thing I care about is not the exact number of failed tests but:
* how many failure groups/clusters exist
* how many tests are assigned to each group, is the group large or small.
* whether a group is new or recurring.
* whether the number of groups is increasing, decreasing, or staying the same
* whether one failure signature explains a large portion of the failed tests
The grouping is deterministic and fuzzy. No AI or LLM is involved. I wanted the results to be repeatable, explainable, and consistent between runs.
I’m sure similar solutions already exist, but I’m curious how other people approach this.
Do you think grouping failures by error signature is the right abstraction?
Anything I'm missing or could add to this?
Interested in any feedback, especially from people dealing with large CI pipelines or unreliable test suites.
https://redd.it/1t15oy2
@r_devops
Mookbars - Self-contained bookmarks page generator from environment variables
https://redd.it/1t18aab
@r_devops
Need an advice about my career
Hey 👋
I'm 23yo I'm currently wotking as a NOC engineer I've like 5 months of experience in that area I was always into Networking and I have a degree in Network Engineering.
Anyway, lately I've been feelin like Networking is about to die so I've been thinking about shifting to DevOps so I want your advice.
Is there still a hope in Networking? Or should I shift now while I'm in the start of my career?
And another thing, I've been thinking about taking the CCNP because I don't have any certifications and my CV is feeling kinda weak so if I took it and decided later to move to DevOps would it be good for me as a DevOps Engineer?
https://redd.it/1t106xi
@r_devops
whats the CHEAPEST Azure VM size I can use?
https://redd.it/1t0z3ku
@r_devops
CircleCI Dynamic Config + Tag Pipelines → “No workflow” after continuation
This is blocking deployment from CI → intg/stge/prod environments.
Hi all,
I’m using CircleCI dynamic config (`setup: true` \+ continuation orb) and facing an issue where tag-triggered pipelines show:
**“No workflow” after setup phase**, even though continuation returns successfully.
# Setup:
* `setup: true` config with `continuation/continue`
* Passing parameters via JSON (`/tmp/continue-params.json`)
* Tag-based deployments (`ci-*`, `rc-*`, `r-*`)
* Using `when:` in continuation config:when: matches: pattern: "\^ci-.\*" value: << pipeline.parameters.pipeline\_tag >>
# What I’ve verified:
* `$CIRCLE_TAG` is available in setup job
* JSON file is written and passed to continuation
* continuation step completes successfully
* workflows exist in `continue_config.yml`
# Issue:
* Tag pipeline shows: → Setup job runs → continuation succeeds → **No workflow is selected**
# Question:
* Do tag contexts need to be explicitly preserved differently in dynamic config?
* Is there a known issue with `pipeline.git.tag` or parameter passing for tag-triggered pipelines?
Would really appreciate any insights, especially from folks using dynamic config with tag-based deployments.
Thanks!
https://redd.it/1t0ie30
@r_devops
SASL-OAuthbearer in ESM lambda(AWS)
Hello community,
I was trying to design a concept where I could just act as a kafka consumer.
I want to strictly follow an event driven architecture.
Found out a solution with the ESM -lambda option where lambda will just act like a consumer.
The tricky part comes here.
The kafka producer establish the authentication with SASL oauthbearer.
But AWS ESM-Lambda doesn't support this authentication mechanism.
I heard a workaround would be going with a proxy mechanism having an ecs/ec2/msk infront of lambda to perform authentication.
I don't like to run a heavy load of infrastructure just to receive some topics.
Writing here to check are there any other alternatives?
https://redd.it/1t0gglr
@r_devops
Does anyone have good ideas for Docker alternatives?
Feels like Docker is still the default, but more people seem to be moving away from it depending on their setup. Some are switching to Podman or nerdctl for a more lightweight, daemonless approach. Others are using Kaniko or Buildah for builds, especially in Kubernetes environments. Then you’ve got alternatives like Nomad instead of k8s, or even things like LXC or FreeBSD Jails if you’re not tied to Linux containers. On macOS, I keep seeing OrbStack and Colima come up as Docker Desktop replacements. And when it comes to base images, a lot of teams seem to be moving toward Alpine, distroless, or more minimal/hardened images. It’s mind boggling. What are people here using in practice? Are you fully off Docker, or swapping out parts of the stack?
https://redd.it/1t06ksr
@r_devops
Need guidance switching to DevOps (7 months experience, not a fresher, but getting rejected everywhere)
I could really use some honest advice from people already working in DevOps.
I’m not a fresher I have around 7 months of experience in a service-based company, but my current role is not related to DevOps. Over the past few months, I’ve been actively trying to switch into DevOps by learning on my own.
So far, I have:
Learned basics of Linux, Git, and scripting
Started working with tools like Docker, kubernetes, Ansible, Github Action, Gitlab CI/CD (still improving)
Built a project (hands-on, not just theory)
Created a resume tailored for DevOps roles
Despite this, I’m getting rejected almost everywhere sometimes no response at all.
I’m trying to understand:
Am I missing something important in my preparation?
Is my experience level too low for switching domains?
Should I focus more on projects, certifications, or something else?
How do I make my profile stand out for DevOps roles as a career switcher?
I’m genuinely serious about this field and consistently learning, but I feel stuck right now.
Any guidance, roadmap suggestions, or even harsh reality checks would really help.
Thanks in advance 🙏
https://redd.it/1t039nl
@r_devops
What improved your incident debugging speed the most?
Weve put a decent amount of work into observability over the past year. Better structured logs, some tracing in key services, and dashboards for the usual metrics. On paper it looks solid.
But during real incidents, debugging still feels slower than it should. We can usualy see what is happening in each service, but figuring out how it all connects is still where time gets lost. It often turns into switching between tools and trying to reconstruct the sequence of events manually.
Its not that we are missing data. The path from signal to understanding is still pretty indirect when multiple services are involved
I’m trying to understand what made a noticeable difference for other teams. Was it a tooling change, better data modeling, tighter service boundaries, or something else entirely?
https://redd.it/1szysrm
@r_devops
Built a rust dashboard to stop giving SSH keys just for service restarts
Hey guys, I have been working as devops engineer for past 4 years and one thing that always annoyed me is managing SSH access just so someone can check logs or restart a crashed docker or systemd service.
So I build a web based dashboard called portsentinel. It's entirely build on rust and open-source. The main features are auto log tailing and you can start, restart, stop and check the services without touching terminal. The fun part for me is it uses barely 10MB of ram.
I actually developed this few months ago but didn't get a chance to get real feedback on it. So the github activity is low right now and my last active commit is from like 4 months ago.
Also full transparency, there's no denying that I used AI to build some of this while learning rust, but I tweaked, tested and reiterated it 100s of times myself on my own VPS nodes to make it stable.
I know it's kinda like promotion but I really need your valueable feedback guys on this. Where am I choking on the architecture and what obvious security things I am missing?
Here's the link of my github:https://github.com/neetesshhr/portsentinel
Ps: I made an observability tool so I just used this flair
https://redd.it/1szu8dj
@r_devops
Anyone here working 100% Crossplane ?
Thinking about potentially moving away from Terraform/Pulumi tired of drifts and fixing them but want to hear from people actually using it before diving in.
Curious about:
\- Whether it actually simplifies things or just trades one set of problems for another
\- Community/ecosystem maturity
\- Is the CI/CD cleaner in terms of drifts ?
https://redd.it/1szqeoz
@r_devops
Why: Infrastructure engineers dealing with AI/ML deployment pain
I've been deploying AI agents for the past year and kept hitting the same wall: agents that worked perfectly in demos would fail silently in production.
Not because the model was bad. Because the infrastructure wasn't designed for agents.
Here's what I learned:
The Problem:
Traditional DevOps assumes deterministic behavior—run the same test twice, get the same result. But AI agents have 63% execution path variance. Your unit tests catch 37% of failures at best.
Traditional APM (Datadog, New Relic) was built for binary failures—crashes, timeouts, 500 errors. But agents fail semantically: wrong tool selection, stale memory, dropped context in handoffs. Nothing alerts. Performance degrades silently.
What the 5% who ship to production do differently:
• Agent registry (every agent has identity, owner, version)
• Session-level traces (not just API logs)
• Behavioral testing (tests that account for non-determinism)
• Pre-execution governance (budget limits, policy guardrails)
• Composable skills (build once, deploy everywhere)
Has anyone else hit this? How are you solving observability and governance for non-deterministic agents in production?
https://redd.it/1szht3d
@r_devops
Need some Advice regarding upskilling and job switch as a CloudOps Engineer ( GCP )
I am a CloudOps Engineer based out of India.
I work with GCP cloud alone.
Work is pretty basic..and I feel like I am not learning anything new, and the only thing to do here is repetitive work which can be avoided if rules are kept in place.
I haven't touched GKE , kubernetes yet....my company doesn't normally use gke apart from very few projects which I am not part of ( only seniors are). I feel like any interesting work is hogged up by the senior colleagues.
I have been wanting to switch but I am not able to as sometimes they say I am inexperienced (2.years) , sometimes they say GKE is required, sometimes I am not a fit.
I also feel like doing just GCP is not good, and I need to go multi cloud, but I don't know if I be able to learn AWS or Azure without handson I get got GCP at office
I have been trying to upskill myself, but have been like a child who is being swayed by all the candies ( tools , network fundamentals , gke, open source contribution to learn about the tool, making your own tool, etc.etc) that I haven't done anything at all.
I really want to switch to a better company, and was hoping if the community can help me in some way ( if not completely, atleast show the way) to upskill and find jobs
https://redd.it/1sz8wzk
@r_devops
Data collection question.
When your infra is a mess like many tools and may serup and the manager needs accurate data. How you guys deal with these. The thing is even if I manage to pull data from db, there is always a mismatch. I don't know if it's my thought process or something. I came from very low like installing windows now/devops.
https://redd.it/1t1r1bi
@r_devops
Decisions from planning sessions disappear by the time sprint work actually starts
We write action items at the end of every planning meeting. They go into Jira or Confluence or whoever's Notion page they ended up in. Then the work starts and the pressure takes over and nobody goes back to the list. The ones that get done are the ones someone kept in their head. The rest surface two weeks later as things we meant to do. Is there a system that actually bridges ""we agreed on this in the room"" and ""this is now someone's active ticket""? Or is everyone just accepting the dropout rate?
https://redd.it/1t1jpcy
@r_devops
Are per-query safeguards sufficient for agent-driven database access?
I’ve been messing around with agents that generate and run SQL (LLMs basically), and there’s something that doesn’t quite sit right.
Most of the safety stuff we rely on is per query — permissions, RLS, validation, etc. That all makes sense when a person is writing queries.
But agents don’t really behave like that. They just keep going until they get what they want.
Like, nothing stops something from doing:
SELECT ... LIMIT 100 OFFSET 0
SELECT ... LIMIT 100 OFFSET 100
SELECT ... LIMIT 100 OFFSET 200
and just walking the whole table over time.
Each query on its own is fine. It passes all the checks, respects permissions, nothing obviously wrong.
But taken together… it kind of defeats the point.
This doesn’t feel like SQL injection or anything classic. It’s more like the system is “safe per request” but not really safe in terms of overall behavior.
I’m not sure if I’m overthinking this or if people actually run into it once this stuff is in production.
If you’re using agents against a DB, how are you thinking about this? Or is this just not a concern in practice?
https://redd.it/1t19czm
@r_devops
Should I take the AWS SAA certificate?
I’ve always been against certification over experience, but recently and with the way the market is, I unfortunately think a certificate gives a proper push.
For context, I am a DevOps engineer and have slight hands on experience with AWS, and growing. As the company isn’t huge, I have bigger ownerships and opportunities to dive deep into the usual DevOps and also AWS.
I decided to ask here because I noticed a lot of people are taking it, and while it’s not opening doors from 0, it may be beneficial to maybe work abroad, work at a bigger corporate or anything in between, but as I simply have no clue how useful it may be, I decided to ask what everyone here thinks? With your experience and maybe even if you took the certificate, should I take it?
https://redd.it/1t15che
@r_devops
Is "building a Docker image" during the CI pipeline considered a best practice?
Hi everyone! I am new to DevOps and trying to better understand CI/CD best practices.
I often hear that, during the CI flow, we should “compile the source code” and run unit tests. However, I am not completely sure what “compile the source code” means in this context.
For context, my app should deliver a Docker image with a Python app running on it.
Questions:
Should the pipeline simply check out the source code from the repository and compile/build it directly on the CI runner, similar to how we would do it locally?
Or is it considered better practice to first build a Docker image and then compile/run the tests inside that container? If it is, what advantages do we gain by building a Docker image and compiling the source code inside it, rather than compiling the code and running tests directly in the CI runner?
Is building the Docker image during the CI flow considered a best practice at all?
If the Docker image should be built during CI, is it common to reuse that exact same image later in the CD/deployment flow to avoid rebuilding it again?
Sorry for asking so many questions.
I am trying to understand what the most common approach is in real-world environments.
Thanks!
https://redd.it/1t106xt
@r_devops
Project Yellow Olive - Pokemon Yellow inspired Kubernetes TUI game
https://redd.it/1t0wlmr
@r_devops
A question around observability for my startup and our configs
We're a mid-size startup running most of our workloads on AWS ECS with EC2 instances.
Over the last few months we've been trying to get proper observability set up host metrics, container logs, and some distributed tracing across our services. The problem is we keep ending up with very different collector configurations for each environment and it's becoming hard to manage.
Our engineers are not deeply experienced with OpenTelemetry config yet, so anything that requires heavy YAML tuning tends to get abandoned.
We've been evaluating a few options but what we really need is:
\- Something that works well specifically on ECS EC2 (not Fargate)
\- A decent UI so engineers can actually make use of the data without writing PromQL from scratch
\- Something that doesn't require a completely custom OTel Collector config per environment
Has anyone gone through this with ECS? Would love any options here came across a few and have tiered a few of those out still confuse about the options that work
https://redd.it/1t0m70r
@r_devops
What CI looks like at PostHog in a week: 575K jobs, 33M tests
https://www.mendral.com/blog/ci-at-scale
https://redd.it/1t0fl32
@r_devops
Postmortem: how I lost ~4% of requests to a Node/Nginx timeout mismatch, and the queue migration that fixed it
Sharing a postmortem of an architecture migration that took me too long to do, in case anyone’s running long-running jobs inside HTTP request handlers.
The setup
I run a backend pipeline that does multi-step work: input parsing, several external API calls in sequence, a scoring step, then a synthesis step. End-to-end runtime ranges from 5 to 35 seconds depending on cache state and the number of external sources involved.
For the first few months, I was naive. Request comes in, handler runs the full pipeline, response goes out. Worked fine in dev. Worked fine for the first dozen users.
Where it broke
Two things hit at once.
First, my reverse proxy (Nginx) and my Node runtime had different timeout settings. Node was set to 60s because the pipeline could occasionally hit 35. Nginx was at 30s by default. Cue silent 502s right when a job was about to finish. The user gets an error, the work completes anyway, and you spend a week chasing what looks like a backend bug but is actually a layer mismatch.
Second, when concurrency went up (a batch test with around 50 parallel requests), the runtime started locking. Connections held open, event loop choked, new requests timed out. I lost roughly 4% of requests in that batch.
The fix
Moved to a queue-based architecture. BullMQ on top of Redis. The flow now looks like:
API receives request, validates, drops a job in Redis, returns a job ID immediately (under 100ms). Frontend polls a status endpoint or subscribes via SSE. Separate worker process pulls jobs from the queue, runs the pipeline, writes results back to the database. User fetches the final result by job ID.
Same business logic, completely different runtime profile.
What changed
502 errors disappeared overnight. Not reduced, gone. The HTTP layer is now decoupled from job duration entirely.
Concurrency is bounded by worker count, not by HTTP request count. I can scale workers independently. If a job takes 90 seconds, it doesn’t block the API.
Retries became trivial. BullMQ has exponential backoff out of the box. A flaky external API call no longer breaks the user experience, the job just retries.
Observability got better. Each job has a clear lifecycle (waiting, active, completed, failed) and I can replay failed jobs on demand.
What I should have done from day one
Built it on a queue from the start. The “I’ll migrate later when I scale” instinct cost me about three weeks of firefighting. The migration itself took two days. The denial took longer than the work.
If you’re running anything where a single user request triggers more than 5 seconds of backend work, especially with external API calls in the chain, decouple it now. The pattern is well understood, the libraries are mature (BullMQ for Node, Celery for Python, RQ for lighter Python use), and you’ll thank yourself the first time you hit real load.
The catch
You’re trading simplicity for resilience. A queue adds operational surface (Redis to monitor, workers to deploy, DLQs to manage). For a hobby project with 5 users, sync handlers are fine. For anything you’d hate to debug at 2am under load, queues aren’t optional.
Happy to answer specifics on the BullMQ config, Nginx tuning, or the SSE side if anyone’s mid-migration.
https://redd.it/1t05dbj
@r_devops
Does anyone have experience with self-hosting gitlab runners
So our small company is currently using the Gitlab shared runners for our CICD tests. So far its been fine but as we add more and more tests the time to run the tests keeps going up. We have parallelized the tests to keep the total run time down. But thats also burns more minutes.
Last month we used up more than 32000 runner minutes.
I was thinking of buying a mini-pc and just have that be a dedicated runner machine. It should run the tests faster since it has a local docker cache and the CPU is more powerful too.
Just based on very minimal research I was thinking of something like this. If performs at par or better than the shared runner it should pay for itself in just 3 months.
Is this a bad idea?
Does anyone have experience with this kind of setup?
Recommendations for which machines to use?
https://redd.it/1t03ibi
@r_devops
Cloud Build Problems (1st & 2nd Gen): OAuth Failure, Can't Read Commits, No Build Triggers
Hi everyone,
I'm running into several issues with Google Cloud Build repositories and 2nd generation connections, and I'm hoping someone here has experienced something similar.
**1. OAuth callback error (2nd gen host connection)**
When trying to create a 2nd generation host connection, I get the following error:
`Error processing oauth callback: failed getting OAuth token with the provided code`
I've already retried the OAuth flow multiple times, but the issue persists.
**2. 1st gen repositories not picking latest commits**
For repositories connected using 1st gen, Cloud Build is not detecting the latest commits. It fails with:
`Couldn't read commit <commit-id>`
This suggests it cannot access or resolve the commit, even though it exists in the repo.
**3. 2nd gen connection stopped triggering builds**
We also have an existing 2nd gen host connection that was previously working. Now, when we push new changes, the build is not triggered at all — it seems like the connection is no longer responding.
At this point, it feels like there may be an issue with authentication, repository access, or possibly something broken between 1st gen and 2nd gen integrations.
Has anyone encountered:
* OAuth token issues when creating 2nd gen connections?
* Cloud Build not detecting commits in 1st gen repos?
* 2nd gen connections silently stopping triggers?
Any ideas, debugging tips, or things to check would be greatly appreciated
https://redd.it/1szxc2q
@r_devops
Security tickets in your backlog, what would actually make you fix one this sprint?
Genuine question from someone who works in AppSec and is trying to understand the engineering experience honestly rather than assuming.
When a security finding lands in your team's backlog, what honestly actually happens to it? not what should happen. what actually happens?
I've spoken to a lot of AppSec practitioners lately and one thing that seems to always pop up as a frustration is that even well-prioritised findings with decent context attached still don't reliably make it into sprints. But I've been hearing almost entirely from the security side and there's significant bias.
So, from a developer or engineering side:
What does a security ticket usually look like when it arrives and why does that make it hard to act on? what would it need to contain in terms of format, context, timing, framing etc, for it to genuinely compete with the feature work already planned for your current sprint or is that just simply an impossible ask?
Specific answers are more useful than general ones. The realer the better.
https://redd.it/1szuki3
@r_devops
At what point does “overengineering” in the cloud actually hurt more than it helps?
I’ve been thinking about how easy it is to go from a simple setup to something way more complex than it needs to be.
You start with something straightforward, then add:
Load balancers
Auto-scaling groups
Microservices
Queues, caching layers, etc.
And before you know it, debugging becomes harder, costs go up, and small changes take way longer than they should.
I get that scalability and reliability matter, but sometimes it feels like people design for problems they don’t even have yet.
For those who’ve worked on real systems — how do you decide when to keep things simple vs when to add more architecture?
Where’s that line for you?
https://redd.it/1szntoh
@r_devops
Terraform v1.15.0 rolled out today!
In this release, the main things I focus on:
\- Terraform now supports variables and locals in module source and version attributes.
\- backend/s3: Support authentication via aws login
https://github.com/hashicorp/terraform/releases
https://redd.it/1szc32o
@r_devops
Update on the cloud waste scanner I posted about last month - added AI/ML rules across AWS, Azure, and GCP
Shared the hygiene rule list here about a month ago. Wanted to post an update since the
scope has changed quite a bit.
What's new since then:
Added AI/ML rules across all three providers, opt-in with --category ai. These target
resources that look quiet from a billing dashboard but are still running and accruing charges.
AWS (6 new rules, 19 total):
SageMaker inference endpoints — InService with no invocations
SageMaker notebook instances — InService but no recent activity
SageMaker Studio apps (JupyterServer, KernelGateway) — InService but idle
SageMaker training jobs running well past a normal threshold
Bedrock Provisioned Throughputs — InService with no request traffic
EC2 GPU instances with near-zero utilization
Azure (5 new rules, 17 total):
AML compute clusters with a baseline node floor and no observed job activity for 14+ days
AML compute instances in Running state with no recent lifecycle activity
AML managed online endpoints with baseline replicas and zero requests per minute
Azure OpenAI provisioned deployments (PTUs) with no observed API traffic
Azure AI Search services — structurally empty and inactive for 90+ days
GCP (5 new rules, 10 total):
Vertex AI Online Prediction endpoints with a replica floor and zero observed requests
Vertex AI Workbench instances with no activity
Vertex AI training jobs running beyond threshold
Cloud TPU nodes in READY state with near-zero accelerator utilization
Vertex AI Feature Stores with zero serving requests for 30+ days
Also: hardening pass on existing rules
The AI rules in particular went through several rounds of tightening. They now require
confirmed monitoring telemetry before emitting — they skip rather than guess when data
is missing, the resource is too new to evaluate, or coverage is incomplete.
The intent is that if these fire in CI with --fail-on-confidence HIGH, you're not chasing false positives.
Still working on hardening the last two GCP AI rules (Workbench and training job) to
the same standard.
What's the AI/ML cost leak you find hardest to catch with existing tooling?
Repo (same as before): https://github.com/cleancloud-io/cleancloud
https://redd.it/1syyut2
@r_devops