270
Reddit DevOps. #devops Thanks @reddit2telegram and @r_channels
We implemented WAF and our bill suddenly spiked, is this normal?
We recently got hit by a robocall fraud incident, and a number of our customer accounts were compromised. To mitigate this, one of our Development Engineering Managers suggested implementing AWS WAF ATP (Account Takeover Prevention) rules so that malicious requests could be filtered out before reaching our AWS Lambda functions.
The solution was proposed to management and approved before looping in the DevOps team (we don’t have a dedicated security team right now). After enabling WAF, we ended up seeing a cost spike of around $6.5k in just three days, with roughly 10 million requests hitting our APIs.
I’m trying to understand if this is expected behavior when using WAF under attack conditions, or if we might have misconfigured something.
For those with more experience in this space, was the approach itself reasonable?
Is this kind of cost spike normal?
What’s the usual way to handle situations like this without costs blowing up?
I’m relatively new to handling security incidents like this, so any insights or best practices would really help.
https://redd.it/1swx3wd
@r_devops
Map Sovereignty, Part 2: One Source for Vector and Raster
https://redd.it/1swiiwu
@r_devops
Replacement for traditional domain-style IdM
Purely hypothetical in a lab space. I'm curious if there is a feature complete selection of tools to fully replace LDAP/Kerberos IdM (think AD or FreeIPA) in a net new environment with no legacy applications and no LDAP/Kerberos dependencies.
My initial research shows this stack may work with some key differences:
* **Keycloak** \- OIDC/Oauth2/SAML for everything, including SSH logins, internal user store replaces LDAP. However, no system identity (NSS/PAM) and no POSIX-compliant attribute matching (UIG/GID, etc.)
* [**OpenBao**](https://github.com/openbao/openbao)**/Hashicorp Vault** \- Handles traditional PKI and credential distribution
* [**Teleport**](https://github.com/gravitational/teleport) \- Access plane for providing JIT certs for SSH/Kubernetes/DB access, etc. via cert-based authentication.
* [**SPIFFE**](https://github.com/spiffe/spiffe)**/**[**SPIRE**](https://github.com/spiffe/spire) **Integration** (optional) - Workload identity for tying cryptographic identities to workloads (namely mTLS between services). Replaces Kerberos.
* **DNS server/NTP** (easiest part here)
What am I missing/not thinking of? Has anyone deployed something similar in the wild?
https://redd.it/1swi7ed
@r_devops
Looking for devops partners
Hei guys,
I am currently working as a Cloud Engineer but I am learning more things each day so that I can transition to fully Devops in a couple of months. I am currently using K8s, Openshift, AWS, ArgoCD at my current job and learning Terraform and Python in my free time. I am looking for people with the same interests as me so we can form a group on discord or telegram so we can advance faster. Is anyone interested?
https://redd.it/1swlggv
@r_devops
r/devops nowadays
https://redd.it/1swivcl
@r_devops
Affordable PagerDuty alternatives that aren't overkill?
I’m looking for a PagerDuty alternative that won't break the bank.
I’ve already checked out Better Stack and VictorOps, but they both feel way too bloated. They seem to require large teams just to manage the tool itself, not to mention the "enterprise" pricing that comes with them.
Self hosted tools is not option currently for customer's policy.
Looking for something cost-effective for smaller setups.
Any suggestions for a straightforward on-call/alerting tool that actually stays within a reasonable budget?
Thank you
https://redd.it/1sweekv
@r_devops
Scaling infra & judging pipelines for a 1000+ team hackathon — looking for DevOps insights
Hey everyone,
Disclosure: I’m part of the organizing team behind this hackathon.
We’re organizing SummerSaaS AI Hackathon 2026 and recently crossed 800+ registrations, targeting \~1000+ teams. As we scale this, we’re os running into some interesting DevOps challenges and I’d love input from this community.
💡 Current challenges we’re thinking through:
• Handling burst traffic during submission deadlines
• Designing a fair and scalable judging pipeline (code + demos + AI outputs)
• Managing CI/CD or deployment validation for multiple teams
• Preventing misuse/spam in submissions (especially with AI-generated projects)
• Supporting teams building on different stacks (no-code → full-stack AI apps)
⚙️ What we’re considering:
• Cloud-based scalable submission systems
• Automated evaluation + manual review hybrid
• Sandbox environments for demos
• Basic infra guidelines for participants
📊 Context:
• 800+ registrations already
• Targeting 2500–3000 participants
• Multi-stage format (online → campus → final)
Would really appreciate insights from people who’ve:
👉 run large-scale hackathons
👉 built infra for high-concurrency events
👉 designed evaluation pipelines
Also open to connecting with teams/tools who’ve supported infra for hackathons — especially around cloud credits, CI/CD, or scalable deployments.
Thanks in advance — would love to learn from your experiences 🙌
https://redd.it/1swbe7l
@r_devops
Tool for automatically opening AWS console links in the right account
Sharing this in case it’s useful for anyone managing multiple AWS accounts through IAM Identity Center.
This extension helps with opening AWS links in the correct account context automatically. It checks the URL for an account ID or uses configured keyword mappings, then redirects via the AWS access portal instead of leaving you in the wrong account with a 403 or missing resource.
If the target account isn't clear, it shows a picker instead.
Everything is stored locally in the browser.
Can also act as a manual account switcher for more than 5 accounts.
GitHub: https://github.com/CoreyHayward/AccountHop-for-AWS
Chrome Web Store: https://chromewebstore.google.com/detail/mlkmbmoehpnifbllgklomdjjoiaifmjm?utm_source=item-share-cb
https://redd.it/1sw3fo6
@r_devops
Lead push to migrate automation flows to AI agents
As the title says
We would have lots of different flows, VM updates, cluster rollouts, QA pipelines.
The meeting we had basically was the downsizing of Jenkins and scripts on our part and focus on agents to do this (to me it's a different type of pipeline). Same with Ansible.
Just wondering are other companies seeing the same push, lesser focus on normal tooling.
In my head it's all fun, but there will always be hallucinations that you just won't get with strict scripts and tooling
https://redd.it/1sw1adi
@r_devops
Experience title
Hi all,
Might seem like a useless post, but I’d like opinions from people in the field.
How would you label this kind of experience? DevOps? DevSecOps? SysAdmin? SRE? SysOps? HPC engineer? Something else?
• Automated the deployment and configuration of HPC clusters using Ansible and GitLab-CI pipelines
• Managed job scheduling and resource allocation for a multi-thousand core cluster with Slurm
• Configured HAProxy for load balancing across critical services
• Hardened cluster security with SSH Bastions, PAM tuning, and CrowdSec deployment
• Conducted automated vulnerability assessments using OpenVAS/GVM, Nikto, and Nuclei, and evaluated Wazuh for SIEM use cases
• Deployed a centralized rsyslog logging architecture for continuous security auditing
• Migrated home and project directory mounts to LDAP-backed autofs direct maps
• Architected the migration from Lustre to CephFS with per-project CephX credentials
• Maintained Conda/Micromamba environments and built reproducible Apptainer (Singularity) containers
• Developed Python tooling to reconcile project state across LDAP and database backends
https://redd.it/1svv7mz
@r_devops
Built MiroTalk, an open-source self-hosted WebRTC platform (P2P + SFU)
Started as an experiment and grew into a full project around real-time communication and self-hosted infrastructure.
Story: https://docs.mirotalk.com/story
GitHub: https://github.com/miroslavpejic85
Self-Hosted scripts: https://docs.mirotalk.com/scripts/about/
Any feedback, suggestions, or thoughts from a DevOps/self-hosting perspective are welcome.
https://redd.it/1svowal
@r_devops
How to deal with colleague who produces AI garbage?
I have a colleague who ships brittle and risky automations in prod (atleast in my perspective). All of it are produced by AI, and he clearly does not understand how it all works together and why is it designed that way. No guard rails, no validations, fire and pray type of scripts.
I did not mind it initially and just left him do his thing however, I am not affected as he rolls it out and I am kinda forced to use it. Aside from my own ego (yes, a little bit of ego, I admit) and my personal standard on how I automate stuff, it really is brittle and I see a lot of possible issues that could occur on production with it. My lead does not really review it as he himself does not code very much.
I don't want to ignore it as well as I might be labeled non compliant/rebellion. I try to make some suggestions but I feel like he accepts it in a negative way so I just keep my mouth shut instead.
How do you deal with it?
https://redd.it/1svif0f
@r_devops
Personal laptop
Hey guys!
I have to buy for myself a laptop. I have a personal pc but now i need a laptop. I have a laptop from my work which i need to have with me everytime because on-call. But i cant use the business laptop for my porpuse. The big question is Lenovo Thinkpad or Macbook Air M4?
Actually i only need for browsing, terminal, access my servers, homelab, vs code. And a big battery life. I had a thinkpad and it was a great laptop, but i sold. But these new macbooks are rellay good. What you have for your personal use?
https://redd.it/1sv41or
@r_devops
Should I take a paycut to get into a traditional devops/sre role? Resume attached
Basically, I come from a more Windows Engineer background, but I mostly work on the Windows DevOps side of things.
In my last job, I was pretty much doing
* Create full stack application in AWS (React Frontend, AWS CDK backend with AppSync, S3, Cloudfront, etc etc) in full CI/CD pipelines with dev/prod stages
* A lot of data pulling from Windows Server via PowerShell scripts
* Creating infrastructure automation based off AWS step functions, lambdas, dynamodb, etc and creating a frontend for them so IT engineers can kick them off and observe
* Manage a fleet of 5000 or so Windows Servers, and 500,000 security devices (cameras, badge readers, etc). A lot of my applications and toolings created were to manage these via config enforcement, upgrade them, etc.
* Mostly used Typescript for AWS CDK and React, Python for Lambdas, and PowerShell for Windows Server side of things
At my new job, unfortunately it seems a lot more legacy and on prem. I'm thinking about leaving and trying to get to a more traditional DevOps/SRE type role. I lack a lot of the typical Unix/Container/Kubernetes skills that DevOps jobs are looking for, although I can do them on my own time in a homelab etc, I feel I'll never really get a chance to use them professionally.
At my new job, I'm moreso doing
* Creating new react + c# backend web app tooling for management of internal servers, vm resizing, etc
* Migrating legacy automations to be re-written for use by ansible and PowerShell
* Some typical Windows infra work
* Little to no work in AWS/Azure
Right now, I'm pretty much been applying to most SRE/Devops role but I'm not really hearing much back. Getting interviews for traditional windows manage everything role, which although pays well, isn't what I really enjoy. I currently make around 250k so I'm thinking I'd realistically need to lower my paycut to around 150-200k range for someone to take a chance on me - I live in a HCOL city.
Heres a sanitized resume as well
https://i.imgur.com/X4Cz7Qw.png
https://redd.it/1susytw
@r_devops
I Built a Kubernetes ingress controller around the load-balancer algo used by Youtube in Go.
I wrote a Kubernetes ingress controller based on the Prequal paper.
The paper is interesting because Google says it uses this load-balancing approach across 20+ services, including YouTube's serving stack. The core idea is to stop balancing CPU and instead choose backends using two signals:
\- requests-in-flight
\- backend-reported latency
Implementation-wise, I built it as:
\- a controller watching `Ingress` and `EndpointSlice`
\- an in-process reverse proxy
\- per-route probe pools so balancing state stays isolated
\- async probing to keep decisions off the request path
I also put a benchmark harness around it so I could compare it against round-robin and least-connections under controlled workloads.
If people are interested I can also break out a follow-up post just on the Kubernetes design and route-local state isolation.
Repo: https://github.com/sathwick-p/prequal
Writeup: https://sathwick.xyz/blog/prequal.html
https://redd.it/1suq94o
@r_devops
Weekly Self Promotion Thread
Hey r/devops, welcome to our weekly self-promotion thread!
Feel free to use this thread to promote any projects, ideas, or any repos you're wanting to share. Please keep in mind that we ask you to stay friendly, civil, and adhere to the subreddit rules!
https://redd.it/1swvmef
@r_devops
Should Terraform Pull Environment Variables from AWS Parameter Store?
I am new to DevOps. Sorry if this is a stupid question.
I am working on an application that uses GitHub Actions, Terraform, and AWS. Currently, we store environment variables and secrets in both AWS Secrets Manager and GitHub Secrets. However, due to rising costs with Secrets Manager, we are switching to AWS Parameter Store.
As part of this change, I am considering centralizing all env variables in PS, including those currently stored in GitHub, but I am not sure whether it is best practice to allow Terraform to fetch variables directly from AWS PS. Does that make sense? Or is there a better pattern for managing environment variables in this setup?
Thanks.
https://redd.it/1swqb2s
@r_devops
Trying to automate our deployment process
Hey folks,
I’ve recently joined a team where deployments are still fully manual, runbook-driven, and pretty error-prone. I’ve been asked to look into automating the process
I should also mention I’m fairly new to this, so I’m trying to be thoughtful about not overengineering things or picking the wrong approach early.
# Current setup
We have two apps:
Market-facing app on Kubernetes (EKS on AWS)
Integration app on ECS (Docker-based)
Two environments: demo and production. I’m planning to automate demo first and only touch prod once things are proven.
# What deployments look like today
Each deployment is a long sequence of manual steps, roughly:
Pre-checks (current version, data reconciliation)
Backup + verify it’s safely in S3
Stop services
Pull and configure new release
Run upgrade
Post-checks (pods healthy, UI version correct)
Notify team + scale down
The integration app differs a bit:
Pull from Git
Build Docker images
Force deploy to ECS
Also worth noting:
Some deployments are full upgrades, others are patches, and the steps differ meaningfully
# What I’m trying to figure out
I want to turn this into a reliable pipeline instead of relying on someone executing 30+ steps perfectly every time.
A few things I’m unsure about:
1. Tooling
We’re already deep in AWS. For a mixed EKS + ECS setup, would you lean toward:
CodePipeline / CodeBuild
GitHub Actions
Jenkins
Something else
2. Pipeline design
Would you:
Build one parameterized pipeline
Or split by app and/or environment
Right now I’m leaning toward separate pipelines per app, but curious what’s worked (or failed) for others.
3. Approval / safety gates
Some steps need human confirmation, especially backups.
Example: we should not proceed unless someone confirms the backup completed successfully.
What’s the cleanest way you’ve implemented this?
Manual approval steps in pipeline tools
External checks
Something else
4. Notifications
We currently send MS Teams messages at start/end of deployments.
Would you:
Integrate notifications into the pipeline
Or keep that separate
If you’ve built something similar, I’d really appreciate any advice, patterns, or horror stories. Especially around what not to do.
Thanks! 👊🏻
https://redd.it/1swjmr5
@r_devops
Self managed Kubernetes vs EKS
Been running self-managed Kubernetes for a while, and the AWS bill keeps creeping up despite flat traffic. Before I rip-and-replace with EKS, I'm curious: has anyone actually saved money switching to managed Kubernetes, or did you just trade CapEx headaches for unexpected bill shock? What were the hidden costs nobody warned you about?
https://redd.it/1swk39g
@r_devops
Agent-based vs push-based remote management for NAT’d nodes (tradeoffs?)
I’ve been experimenting with a small setup to manage machines that sit behind NAT (homelab + some lightweight infra), and ran into the usual friction:
* SSH requires inbound access or a VPN
* Push-based tools (e.g. Ansible) don’t work well when nodes are intermittently online
* Monitoring tools rarely include safe remote execution
So I tried a different approach:
a **polling agent over HTTPS**, where each node checks in and pulls commands. Also, I was deeply inspired by how Moonlight/Sunshine handle the enrollment approach. :-)
That gives:
* no inbound firewall rules
* no exposed SSH
* works on unstable or occasionally-online nodes
But it also introduces tradeoffs:
* latency (bounded by poll interval)
* state lives server-side
* less “real-time” than push
I ended up building a small self-hosted system around this model (repo linked below), mainly to test the idea in practice.
One extra detail: I used AI tools (mostly for boilerplate, refactoring, and some API plumbing), but the architecture decisions and tradeoffs were manual. Curious how others are approaching this - especially whether AI is actually helping in infra tooling, or just speeding up iteration...
Repo: [https://github.com/tyxak/remotepower](https://github.com/tyxak/remotepower)
Questions I’m interested in:
* Are people still defaulting to SSH/VPN, or moving toward pull-based models?
* How do you handle remote execution for nodes you don’t want exposed?
* Is the polling model a dead end at scale, or still valid for small/medium setups?
* Where (if anywhere) has AI actually been useful in your infra/tooling work?
Interested in real-world approaches and tradeoffs.
https://redd.it/1swi91i
@r_devops
We took production down for 20 minutes because of a DB migration, how do you prevent this?
Yesterday we had a migration that added an index to a large table without thinking much about it.
Turns out it locked the table and took the whole app down for 20 minutes.
It wasn’t caught in code review, and our CI didn’t flag anything.
Now we’re trying to figure out how to prevent this kind of thing from happening again.
For teams that run migrations regularly:
* Do you have any safeguards in place?
* Do you rely on code review only?
* Have you had similar incidents?
Curious what’s actually working in practice.
https://redd.it/1swbj6e
@r_devops
OSS project: deterministic cloud + LLM testing locally. Would this be useful?
Biggest gap I’ve been running into lately is deterministic testing for cloud + LLM workflows without calling real services. Curious how others are solving this.
I ended up building a small runtime for my own use that:
* emulates AWS, Azure, and GCP APIs locally
* works for SDK calls, Terraform runs, and CI testing (SQLite or in-memory)
* includes a local dashboard to inspect resources and verify state changes
One thing I focused on was LLM workflows. It has a config-driven simulation for Bedrock-style APIs that lets you:
* simulate responses (text, schema, static)
* inject errors (throttling, failures)
* control latency + streaming behavior
* define prompt-based rules
Basically lets you test retry logic, routing, and edge cases without calling real models.
[Screenshot of the Bedrock dashboard showing simulated responses which can be from fixed JSON, schema generated data, and lorem ipsum text](https://preview.redd.it/15sntwy21jxg1.png?width=2940&format=png&auto=webp&s=5142d6fbfedf0ff8f3046224f73d93a187f95081)
Not trying to recreate everything, just cover the common integration/testing paths I kept running into.
Would be interested in how others are approaching this, and if something like this would actually be useful in your workflows.
There’s also a lightweight Rust version I’ve been working on, and I’m considering moving the full runtime there to keep the footprint small.
Would love any feedback.
Project:
[https://github.com/creocorp/cloud-twin](https://github.com/creocorp/cloud-twin)
Docker:
[https://hub.docker.com/repository/docker/creogroup/cloudtwin](https://hub.docker.com/repository/docker/creogroup/cloudtwin)
https://redd.it/1sw6hzw
@r_devops
Declarative identity on Kubernetes: an operator approach for GitOps workflows
I've been running Kanidm (a Rust-based identity provider) on Kubernetes for a while now, and
eventually wrapped it in an operator because the manual setup got tedious.
This is about what I learned making identity infrastructure declarative, and why it matters if you
run GitOps-style clusters.
## The practical problem
If you self-host Kanidm the normal way, you end up with:
- Manual container setup and config generation
- Identity objects (persons, groups, OAuth2 clients) created through CLI or web UI
- No real integration with how you already manage the rest of your cluster
That works, but it doesn't fit a GitOps workflow. You want identity changes to go through the same
pipeline as everything else: manifests, commits, PRs, review.
## What the operator does
Kaniop handles:
- Kanidm deployment as a StatefulSet with proper replication
- CRDs for persons, groups, OAuth2 clients, service accounts
- Generated child resources that stay in sync with the parent spec
- Day-2 ops: upgrades, status conditions, cleanup on deletion
The idea is that you define identity objects in YAML, apply them with Flux or ArgoCD, and the
controller reconciles them against the actual Kanidm instance.
## What surprised me
Users didn't care much about the operator existing. They cared about whether it survived noisy
cluster conditions, whether updates were understandable, and whether CRDs mapped cleanly to what
they actually do.
That pushed most of the work toward boring things:
- Status handling and condition reporting
- Finalizers and cleanup paths
- Patching edge cases when Kanidm API behavior changed
- Reducing surprise in generated child resources
## Real usage (not just my testing)
One user on r/kubernetes mentioned they've been running Kaniop for months with Flux GitOps,
managing OAuth2 for Grafana, Nextcloud, and NixOS hosts. They said it "works flawlessly" for their
setup.
That kind of feedback matters more than feature lists. If someone actually uses it day-to-day and
it stays stable, that's the proof.
## Honest limitations
- This is only useful if you already run Kubernetes and want Kanidm there
- If you want the simplest Kanidm deployment, an operator is overkill
- Kanidm is still evolving, so the operator has to chase API changes sometimes
- Not a drop-in replacement for enterprise IAM solutions
## Repo and docs
- https://github.com/pando85/kaniop
- https://pando85.github.io/
I built and maintain it. Not trying to sell anything here - just interested in the discussion about
whether operators make sense for identity infrastructure or whether people prefer thinner
deployment patterns.
https://redd.it/1sw2ezh
@r_devops
How do you debug when the same workflow behaves differently across environments?
Ran into something odd recently.
Same workflow, same inputs. Staging and prod both return 200s, CI is green, but the actual behavior is different.Logs didn’t really help. Everything looked “fine”, but clearly something was taking a different path under the hood.
Eventually tracked it down to a small difference in data that changed the execution path, but it took way longer than it should have.Curious how people usually approach this kind of thing. Do you rely on tracing tools? Add more logging? Replay requests locally? Something else?
Feels like this is one of those cases where logs just aren’t enough.
https://redd.it/1svyfj1
@r_devops
What do you use as the source of truth for fixes across release branches?
Had a small annoyance at work recently.
A fix had to be tracked across a couple of release versions, and it got surprisingly messy to tell what landed where.
For teams with multiple release branches, what do you usually rely on as the source of truth? Tickets, PRs, commits, release notes, or something else?
https://redd.it/1svup2s
@r_devops
3rd Year Engineering Student Seeking DevOps / Cloud Opportunity (Immediate Joiner)
Hi everyone,
I have completed my 3rd year of engineering and I am currently looking for an opportunity in DevOps / Cloud / IT Infrastructure / Support roles.
I have knowledge of:
• Amazon Web Services basics
• Linux
• Docker
• Jenkins
• Kubernetes basics
• Git / GitHub
• Shell scripting
I am fully available at any time and can join immediately. I have no other commitments and can dedicate myself completely to the role. I am ready to learn quickly, work hard, and grow in this field.
I urgently need an opportunity to support myself financially and build my career. If anyone has openings, internships, freelance work, or can provide a referral, I would truly appreciate it.
Please comment or DM me. Thank you.
https://redd.it/1svjamy
@r_devops
devops python course: what actually helped you go from basic scripting to real usage?
i work mostly in linux and bash. i can use python, read it, fix things, write small scripts etc
but in reality i just default back to bash or copy paste python and move on
every time i try to “get better” it’s either super basic tutorials or full dev courses building apps and frameworks which i don’t really need
what i actually want is:
* automation i understand
* using API properly
* python in pipelines instead of hacking things together
for people already in devops/sre did this just come from doing the job over time or was there something that made it click?
https://redd.it/1svg0pg
@r_devops
We discovered 40+ shadow APIs in production that nobody on the team knew existed. How does this happen?
After the Trivy supply chain hit we decided to do a full review of our cloud exposure. Started with the usual: workload scans, credential audits, misconfig checks.
Then someone asked do we actually know all our APIs? Turns out we didnt. Found 40+ API endpoints in production that werent documented anywhere. Some deployed by teams that have since reorged. Some spun up for a proof of concept that never got decommissioned. A few were publicly exposed serving data they absolutely shouldnt have been.
Our API gateway only covers the endpoints we explicitly route through it. Everything else is invisible. And our existing API security tool requires agents and traffic mirroring which means anything not instrumented is a blind spot.
What are ppl using for continuous API discovery across multicloud? Need something that finds what we dont know about, not just monitors what we do.
https://redd.it/1suz9ip
@r_devops
I spent quite a few late nights trying to build an extension that draws your entire infra topology inside your IDE and hope it helps someone else too 🙂
https://redd.it/1surlms
@r_devops
Do you trust AI agents running code on your machine?
I've been experimenting a lot with AI agents (Claude Code, etc.) that can execute code locally. Yesterday I ran into a situation where the agent suggested a command that I didn’t fully understand. It made me pause for a second because once you hit enter, it's already too late.
It got me thinking: there’s basically no control layer between what the agent decides to do and what actually runs on your system.
Curious how others are dealing with this.
Do you:
* just trust the agent?
* manually review everything?
* restrict what it can do somehow?
Have you ever had a moment where you thought “this could go wrong” 🤔?
https://redd.it/1suhoss
@r_devops