270
Reddit DevOps. #devops Thanks @reddit2telegram and @r_channels
Not convinced CI and IaC fully solve config drift in real environments
Been thinking about this after a few recent releases and I might be off here
We put a lot of effort into CI checks, terraform, and keeping infra defined as code. on paper it feels like environment drift should basically be solved
In practice it still shows up during incidents in small ways
* a config value changed during a past incident and never fully rolled back
* a regional setting added as a quick fix that never got synced elsewhere
* a service behaving slightly differently between staging and prod even though pipelines are green
What makes it harder is that none of this breaks deployments. Everything still passes validation and deploys cleanly
You only notice it when behavior starts diverging and then it turns into comparing logs, configs, and metrics across multiple systems trying to spot what is actually different
I know there's not a single solution for this, but how do other handle this in their environment?
https://redd.it/1sui92n
@r_devops
We have 30 GitHub org owners. The entire reason is that our member base permissions made creating a repo require org owner.
Took over GitHub administration 8 months ago. First thing I did was pull the org owner list expecting maybe 4 or 5 people. 31 org owners.
Went back through the audit log to figure out how. The pattern is completely consistent. Developer needs to create a repo. Default member permissions in our org were set to none which means members cannot create repos at all. Dev opens a ticket. IT or whoever had org owner at the time just elevated them to org owner rather than creating the repo for them or figuring out a delegated permission model. Easiest path. Repeated 31 times over 3 years.
Org owner in GitHub is not a limited role. Those 31 people can delete any repo, change branch protection rules on anything, invite or remove members, modify Actions settings org wide, access the audit log, and probably a few other things I am forgetting. We have production repos in this org. We have repos with deployment secrets configured.
The actual fix for the original problem takes about 10 minutes. Create a team with repo creation permissions or set base permissions to allow members to create private repos. We did this. Nobody has needed org owner since.
Now the question is how to safely remove it from 31 people without someone screaming that a workflow broke. A few of them definitely have automations or webhooks configured under their personal tokens with org owner scope. No way to know which ones without going person by person.
Anyone done a safe org owner reduction at this scale? Specifically interested in how you identified who was actually using the permissions versus who just had them sitting there.
https://redd.it/1sucqeo
@r_devops
What’s your take on FinOps?
What’s your take on FinOps, have you seen value from it or is it nothing but noise?
Looking to our cloud spend and wondering if it’s worth going down this path more seriously than just regular cost deep dives every 2-3months.
What’s been your experience?
https://redd.it/1subkzg
@r_devops
Want to create a homelab for Kubernetes. How much do I need to spend?
Hey, folks!
I do not want to build a Kubernetes cluster on a laptop. I want to buy a machine and develop a Kubernetes lab on it. How much do I need to spend? Would anyone be able to help me? I already have monitors.
Like 32 GB ram, hard disk, etc (I live in the US)
A multi-node environment with a budget of less than 500 USD. For basic projects.
https://redd.it/1su0qur
@r_devops
Analysis and IOCs for the @bitwarden/cli@2026.4.0 Supply Chain Attack
https://www.endorlabs.com/learn/shai-hulud-the-third-coming----inside-the-bitwarden-cli-2026-4-0-supply-chain-attack
https://redd.it/1styq6i
@r_devops
Brainstorming ideas for my final thesis. HELP.
To make it short, my project is about provisioning and deployment using Ansible and Terraform and I was most likely going to use AWS for ec2 instances but I'm not quite sure.
So, i have the main idea down i just want someone to help me come up with a complicated enough use case of some sort?
Something like using Ansible+Terraform for AWS infrastructure, but I feel like this idea is just a little too broad and I'd like help! Thanks.
https://redd.it/1sttw2e
@r_devops
Approaches and tooling for Infrastructure Automation, not just IaC, in real life?
I want to understand what do you use in your on-prem environment for infrastructure automation: provisioning, configuring, and managing infrastructure including Networking, Network Security and Compute/Virtualization components? I am kinda looking for a solution/tool to rule-them-all to cover infrastructure day0/1/2...Trying to get a as-centralized-as-possible model instead of distributed among several tools to accomplish the tasks.
I am semi-good on Terraform with Git to build/provision the infrastructure but I keep hearing I am wrong to use Terraform for Day 2 or configuration management...I need Ansible...But I never get the sense of why...In my mind, with the state built-in with Terraform, would it be more suitable solution for configuration management?
Anyway, what do you guys use or apply in reallife or production on-prem? no public IaaS.
https://redd.it/1stpje5
@r_devops
Survey for end-of-studies project
Hi everyone,
I'm a student working on my end-of-studies research project on how engineers actually build the skills to diagnose and resolve technical problems : things like production incidents, weird bugs, outages, systems you inherited that break in ways you've never seen before.
What I'm trying to understand: when engineers feel under-prepared or stuck in these moments, what actually helps them get better? Formal training? Hands-on practice? Mentorship? Just experience? Something else?
The reason I'm asking here: the existing research I found is mostly about tools and processes, not about the human learning side. I'd like to hear from people who actually deal with this.
What I'd love from you:
- 4 minutes of your time for a survey (link below)
- No product, no pitch, no mailing list signup
- Anonymous by default; optional email at the end if you'd be open to a 15-min chat
- I'll share the anonymized results back to this subreddit once I have 30+ responses
The survey asks about your role, your experience with incidents, what you've tried to get better, and what would actually help. It's structured so you can skip parts that don't apply to you.
https://forms.gle/S9mMfcuYf3dn6s9r8
Thanks so much, even if you don't fill it out
https://redd.it/1stlbmz
@r_devops
Anyone here learning DevOps and actually building stuff? Looking for people to team up with.
Hey everyone,
I don't know if this is the right space to post this but I’m currently transitioning into the DevOps space and I’ve been spending a lot of time learning and building projects.
But honestly, doing this alone is starting to feel a bit slow and kind of isolating. I feel like it would be way better to have a few people in the same phase where we can just share what we’re working on, talk through problems, maybe even build small stuff together or just keep each other accountable.
A bit about me:
I’ve covered Linux, Networking, AWS fundamentals (SAA level), Containers (Docker) and Kubernetes (cleared CKA)
Currently exploring things like CI/CD, infrastructure as code and Observability
I’m trying to focus more on building hands-on projects instead of just consuming content.
This isn’t meant to be anything formal. Just a small group or a few people trying to push each other, stay accountable, grow together and exchange ideas :))
If this sounds like you, drop a comment or DM. Would love to connect.
https://redd.it/1sthth4
@r_devops
how long did it take your team to admit the GPU K8s cluster was a mistake
asking because we just had the conversation and it took us about eighteen months longer than it should have
the signs were there pretty early. the failure recovery logic kept growing. driver version management across heterogeneous nodes was a constant background tax. utilization numbers that looked fine at the cluster level but masked a lot of waste at the workload level
the thing that kept us going was sunk cost and the feeling that we were almost there with the custom orchestration. we weren’t. we were just adding more bash
eventually did the math on senior engineer hours going into infrastructure maintenance vs what those hours were worth on product work. the answer was embarrassing
curious how long this took for other teams and what finally pushed you to make the switch
https://redd.it/1staz0r
@r_devops
Will I do well as cloud architect?
​
I’m a DevOps engineer (CI/CD, Kubernetes, some cloud work). I enjoy doing DevOps and the hands on stuffs. I recently got an offer for a Cloud Architect role(it is mentioned that might require some devops/handson capabilities). Their team has multiple architects (security, network, platform), so I’d focus on cloud/platform. I care about growth, but I know I’m not the strongest DevOps engineer yet and still have a lot to learn. That said, I do feel I have some mid-level understanding of architecture and system design.
For those who moved from DevOps to Architect, was it worth it at this stage? Did you lose hands-on work too quickly? Or were you able to stay technical while growing into the role? Also, for people who genuinely enjoy DevOps work, did you still enjoy the architect role and responsibilities? Trying to decide if I should take this or deepen my DevOps skills first.
https://redd.it/1stb0iq
@r_devops
ALB returns 503 Service Unavailable even though EC2 + Nginx + Docker app works via public IP
https://www.reddit.com/user/Dependent_Leek_6655/comments/1sssl41/alb_returns_503_service_unavailable_even_though/#lightbox
https://redd.it/1sssscp
@r_devops
I feel like I am behind in DevOps after this conversation
I had a nice chat with my teammate who does not have any coding background. I built a brand new CI/CD pipeline which is used to deploy resources in AWS. He told me that I am doing it the old way. He said that the new way our team must do is to use an existing tool like ArgoCD and then teach our developers to use it. Am I really behind? I feel like, I am building automation tools based from what developers would like to have and I was told I'm doing the old way. Am I missing something? Please let me know. TIA!
Oh he also said, 'programming is dead, it's thing from the past' LMAO
https://redd.it/1sst8hn
@r_devops
Needed an OTel trace analyzer that detects N+1 and other anti-patterns from OTLP, Jaeger, Zipkin and Tempo, and wondering about the reliability ceiling of passive capture
https://redd.it/1ssob4p
@r_devops
Some incident management tool for alerts deduplication and Slack notifications with SSO?
Hey guys, I'm looking for a tool that would deduplicate alerts, create posts in a specific Slack channel, and update the alerts and the posts bi-directionally. No on-call schedule, calls, SMS, AIOps, and similar stuff is needed.
For the "bi-directionally", I'll clarify what I mean with an example. When an engineer marks an alert as acknowledged or resolved in Slack, it's updated accordingly in the tool backend. When it's done on UI/API side, the alert message is updated in Slack.
OIDC integration for SSO is highly desirable, but I think that it's possible to live without it, if everything else is good.
Open-source solutions are preferred, but I'm okay with a paid option if it's not too expensive. Right now I'm looking at target/goalert and PD as possible options.
I'd appreciate any suggestions and insights from engineers that had experience with such a tool
https://redd.it/1ssjh25
@r_devops
Curious how DevOps/platform teams are handling AI pipeline security right now.
For teams building with LLMs, agents, copilots, RAG, etc., where is security actually getting enforced?
Things like:
* what data gets pulled into the pipeline
* what context/data gets sent to models or external tools
* what agents are allowed to do (actions, permissions)
* how secrets, PII, and internal context are protected
* where controls live (app code, gateways, sidecars, containers, K8s policy, etc.)
Also curious who owns this in practice.
Is this usually starting with developers/app teams because they are building the AI workflows first, then getting handed off to platform/security later?
Or are platform/security teams setting standards upfront?
I’m also seeing a pattern where teams start with hosted API tools for speed, then move toward containerized or self-managed deployments once governance, auditability, and data control matter more.
It feels like the tooling path may be developer-led early on, but long-term ownership shifts to platform/security once things move beyond experimentation. These days it might just all sit with the developers though, not sure.
Is that actually happening in real orgs, or are most teams still figuring this out case by case?
Would love to hear what this looks like in different orgs from people running or supporting these systems.
https://redd.it/1suiumk
@r_devops
What happens to your cloud setup when the engineer who built it leaves?
Our lead infrastructure engineer quit in january and three months later, we are still finding things we don't understand not just undocumented services, design decisions that made sense to him but nobody else can explain. we had an outage last week that took us six hours to resolve because the person who would have known exactly where to look wasn't there anymore.
The worst part is there's no list of what's missing. we only find out something exists when it breaks. Every time we touch something, we find another dependency that isn't written down anywhere.
how do other teams handle this, is there a way to get ahead of it before someone leaves or do you just find out the hard way?
https://redd.it/1suau0a
@r_devops
Need advice, I'll be in devops role soon
Hey people,
My manager asked me to work on automation and he wants to promote me to a role there.
It is a devops role based on python is what he told me.
I can write snippets in python to receive responses from APIs.
What else should I know?
I'm pretty excited as devops is something I wanted to be in for a long time.
And it's a premature promotion. I have not reached the expected months of experience yet. So my manager is doing a lot of heavy lifting here. I don't know what made him do this for me, did I overachieve? Idk lol.
https://redd.it/1su8kzj
@r_devops
Which is more of a concern today.. Security? Or Cost?
I think the bigger you are, the less cost is a concern and the more security is. Why... the larger you are, the more you attract the hackers, and the less 'organized' your organization is just given the fact that many different people touch the same systems (many different ways of doing things, no 100% cohesiveness, much older systems still in use.. hence vulnerabilities (think airports)). But the larger you are, the more you can 'absorb' fluctuations in costs. On the contrary.. the smaller you are, the more you are susceptible to market cycles (less cash, less credit, etc).. but the more secure you are given merely by the fact that not as many people touch your systems = not as many mistakes, plus hackers prefer catching the bigger fish.. over the smaller.. AND smaller organizations can improve systems and operations MUCH faster than a larger one with less chance of using outdated vulnerable infrastructure. IMHO.
https://redd.it/1su077d
@r_devops
Analysed 2,000+ developer sites - Cloudflare on 38%, Azure and GCP nearly invisible
https://redd.it/1stwu24
@r_devops
SWE with frontend background pivoting toward cloud/security — is DevOps/platform the right on-ramp, and do CCNA/RHCSA matter here?
Background
BS in SWE (2023), \~2 years frontend / React / UI-UX since. No sysadmin, no on-call, no infra ownership.
Laid off \~2 months ago. Using the runway to pivot.
Done since layoff: Security+, AWS SAA (Cantrill). C
Building a homelab to get actual hands-on time
What I'm actually trying to figure out Long-term target is cloud security engineer. The common advice on security subs is help desk → sysadmin → security, but that feels like a detour given I can already code and ship. DevOps/platform keeps coming up as a more direct route that uses my existing skills (CI/CD, IaC, code review, automation) while forcing me to actually learn the infra side on the job.
So my questions for this sub specifically:
1. Is DevOps/platform realistically a better on-ramp than help desk → sysadmin for someone with a SWE background aiming at cloud security? Or am I romanticizing it because it sounds more like what I already do?
2. What does a junior/associate DevOps resume actually need to look like coming from pure frontend? I can write Terraform and GitHub Actions, I've touched Docker, but I've never owned a production pipeline or been paged at 3am. What closes that gap fastest — homelab projects, OSS contributions, something else?
3. Cert question, honestly: I'm weighing CCNA, RHCSA, and AWS Security Specialty as the next thing. I want a sanity check from people actually doing hiring. If one of them is worth it, which?
4. Any tools or areas where spending a focused month would meaningfully change how my resume reads? Kubernetes is the obvious one. Considering also going deeper on Terraform + a real multi-account AWS setup, or picking up something like Snyk / Trivy / OPA to start bridging toward the security side.
Runway isn't the bottleneck (moved backed home, months savings). Direction is. I'd rather spend the next 3 months building one thing that actually demonstrates platform/security-adjacent capability than stacking certs that hiring managers skim past.
Appreciate any honest takes — including "you're not ready, go do help desk" if that's genuinely the read.
https://redd.it/1stt4rg
@r_devops
Months of flaky CI, and the RCA was waiting in CloudTrail the whole time
We had a bug in our self-hosted GitHub Actions runners that failed jobs every other day for two and a half months. The failure was intermittent, the workaround was a one-click rerun, and nobody made it a priority - until our CTO pinged the security channel asking "is this a known problem?"
The first RCA was wrong. A teammate used an AI assistant to analyze the error and it produced a plausible, internally consistent, specific theory involving warm-pool hibernation. The problem was that the AI was working from **only** an error message and a handful of instance IDs - evidence thin enough to support several different mechanisms.
What actually caught the bug was querying CloudTrail and feeding it in the LMM. We hadn't set up Athena against our Control Tower log-archive bucket. A day of tedious Terraform later, I had a partitioned table with partition projection over three years of org-wide CloudTrail events. One query, and the real race condition was visible to the second.
Writeup covers:
* The two-stage wrong RCA (from observation theory to AI-refined theory)
* The Athena-over-CloudTrail setup (the pattern that probably works for your org too)
* The CloudTrail timeline of the actual race, to the second
* The "make the race survivable" design decision (rather than trying to close it)
* Four PRs across two repos, including three silent systemd bugs we fixed along the way
[https://infrahouse.com/blog/2026-04-20-ci-was-failing-every-other-day-for-months/](https://infrahouse.com/blog/2026-04-20-ci-was-failing-every-other-day-for-months/)
Happy to answer questions about the scale-in race, the Athena setup, or the systemd side.
https://redd.it/1stkqn7
@r_devops
What keeps you going as a DevOps Engineer?
Hi all, I have an assignment for university where I have to create 2 personas of people in an IT related field. I decided to go with a DevOps Engineer for one of them.
Google and personal experience with my homelab only gets me so far in creating this persona, it gives an indication of what the job might entail, but it doesn't give much insight in the experience of a DevOps Engineer and the methods of a professional DevOps Engineer.
So as a starting point to creating a persona I am interested to know what motivates you guys to be a DevOps Engineer? After having worked in this field for a while, do you experience the job the same as when you started? Do you have any worries for the future? Is there anything you're still working towards?
I appreciate any and all input.
Thanks!
https://redd.it/1stiz5v
@r_devops
Why I choosed amazon ecs over k8s
I decided to go with aws ecs instead of k8s due to it's complexity and steep learning curve.
Our server is monoloth, not microservice.
I just want deploy easily into ec2
Our deploy flow is like below.
1. trigger github action using a slack command
2. github action: builds spring boot docker image
3. github action: uploads the image to aws ecr
4. github action: command aws ecs to pull and run the image
Is this a good choice? or are there better alternatives I should consider?
https://redd.it/1stfni5
@r_devops
When did you come to the realisation that it's all just bs, and you should just nod along?
I said that we have a few Linux servers, and the Senior SRE "corrected" me saying they are not Linux, but Ubuntu servers.
lol
https://redd.it/1stci76
@r_devops
Would an incident-focused copilot actually be useful?
Hey folks,
I've been working in incident-heavy environments (NOC/SOC / on-call rotations), and one thing that still feels pretty painful is the investigation workflow.
Even with tools like Grafana, logs, and alerting systems, I still find myself constantly jumping between systems to piece together:
* what happened
* what changed (deploys, config, infra)
* and what the likely cause is
So I’ve been thinking about a more structured approach to this.
**Question 1:**
Would you find value in a dedicated *incident-focused copilot* (not a general LLM like ChatGPT), that:
* builds a timeline automatically (alerts + deploys + logs)
* surfaces possible correlations / change windows
* suggests investigation steps based on past incidents
The idea is not “AI finds the root cause,” but more like:
→ reducing investigation entropy and speeding up decision-making
**Question 2:**
Let’s say such a system improves over time by learning from past incidents across teams (anonymized / abstracted, not directly exposing raw data).
How would you feel about that?
* Would you be okay with certain data contributing to model improvement?
* Where would you personally draw the line? (e.g. architecture, logs, incident timelines, resolution steps, etc.)
Curious how people here approach this in practice:
* Mostly manual digging?
* Heavy reliance on tooling?
* Any trust in AI-assisted investigation today?
Appreciate any thoughts — especially from folks dealing with frequent incidents.
https://redd.it/1st939v
@r_devops
Feeling overwhelmed.
I landed a "junior devops" role having a modest background in web development. I'm about a couple months in and still haven't finished onboarding. I still don't have admin access to our eks clusters, but am getting tickets that require me to test against them, so I have to bother someone else to check the cluster for me for every little thing I want to test.
I'm leagues behind my teammates who have been doing this for decades, they're very helpful when I ask questions but they're typically busy. I'm also getting paired with a even newer employee and feel like I'm the blind leading the blind. I'm finally starting to wrap my head around our platform on a high level and feel a bit more confident navigating everything, but this whole experience has felt disorganized and overwhelming. I'm just trying to take it one day at a time and learn as much as I can, I just feel like I'm gonna randomly get fired lol.
Is this pretty normal?
https://redd.it/1ssuznf
@r_devops
podman - verify cosigne signature
i'm going in circles. i need to sign images, and to make podman pull and run them only if signature is verified.
i have local docker repo, zot.
i have signed images
signed with
FLAGS=(
"--key" "$KEY_FILE"
"--tlog-upload=false"
"--use-signing-config=false"
"--allow-http-registry=true"
"--registry-referrers-mode=legacy"
"${ANNOTATIONS[@\]}"
)
cosign sign "${FLAGS[@\]}" "$IMAGE"
(i also tried without "--registry-referrers-mode=legacy", no difference)
cosigne verify work just fine
"The following checks were performed on each of these signatures:
\- The cosign claims were validated
\- Existence of the claims in the transparency log was verified offline
\- The signatures were verified against the specified public key
"
i have policy
"docker": {
"gooseberry.home:5000": [
{
"type": "sigstoreSigned",
"keyPath": ".cosign.pub",
"signedIdentity": { "type": "matchRepository" }
}
\]
and registry
❯ batcat --plain registries.d/gooseberry.yaml
docker:
gooseberry.home:5000:
use-sigstore-attachments: true
podman refuses to pull
Error: Source image rejected: A signature was required, but no signature exists
https://redd.it/1ssgghf
@r_devops
Anyone else frustrated with Google Meet chat disappearing? Built a quick fix
https://chromewebstore.google.com/detail/keoflebbbfemdfgggclhimpfcnnckpmk?utm_source=item-share-cb
https://redd.it/1ssmja2
@r_devops
Should i hide my previous experiences?
Hi
I have 6+ years of experience as a Devops engineer and in total 11 years of experience. Previously was into IT infrastructure. Started as a Network engineer and then to senior system administration.
My concern are if i show more experience will be difficult to find a new job. Recruiter may think of the budgets constraint.
https://redd.it/1ssgre3
@r_devops