r_devops | Unsorted

Telegram-канал r_devops - Reddit DevOps

270

Reddit DevOps. #devops Thanks @reddit2telegram and @r_channels

Subscribe to a channel

Reddit DevOps

Need advice on my resume

So I’m a DevOps engineer with 9 years experience, I used to get a lot of interviews and never had problems landing jobs.

I’m on a bit of a dry streak right now, currently working a full time position at a healthcare company, I just feel like I’ve plateaud and trying to see what my options are.

Is there anyone who would like to take a look at my resume? And give me feedback?

Thank you!

https://redd.it/1t5x7rv
@r_devops

Читать полностью…

Reddit DevOps

Lead engineer said I lack communication skills after I let every dev vote new observability tool they want to use, even though all devs already acked my post 1 month before the vote day. He said if I didn’t inform him directly, I lack communication skills, regardless of whether the devs knew

Please feel free to give your honest opinion. I don’t mind if I’m wrong, because I’m here to learn, not to debate.

Context: I'm DevOps in a small tech company, 7 devs in the entire company, no product, no clients

Full details

Our company are currently using expensive enterprise observability tool.

My lead told me in the office that he wanted me to switch to a self-hosted solution to reduce costs.

So I deployed 2 self-hosted observability tools and let every dev test them for 1 month. The plan was for the devs to vote on which tool they wanted to use.

I didn’t inform my lead directly about the voting plan because I thought that as long as it significantly reduced costs, as he wanted, it should be okay.

When the vote day came, the devs voted normally without any issue.

But my lead urgently messaged me and said this was not the right way to work... He said I should have informed him first, created a formal architecture decision document, and scheduled a meeting with him before doing this.

TLDR My lead wanted a formal architecture decision document, even though the new tool was a clear win and our company only has 7 devs with zero clients.

My opinion I think lead should trade balance between "innovation" and "policy" especially in a small company.

In this case, we were moving from an expensive enterprise tool to a cheaper self-hosted internal tool. Since it was only for internal use, had no legal or compliance issues, and the devs were okay with it, I thought it should be fine. Overly rigid policy for a 7-dev company with zero clients feels like overkill for this case...

https://redd.it/1t5q0jn
@r_devops

Читать полностью…

Reddit DevOps

sshc – a minimal CLI for managing SSH configs

Hey everyone,

I wrote a small utility called sshc to help manage ~/.ssh/config.

It helps with dealing with a single giant SSH config file or multiple SSH config files.

It’s just a simpler way to stay organized if you have a lot of servers to manage.

I’m a fan of low-ceremony tools that stay out of the way, so I kept this pretty minimal.

If you want to check out the code or use it, it's here: https://github.com/cortiz/sshc

Feedback is always welcome.
Cheers.

https://redd.it/1t5f2lp
@r_devops

Читать полностью…

Reddit DevOps

Pre-push hook that catches AI-IDE leaks Gitleaks misses. Looking for genuine feedback

Hey devops folks.

I built something small and I'm wondering if you've seen this problem in your org.

THE PROBLEM:

Last week someone on my team pushed an OpenRouter API key to env.example. They said Claude Code generated too much code and they just missed it.

Looked around and found it happens... constantly. People using Cursor/Claude/Copilot are leaking:

\- .cursor/ workspace folders (codebase structure)

\- CLAUDE.md files (architecture thinking)

\- Copilot cached contexts (conversation history)

\- .env mistakes (hallucinated secrets)

Gitleaks catches API keys great. Misses all of this.

THE SOLUTION:

Built a small pre-push hook called ghst. One command: `ghst install`

Runs on every push. Blocks if it finds leaks. Silent if clean.

Detects 68+ token patterns + AI-IDE specific patterns.

REAL QUESTION:

If you manage dev teams, have you seen this? Do you think this is actually a problem worth solving?

I don't have a huge sample size yet, so honest feedback would be helpful. What's missing? What would make this useful for your team?

Looking for DevOps perspective specifically and does this fit your security posture?

[Comment with the link\]

https://redd.it/1t5eew7
@r_devops

Читать полностью…

Reddit DevOps

Random ContainersNotReady build helper failures on GitLab Kubernetes runners after switching to custom CI Docker image

Hi everyone,

Following advice from my previous post, I switched from restoring a \~3 GB node_modules cache in every Nx CI job to using a custom Docker image with dependencies preinstalled.

It solved part of the cache overhead, but now my test stage (13 parallel jobs on GitLab Kubernetes runners) randomly fails on some jobs with:

ContainersNotReady: "containers with unready status: [build helper\]"

Not all jobs fail, and it’s different every pipeline.

I have no access to runner/K8s logs, can’t pin pods to the same node, and don’t control runner resources.

Could this be caused by:
\- image too heavy?
\- too many parallel pulls/startups?
\- runner/Kubernetes saturation?

Has anyone faced this random build helper issue after moving to a heavy custom CI image?

Thanks.

https://redd.it/1t57i9t
@r_devops

Читать полностью…

Reddit DevOps

docker request truncation bug bypasses AuthZ plugins (CVE-2026-34040)

Docker v29.3.1 dropped in March with a fix for CVE-2026-34040 (CVSS 8.8)

the bug is weird. Dockers middleware strips request bodies over \~1mb before AuthZ plugins see them but the daemon still processes the full thing. so the plugin evaluates an empty body, approves it, and the daemon runs whatever was actually in the request

the AuthZ plugin and daemon are literally looking at different requests

craft an oversized request, plugin sees nothing suspicious and approves it, daemon executes the full payload with elevated access. could spin up privileged containers, read bind mounted host files, maybe even break out depending on how things are configured

this is supposedly related to CVE-2024-41110 from last year which was "fixed" but apparently not really. i'm starting to think nobody actually tests these patches

mainly a problem if you expose the Docker API over TCP (even internally), run CI/CD that talks to Docker remotely, or lean on AuthZ plugins for access control

check your version:

docker version --format '{{.Server.Version}}'

anything under 29.3.1 has the bug

if your Docker API is network accessible this is one to actually fix rather than add to the backlog and forget about

just ran into this while auditing our infra and would love to hear your thoughts

https://redd.it/1t55mr0
@r_devops

Читать полностью…

Reddit DevOps

Media Compression in user's browser

i want some suggestions on how to best optimize media compression on users browser for my webapp since my server cant handle compression as it is only of 1vcpu n 2gb ram. i am willing to put limit of only 30sec of video and compressing it to 480p through users browser. is it a good technique? or is there a better technique available since i am not willing to pay hundreds of bucks just to cloudflare stream and all

https://redd.it/1t4o2e2
@r_devops

Читать полностью…

Reddit DevOps

How do you structure DevOps for personal projects?

I’ve been spinning up a lot more personal projects lately, partly thanks to our friend AI, and it made me reflect on how I structure the DevOps side of things.

I wanted to start a discussion and see how others are setting up their local dev, CI/CD, deployment, and infrastructure for side projects or small personal apps.

For my own projects, I usually start pretty simple. I tend to use a `Makefile` for common tasks like building, packaging, running, and deploying. From there, I’ll add more structure only when the project starts to feel real.

Curious how others approach this:

* Do you use Terraform from the beginning, or only after a project gets serious?
* What’s your go-to setup for running multiple apps?
* Do you have a default cloud provider or tech stack?
* Are you running everything on a single VPS/EC2 instance, containers, Kubernetes, serverless, something else?
* Are you using a personal cluster or homelab?
* How much CI/CD do you set up for side projects?
* What do you do to keep costs down?

Basically, what does your personal project DevOps setup look like, and what have you found to be worth the effort vs. overkill?

https://redd.it/1t4e5zp
@r_devops

Читать полностью…

Reddit DevOps

Your .env files are under attack
https://netflux.io/posts/your-env-files-are-under-attack/

https://redd.it/1t4ra9p
@r_devops

Читать полностью…

Reddit DevOps

Dealing with AI in Devops

Hello guys, lately how are you dealing with pressure of AI , like I have been in the field for almost a decade and am a part of a team that is quickly adopting , like using AI agent to code Iac and frontier agents for debugging.

All I feel is use AI to debug and plan future projects, and not using enough skills that I used earlier, and AI may be replacing soon, though we are the one who is implementing it .

https://redd.it/1t4legi
@r_devops

Читать полностью…

Reddit DevOps

What are your favorite Cloud Custodian policies?

I just implemented Cloud Custodian across our environment with checks for unused IAM roles and users. What are your favorite use cases for the tool? Looking for cool ideas on how to use the tooling to increase security.

https://redd.it/1t4it2b
@r_devops

Читать полностью…

Reddit DevOps

OpsGenie is shutting down April 2027 — wrote a migration guide covering the main alternatives

Atlassian confirmed OpsGenie EOL. A lot of teams are now figuring out what to move to and the options aren't obviously equivalent — some are SaaS-only, some are self-hosted, some require ripping out your whole alerting stack.

Wrote a breakdown covering the main options across both categories:

SaaS: PagerDuty, Grafana OnCall (Cloud), "Incident.io", Better Stack

Self-hosted / Open Source: Wachd, Grafana OnCall (OSS)

Each one covers: what it does well, what it doesn't, and who it's actually for.

Full guide: https://wachd.io/blog/opsgenie-alternatives-2026

Happy to answer questions — been through a few of these migrations.

https://redd.it/1t4geiv
@r_devops

Читать полностью…

Reddit DevOps

BSc CS student aiming for DevOps — how to move forward after Linux basics?

Hi everyone,

I’m currently pursuing a BSc in Computer Science and I’ve recently decided to focus on Cloud and DevOps.

So far, I’ve completed basic Linux (CLI, permissions, package management), and now I’m moving into networking. However, I’m feeling a bit overwhelmed because there’s a lot of scattered information online, and different people suggest different paths.

I’d really appreciate some guidance from those already working in this field:

What should I focus on next after basic Linux?
How deep should I go into networking for DevOps?
Which resources (courses, docs, or practice methods) actually helped you?
How can I avoid wasting time on unnecessary topics?

My goal is to build strong fundamentals and eventually become job-ready, not just follow tutorials blindly.

Any advice or roadmap suggestions would be really helpful. Thanks in advance!

https://redd.it/1t4de4o
@r_devops

Читать полностью…

Reddit DevOps

New to Control-M

I am new to control M, currently monitoring the alerts and working on the planning section to schedule the jobs and find the dependencies

Shall I continue to learn Control-M, is it worthy enough or shall I switch to Tidal or any other similar and more prominent tools ?

How’s the market for control-M currently ?

https://redd.it/1t47u6n
@r_devops

Читать полностью…

Reddit DevOps

What are some good alternatives to SendGrid (supporting the China Region)?

We are building a global application, and SendGrid is not supported in the China region. Can anyone suggest a good alternative?

https://redd.it/1t46gm8
@r_devops

Читать полностью…

Reddit DevOps

Rate My Level As a First Year Master Student and suggestion of how to improve

note: i used ai to correct typos

Hey, I hope I don't get a lot of hate for sharing this, but I'm a first-year master's student who is interested in DevOps and cloud. I know DevOps is not an entry-level field, but companies in my country hire junior DevOps engineers after completing a mandatory 4-month internship if they perform well.

I have a background in web development because I completed a bachelor's degree in web development, so I understand all layers of web applications.

My plan for now is to get the AWS SAA certification and improve my troubleshooting skills through labs.

if anyone has some suggestion of how to improve it will be helpful.

Some of the projects I worked on:

you will find more details in the github repo above each project

• Built a full GitOps-based DevSecOps platform on AWS EKS with Jenkins, ArgoCD, Argo Rollouts, Terraform, Vault, Prometheus, Loki, and Kustomize.

* Implemented dual CI/CD pipelines for application delivery and infrastructure changes, with integrated security scanning using Trivy, Snyk, and Gitleaks.
* Added canary deployments with automated rollback analysis using Prometheus metrics.
* Provisioned AWS infrastructure using modular Terraform.

LINK : [GitHub - saberBenhamda0/monolithic-devops-project · GitHub](https://github.com/saberBenhamda0/monolithic-devops-project)

• Built a private cloud / homelab platform on Proxmox that replicates cloud concepts like ECS, EKS, and EC2 using LXC, Docker, K3s, Traefik, Ansible, and Python automation.

* Automated VM/container provisioning and inventory management through Proxmox and pfSense APIs.

LINK: [GitHub - saberBenhamda0/homelab · GitHub](https://github.com/saberBenhamda0/homelab/tree/main)

• Built a Kubernetes microservices platform with Istio service mesh and runtime security using Tetragon (eBPF-based threat detection).

* Implemented mTLS, observability, distributed tracing, and policy enforcement.

LINK: [https://github.com/saberBenhamda0/secure\_microservices\_architecture](https://github.com/saberBenhamda0/secure_microservices_architecture)

https://redd.it/1t5szk2
@r_devops

Читать полностью…

Reddit DevOps

Is 50 SSRS on one VM possible or a staged demo??

We tested a new SSRS product this week and I genuinely can’t tell if this is actually viable long term or if we just had a really good lab run.

We had 50+ separate SSRS environments running on one VM at the same time for testing and training purposes.

Everything stayed responsive, startup times were quick, and resource usage was way lower than I expected.

I’ve spent enough time dealing with SSRS environments to know this normally becomes a mess pretty fast, so I’m curious if anyone else has seen setups like this actually hold up outside of demos. apparently the company works with Novartis and American Family insurance which sounds promising.

Would LOVE someone to help me out and figure out this product.

https://redd.it/1t5pc35
@r_devops

Читать полностью…

Reddit DevOps

Can we ban "I built .... " posts?

I know many folks here build cool stuff, but every time someone start with, "I built.." I think

Yeah.... you and Claude (but mostly Claude).

https://redd.it/1t5elr3
@r_devops

Читать полностью…

Reddit DevOps

Which DevOps tool do you think is under-documented for learners?

I've been building free audio courses on DevOps topics: things like Docker, Kubernetes, Terraform, GitHub Actions, Helm. Structured episodes you can follow without staring at a screen.

Now planning what to cover next and I'm curious what this community feels is underserved in terms of learning resources. Not necessarily the trendiest topic, but the one where you actually struggled to find something decent when you needed it.

A few I'm weighing up:

\- ArgoCD and GitOps workflows

\- OpenTelemetry and observability

\- Platform engineering basics

\- Backstage developer portals

\- eBPF for DevOps engineers



What would you actually find useful?

https://redd.it/1t5crpi
@r_devops

Читать полностью…

Reddit DevOps

Private Registry

Hey guys, I know this might sound silly, but why does my team actually need a private registry? How does it help us? can't we live without it? from your experience, what pain does it solves ?

https://redd.it/1t564c0
@r_devops

Читать полностью…

Reddit DevOps

Is Single Pane of Glass a myth?

I feel like I've been chasing the 'Single Pane of Glass' (SPoG) for years, but the more I build towards it, the more fractured things feel.We have the 'big players' for metrics and logs, then specialized tools for traces, then a different dashboard for Kubernetes health, and maybe another for only incidents handling.

Instead of a single pane, I just have a dozen different 'panes' open in Chrome, and each one is screaming at me with its own version of the truth.An alert in one tool doesn't always talk to the trace in another. Huge mental energy required to switch contexts between five different dashboards.

Ingesting everything into one platform is becoming a massive budget line item.

Is the SPoG just a marketing myth sold to management? Or has anyone actually achieved a 'quiet' and unified observability stack that doesn't feel like a part-time job just to maintain?

What’s your current un-shittiest setup?

https://redd.it/1t50a8o
@r_devops

Читать полностью…

Reddit DevOps

Do you separate CI and CD in your architecture?

Most setups I’ve seen treat CI as the release system too (build, test, deploy, rollback...).

I don't like this approach. GitHub, GitLab, Jenkins — these are great services but they are not really designed to manage releases.

Do you separate CI/CD? If yes, what tools do you use, & why? Or do you keep everything in one pipeline? Curious to hear different thoughts

https://redd.it/1t4vm5f
@r_devops

Читать полностью…

Reddit DevOps

How do you keep infrastructure understandable as it scales?

As systems grow, I keep running into the same problem: what started as a clean and understandable setup slowly turns into something only a few people fully understand. Even with IaC, diagrams, and documentation, there is always some drift between what is written and what actually exists.
We try to keep Terraform modules structured, naming consistent, and docs updated, but after enough services, environments, and edge cases, it still gets messy. New team members struggle to build a mental model, and even experienced ones sometimes rely on tribal knowledge.
I am curious how others deal with this in practice. Do you rely more on strict conventions, better tooling, or just accept a certain level of complexity? Have you found anything that actually scales long term, not just in theory but in day-to-day work?
Also interested in how you keep diagrams or topology views relevant. Do you automate them from source of truth, or are they more of a rough reference?
Would be great to hear what has worked and what has not, especially in teams that have gone through multiple growth stages.

https://redd.it/1t4qgfu
@r_devops

Читать полностью…

Reddit DevOps

Career Dilemma? Need Advice

I am an intern at a startup, and I do work at an assistant level because, as we know, in a startup everyone works irrespective of position. Therefore, I have gained a lot of learning in these 5 months. Next month, my internship will be completed and I will become permanent. I have a bond until 2 years, including the 6 months of internship too, but this bond thing is on my mind a lot of the time. If I leave the organization, I'll have to pay XYZ amount. I am confused—should I leave and pay this XYZ amount within 2 years, or take most of the learning and then leave after these 2 years?

Because what if, after 2 years, they ask me to sign another bond for appraisal?

https://redd.it/1t4nyoc
@r_devops

Читать полностью…

Reddit DevOps

How to handle pushback from team for any improvement?

I moved into a team around 6 months ago, they have been in production for more than 8 years. The Terraform code and Python scripts was never updated, looked like it was written by someone who learnt Python or Terraform yesterday.

Long lived infra, ec2 instances which are never terminated, only stopped, even though they are doing blue-green deployment. ECS cluster had its plethora of issues. But hey, they have made it work for years.

I have been trying to improve the Terraform code and Infra but I am getting pushback for almost everything and have to explain every change, some examples below:



1. Why do we have to move to capacity providers, why can't we keep ASG associated directly to Cluster?

2. Why do we have to use count/variables to conditionally create terraform modules, why can't we keep using -target option?

3. Why do we have to commit variables determining module creation in Git, why can't we keep using -target option?

4. Why did you write new Terraform modules, why could you not make modifications to existing modules?

5. Why create versions of the Terraform modules?



How to handle this, I am getting exhausted, for every 10 change I propose/do, 5 are blocked. I don't want to be on bad terms, but can't ignore all that can be improved/corrected.





https://redd.it/1t4lkcn
@r_devops

Читать полностью…

Reddit DevOps

We built a skills registry + CLI to distribute them across our engineering team
https://newsletter.port.io/p/how-to-build-a-skills-library-for-your-engineering-team

https://redd.it/1t4i6vs
@r_devops

Читать полностью…

Reddit DevOps

Seeking advice on how to approach a complex multi-service webapp

I'm dealing with large complex multi-service webapp that has been around for over 25 years. It has a mix of legacy stuff and some new more modern approaches. It was originally a PHP based web app which has sprawled into still a PHP web based app with a ton of legacy code + many micro services in various git repos mostly Node/Nextjs. There is also ongoing migration from old to new platforms and a lot of path based routing in the mix so we have requests going everywhere. There's big dependency on PostgreSQL DB but also heavy use of Cloud firestore/firebase. It also relies on several other third party services. Anlytics, sentry, mailchimp and others. There's also a mix of testing and deployment strategies.

Development, Staging and Production are so fractured. Staging/prod was originally a home grown setup on-prem and production was moved to AWS using various services. Staging remains in-house using its own custom setup, there's barely any parity with production.

Development is a major pain because its super complex and developers don't have an easy way to work on the services or connect them together in a way that looks even remotely close to production.

95% of the production AWS infra is in terraform but mostly done as a infrastucture inventory. It's not setup with multi-env or using modules and it would very unlikely be able to bring up the infrastructure from the ground up. It was a good attempt to get per-existing infra tracked in code.

We are especially wanting to be able to spin up environments for features but that always leads to talk about duplication of environments which feels very complex and heavy.

Essentially I'm dealing with a web app that has grown out of control over many, many years that has some good bits and a lot of ugly stuff, and I'm trying to find a starting point to fix some of this but I'm having a hard time figuring out what I should tackle first and what tooling I should be looking at.

Part of me feels like full adoption of kubernetes would make sense along with getting 100% of infrastructure into terraform with proper module use and environments. A state where an entire copy of the production infrastructure could be deployed to a new environment.

Kubernetes for local dev feels like there could be a lot of pain points although I'm aware of several tools that are suppose to help with the paint points (skaffold, tilt etc).

We have one team that has been expirementing with Nix for local development and it's showing some promise, however it also feels complex and perhaps heavy handed for our use case.

I was hoping to get some advice from pros who have been through this in the past with similar environments with a ton of tech debt and get some bearings on where to start focusing efforts

https://redd.it/1t4fb4f
@r_devops

Читать полностью…

Reddit DevOps

pocketos lost their prod db + backups to a cursor agent in 9 seconds. the ai isn’t the main story

been reading the pocketos incident and the takes feel off. everyone is focused on ai agent deleted a startup but if you remove the ai part, this is just infra failing hard:

one api key had delete access to prod + backups
backups were in the same railway env as prod
no confirmation step before destructive actions
\~30 hours of downtime, some data gone for good

this could’ve been a bad script, leaked key, or someone half-asleep running a command. the agent just did it faster.

the only actually new thing here is the transcript after. it literally says it ignored instructions, guessed instead of checking, then explains what it did. that part is wild.

if you’re running agents on prod right now without hard guardrails, this is on you what are people actually doing here?

read only by default? scoped creds per task? manual approval for deletes?or just hoping nothing goes wrong

trying to figure out what’s overkill vs what should already be standard after this.

https://redd.it/1t4au5h
@r_devops

Читать полностью…

Reddit DevOps

how is your team dealing with alert fatigue these days

been thinking about how alerting setups always start clean and turn into noise within a year. every page either gets ignored or acted on by reflex without anyone actually reading the alert. tuning thresholds, dedup, separating slack alerts from real pages, all of it helps for a while but it always drifts back
curious what is actually working long term. is there a real discipline that keeps the signal to noise ratio reasonable, or is constant pruning just part of the job and people accept that

https://redd.it/1t47noq
@r_devops

Читать полностью…

Reddit DevOps

Follow-up: are teams building session budgets for SQL agents?

Follow-up from my previous post about per-query safeguards for AI agents accessing databases:



https://www.reddit.com/r/devops/comments/1t19czm/are\_perquery\_safeguards\_sufficient\_for/



A lot of the replies pointed out the obvious baseline:



\- read replicas

\- RLS

\- curated views

\- no direct prod access

\- API/gateway in front



I agree with that.



The part I’m still trying to understand is what happens inside the allowed sandbox.



If an agent is allowed to query a read replica or a curated view, do teams also track cumulative access during the session?



Things like:



\- total rows returned

\- number of queries

\- columns/tables touched

\- repeated pagination

\- time window

\- circuit breaker when it crosses a threshold



One comment put it better than I did:



“Is this query allowed?” and “should this principal still be allowed to keep asking?” are different questions.



That distinction stuck with me.



So for teams actually using SQL agents / Postgres MCP / database tools:



Are you building session budgets yourselves, or is this usually solved well enough with RLS/read replicas/API boundaries?

https://redd.it/1t40u5o
@r_devops

Читать полностью…
Subscribe to a channel