270
Reddit DevOps. #devops Thanks @reddit2telegram and @r_channels
Why our canary didn't roll back: hardcoded Prometheus endpoint in a separate HelmRelease
So we had a fun tuesday. “fun”
Background: we run a fairly standard k8s setup on EKS, 40 services, nothing crazy.
deploy pipeline goes through ArgoCD, we do canary via flagger with a prometheus-based analysis template. has worked fine for like eight months.
2:17 pm I get paged. error rate on checkout-service climbing past 2%, flagger’s canary analysis is failing but it’s not rolling back. just… sitting there. suspended state. customers hitting errors, canary still live at 20% traffic split.
first thing i do is check the flagger logs. analysis is failing because the prometheus query is returning no data, not bad data, no data. so flagger can’t evaluate the metric and it’s going into this weird limbo instead of failing safe. ok, prometheus issue then. i go check prometheus, it’s up, scraping looks fine, i run the query manually and it returns results no problem.
15 minutes wasted there.
SO i’m thinking maybe it’s a namespace thing, flagger can’t reach the prometheus endpoint, we had an ingress policy change a few weeks back. i start tracing network policy rules and slack the platform team asking if anything changed on the observability namespace. they say no. i spend another twenty minutes here going through netpol yaml diffs. nothing.
what actually got me looking in the right place was one of our senior engineers asking “which prometheus?” offhand in the thread. we have two. the one flagger was configured to query had been migrated off its old service name as part of a prometheus-operator upgrade in staging that got promoted to prod last thursday. the internal DNS name changed. flagger’s helmrelease still had the old address hardcoded.
Our runbook for “flagger canary stuck in suspended” literally says step 3 is “Verify prometheus connectivity using the endpoint in values.yaml” which we did, and that endpoint was fine, but it just wasn’t the one flagger was actually using because the flagger config is in a separate helmrelease that nobody remembered has its own prometheus URL override.
So we manually rolled back the canary, patched the flagger config, redeployed. took about four minutes after knowing what it was.
Now the most frustrating part is I had a vague experience of something similar happening before I joined this team, there’s a slack thread from like 14 months ago that references “prometheus endpoint mismatch” but it’s not in any runbook and nobody who was here then is still on the rotation. So we just wasted time rediscovering it.
updating the runbook NOW to explicitly call out that flagger has its own prometheus config independent of everything else, and adding a config validation step to the deploy pipeline that actually curls the configured endpoint from within the flagger pod. probably should have done that the first time, whoever that was.
https://redd.it/1te9r16
@r_devops
Data classification as a one time project is basically guaranteed to rot
Treating data classification like a cleanup project feels doomed. You label a bunch of stuff, write a taxonomy, maybe hook it into policy and then six months later the world has changed:
new buckets, new tables, new services, new pipelines, new SaaS apps, new AI use cases, new temporary exports that somehow became permanent. From a platform/DevOps perspective, the problem is not just what is this data?
It is where did it move, who can access it, what deploy created it, who owns the service and what action is safe to take. Has anyone made classification/remediation part of the workflow instead of a periodic audit exercise?
https://redd.it/1te05ie
@r_devops
Project-Based Mentor / Senior Dev to help me build an MVP (Custom Curriculum + Async Support)
Hello people of Reddit 👋🏾, my name is Taí and I need some help with a few "weird" requests.
I’m looking for a technical mentor to help me build out a specific project from the ground up. In the past, I’ve hired devs to build for me or "Frankensteined" pieces together myself, but this time I want to actually learn the "why" and the "how" behind building a scalable MVP. I’m looking for a collaborative partner, teacher, and mentor, not just a freelancer.
I want to do this for a few reasons. The primary one is that I want to improve myself and actually learn the ins and outs of development, but I also want to take the responsibility of development into my own hands. I want to be the factor that determines how good the project is, how fast the project gets done, and what it is capable of. Lastly, I think this might be a better structure for skilled devs, such as yourselves, who might want to help but just don’t have the time to commit to another full project. 😭
How I envision the workflow:
• Strategy Sessions: We start with a call to brainstorm the architecture and tech stack. This is really just a conversation to figure out how to move forward on the roadmap; as I complete each syllabus and you review each module, we have another call for further iteration.
• Custom Curriculum: Like I said before, after our calls, you would build/iterate a modular "syllabus" or roadmap of tasks for me to execute for the project. After I complete them, you would review them and we would schedule another meeting for further iteration (kind of similar to school).
• Execution: I build out the modules based on your roadmap. This is what I was talking about earlier where I take the process into my own hands; how fast the project moves towards completion will depend on my will, which is a really important thing for me.
• Async Support: I can text/message you when I hit a wall for quick guidance or a "nudge" in the right direction. I really need a responsive and communicative person for this part.
• Code Review & Pivot: Once a module is done, you review it. Once a syllabus is done, we meet back up to review, iterate, refine the code, and adapt the next steps of the curriculum.
Project “curriculum”:
Module > Syllabus > Roadmap
I’m really looking for an… Ironman 😅. What I mean by that is someone who has not only the technical skills and knowledge, but also the people skills and patience to work with me, but most importantly, the belief and willingness to do something like this. I’m a very Type A person and I 100% believe if I can see it in my head, then I can build it in reality. If you’re the kind of person who thinks it can’t be done, rather than finding a way to get it done, we probably won’t work well together… because I believe everything is possible.
Anyway, if you made it this far, thank you for reading this. I would appreciate any resources you may have; if you know where I can find a person to assist me with this or if you are that person, please shoot me a DM or leave a comment.
Thank you in advance! :)
https://redd.it/1tdv7lc
@r_devops
Beginner in DevOps, review my Bitbucket pipeline (AWS ECR -> EC2)
Hi everyone, I’m a beginner DevOps engineer working with Bitbucket Pipelines, AWS ECR, and an EC2 Ubuntu instance.
This pipeline builds my Flask backend Docker image, pushes to ECR, then SSH to EC2 to restart the container. It's working, but I know env management can be better
Could you guys please review it and suggest improvements
image: atlassian/default-image:3
pipelines:
branches:
main:
- step:
name: Build and Push to ECR
services:
- docker
script:
# Login to ECR
- aws ecr get-login-password ... | docker login ...awscli
# Build and push
- docker build -t "$AWS_ECR_URI:latest" backend
- docker push "$AWS_ECR_URI:latest"
- step:
name: Deploy to EC2
script:
# SSH Setup
- mkdir -p ~/.ssh
- echo "$EC2_SSH_KEY" | base64 --decode > ~/.ssh/id_rsa
- chmod 600 ~/.ssh/id_rsa
# Copy env file
- scp -o StrictHostKeyChecking=no -i ~/.ssh/id_rsa backend/.env.staging ubuntu@$EC2_INSTANCE_IP:/home/ubuntu/.env
# Deploy container
- |
ssh -o StrictHostKeyChecking=no -i ~/.ssh/id_rsa ubuntu@$EC2_INSTANCE_IP <<EOF
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.$AWS_REGION.amazonaws.com
docker stop my_app || true
docker rm my_app || true
docker pull "$AWS_ECR_URI:latest"
docker run -d --name my_app \
--env-file /home/ubuntu/.env \
-p 5000:5000 \
--restart unless-stopped \
"$AWS_ECR_URI:latest"
sudo systemctl restart nginx
EOF
https://redd.it/1tdork6
@r_devops
How do i start learning?
Hi, I am currently a 3rd year in telecommunications engineering and im curious about getting into devops. I know some linux and some networking but not a whole lot of stuff. I know there are a lot of tools used, but what do i start with exactly? If anyone can help me with a roadmap and some direction and maybe recommend some courses I would be very grateful.
https://redd.it/1tdcl06
@r_devops
Building a Windows-first orchestration layer for distributed GPU compute using consumer hardware
Over the last few months I’ve been building a backend orchestration system focused on coordinating distributed GPU workloads across multiple runtime/provider environments.
Current systems include:
\- workload routing
\- telemetry arbitration
\- heartbeat/recovery logic
\- failover handling
\- sandboxed execution
\- provider orchestration
\- operator HUD tooling
The long-term goal is making fragmented GPU resources easier to coordinate and utilize across future compute markets.
Still early, but the orchestration layer is finally starting to behave like a real distributed system instead of isolated components.
Would genuinely love feedback from people with experience in:
\- distributed systems
\- orchestration
\- homelab clusters
\- GPU infrastructure
\- runtime/container systems
https://redd.it/1tda66e
@r_devops
What’s the worst retry-related production issue you’ve seen?
I’ve been spending a lot of time learning about retries, background jobs, and failure handling in distributed systems lately.
One thing that surprised me is how many “successful” production incidents are actually partial failures:
payment succeeded but DB update failed
email sent twice after retry
webhook timed out but still processed
job crashed after side effect completed
The scary part is that retries often make systems look reliable while silently creating duplicate side effects underneath.
Curious what retry-related production issue taught you the hardest lesson?
https://redd.it/1td63t1
@r_devops
How to handle multiple job processes as a DevOps Engineer
I work in IT as a DevOps Engineer. Currently unemployed and a little bit desperate to get a job, but there is no rush or pressure as I have the unemployment benefit and some savings on my bank account.
I currently going through different interview processes, in this field interviews could takes months with at least 3 interview rounds and this makes things very difficult to have multiple offers at the same time so I can make a good decision with all of the options available.
Last time I accepted an offer I had to withdraw myself from ongoing interview processes because I was tired of the hustle I had to go through while trying to keep everything quiet in my current job.
If someone ever has ever been on this type of situation can please give me some advice?
1. If I manage to land and offer it is ok to make a company wait for 2/3 weeks to finish all of your interview process?
2. Can I decline an offer an come back letter if the other interviews don't go as expected?
3. What strategies can I use to have multiple offers at the same time to make the best decision?
4. Do you have strategies to earn more time?
https://redd.it/1td2jva
@r_devops
Hosted git options these days?
I see a lot of hate on GitHub, I see GitLab recently announced a lot of layoffs and it seems they've joined the 'people you love to hate' club in terms of public opinion.
That leaves who for hosting private repos? Bitbucket?
Who does everyone actively recommend someone use for their private git repos if self-hosting is not an option?
Our company was thinking about migrating off of Bitbucket and moving to GitHub; but recently everyone has kind of splintered on opinions of where to go.
https://redd.it/1tcwqgb
@r_devops
Storage types that trip up engineers...explained simply
After working with a lot of AWS environments I still see engineers mixing these up regularly.
EBS vs EFS vs S3 they're not interchangeable.
EBS is a hard drive attached to one EC2 instance. Fast, low latency, lives and dies with that instance.
EFS is a shared network drive. Multiple instances can read and write simultaneously. Great for shared filesystems across containers or services.
S3 is object storage. Not a filesystem at all. Store files, retrieve them by URL. Infinitely scalable but not meant for live application reads.
The mistake I see most: teams use EBS when they need shared access across multiple instances and wonder why it doesn't work. Or they treat S3 like a filesystem and hit latency issues.
https://redd.it/1tcw3as
@r_devops
What's your CI/CD flow?
I was talking to a colleague yesterday and realized some people have different ci flows, basically he's merging all his PRS into a release branch then to the main so that he can have very clear release notes from every release branch. Also he was building each time he was deploying so one build for dev, then staging and then prod obviously this part is problematic.
How many of you do this?
Here's my flow:
I basically do trunk based without release branches and every merge is a new version release that builds both prod and staging artifacts in the same job, deploys only staging and when we're happy with staging we manually deploy prod.
I've had some deployment in the past which where fully automated with argo rollouts but that needs very good testing and observability.
I've also seen some people create a release candidate branch when they want release to prod with all relevant merges that way they keep track of what's released.
Interested to know what people here do?
https://redd.it/1tcprf8
@r_devops
Building a AWS Cost Management Dashboard as a class project -roasting and suggestions welcome
Hey
I'm a Cloud Engineering grad student building a full-stack AWS cost visibility and resource management dashboard. Think a lightweight self-hosted alternative to AWS Cost Explorer
**Tech stack:** React frontend, Flask backend, SQLite for caching, boto3 for AWS, Gemini API for AI features
**Services I'm covering:** EC2, S3, RDS, Lambda, EKS , with CloudWatch CPU utilization and Cost Explorer data tied together
**Features I'm building:**
* Fleet overview with CPU utilization vs cost per instance
* Idle/zombie resource detector that shows you exactly how much you're wasting in dollars
* Bill shock predictor projects end of month spend based on current trajectory
* Cost anomaly detection with AI explanation of what caused the spike
* Natural language querying, type "which instances are costing most and barely used" and get a plain English answer
* One-click bulk stop for idle resources with savings preview
* Auto-generated monthly cost report in plain English
**My questions for the community:**
1. What's the most annoying thing about the native AWS Cost Explorer you wish was better?
2. Is there a feature you constantly wish existed when managing AWS costs?
3. Anything here that seems pointless or that you'd never use?
4. What would make you actually use this over the AWS console?
https://redd.it/1tcfqdy
@r_devops
I’ve reached peak DevOps: I spent 6 hours automating a 30-second deployment task because "manual work is a technical debt." 🤡
The logic was sound: why do it manually when I can spend a whole afternoon fighting with a dependency graph and a custom script? Now, the task takes 2 seconds to run, but it requires 3 different monitoring tools just to make sure the "automation" doesn't have a mental breakdown.
Is it still "efficiency" if the maintenance of the automation takes more time than the original task ever did? Or are we all just collectively addicted to building complex systems for simple problems in 2026?
https://redd.it/1tc9ew3
@r_devops
Need help understanding GoReleaser + GitHub Actions + Homebrew/Scoop automation for my CLI tool
I built a Go CLI tool https://github.com/aminshahid573/gen
Now I am trying to set up proper distribution (Homebrew, Scoop, releases), but I’m honestly confused about how the whole pipeline works.
Right now I used AI to generate my setup, but I don’t fully understand it and I feel like I’m just copy pasting things without knowing what I’m doing and not aure is that working or not..
What I want is..every time I create a new version tag (v1.0.0, v1.0.1, etc.) gitHub Actions should automatically:
1. run tests
2. build binaries for Linux/macOS/Windows
3. create a GitHub release
4. update Homebrew tap
5. optionally update Scoop manifest
---
my current setup is .github/workflows/release.yml
https://github.com/aminshahid573/gen/blob/main/.github%2Fworkflows%2Frelease.yml
---.goreleaser.yml
https://github.com/aminshahid573/gen/blob/main/.goreleaser.yml
---
Upto i created GitHub repo for CLI, reated Homebrew tap repo (homebrew-gen) and Wrote formula manually
I am confused that do I actually need HOMEBREW_GITHUB_TOKEN or is GITHUB_TOKEN enough?
how does GoReleaser exactly update Homebrew tap repo?
or do I need to manually create formulas or GoReleaser handles everything?
What is the correct minimal setup for github Actions,GoReleaser,homebrew tap, scoop bucket .
If someone can explain correct clean workflow from scratch and minimal working config, what each part actually does
that would really help me understand instead of just copy pasting AI configs.
https://redd.it/1tc506c
@r_devops
Open-source byte-exact dedup tool — EOSE Labs picked it up as a Terraform state hygiene layer for their sovereign language fleet
Disclosure: I'm the author. Posting because the use case isn't one I would've pitched to r/devops on my own, but a production team picked it up for exactly that and figured this sub might find it useful.
**The case study**
EOSE Labs ([pemos.ca/pemgraphs](https://pemos.ca/pemgraphs)) built a sovereign infrastructure architecture they call PEMGRAPHS — eight language nodes (Python, Rust, Go, TypeScript, Bash, Perl, Lean 4 for formal verification, HCL/Terraform), AKS + Azure Key Vault, multi-tier ADA vault secret rotation, 3,051-theorem Lean 4 proof engine. They're using my dedup tool (Merlin) at the Terraform state node and the Python v4 lineage tier as a hygiene primitive.
Their public shoutout: *"Corbenic didn't know they were unlocking sovereign fleet context optimization. They were just building really good dedup."* [pemos.ca/community-shoutout](https://pemos.ca/community-shoutout)
**What the tool is**
* Byte-exact line-level dedup using xxHash3-64. Deterministic, lossless.
* Single-threaded, \~250 KB Windows x64 binary, statically linked (no msys2 / runtime deps).
* 1.10 µs median latency per chunk (their measurement, not mine).
* MIT-licensed integrations: CLI, MCP server, HTTP proxy.
* Hard-enforced caps in community tier: 50 MB/run · 200 MB/day · 2 GB/month. Refuses oversized work cleanly.
**Why DevOps might care (beyond AI workloads)**
* Terraform state files balloon fast in large infra. EOSE uses it as a hygiene layer at the HCL/Terraform node.
* CI/CD log pipelines accumulate duplicate noise. `./merlin-lite input.log --output-dedup=clean.log` is a standalone primitive.
* Anything where you process append-heavy streams and want a fast pre-archival pass.
**Install**
curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip
unzip merlin-community.zip && cd merlin-community
python shared/install_helpers.py <integration> enable
Or just use the binary standalone positional input file + `--output-dedup=PATH`.
**Honest caveats**
* Line-level exact-match. Not semantic, not fuzzy. If you need near-duplicate detection on rephrased content, this isn't it.
* Open-core: there's a closed-source Pro engine for high-throughput servers. What's in the public repo is what runs in the community edition same MCP interface, same caps.
* Windows x64 binary in v0.2.1 release. Linux + macOS coming once cross-platform CI is up.
Repo: [**github.com/corbenicai/merlin-community**](http://github.com/corbenicai/merlin-community)
Zero telemetry. Curious if anyone here has Terraform state hygiene pain that this might fit, or if EOSE's pattern (using a dedup primitive as a building block in a bigger system) maps to something you're already doing.
https://redd.it/1tc1rbg
@r_devops
Upstream covariance reshaping produces consistent BPP reduction across four independent codec architectures — reproducible results on Kodak PCD0992
https://redd.it/1te2io4
@r_devops
Anyone use Pager Duty?
I'm looking to give away 5-10 free lifetime accounts for my app which is essentially an integration with pager duty to automate calculating your on call pay each month (or however frequent you do it.)
The idea is that engineers and analysts work hard enough and strain their brains enough without having to whip up a spreadsheet correlating multiple calenders, alert times, alert numbers, schedules etc manually last minute each month to submit for their own on call pay, I've also found from experience this manual method is prone to human error more times than you realise.
All I want in exchange is feedback.
If you'd be interested please drop a comment and let me know your Role if you don't mind.
If you want to check it out it's at calloutpay.com
Thanks 👍
https://redd.it/1tdv99d
@r_devops
Dependency Track, notifications not triggering.
Hello everyone!
I am working with Dependency Track (version 4.14.2), which I just barely started to use. I have now a couple of projects and some policies in place. Policies correctly scan the projects and label the Policies Violations with the severity I defined.
I want to enable the Email notifications when a new Policy Violation is found, but they don't seem to trigger - the test email is correctly received tho.
I have tried forcing the re-scan, deleting the project and starting over (so all violation policies are "new"), scoping the notification to just certain tags, changing the project version and I am running out of ideas.
If you anyone has any tip on where to look, I would really appreciate it.
Thanks!
https://redd.it/1tdtdm7
@r_devops
How do you actually catch security issues in Terraform PRs when you're doing solo reviews?
The pattern I keep seeing: security groups too open, S3 buckets publicly accessible, encryption disabled on databases, IAM policies wider than they need to be. I catch some of it in manual review, but I know I'm missing things.
Question for the room: what's actually working for you?
* Are you using any automated tooling? (Checkov, tfsec, something else?)
* Has anyone tried running infrastructure changes through ChatGPT or Claude to catch gaps before merge?
* If you haven't automated this, what's the blocker company policy, trust in the output, or just haven't found the right tool?
Curious what's actually practical at the startup/small-team scale where you can't afford enterprise solutions.
https://redd.it/1tdh4vj
@r_devops
Initial full backup concers with Azure DevOps 2020 on prem. Need advice
Hi everyone,
I have recently taken over the administration of an Azure DevOps Server 2020 (on-premises) environment. The previous administrator is no longer with the company, and unfortunately, there is no existing documentation regarding the backup strategy. It appears that no automated backups have been configured via the Administration Console so far.
Environment Details:
Version: Azure DevOps Server 2020.
Scope: Single server instance containing one Collection with two active projects.
Content: Includes source code (TFVC/Git) and several CI/CD YAML/Classic pipelines.
Status: The environment is live and business-critical.
My Goal:
I want to use the built-in Scheduled Backups tool within the Azure DevOps Administration Console to create a backup plan, including an initial full backup and subsequent scheduled increments.
My Concerns:
Since I am new to this specific instance, I want to ensure that enabling the backup plan won't inadvertently disrupt the production services or lock any databases in a way that affects the pipelines or developer access.
Specific Questions:
Impact on Live Environment: Does the initial full backup via the Admin Console trigger any significant downtime or "Read-Only" states for the collections?
Permissions: Besides the service account having sysadmin rights on SQL Server, are there any easily overlooked folder permissions required for the backup network share?
TFS Integration: As there is still legacy source code on the instance, are there specific metadata files outside of the SQL databases that I need to manually include, or does the wizard cover all necessary components (databases + reporting + encryption keys)?
Common Pitfalls: Are there any known issues when running the backup wizard for the first time on a "neglected" 2020 instance?
I want to avoid breaking anything while securing the data. Any advice or checklists from experienced Azure DevOps admins would be greatly appreciated.
Thanks in advance!
https://redd.it/1td6vzo
@r_devops
How are you securing AI-generated / “vibe-coded” internal apps built by non-dev teams?
I work as a DevOps engineer at an AI startup, and we are running into a new problem.
With tools like Cursor and Claude Code, more people across the company are building small internal apps on their own — not just developers, but also folks from marketing, product, and sales. These apps often get deployed quickly on platforms like Vercel, Cloudflare Pages, or Netlify.
The concern is that this can become a security and governance mess very fast.
Right now, I am trying to figure out a practical way to make sure:
\- Every internal app is behind authentication from day one
\- Apps are hosted under the company’s domain only, not random public preview URLs
\- We can discover if someone has deployed an internal app outside approved company accounts
\- Sensitive internal data is not exposed through a personally created Vercel/Cloudflare/Netlify project
\- Security controls do not kill the speed and productivity that made these tools useful in the first place
For “normal” dev-built apps, we usually put them behind SSO, auth gateways, or internal access controls. But that is harder when apps are being created outside the engineering team by non-dev teams.
I would like to know what has actually worked in practice, especially in environments where people are moving fast and experimenting with AI-assisted development.
https://redd.it/1td7mxp
@r_devops
How to provide Database Schemas either empty or with masked/obfuscated data to non-production environments?
Hi Everyone, I'm working on an initiative to revamp our cloud infrastructure and address some of the challenges our dev teams are facing.
One of the core issues I want to address is the lack of non-production database availability for testing/development in lower environment tiers.
We do have Redgate available as a tool and I'm wondering how best to provide the non-prod database instances both to lower environment tiers within our primary ADDS domain as well as to off-domain sandbox instances that developers have open access to play around in. This would be both for PaaS and IaaS database (primarily SQL) types.
Ideally I would like the developers to be able to easily redeploy the non-production db schema without DBA intervention once the solution has been setup.
Has anyone had experience setting this sort of thing up?
My background is primarily infrastructure and while I have experience with a breadth of cloud services I've never dealt with this type of issue specifically before.
Thanks in advance for any input or suggestions!
https://redd.it/1tcw5b6
@r_devops
NGINX CVE-2026-42945 (ngx_http_rewrite_module) — patched boundary is 1.30.1 / 1.31.0
Disclosure: I work on Forkline, which maintains a fork of the retired Kubernetes ingress-nginx controller.
NGINX published a security advisory for ngx_http_rewrite_module. The affected versions are NGINX Open Source below 1.30.1 and 1.31.0.
Advisory: https://nginx.org/en/security_advisories.html
CVE-2026-42945 (NVD): https://nvd.nist.gov/vuln/detail/CVE-2026-42945
NGINX labels it medium, but NVD lists CVSS v4.0 9.2 / v3.1 8.1.
Trigger condition: a `rewrite` directive that uses unnamed PCRE captures (`$1`, `$2`) with a `?` in the replacement string, and is followed by another `rewrite`, `if`, or `set` in the same scope. DepthFirst has a solid technical breakdown: https://depthfirst.com/nginx-rift
For plain NGINX: upgrade to 1.30.1+ or 1.31.0+. If your config does not use the rewrite pattern above, you are not directly affected by this specific CVE — but check the full advisory batch.
For Kubernetes ingress-nginx: upstream kubernetes/ingress-nginx is archived. The last controller line embeds NGINX 1.27.1. The host NGINX version does not matter here — what matters is what is compiled into the controller image.
```
kubectl exec -n ingress-nginx <controller-pod> -- /nginx-ingress-controller --version
```
Options for ingress-nginx operators:
- Migrate to a Gateway API implementation (long-term path)
- Run a maintained fork that tracks current NGINX (we publish one at https://github.com/forkline/ingress-nginx)
- Accept the risk if your Ingress rules do not hit the rewrite trigger
https://redd.it/1td0es6
@r_devops
Should i go for these DevOps courses to start with or youtube only?
DevOps courses to start in 2026:
KodeKloud- best for hands-on labs + Kubernetes
Udemy DevOps Courses - cheap + beginner-friendly
Coursera DevOps Courses - structured + certifications
Intellipaat DevOps Course - live classes + projects + placement support
TechWorld with Nana YouTube - free practical DevOps learning
https://redd.it/1tcwsw1
@r_devops
DevOps/SRE/IT + more | Discords servers out there?
Does anybody know of any good Discord servers out there for actual IT work?
Not looking for a beginner coding server or a vendor/community server centered around one product. Nor a basic low-level tech troubleshooting discord.
More like a place where you’ve got a mix of:
* sysadmins
* devops/sre
* cloud infra
* terraform/sam/iac nerds
* network guys
* dbas
* full stack/front end/backend/devs
* security people
* hell, even PMs or POs
Basically the kind of people you’d normally only run into at work in the IT department.
I’ve looked around but most servers I find are either dead, super junior or basic tech troubleshooting focused, or just product support channels.
Curious if there’s a well-known one people actually use that is more high level for general tech discussions.
https://redd.it/1tcolzu
@r_devops
Entrepreneurial DevOps Developers
I would like to connect with some experienced folks and brainstorm some ideas on multi cloud environment for enterprises that involves some security too. Please comment and DM freely.
https://redd.it/1tcoquy
@r_devops
Why do we still treat EBS storage like a one-way street?
Curious how other teams handle this.
When compute gets oversized, we rightsize it.
When an EBS volume gets close to full, we expand it.
But once storage is provisioned, it feels like nobody wants to touch it again.
At my last company we had volumes that were massively overprovisioned because everyone was more afraid of a storage-related outage than wasting money. Once systems were live, storage basically became "set and forget."
Shrinking volumes always sounded like one of those things that was technically possible but operationally too risky to justify.
So the result was:
* storage only kept growing
* unused space piled up
* nobody wanted to be the person touching production disks
Feels weird that in 2026 we still handle storage this differently from the rest of infrastructure.
Are most teams just accepting the waste as the safer option, or are people actually solving this somehow now?
https://redd.it/1tcb0i3
@r_devops
For application configurations in OCI
you use Cloud Shell quite a lot? I'm new here and I'd like to know if people actually use it
https://redd.it/1tc843e
@r_devops
MCP servers just showed up in our infrastructure and I genuinely have no idea how to secure them, anyone been through this?
Not panicking but definitely out of my depth and i'd rather admit that now than figure it out after something breaks.
I've been doing DevOps for about three years at a mid-sized SaaS company. pipelines, containers, infra automation, the usual. last month our engineering team started integrating MCP servers to power some internal AI agent tooling and it landed in my lap to manage the deployment and infra side of it.
The problem is that everything i know about securing infrastructure doesn't map cleanly onto this. i can lock down a container. i can harden a CI/CD pipeline. but MCP is a different thing entirely. the servers expose tools that AI agents can call autonomously, and some of those tools have filesystem access, shell execution, database connectors. the blast radius of a misconfigured permission scope here feels genuinely significant and i don't have a framework for thinking about it systematically yet.
What's been keeping me up is the agentic side of it. these aren't just APIs sitting behind auth. the agents decide what tools to call and chain them together without a human approving each step. our current pipeline validation has already started flagging permission scope warnings on three of the deployed MCP tools and i blocked the deployment because i didn't know what the acceptable threshold even was. i've been piecing things together from blog posts and the handful of MCP security write-ups that exist but nothing gives me a repeatable methodology i can actually build a process around.
https://preview.redd.it/ay2kuyzyrw0h1.png?width=648&format=png&auto=webp&s=4b178a766229d4914c4927874e1c81e9757aa850
This is basically what my week has looked like. pass rate dropped from 96% to 81% since we started integrating MCP servers and almost all of the failures are permission or schema validation errors i don't fully understand yet.
Has anyone here gone through this? specifically curious whether there's any structured training that actually covers MCP security mechanics rather than AI security broadly, and how you're handling scope definition in your engagement agreements when the blast radius of these servers isn't obvious even to the people who built them.
https://redd.it/1tc01ui
@r_devops
Deployment advice for early stage startup!
Hello everyone,
We are running a small startup and the problem I am facing right now is single point of failure. Since we don't have much budget, we have hosted in cheap VPS as of now.
We have multiple services(python, node, db, redis, etc) and everything is dockerized inside a compose. So we run staging and production environment behind a nignx revere proxy. Both environment is hosted in single vps. We don't have any monitoring and observisibilty tool right now. The way we deploy is build docker image via github action and push it into vps and run it.
So for our setup, how can we improve our deployment and what are the best strategies we can adapt.
Thank you.
https://redd.it/1tc0h4j
@r_devops