Reddit DevOps. #devops Thanks @reddit2telegram and @r_channels
What’s one DevOps tool you tried but just didn’t click with?
I really wanted to love Terraform when I first picked it up. Everyone was hyping it up, and it is powerful—but I kept getting tripped up by state files and weird syntaxes. I probably broke my infra more times than I’d like to admit before things started making sense.
It made me wonder—do some tools just not fit the way certain people think?
Then i also worked on pulumi and its use of python aided in my learning a lot about Iac.
What’s a tool you tried (Ansible, Helm, whatever) that you wanted to love but just couldn’t vibe with?
Was it the learning curve, docs, or something else?
https://redd.it/1kh3iwb
@r_devops
How can I let devs update their lower environment terraform while protecting production environments?
I know the title is a rather open ended question, but let me lay out where I am now, in the hopes of getting ideas on how to do this better.
For a given service, we'll have one directory for environment. We have a directory called production
that holds the production configuration. A directory called dev
for the dev environment, a folder called banana
for the banana environment. You get the picture. The terraform is stored in GitHub in the same repo as the service's code. I have GitHub Actions setup so that whenever a Pull Request is made that touches the terraform code, it does a terraform plan and puts the plan output into the pull request as a comment. We require approvals for PRs, so someone else will have to approve the PR. Once it's merged, GitHub Actions will do a terraform apply, potentially using approvals in GitHub Environments depending on the environment (I've generally set these up on production environments but not lower environments, with people able to approve their own deployments).
The sticking point right now is that if a developer wants to update a lower environment (usually this is things like adding a new environment variable to a service, not totally restructuring the service), they have to go through the PR approval process, even though it's generally just serving as a rubber stamp rather than a true review at this point.
I'm trying to figure out some way to utilize GitHub's branch protection rules and/or rulesets to allow commits directly to main for those lower environment directories, but still require review when making changes to the production environment.
I've been thinking about this for a while, and been playing around with it a bit this morning. The best I've come up with is
1. Moving the terraform code out of the service repo into a dedicated repo (aka out of corp/service-name
into corp/terraform-service-name
)
2. Creating a CODEOWNERS
file that requires reviewers for the production
directory
3. Setting up a branch ruleset (not a branch protection rule) that requires PRs, requires 0 reviews, but requires approvals from Code owners.
This appears to work in my very quick exploration, but my ~~spidey~~ devops sense is tingling tell me that this isn't the right way.
So, with doing as little re-engineering of our entire process, how else can I solve this?
https://redd.it/1kh15h9
@r_devops
Americans working in majority Indian workplaces. What do you need to know to succeed?
I’ve been working at my company for a year or so and it’s been great. I’ve learned a lot of new tech as well as practice old tech (Django). My team is also quite strong and I can’t really complain.
I’ve been getting more responsibilities, such as integrating with other teams cross functionally. I’m starting to come up against my own professional expertise.
On top of the standard cross functionality challenges, I’m finding I didn’t know many cultural facts about communication.
If you’re in a similar boat, what are some tips/tricks you know for people in this situation, where I find my cultural knowledge is limiting my professional abilities?
https://redd.it/1kgxisp
@r_devops
SOC maturity tool for small teams — assess detection, IR, and automation readiness
We struggled to get a clear read on how mature our SOC really was — especially with a lean team and cloud-first stack.
So we put together a free tool to assess:
* Logging & telemetry coverage
* Alert fidelity & escalation paths
* Response playbooks
* Security automation maturity
* Lessons learned and feedback loops
It’s not a compliance tool — just a fast way to self-assess and align the team before audits or roadmap planning.
🔗 [https://soc.tools.ssojet.com/](https://soc.tools.ssojet.com/)
No login required.
Curious what others in DevOps/SecOps are using to track security ops maturity — especially in hybrid or cloud-native environments?
https://redd.it/1kgvbng
@r_devops
microservices ci/cd and git branching
We are working on a microservice application and we are supposed to have 3 environments development, staging and production..
As a devsecops intern engineer, I'm thinking that the devs should work on feature/* branches and merge request to development branch only and then we will merge to staging and then to main ( for prod )
And we will have a manifests repos in which we will make the deployment to the appropriate environment..
My question is: Is that strategy possible and duable? and how will the .gitlab-ci.yml will be any different in the backend microservices that the devs work on in different branches, I mean in the end we will get the docker image pushed to our harbor registry... Will we have an image pushed on development, staging, main? and how about feature and branches and merge request pipelines?
And how about the manifests repo? should it also have 3 branches or what?
https://redd.it/1kgsl3u
@r_devops
Site Reliability Engineering Internship at S&P Global
Hey guys, I have an interview for Site Reliability Engineering internship at S&P Global. What should I expect? Has anyone ever interviewed for this role? Also what kind of Questions did you get? Again, I’m big on the questions to expect. Also, do they retain you after internships? I am done with school this summer so I’m looking for something can transition to a full time role.
https://redd.it/1kgql86
@r_devops
What to do about poor performing team member that isn't contributing?
I've got a very full roadmap and a team member that is openly working on a "skunk works" that provides limited value and is deprecated by the next version of one of our vendors. However this person is really playing the political game and claiming that tickets that take a few weeks max are taking 6 months plus, talking a lot in meetings, throwing ppl under the bus etc. How would you approach this situaiton?
https://redd.it/1kgpro4
@r_devops
docker_pull.py: Script to pull lots of container images in parallel
[https://github.com/joshzcold/docker\_pull](https://github.com/joshzcold/docker_pull)
Not sure who needs this, but I wrote as part of my work and this task seems to be lacking from the docker cli or equivilient.
Pulls lots of images in parallel using python multiprocessing and the docker engine api
Requirement is that you supply the full image like \`docker.io/nginx:latest\` instead of \`nginx:latest\`
At work we use this to consistently update a series of images from our private registry.
Supports auth through plaintext in \~/.docker/config.json or through the \`secretservice\` credential helper from [https://github.com/docker/docker-credential-helpers](https://github.com/docker/docker-credential-helpers)
[https://github.com/user-attachments/assets/98832e30-0a05-4789-b055-a825cbba1ba5](https://github.com/user-attachments/assets/98832e30-0a05-4789-b055-a825cbba1ba5)
https://redd.it/1kghgi7
@r_devops
Backstage feels like a fools errand
The employee I replaced was promoting backstage and now its all my company wants to talk about.
Recently I looked up the custom runner he had to develop in react to get templates to run bash scripts, and now script updates requires a full upgrade of backstage.
I've also decided that I'd like to add some bash one-liners to my templates, but of course there's no runner for that so I can develop my own or find a 3rd party (not approved by the security team, so it wont ever see the light of day, however)
Context aside, why are so many people advocating for making a react app handle all of my infra provisioning?
https://redd.it/1kgfqys
@r_devops
Services which don't quite mesh with devops
Hey folks,
Do you have stories about teams or products which don't quite fit into devops? - for any reason.
How did your org or you approached these?
-
At my current org (midsized insurance enterprise) there are many teams with valid "buts" why devops as a culture and bag of methods/technologies is not or at least not fully applicable. While I always will argue that devops can be at least partially be useful for them, or that it is only about changing the teams processes or boundaries.. there are some external factors which can dampen acceptance.
for example:
- product releases/deployment is tied to a quarterly rythm cause of accounting rules / deployment frequency is flat. It could be grown with feature flags and decoupling of release and deployment, but the mindset of "why bother, we only need to deploy it every quarter" is strong
- onpremise infrastructure services / these are in various states, in-between "send me an jira ticket for your postgres" and "here is the self service/endpoint". In some of these, the day to day includes very little development. Base onprem infra teams are currently not in the nearest thing we have to a "platform team/product"
My first impuls tells me these or others similar to these are just valid and have to be looked at on a case by case basis or need an org restructure to see if and what of devops fits.
Would love to hear your thoughts on this.
Cheers
https://redd.it/1kgd7iq
@r_devops
Junior sysadmin looking for project ideas to modernize a simple infra
Junior sysadmin looking for project ideas to modernize a simple on-prem infra
Hey everyone,
I’m a junior sysadmin working with a fairly basic on-prem infrastructure with about 45 users, and I’m looking for ideas to improve, automate, and modernize it, ideally to make it more secure, more efficient, and a bit more DevOps-friendly.
The current setup is kind of “freestyle”: backups aren’t really solid yet, and a lot of things could be more structured
Here’s the current setup:
• 5 Ubuntu servers on-prem, used by data scientists to run AI/GPU workloads and experiments.
• Users currently have sudo access, which isn’t very secure - I’m looking for ways to improve that.
• 1 Proxmox server, where I run personal/admin VMs for Docker apps (Grafana, Prometheus, etc.).
• I occasionally spin up temporary VMs for test environments (no GPU) and give users access.
• Using Snipe-IT for asset management and Intune for endpoints.
Some project ideas I’m considering:
• Securing user access more effectively (e.g. removing full sudo, implementing access control or centralized auth).
• Setting up a Proxmox cluster for better flexibility and redundancy — not sure how well that works with GPU passthrough yet.
• Building a web portal where users can request or deploy their own VMs (via Proxmox API) and get direct access (ansible+terraform?).
• Improving asset and VM lifecycle management, to track what’s running, who owns it, and clean up unused resources automatically.
If you’ve done similar projects or have any ideas especially around automation, user access control, or Proxmox + GPU setups, I’d love to hear your thoughts!
https://redd.it/1kg8puy
@r_devops
What really makes an Internal Developer Platform succeed?
Hey, I work at Pulumi as a community engineer and as we are doubling down on IDP features I’ve been looking around at various other platform tools and it's hard for me to tell which features are great for demos and which are really the important pieces of an ongoing platform effort.
so, in your experience what features are essential for a real world internal developer platform? and how are you handling infrastructure lifecycle management or how would you like to be handling it? I’m more interested in the day-2-and-beyond messy bits of a platform approach but if you are successfully using a 1-click to provision portals I'd love to hear about that as well.
https://redd.it/1kg3gj4
@r_devops
Stategies for scaling out MySQL/MariaDB when database gets too large for a single host?
What are your preferred strategies when a MySQL/MariaDB database server grows to have too much traffic for a single host to handle, i.e. scaling CPU/RAM or using regular replication is not an option anymore? Do you deploy ProxySQL to start splitting the traffic according to some rule to two different hosts?
Has anyone migrated to TiDB? In that case, what was the strategy to detect if the SQL your app uses is fully compatible with TiDB?
https://redd.it/1kfwpne
@r_devops
Restart Operator: Schedule K8s Workload Restarts
github: [https://github.com/archsyscall/restart-operator](https://github.com/archsyscall/restart-operator)
Built a simple K8s operator that lets you schedule periodic restarts of Deployments, StatefulSets, and DaemonSets using cron expressions.
apiVersion: restart-operator.k8s/v1alpha1
kind: RestartSchedule
metadata:
name: nightly-restart
spec:
schedule: "0 3 * * *" # 3am daily
targetRef:
kind: Deployment
name: my-application
It works by adding an annotation to the pod template spec, triggering Kubernetes to perform a rolling restart. Useful for apps that need periodic restarts to clear memory, refresh connections, or apply config changes.
helm repo add archsyscall https://archsyscall.github.io/restart-operator
helm repo update
helm install restart-operator archsyscall/restart-operator
Look, we all know restarts aren't always the most elegant solution, but they're surprisingly effective at solving tricky problems in a pinch.
Thank you!
https://redd.it/1kfbkfl
@r_devops
Ibm Event notification question
Hello everyone,
I am having difficulties to configure my alerts with different templates.
Maybe can someone help me?
In Event-notifications i have created a Source.
In this sources i have 2 Topics.
I have 2 subscriptions and 2 templates.
But only one of the template is used to send the alerts to slack.
How can i change that?
Ideally would be to write the Template query to call the alert description on slack.
Is this possible?
https://redd.it/1kf8w1x
@r_devops
how do you manage cache browser control- after version update?
here's the problem-
obviously we don't want to screw up our clients when they are working, so a new version should be in a manner that won't cause conflicts in the previous version, which has loaded from local storage of the cache.
but obviously, if we actually don't want to interfere with their work, and update the app, without breaking their session at all, this will cause conflicts with the version they are currently using- unless we force them to reload and refresh. which currently, can be too much loading time in mid work, and also can break their own workflow-which is horrible.
the only solution i could come up with is the "downtime", which seems harsh.
but perhaps necessary as that way we don't cause conflicts with our clients, and everyone is communicating with each other seamlessly on the new version. and obviously no "inner" conflicts between local/previous version and updated one.
how do you manage this?
there is cache busting. but i'm not entirely sure its the correct policy for us.
https://redd.it/1kh0ezy
@r_devops
Log / Metrics / APM for SaaS Solutions with minimal / no Selfhosting
I'm currently looking into a tool for our developers to get metrics and logs from our Azure App Services and Azure SQL services into. I'm currently using Azure Managed Grafana for Alerting and Datadog for infrastructure log ingestion and SIEM, the theme being minimal selfhosting, as I'm the sole devops. The reason I'm not using either for our app services is that Azure Managed Grafana doesn't have Loki in its stack and Datadog would simply be too expensive.
I've looked a bit into SigNoz, but that requires a Centralized Collector setup for it to work (which is an AKS cluster or VM custom setup), which for me defeats the purpose of a cloud solution. I also looked briefly into Splunk but I found their interface and setup very confusing.
In my ideal scenario, I'd use one tool for both alerting, SIEM / infrastructure logs and App Service logs / metrics, but with cost constraints that seems like a pipe dream.
I'm not sure if I'm being too stubborn on the whole no selfhosting, but I'd really like to avoid having to deal with storage management when I'm the sole devops. For reference, there's about 30+ Developers.
https://redd.it/1kgy3ix
@r_devops
Build sre job website to list newest jobs
I put together a simple site for SRE job listings: https://newsrejobs.com/. Most listings don’t have tech filters, so I added a basic feature to filter by technology. Might be useful to some.
https://redd.it/1kgwjl0
@r_devops
I am going to give my first ever interview and it's for an Azure SRE intern position. What should I expect?
After applying for around 400+ intern positions, I've finally got this - one interview. I don't wanna mess it up. I have 24 hours to prepare for it. I have a basic idea about azure. Where should I start and what to focus on?? Any other interview tips would be great too!!
https://redd.it/1kgtbg7
@r_devops
Is Cloud & DevOps right for a non-coder with an IT degree?
Hi all,
I have a B.Tech in IT but I’m not a strong coder. I took a year break for SSC/RRB prep, but now I want to restart my career in tech.
I’m considering an offline Cloud and DevOps course, but I’m unsure if it’s beginner-friendly. I’m hoping to work abroad in the future — maybe in countries like Germany, the UK, or Canada.
Is this a good path for someone with limited coding skills?
How is the job/internship scope after completing such a course?
What kind of technical knowledge is expected before starting?
Would love to hear from anyone who started out like me or is working in this field. Thanks in advance!
https://redd.it/1kgsbxu
@r_devops
How do you promote kubernetes environments using ArgoCD?
I've watched a video by Anton Putra, https://www.youtube.com/watch?v=\_G\_RY5trQao, on production grade setup with Argo.
The video is great and I've learn a lot, but I'm curious about his method of promoting environments.
His suggestion is that you let developers deploy their applications to a development environment, and then at a scheduled time you freeze this environment, promote it to staging, run your tests, then promote it to production when ready.
All of this is done with a python script that he created.
My question is, is this best practice? Something about having a Python script loop through your manifests, make an annotation change, do a git push, etc, etc. All seems a bit anti-pattern to me?
Also if I understand it, how do you make changes to all environments to ensure they are consistent? In the video he is mostly demonstrating the image updater, which makes sense because once staging is unfroozen it can pull the latest image. But do you have to copy your manifest files between your development folder to your staging folder, check all changes have been copied correctly, then un-freeze? Then do the same for production?
Curious how others handle this, and what they think of the above?
https://redd.it/1kgrgfe
@r_devops
How Liquibase Simplifies Schema Management
If you've ever deployed schema changes manually, you know the pain: tracking SQL scripts, guessing what's applied where, and praying nothing breaks in prod.
I recently wrote a post on how **Liquibase** helps database admins and DevOps teams version-control and automate PostgreSQL migrations—like Git for your database schema.
It covers:
* Why traditional schema management breaks at scale
* How Liquibase tracks, applies, and rolls back changes safely
* Real YAML examples for PostgreSQL
* CI/CD automation tips
* Rollback strategies and changelog best practices
Check it out here 👉 [https://blog.sonichigo.com/how-liquibase-makes-life-easy-for-db-admins](https://blog.sonichigo.com/how-liquibase-makes-life-easy-for-db-admins)
Would love feedback from folks using other tools too—Flyway, Alembic, etc.
https://redd.it/1kgpm70
@r_devops
Is there sometimes no hope?
Good afternoon, DevOps people of Reddit. I want to know if anyone else is feeling this. I have been brought on a project to help this company achieve DevOps practices. My main issue is that I am getting pushback on all my suggestions. I am looking at how things are done and thinking to myself that to even begin to achieve anything, everything would need to be changed. So my question to everyone is, as the way I am seeing it, this place will never achieve anything close to a DevOps mindset, is there any point in trying to do so? I just give up and roll with the insanity that is sanity, and look for a new role.
https://redd.it/1kgguq6
@r_devops
What does Fastly need to do to be more enticing to developers?
I've seen a lot of people praise fastly for having great tech, but Cloudflare is much more popular.
What makes Cloudflare so much better than Fastly, and what can Fastly do to be better?
https://redd.it/1kgdc9l
@r_devops
What Platform Engineering Really Means (and How It Differs from DevOps and SRE)
Hey all,
I just wrote a piece breaking down what Platform Engineering is — not just as a buzzword, but as a real discipline that’s emerging in many engineering organizations.
🔧 Key takeaways:
Platform Engineering is not just “DevOps rebranded.” It's about productizing the platform for developers — treating the internal developer platform (IDP) like a real product.
It focuses on golden paths, developer self-service, and abstracting complex infra behind sensible defaults.
It complements SRE by focusing on enablement, not just reliability.
The role is deeply cross-functional — blending infrastructure, developer experience, automation, and even elements of UX.
I also share real-world examples and tools/platforms that embody these ideas (e.g., Backstage, Kratix, Humanitec, etc.).
If you're navigating the gray area between DevOps, SRE, and Platform roles — or building an internal platform yourself — I’d love your thoughts.
👉 Full post here
Would love to hear:
How do you define platform engineering in your org?
What tooling or practices have helped you build your IDP?
https://redd.it/1kg7q10
@r_devops
Was pushed into a Devops role. Never got the chance to learn properly
I was pushed into a devops role. And since then there was always a deadline on head and was never able to learn things properly. I am still good at my job and can do what is required but somewhere feel like I don't know stuff in depth. Or some not trivial things like Istio or monitoring tools or something else.
Want to change that. But because devops is so fast, don't have the slightest clue where to begin or how to start. Should I follow some roadmaps? Or implement things? If yes what?
https://redd.it/1kg53p9
@r_devops
LogWhisperer – AI-powered log summarizer that runs locally (no OpenAI keys, no cloud)
I built an open-source CLI tool called LogWhisperer that uses a local LLM to summarize Linux system logs into human-readable summaries. It’s useful for triaging noisy logs, quick postmortems, or just getting a sense of what the hell happened without manually parsing journalctl
.
Key features:
Uses a local model (via [Ollama](https://ollama.com)) — supports `mistral`, `phi`, etc.
Parses logs from journalctl
or file paths (e.g. /var/log/syslog
)
CLI-friendly with flags for source, priority, model, entries
Outputs markdown reports for easy archiving
Includes a spinner so it doesn't feel frozen when summarizing large logs
100% offline (after install) — no OpenAI keys or cloud dependencies
Use case: you're SSH'd into a flaky VM, and you just want a summary of the last 500 err
\-level logs without sifting through pages of noise.
Install it with a one-liner shell script — it sets up the Python env, installs Ollama, and pulls the model.
GitHub: https://github.com/binary-knight/logwhisperer
Would love feedback from fellow infra folks. I'm also thinking of extending this into scheduled cron-based summaries, Slack alerts, and anomaly tagging if anyone’s interested in contributing or ideas.
https://redd.it/1kfyv61
@r_devops
Does anyone here use Humanitec? Feedback wanted!
I’ve been looking into Humanitec and I’m curious to hear from people who are actually using it.
What use case(s) you’re solving with it?
How it's integrated into your workflows?
Any wins or challenges you've encountered?
Would you recommend it to others building platform tooling?
I’m especially interested in any honest pros and cons.
Appreciate any insight you can share!
https://redd.it/1kfcpze
@r_devops
Passive FTP into Kubernetes ? Sounds cursed. Works great.
“talk about forcing some ancient tech into some very new tech wow... surely there's a better way” said a VMware admin watching my counter FTP strategy😅
Challenge accepted
I recently needed to run a passive-mode FTP server inside a Kubernetes cluster and quickly hit all the usual problems : random ports, sticky control sessions, health checks failing for no reason… you know the drill.
So i built a Helm chart that deploys vsftpd
, exposes everything via stable NodePorts, and even generates a full haproxy.cfg
based on your cluster’s node IPs, following the official HAProxy best practices for passive FTP.
You drop that file on your HAProxy box, restart the service, and FTP/FTPS just work.
https://github.com/adrghph/kubeftp-proxy-helm
Originally, this came out of a painful Tanzu/TKG setup (where the built-in HAProxy is locked down), but the chart is generic enough to be used in any Kubernetes cluster with a HAProxy VM in front.
Let me know if anyone else is fighting with FTP in modern infra. bye!
https://redd.it/1kfa7mz
@r_devops
LLMs ('AI') are coming for our jobs whether or not they work - Chris's Wiki
From here:
> In most non-tech organizations, both internal development and system administration is something similar to janitorial services; you have to have it because otherwise your organization falls over, but you don't like it and you're happy to spend as little on it as possible.
https://redd.it/1kf2kzc
@r_devops