Telegram-канал r_devops - Reddit DevOps: Unsorted

Reddit DevOps

15 May 2026 23:28

Why our canary didn't roll back: hardcoded Prometheus endpoint in a separate HelmRelease

So we had a fun tuesday. “fun”

Background: we run a fairly standard k8s setup on EKS, 40 services, nothing crazy.

deploy pipeline goes through ArgoCD, we do canary via flagger with a prometheus-based analysis template. has worked fine for like eight months.

2:17 pm I get paged. error rate on checkout-service climbing past 2%, flagger’s canary analysis is failing but it’s not rolling back. just… sitting there. suspended state. customers hitting errors, canary still live at 20% traffic split.

first thing i do is check the flagger logs. analysis is failing because the prometheus query is returning no data, not bad data, no data. so flagger can’t evaluate the metric and it’s going into this weird limbo instead of failing safe. ok, prometheus issue then. i go check prometheus, it’s up, scraping looks fine, i run the query manually and it returns results no problem.

15 minutes wasted there.

SO i’m thinking maybe it’s a namespace thing, flagger can’t reach the prometheus endpoint, we had an ingress policy change a few weeks back. i start tracing network policy rules and slack the platform team asking if anything changed on the observability namespace. they say no. i spend another twenty minutes here going through netpol yaml diffs. nothing.

what actually got me looking in the right place was one of our senior engineers asking “which prometheus?” offhand in the thread. we have two. the one flagger was configured to query had been migrated off its old service name as part of a prometheus-operator upgrade in staging that got promoted to prod last thursday. the internal DNS name changed. flagger’s helmrelease still had the old address hardcoded.

Our runbook for “flagger canary stuck in suspended” literally says step 3 is “Verify prometheus connectivity using the endpoint in values.yaml” which we did, and that endpoint was fine, but it just wasn’t the one flagger was actually using because the flagger config is in a separate helmrelease that nobody remembered has its own prometheus URL override.

So we manually rolled back the canary, patched the flagger config, redeployed. took about four minutes after knowing what it was.

Now the most frustrating part is I had a vague experience of something similar happening before I joined this team, there’s a slack thread from like 14 months ago that references “prometheus endpoint mismatch” but it’s not in any runbook and nobody who was here then is still on the rotation. So we just wasted time rediscovering it.

updating the runbook NOW to explicitly call out that flagger has its own prometheus config independent of everything else, and adding a config validation step to the deploy pipeline that actually curls the configured endpoint from within the flagger pod. probably should have done that the first time, whoever that was.

https://redd.it/1te9r16
@r_devops