Real mistakes, real incident post-mortems, and the fixes that actually work.
Tap to start →
1
Gotcha #1
imagePullPolicy: Always slows every deploy
Every Pod restart hits your registry. On spot instances with high churn this adds 30-60s per restart and can rate-limit your registry.
imagePullPolicy: IfNotPresent # cached image is fine
# Always = necessary only if you use :latest or mutable tags
# Pin tags: my-app:1.4.0 — then IfNotPresent is safe
2
Gotcha #2
No resource limits = noisy neighbour OOMKills your app
Without limits, one runaway Pod can OOMKill every other Pod on the node. Without requests, the scheduler can't plan and overcommits nodes.
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
memory: "256Mi" # CPU: no limit (throttle not kill)
# No CPU limit = throttled, not OOMKilled
3
Gotcha #3
livenessProbe kills app before it finishes starting
livenessProbe starts checking immediately. JVM apps, apps loading ML models, apps running migrations — all get killed in CrashLoopBackOff.
Cloud block disks (EBS, Azure Disk) are zone-specific. If PVC is created before the Pod is scheduled, it may land in the wrong AZ and the Pod can never mount it.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: {name: gp3}
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
# Waits until Pod is scheduled, then creates disk in same AZ
⚡
Track every Kubernetes release
releaserun.com monitors Kubernetes, Node.js, Go, Python, PostgreSQL, and 13+ more technologies. Get the story behind every version bump.