Introduction
Kubernetes needs to know when your applications are healthy and ready to serve traffic. Health Probes enable Kubernetes to automatically detect and recover from application failures, ensuring high availability.
Understanding Health Probes
The Problem Without Probes:
- Crashed applications continue receiving traffic
- Deadlocked processes never recover
- Slow-starting apps receive traffic too early
- Failed containers remain in “Running” state
The Solution: Health probes provide automated health monitoring and recovery:
- Automatic restart of failed containers
- Traffic routing only to healthy pods
- Graceful startup for slow applications
- Self-healing without manual intervention
Types of Probes - Complete Comparison
Probe Type | Purpose | Action on Failure | When to Use |
---|---|---|---|
Liveness | Is container alive? | Restart container | Detect deadlocks, crashes |
Readiness | Ready for traffic? | Remove from service | Temporary unavailability |
Startup | Has it started? | Restart if timeout | Slow-starting apps |
Probe Execution Order:
- Startup Probe runs first (if configured)
- Once startup succeeds, Liveness and Readiness probes begin
- Liveness and Readiness run continuously in parallel
Key Differences:
Aspect | Liveness | Readiness | Startup |
---|---|---|---|
Failure Action | Kill & restart pod | Remove from endpoints | Kill & restart pod |
During Startup | Disabled if startup exists | Can run immediately | Runs first |
Use Case | Permanent failures | Temporary issues | Initial startup |
Example | Deadlock detection | Database connection | App initialization |
Liveness Probe - Detecting When to Restart
What is a Liveness Probe? A liveness probe checks if a container is still running properly. If it fails, Kubernetes kills and restarts the container.
Why Use Liveness Probes?
- Deadlock Detection: Restart hung processes
- Memory Leak Recovery: Restart before OOM
- Crash Recovery: Detect silent failures
- Automatic Healing: No manual intervention needed
When to Use: ✅ Application can deadlock ✅ Memory leaks cause gradual degradation ✅ Process can enter unrecoverable state ❌ Temporary failures (use readiness instead) ❌ Slow startup (use startup probe instead)
⚠️ Warning: Aggressive liveness probes can cause restart loops!
Probe Mechanisms:
1. HTTP GET Probe
apiVersion: v1
kind: Pod
metadata:
name: liveness-http
spec:
containers:
- name: app
image: nginx
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: Health-Check
scheme: HTTP # or HTTPS
initialDelaySeconds: 3
periodSeconds: 3
timeoutSeconds: 1
failureThreshold: 3
Success Criteria: HTTP status code 200-399 Use When: App has HTTP endpoint
2. Command/Exec Probe
livenessProbe:
exec:
command:
- /bin/sh
- -c
- |
if [ -f /tmp/healthy ]; then
exit 0
else
exit 1
fi
initialDelaySeconds: 5
periodSeconds: 5
Success Criteria: Exit code 0 Use When: Custom health logic needed
3. TCP Socket Probe
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
Success Criteria: TCP connection succeeds Use When: Non-HTTP services (databases, message queues)
4. gRPC Probe (Kubernetes 1.24+)
livenessProbe:
grpc:
port: 9090
service: my.service.v1.Health
initialDelaySeconds: 10
Success Criteria: gRPC health check returns SERVING Use When: gRPC services
Readiness Probe - Controlling Traffic Flow
What is a Readiness Probe? A readiness probe determines if a container is ready to serve requests. Failed readiness removes the Pod from Service endpoints.
Why Use Readiness Probes?
- Graceful Startup: Don’t send traffic until ready
- Dependency Checks: Wait for database connection
- Temporary Unavailability: Handle maintenance mode
- Zero-Downtime Deployments: Smooth rolling updates
Readiness vs Liveness:
Scenario | Liveness | Readiness |
---|---|---|
App deadlocked | ✅ Restart | ❌ Won’t help |
Database down | ❌ Don’t restart | ✅ Remove from service |
Slow startup | ❌ May restart prematurely | ✅ Wait until ready |
Memory leak | ✅ Restart eventually | ❌ Won’t help |
HTTP Readiness Probe:
apiVersion: v1
kind: Pod
metadata:
name: readiness-http
spec:
containers:
- name: app
image: myapp:1.0
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
Readiness Endpoint Example (Go):
func readinessHandler(w http.ResponseWriter, r *http.Request) {
// Check database connection
if err := db.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
// Check cache connection
if err := cache.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("Ready"))
}
Database Readiness Check:
readinessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -h localhost -U postgres -d mydb
initialDelaySeconds: 5
periodSeconds: 10
Startup Probe - Handling Slow Initialization
What is a Startup Probe? A startup probe gives containers extra time to start before liveness/readiness probes begin. It runs only during startup.
Why Use Startup Probes?
- Slow Initialization: Apps that take minutes to start
- Prevent Premature Restarts: Avoid liveness killing slow apps
- Legacy Applications: Apps with long startup times
- Data Loading: Apps that load large datasets on start
When to Use: ✅ App takes >30 seconds to start ✅ Startup time varies significantly ✅ Loading large datasets on initialization ❌ Fast-starting apps (use initialDelaySeconds instead)
Startup Probe Example:
apiVersion: v1
kind: Pod
metadata:
name: startup-probe
spec:
containers:
- name: app
image: slow-app:1.0
startupProbe:
httpGet:
path: /startup
port: 8080
failureThreshold: 30 # 30 attempts
periodSeconds: 10 # Every 10 seconds
# Total: 300 seconds (5 minutes) to start
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
How it works:
- Startup probe runs first (up to 300 seconds in example)
- Liveness/readiness probes disabled during startup
- Once startup succeeds, liveness/readiness begin
- If startup fails after 300s, container restarts
Probe Parameters Explained
initialDelaySeconds: 0 # Wait before first probe (default: 0)
periodSeconds: 10 # How often to probe (default: 10)
timeoutSeconds: 1 # Probe timeout (default: 1)
successThreshold: 1 # Successes needed (default: 1, readiness can be >1)
failureThreshold: 3 # Failures before action (default: 3)
Parameter Details:
Parameter | Liveness | Readiness | Startup | Notes |
---|---|---|---|---|
initialDelaySeconds |
Common | Common | Rare | Use startup probe instead |
periodSeconds |
10-30s | 5-10s | 10-30s | Balance responsiveness vs load |
timeoutSeconds |
1-5s | 1-5s | 1-5s | Network latency + processing |
successThreshold |
Always 1 | 1-3 | Always 1 | Readiness can require multiple |
failureThreshold |
3-5 | 3-5 | 10-30 | Higher for startup |
Calculating Total Timeout:
Maximum Time = failureThreshold × periodSeconds
Example:
failureThreshold: 30
periodSeconds: 10
Total: 300 seconds (5 minutes)
Production Example
Complete application with all probes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: app
image: myapp:1.0
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /startup
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
Common Probe Patterns
1. Database Connection Check (PostgreSQL):
readinessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -h localhost -U postgres -d mydb
initialDelaySeconds: 5
periodSeconds: 10
2. MySQL Connection Check:
readinessProbe:
exec:
command:
- mysqladmin
- ping
- -h
- localhost
initialDelaySeconds: 5
periodSeconds: 10
3. Redis Connection Check:
readinessProbe:
exec:
command:
- redis-cli
- ping
initialDelaySeconds: 5
periodSeconds: 5
4. File Existence Check:
livenessProbe:
exec:
command:
- cat
- /app/healthy
periodSeconds: 10
5. gRPC Health Check:
livenessProbe:
grpc:
port: 9090
initialDelaySeconds: 10
periodSeconds: 10
6. Multi-Dependency Check:
readinessProbe:
exec:
command:
- /bin/sh
- -c
- |
# Check database
pg_isready -h db-host -U postgres || exit 1
# Check cache
redis-cli -h cache-host ping || exit 1
# Check API dependency
curl -f http://api-host/health || exit 1
exit 0
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
Best Practices
1. Always Implement Readiness Probes
- Essential for zero-downtime deployments
- Prevents traffic to unready pods
- Required for rolling updates
2. Use Liveness Probes Carefully
- Avoid false positives (causes restart loops)
- Don’t check external dependencies
- Keep checks lightweight
- Set conservative thresholds
3. Set Appropriate Timeouts
# Good: Conservative settings
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Total: 30s before restarts
# Bad: Aggressive settings
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
timeoutSeconds: 1
failureThreshold: 1
# May cause restart loops!
4. Use Startup Probes for Slow Apps
# Instead of high initialDelaySeconds
startupProbe:
httpGet:
path: /startup
port: 8080
failureThreshold: 30
periodSeconds: 10
5. Separate Health Endpoints
/healthz - Liveness (is app alive?)
/ready - Readiness (can serve traffic?)
/startup - Startup (has app started?)
6. Monitor Probe Failures
- Set up alerts for probe failures
- Track restart counts
- Monitor probe latency
7. Test Probe Endpoints
# Test locally
curl http://localhost:8080/healthz
curl http://localhost:8080/ready
# Test in pod
kubectl exec -it mypod -- curl localhost:8080/healthz
Troubleshooting Health Probe Issues
Problem 1: Pod Constantly Restarting
Symptoms:
kubectl get pods
NAME READY STATUS RESTARTS AGE
myapp-0 0/1 CrashLoopBackOff 5 3m
Diagnosis:
kubectl describe pod myapp-0
# Look for: "Liveness probe failed"
kubectl logs myapp-0 --previous
# Check logs from before restart
Common Causes:
Cause 1: Liveness probe too aggressive
# Problem: Probe starts too early
livenessProbe:
initialDelaySeconds: 5 # App needs 30s to start!
# Solution: Use startup probe
startupProbe:
httpGet:
path: /startup
port: 8080
failureThreshold: 30
periodSeconds: 10
Cause 2: Timeout too short
# Problem
livenessProbe:
timeoutSeconds: 1 # Endpoint takes 2s to respond
# Solution
livenessProbe:
timeoutSeconds: 5
Cause 3: Checking external dependencies
// Bad: Liveness checks database
func livenessHandler(w http.ResponseWriter, r *http.Request) {
if err := db.Ping(); err != nil { // Don't do this!
w.WriteHeader(500)
return
}
w.WriteHeader(200)
}
// Good: Liveness checks only app health
func livenessHandler(w http.ResponseWriter, r *http.Request) {
// Just check if app is responsive
w.WriteHeader(200)
}
Problem 2: Pod Not Receiving Traffic
Symptoms:
kubectl get pods
NAME READY STATUS RESTARTS AGE
myapp-0 0/1 Running 0 5m
kubectl get endpoints myservice
NAME ENDPOINTS AGE
myservice <none> 5m
Diagnosis:
kubectl describe pod myapp-0
# Look for: "Readiness probe failed"
kubectl logs myapp-0
# Check application logs
Common Causes:
Cause 1: Readiness endpoint not implemented
# Test endpoint
kubectl exec -it myapp-0 -- curl localhost:8080/ready
# Returns 404 - endpoint doesn't exist!
# Solution: Implement /ready endpoint
Cause 2: Dependencies not ready
# App waiting for database
readinessProbe:
exec:
command:
- /bin/sh
- -c
- pg_isready -h db-host -U postgres
# Database not ready yet
Cause 3: Wrong port
# Problem
readinessProbe:
httpGet:
port: 8080 # App listens on 3000!
# Solution
readinessProbe:
httpGet:
port: 3000
Problem 3: Slow Rolling Updates
Symptoms:
- Deployment takes very long
- Pods stay in “Not Ready” state
Diagnosis:
kubectl rollout status deployment myapp
# Waiting for deployment "myapp" rollout to finish...
kubectl describe pod myapp-xxx
# Check readiness probe status
Solution:
# Optimize readiness probe
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5 # Reduce if app starts fast
periodSeconds: 5 # Check more frequently
failureThreshold: 2 # Fail faster
Problem 4: Probe Endpoint Slow/Timing Out
Diagnosis:
kubectl describe pod myapp-0
# Events: Readiness probe failed: Get "http://10.0.0.1:8080/ready": context deadline exceeded
Solutions:
1. Increase timeout:
readinessProbe:
timeoutSeconds: 5 # Increase from 1
2. Optimize endpoint:
// Bad: Slow health check
func healthHandler(w http.ResponseWriter, r *http.Request) {
// Don't do expensive operations!
results := runComplexQuery() // Takes 3 seconds
w.WriteHeader(200)
}
// Good: Fast health check
func healthHandler(w http.ResponseWriter, r *http.Request) {
// Quick check only
w.WriteHeader(200)
}
Problem 5: Probe Causing High Load
Symptoms:
- High CPU usage from health checks
- Probe endpoints slow down app
Solution:
# Reduce probe frequency
livenessProbe:
periodSeconds: 30 # Instead of 10
readinessProbe:
periodSeconds: 10 # Instead of 5
Debugging Commands
# Check pod events
kubectl describe pod myapp-0
# View probe configuration
kubectl get pod myapp-0 -o yaml | grep -A 10 livenessProbe
# Check logs
kubectl logs myapp-0
kubectl logs myapp-0 --previous # Before restart
# Test probe endpoint manually
kubectl exec -it myapp-0 -- curl localhost:8080/healthz
kubectl exec -it myapp-0 -- wget -O- localhost:8080/ready
# Watch pod status
kubectl get pods -w
# Check endpoints
kubectl get endpoints myservice
# Force probe execution (for testing)
kubectl exec -it myapp-0 -- /bin/sh
# Then manually run probe command
Conclusion
Health probes are essential for production reliability, enabling automatic failure detection and recovery.