Introduction

Kubernetes needs to know when your applications are healthy and ready to serve traffic. Health Probes enable Kubernetes to automatically detect and recover from application failures, ensuring high availability.

Understanding Health Probes

The Problem Without Probes:

  • Crashed applications continue receiving traffic
  • Deadlocked processes never recover
  • Slow-starting apps receive traffic too early
  • Failed containers remain in “Running” state

The Solution: Health probes provide automated health monitoring and recovery:

  • Automatic restart of failed containers
  • Traffic routing only to healthy pods
  • Graceful startup for slow applications
  • Self-healing without manual intervention

Types of Probes - Complete Comparison

Probe Type Purpose Action on Failure When to Use
Liveness Is container alive? Restart container Detect deadlocks, crashes
Readiness Ready for traffic? Remove from service Temporary unavailability
Startup Has it started? Restart if timeout Slow-starting apps

Probe Execution Order:

  1. Startup Probe runs first (if configured)
  2. Once startup succeeds, Liveness and Readiness probes begin
  3. Liveness and Readiness run continuously in parallel

Key Differences:

Aspect Liveness Readiness Startup
Failure Action Kill & restart pod Remove from endpoints Kill & restart pod
During Startup Disabled if startup exists Can run immediately Runs first
Use Case Permanent failures Temporary issues Initial startup
Example Deadlock detection Database connection App initialization

Liveness Probe - Detecting When to Restart

What is a Liveness Probe? A liveness probe checks if a container is still running properly. If it fails, Kubernetes kills and restarts the container.

Why Use Liveness Probes?

  • Deadlock Detection: Restart hung processes
  • Memory Leak Recovery: Restart before OOM
  • Crash Recovery: Detect silent failures
  • Automatic Healing: No manual intervention needed

When to Use: ✅ Application can deadlock ✅ Memory leaks cause gradual degradation ✅ Process can enter unrecoverable state ❌ Temporary failures (use readiness instead) ❌ Slow startup (use startup probe instead)

⚠️ Warning: Aggressive liveness probes can cause restart loops!

Probe Mechanisms:

1. HTTP GET Probe

apiVersion: v1
kind: Pod
metadata:
  name: liveness-http
spec:
  containers:
  - name: app
    image: nginx
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Health-Check
        scheme: HTTP  # or HTTPS
      initialDelaySeconds: 3
      periodSeconds: 3
      timeoutSeconds: 1
      failureThreshold: 3

Success Criteria: HTTP status code 200-399 Use When: App has HTTP endpoint

2. Command/Exec Probe

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      if [ -f /tmp/healthy ]; then
        exit 0
      else
        exit 1
      fi
  initialDelaySeconds: 5
  periodSeconds: 5

Success Criteria: Exit code 0 Use When: Custom health logic needed

3. TCP Socket Probe

livenessProbe:
  tcpSocket:
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20

Success Criteria: TCP connection succeeds Use When: Non-HTTP services (databases, message queues)

4. gRPC Probe (Kubernetes 1.24+)

livenessProbe:
  grpc:
    port: 9090
    service: my.service.v1.Health
  initialDelaySeconds: 10

Success Criteria: gRPC health check returns SERVING Use When: gRPC services

Readiness Probe - Controlling Traffic Flow

What is a Readiness Probe? A readiness probe determines if a container is ready to serve requests. Failed readiness removes the Pod from Service endpoints.

Why Use Readiness Probes?

  • Graceful Startup: Don’t send traffic until ready
  • Dependency Checks: Wait for database connection
  • Temporary Unavailability: Handle maintenance mode
  • Zero-Downtime Deployments: Smooth rolling updates

Readiness vs Liveness:

Scenario Liveness Readiness
App deadlocked ✅ Restart ❌ Won’t help
Database down ❌ Don’t restart ✅ Remove from service
Slow startup ❌ May restart prematurely ✅ Wait until ready
Memory leak ✅ Restart eventually ❌ Won’t help

HTTP Readiness Probe:

apiVersion: v1
kind: Pod
metadata:
  name: readiness-http
spec:
  containers:
  - name: app
    image: myapp:1.0
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      failureThreshold: 3

Readiness Endpoint Example (Go):

func readinessHandler(w http.ResponseWriter, r *http.Request) {
    // Check database connection
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    // Check cache connection
    if err := cache.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("Ready"))
}

Database Readiness Check:

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - pg_isready -h localhost -U postgres -d mydb
  initialDelaySeconds: 5
  periodSeconds: 10

Startup Probe - Handling Slow Initialization

What is a Startup Probe? A startup probe gives containers extra time to start before liveness/readiness probes begin. It runs only during startup.

Why Use Startup Probes?

  • Slow Initialization: Apps that take minutes to start
  • Prevent Premature Restarts: Avoid liveness killing slow apps
  • Legacy Applications: Apps with long startup times
  • Data Loading: Apps that load large datasets on start

When to Use: ✅ App takes >30 seconds to start ✅ Startup time varies significantly ✅ Loading large datasets on initialization ❌ Fast-starting apps (use initialDelaySeconds instead)

Startup Probe Example:

apiVersion: v1
kind: Pod
metadata:
  name: startup-probe
spec:
  containers:
  - name: app
    image: slow-app:1.0
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      failureThreshold: 30  # 30 attempts
      periodSeconds: 10     # Every 10 seconds
      # Total: 300 seconds (5 minutes) to start
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 10
      failureThreshold: 3

How it works:

  1. Startup probe runs first (up to 300 seconds in example)
  2. Liveness/readiness probes disabled during startup
  3. Once startup succeeds, liveness/readiness begin
  4. If startup fails after 300s, container restarts

Probe Parameters Explained

initialDelaySeconds: 0    # Wait before first probe (default: 0)
periodSeconds: 10         # How often to probe (default: 10)
timeoutSeconds: 1         # Probe timeout (default: 1)
successThreshold: 1       # Successes needed (default: 1, readiness can be >1)
failureThreshold: 3       # Failures before action (default: 3)

Parameter Details:

Parameter Liveness Readiness Startup Notes
initialDelaySeconds Common Common Rare Use startup probe instead
periodSeconds 10-30s 5-10s 10-30s Balance responsiveness vs load
timeoutSeconds 1-5s 1-5s 1-5s Network latency + processing
successThreshold Always 1 1-3 Always 1 Readiness can require multiple
failureThreshold 3-5 3-5 10-30 Higher for startup

Calculating Total Timeout:

Maximum Time = failureThreshold × periodSeconds

Example:
failureThreshold: 30
periodSeconds: 10
Total: 300 seconds (5 minutes)

Production Example

Complete application with all probes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: app
        image: myapp:1.0
        ports:
        - containerPort: 8080
        
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          failureThreshold: 30
          periodSeconds: 10
        
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
            httpHeaders:
            - name: Custom-Header
              value: Awesome
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3
        
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"

Common Probe Patterns

1. Database Connection Check (PostgreSQL):

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - pg_isready -h localhost -U postgres -d mydb
  initialDelaySeconds: 5
  periodSeconds: 10

2. MySQL Connection Check:

readinessProbe:
  exec:
    command:
    - mysqladmin
    - ping
    - -h
    - localhost
  initialDelaySeconds: 5
  periodSeconds: 10

3. Redis Connection Check:

readinessProbe:
  exec:
    command:
    - redis-cli
    - ping
  initialDelaySeconds: 5
  periodSeconds: 5

4. File Existence Check:

livenessProbe:
  exec:
    command:
    - cat
    - /app/healthy
  periodSeconds: 10

5. gRPC Health Check:

livenessProbe:
  grpc:
    port: 9090
  initialDelaySeconds: 10
  periodSeconds: 10

6. Multi-Dependency Check:

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      # Check database
      pg_isready -h db-host -U postgres || exit 1
      # Check cache
      redis-cli -h cache-host ping || exit 1
      # Check API dependency
      curl -f http://api-host/health || exit 1
      exit 0
  initialDelaySeconds: 10
  periodSeconds: 15
  timeoutSeconds: 5

Best Practices

1. Always Implement Readiness Probes

  • Essential for zero-downtime deployments
  • Prevents traffic to unready pods
  • Required for rolling updates

2. Use Liveness Probes Carefully

  • Avoid false positives (causes restart loops)
  • Don’t check external dependencies
  • Keep checks lightweight
  • Set conservative thresholds

3. Set Appropriate Timeouts

# Good: Conservative settings
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
  # Total: 30s before restarts

# Bad: Aggressive settings
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 3
  timeoutSeconds: 1
  failureThreshold: 1
  # May cause restart loops!

4. Use Startup Probes for Slow Apps

# Instead of high initialDelaySeconds
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

5. Separate Health Endpoints

/healthz  - Liveness (is app alive?)
/ready    - Readiness (can serve traffic?)
/startup  - Startup (has app started?)

6. Monitor Probe Failures

  • Set up alerts for probe failures
  • Track restart counts
  • Monitor probe latency

7. Test Probe Endpoints

# Test locally
curl http://localhost:8080/healthz
curl http://localhost:8080/ready

# Test in pod
kubectl exec -it mypod -- curl localhost:8080/healthz

Troubleshooting Health Probe Issues

Problem 1: Pod Constantly Restarting

Symptoms:

kubectl get pods
NAME      READY   STATUS             RESTARTS   AGE
myapp-0   0/1     CrashLoopBackOff   5          3m

Diagnosis:

kubectl describe pod myapp-0
# Look for: "Liveness probe failed"

kubectl logs myapp-0 --previous
# Check logs from before restart

Common Causes:

Cause 1: Liveness probe too aggressive

# Problem: Probe starts too early
livenessProbe:
  initialDelaySeconds: 5  # App needs 30s to start!
  
# Solution: Use startup probe
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Cause 2: Timeout too short

# Problem
livenessProbe:
  timeoutSeconds: 1  # Endpoint takes 2s to respond
  
# Solution
livenessProbe:
  timeoutSeconds: 5

Cause 3: Checking external dependencies

// Bad: Liveness checks database
func livenessHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {  // Don't do this!
        w.WriteHeader(500)
        return
    }
    w.WriteHeader(200)
}

// Good: Liveness checks only app health
func livenessHandler(w http.ResponseWriter, r *http.Request) {
    // Just check if app is responsive
    w.WriteHeader(200)
}

Problem 2: Pod Not Receiving Traffic

Symptoms:

kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
myapp-0   0/1     Running   0          5m

kubectl get endpoints myservice
NAME        ENDPOINTS   AGE
myservice   <none>      5m

Diagnosis:

kubectl describe pod myapp-0
# Look for: "Readiness probe failed"

kubectl logs myapp-0
# Check application logs

Common Causes:

Cause 1: Readiness endpoint not implemented

# Test endpoint
kubectl exec -it myapp-0 -- curl localhost:8080/ready
# Returns 404 - endpoint doesn't exist!

# Solution: Implement /ready endpoint

Cause 2: Dependencies not ready

# App waiting for database
readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - pg_isready -h db-host -U postgres
  # Database not ready yet

Cause 3: Wrong port

# Problem
readinessProbe:
  httpGet:
    port: 8080  # App listens on 3000!
    
# Solution
readinessProbe:
  httpGet:
    port: 3000

Problem 3: Slow Rolling Updates

Symptoms:

  • Deployment takes very long
  • Pods stay in “Not Ready” state

Diagnosis:

kubectl rollout status deployment myapp
# Waiting for deployment "myapp" rollout to finish...

kubectl describe pod myapp-xxx
# Check readiness probe status

Solution:

# Optimize readiness probe
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5  # Reduce if app starts fast
  periodSeconds: 5        # Check more frequently
  failureThreshold: 2     # Fail faster

Problem 4: Probe Endpoint Slow/Timing Out

Diagnosis:

kubectl describe pod myapp-0
# Events: Readiness probe failed: Get "http://10.0.0.1:8080/ready": context deadline exceeded

Solutions:

1. Increase timeout:

readinessProbe:
  timeoutSeconds: 5  # Increase from 1

2. Optimize endpoint:

// Bad: Slow health check
func healthHandler(w http.ResponseWriter, r *http.Request) {
    // Don't do expensive operations!
    results := runComplexQuery()  // Takes 3 seconds
    w.WriteHeader(200)
}

// Good: Fast health check
func healthHandler(w http.ResponseWriter, r *http.Request) {
    // Quick check only
    w.WriteHeader(200)
}

Problem 5: Probe Causing High Load

Symptoms:

  • High CPU usage from health checks
  • Probe endpoints slow down app

Solution:

# Reduce probe frequency
livenessProbe:
  periodSeconds: 30  # Instead of 10
  
readinessProbe:
  periodSeconds: 10  # Instead of 5

Debugging Commands

# Check pod events
kubectl describe pod myapp-0

# View probe configuration
kubectl get pod myapp-0 -o yaml | grep -A 10 livenessProbe

# Check logs
kubectl logs myapp-0
kubectl logs myapp-0 --previous  # Before restart

# Test probe endpoint manually
kubectl exec -it myapp-0 -- curl localhost:8080/healthz
kubectl exec -it myapp-0 -- wget -O- localhost:8080/ready

# Watch pod status
kubectl get pods -w

# Check endpoints
kubectl get endpoints myservice

# Force probe execution (for testing)
kubectl exec -it myapp-0 -- /bin/sh
# Then manually run probe command

Conclusion

Health probes are essential for production reliability, enabling automatic failure detection and recovery.

Resources