Kubernetes Health Probes: Liveness, Readiness, and Startup Probes Explained

Introduction

Kubernetes needs to know when your applications are healthy and ready to serve traffic. Health Probes enable Kubernetes to automatically detect and recover from application failures, ensuring high availability.

Understanding Health Probes

The Problem Without Probes:

Crashed applications continue receiving traffic
Deadlocked processes never recover
Slow-starting apps receive traffic too early
Failed containers remain in “Running” state

The Solution: Health probes provide automated health monitoring and recovery:

Automatic restart of failed containers
Traffic routing only to healthy pods
Graceful startup for slow applications
Self-healing without manual intervention

Types of Probes - Complete Comparison

Probe Type	Purpose	Action on Failure	When to Use
Liveness	Is container alive?	Restart container	Detect deadlocks, crashes
Readiness	Ready for traffic?	Remove from service	Temporary unavailability
Startup	Has it started?	Restart if timeout	Slow-starting apps

Probe Execution Order:

Startup Probe runs first (if configured)
Once startup succeeds, Liveness and Readiness probes begin
Liveness and Readiness run continuously in parallel

Key Differences:

Aspect	Liveness	Readiness	Startup
Failure Action	Kill & restart pod	Remove from endpoints	Kill & restart pod
During Startup	Disabled if startup exists	Can run immediately	Runs first
Use Case	Permanent failures	Temporary issues	Initial startup
Example	Deadlock detection	Database connection	App initialization

Liveness Probe - Detecting When to Restart

What is a Liveness Probe? A liveness probe checks if a container is still running properly. If it fails, Kubernetes kills and restarts the container.

Why Use Liveness Probes?

Deadlock Detection: Restart hung processes
Memory Leak Recovery: Restart before OOM
Crash Recovery: Detect silent failures
Automatic Healing: No manual intervention needed

When to Use: ✅ Application can deadlock ✅ Memory leaks cause gradual degradation ✅ Process can enter unrecoverable state ❌ Temporary failures (use readiness instead) ❌ Slow startup (use startup probe instead)

⚠️ Warning: Aggressive liveness probes can cause restart loops!

Probe Mechanisms:

1. HTTP GET Probe

apiVersion: v1
kind: Pod
metadata:
  name: liveness-http
spec:
  containers:
  - name: app
    image: nginx
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Health-Check
        scheme: HTTP  # or HTTPS
      initialDelaySeconds: 3
      periodSeconds: 3
      timeoutSeconds: 1
      failureThreshold: 3

Success Criteria: HTTP status code 200-399 Use When: App has HTTP endpoint

2. Command/Exec Probe

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      if [ -f /tmp/healthy ]; then
        exit 0
      else
        exit 1
      fi
  initialDelaySeconds: 5
  periodSeconds: 5

Success Criteria: Exit code 0 Use When: Custom health logic needed

3. TCP Socket Probe

livenessProbe:
  tcpSocket:
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20

Success Criteria: TCP connection succeeds Use When: Non-HTTP services (databases, message queues)

4. gRPC Probe (Kubernetes 1.24+)

livenessProbe:
  grpc:
    port: 9090
    service: my.service.v1.Health
  initialDelaySeconds: 10

Success Criteria: gRPC health check returns SERVING Use When: gRPC services

Readiness Probe - Controlling Traffic Flow

What is a Readiness Probe? A readiness probe determines if a container is ready to serve requests. Failed readiness removes the Pod from Service endpoints.

Why Use Readiness Probes?

Graceful Startup: Don’t send traffic until ready
Dependency Checks: Wait for database connection
Temporary Unavailability: Handle maintenance mode
Zero-Downtime Deployments: Smooth rolling updates

Readiness vs Liveness:

Scenario	Liveness	Readiness
App deadlocked	✅ Restart	❌ Won’t help
Database down	❌ Don’t restart	✅ Remove from service
Slow startup	❌ May restart prematurely	✅ Wait until ready
Memory leak	✅ Restart eventually	❌ Won’t help

HTTP Readiness Probe:

apiVersion: v1
kind: Pod
metadata:
  name: readiness-http
spec:
  containers:
  - name: app
    image: myapp:1.0
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      failureThreshold: 3

Readiness Endpoint Example (Go):

func readinessHandler(w http.ResponseWriter, r *http.Request) {
    // Check database connection
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    // Check cache connection
    if err := cache.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("Ready"))
}

Database Readiness Check:

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - pg_isready -h localhost -U postgres -d mydb
  initialDelaySeconds: 5
  periodSeconds: 10

Startup Probe - Handling Slow Initialization

What is a Startup Probe? A startup probe gives containers extra time to start before liveness/readiness probes begin. It runs only during startup.

Why Use Startup Probes?

Slow Initialization: Apps that take minutes to start
Prevent Premature Restarts: Avoid liveness killing slow apps
Legacy Applications: Apps with long startup times
Data Loading: Apps that load large datasets on start

When to Use: ✅ App takes >30 seconds to start ✅ Startup time varies significantly ✅ Loading large datasets on initialization ❌ Fast-starting apps (use initialDelaySeconds instead)

Startup Probe Example:

apiVersion: v1
kind: Pod
metadata:
  name: startup-probe
spec:
  containers:
  - name: app
    image: slow-app:1.0
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      failureThreshold: 30  # 30 attempts
      periodSeconds: 10     # Every 10 seconds
      # Total: 300 seconds (5 minutes) to start
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 10
      failureThreshold: 3

How it works:

Startup probe runs first (up to 300 seconds in example)
Liveness/readiness probes disabled during startup
Once startup succeeds, liveness/readiness begin
If startup fails after 300s, container restarts

Probe Parameters Explained

initialDelaySeconds: 0    # Wait before first probe (default: 0)
periodSeconds: 10         # How often to probe (default: 10)
timeoutSeconds: 1         # Probe timeout (default: 1)
successThreshold: 1       # Successes needed (default: 1, readiness can be >1)
failureThreshold: 3       # Failures before action (default: 3)

Parameter Details:

Parameter	Liveness	Readiness	Startup	Notes
`initialDelaySeconds`	Common	Common	Rare	Use startup probe instead
`periodSeconds`	10-30s	5-10s	10-30s	Balance responsiveness vs load
`timeoutSeconds`	1-5s	1-5s	1-5s	Network latency + processing
`successThreshold`	Always 1	1-3	Always 1	Readiness can require multiple
`failureThreshold`	3-5	3-5	10-30	Higher for startup

Calculating Total Timeout:

Maximum Time = failureThreshold × periodSeconds

Example:
failureThreshold: 30
periodSeconds: 10
Total: 300 seconds (5 minutes)

Production Example

Complete application with all probes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: app
        image: myapp:1.0
        ports:
        - containerPort: 8080
        
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          failureThreshold: 30
          periodSeconds: 10
        
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
            httpHeaders:
            - name: Custom-Header
              value: Awesome
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3
        
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"

Common Probe Patterns

1. Database Connection Check (PostgreSQL):

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - pg_isready -h localhost -U postgres -d mydb
  initialDelaySeconds: 5
  periodSeconds: 10

2. MySQL Connection Check:

readinessProbe:
  exec:
    command:
    - mysqladmin
    - ping
    - -h
    - localhost
  initialDelaySeconds: 5
  periodSeconds: 10

3. Redis Connection Check:

readinessProbe:
  exec:
    command:
    - redis-cli
    - ping
  initialDelaySeconds: 5
  periodSeconds: 5

4. File Existence Check:

livenessProbe:
  exec:
    command:
    - cat
    - /app/healthy
  periodSeconds: 10

5. gRPC Health Check:

livenessProbe:
  grpc:
    port: 9090
  initialDelaySeconds: 10
  periodSeconds: 10

6. Multi-Dependency Check:

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      # Check database
      pg_isready -h db-host -U postgres || exit 1
      # Check cache
      redis-cli -h cache-host ping || exit 1
      # Check API dependency
      curl -f http://api-host/health || exit 1
      exit 0
  initialDelaySeconds: 10
  periodSeconds: 15
  timeoutSeconds: 5

Best Practices

1. Always Implement Readiness Probes

Essential for zero-downtime deployments
Prevents traffic to unready pods
Required for rolling updates

2. Use Liveness Probes Carefully

Avoid false positives (causes restart loops)
Don’t check external dependencies
Keep checks lightweight
Set conservative thresholds

3. Set Appropriate Timeouts

# Good: Conservative settings
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
  # Total: 30s before restarts

# Bad: Aggressive settings
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 3
  timeoutSeconds: 1
  failureThreshold: 1
  # May cause restart loops!

4. Use Startup Probes for Slow Apps

# Instead of high initialDelaySeconds
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

5. Separate Health Endpoints

/healthz  - Liveness (is app alive?)
/ready    - Readiness (can serve traffic?)
/startup  - Startup (has app started?)

6. Monitor Probe Failures

Set up alerts for probe failures
Track restart counts
Monitor probe latency

7. Test Probe Endpoints

# Test locally
curl http://localhost:8080/healthz
curl http://localhost:8080/ready

# Test in pod
kubectl exec -it mypod -- curl localhost:8080/healthz

Troubleshooting Health Probe Issues

Problem 1: Pod Constantly Restarting

Symptoms:

kubectl get pods
NAME      READY   STATUS             RESTARTS   AGE
myapp-0   0/1     CrashLoopBackOff   5          3m

Diagnosis:

kubectl describe pod myapp-0
# Look for: "Liveness probe failed"

kubectl logs myapp-0 --previous
# Check logs from before restart

Common Causes:

Cause 1: Liveness probe too aggressive

# Problem: Probe starts too early
livenessProbe:
  initialDelaySeconds: 5  # App needs 30s to start!
  
# Solution: Use startup probe
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Cause 2: Timeout too short

# Problem
livenessProbe:
  timeoutSeconds: 1  # Endpoint takes 2s to respond
  
# Solution
livenessProbe:
  timeoutSeconds: 5

Cause 3: Checking external dependencies

// Bad: Liveness checks database
func livenessHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {  // Don't do this!
        w.WriteHeader(500)
        return
    }
    w.WriteHeader(200)
}

// Good: Liveness checks only app health
func livenessHandler(w http.ResponseWriter, r *http.Request) {
    // Just check if app is responsive
    w.WriteHeader(200)
}

Problem 2: Pod Not Receiving Traffic

Symptoms:

kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
myapp-0   0/1     Running   0          5m

kubectl get endpoints myservice
NAME        ENDPOINTS   AGE
myservice   <none>      5m

Diagnosis:

kubectl describe pod myapp-0
# Look for: "Readiness probe failed"

kubectl logs myapp-0
# Check application logs

Common Causes:

Cause 1: Readiness endpoint not implemented

# Test endpoint
kubectl exec -it myapp-0 -- curl localhost:8080/ready
# Returns 404 - endpoint doesn't exist!

# Solution: Implement /ready endpoint

Cause 2: Dependencies not ready

# App waiting for database
readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - pg_isready -h db-host -U postgres
  # Database not ready yet

Cause 3: Wrong port

# Problem
readinessProbe:
  httpGet:
    port: 8080  # App listens on 3000!
    
# Solution
readinessProbe:
  httpGet:
    port: 3000

Problem 3: Slow Rolling Updates

Symptoms:

Deployment takes very long
Pods stay in “Not Ready” state

Diagnosis:

kubectl rollout status deployment myapp
# Waiting for deployment "myapp" rollout to finish...

kubectl describe pod myapp-xxx
# Check readiness probe status

Solution:

# Optimize readiness probe
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5  # Reduce if app starts fast
  periodSeconds: 5        # Check more frequently
  failureThreshold: 2     # Fail faster

Problem 4: Probe Endpoint Slow/Timing Out

Diagnosis:

kubectl describe pod myapp-0
# Events: Readiness probe failed: Get "http://10.0.0.1:8080/ready": context deadline exceeded

Solutions:

1. Increase timeout:

readinessProbe:
  timeoutSeconds: 5  # Increase from 1

2. Optimize endpoint:

// Bad: Slow health check
func healthHandler(w http.ResponseWriter, r *http.Request) {
    // Don't do expensive operations!
    results := runComplexQuery()  // Takes 3 seconds
    w.WriteHeader(200)
}

// Good: Fast health check
func healthHandler(w http.ResponseWriter, r *http.Request) {
    // Quick check only
    w.WriteHeader(200)
}

Problem 5: Probe Causing High Load

Symptoms:

High CPU usage from health checks
Probe endpoints slow down app

Solution:

# Reduce probe frequency
livenessProbe:
  periodSeconds: 30  # Instead of 10
  
readinessProbe:
  periodSeconds: 10  # Instead of 5

Debugging Commands

# Check pod events
kubectl describe pod myapp-0

# View probe configuration
kubectl get pod myapp-0 -o yaml | grep -A 10 livenessProbe

# Check logs
kubectl logs myapp-0
kubectl logs myapp-0 --previous  # Before restart

# Test probe endpoint manually
kubectl exec -it myapp-0 -- curl localhost:8080/healthz
kubectl exec -it myapp-0 -- wget -O- localhost:8080/ready

# Watch pod status
kubectl get pods -w

# Check endpoints
kubectl get endpoints myservice

# Force probe execution (for testing)
kubectl exec -it myapp-0 -- /bin/sh
# Then manually run probe command

Conclusion

Health probes are essential for production reliability, enabling automatic failure detection and recovery.

Introduction#

Understanding Health Probes#

Types of Probes - Complete Comparison#

Liveness Probe - Detecting When to Restart#

Readiness Probe - Controlling Traffic Flow#

Startup Probe - Handling Slow Initialization#

Probe Parameters Explained#

Production Example#

Common Probe Patterns#

Best Practices#

Troubleshooting Health Probe Issues#

Debugging Commands#

Conclusion#

Resources#

Introduction

Understanding Health Probes

Types of Probes - Complete Comparison

Liveness Probe - Detecting When to Restart

Readiness Probe - Controlling Traffic Flow

Startup Probe - Handling Slow Initialization

Probe Parameters Explained

Production Example

Common Probe Patterns

Best Practices

Troubleshooting Health Probe Issues

Debugging Commands

Conclusion

Resources