Introduction

By default, Kubernetes scheduler automatically places Pods on available nodes. But what if you need more control? Perhaps you want GPU-intensive workloads on specific nodes, or need to isolate production from development workloads. Node Affinity, Taints, and Tolerations give you fine-grained control over Pod placement.

Understanding Kubernetes Scheduling

Default Scheduler Behavior: The Kubernetes scheduler automatically assigns Pods to nodes based on:

  • Available resources (CPU, memory)
  • Node conditions (ready, disk pressure, memory pressure)
  • Pod resource requests and limits
  • Quality of Service (QoS) class

Why Advanced Scheduling? Default scheduling doesn’t consider:

  • Hardware requirements (SSD, GPU, specific CPU)
  • Workload isolation (production vs development)
  • Data locality (co-locate with data source)
  • High availability (spread across zones)
  • Cost optimization (use spot instances)

Advanced Scheduling Tools:

Tool Purpose Use Case
Node Affinity Attract pods to nodes GPU workloads, SSD storage
Pod Affinity Co-locate pods Cache near application
Pod Anti-Affinity Spread pods apart High availability
Taints Repel pods from nodes Dedicated nodes
Tolerations Allow pods on tainted nodes Special workloads
Node Selector Simple node selection Basic requirements

Node Affinity - Attracting Pods to Specific Nodes

What is Node Affinity? Node Affinity is a set of rules that constrain which nodes your Pod can be scheduled on, based on node labels. It’s like saying “I prefer/require nodes with these characteristics.”

Why Use Node Affinity?

  • Hardware Requirements: Schedule ML workloads on GPU nodes
  • Performance: Place databases on SSD-backed nodes
  • Compliance: Keep sensitive data in specific regions
  • Cost Optimization: Use cheaper nodes for dev workloads
  • Workload Isolation: Separate production from development

Node Affinity vs Node Selector:

Feature Node Selector Node Affinity
Syntax Simple key-value Complex expressions
Operators Equality only In, NotIn, Exists, etc.
Required/Preferred Required only Both supported
Multiple Rules AND only AND/OR combinations
Use Case Simple selection Complex requirements

How Node Affinity Works:

  1. Label nodes with characteristics
  2. Define affinity rules in Pod spec
  3. Scheduler evaluates rules during placement
  4. Pod scheduled on matching node (or pending if none match)

Step 1: Label Your Nodes

# Label nodes with hardware characteristics
kubectl label nodes node1 disktype=ssd
kubectl label nodes node1 cpu-type=high-performance
kubectl label nodes node2 gpu=true
kubectl label nodes node2 gpu-type=nvidia-v100
kubectl label nodes node3 environment=production
kubectl label nodes node3 zone=us-east-1a

# View node labels
kubectl get nodes --show-labels
kubectl describe node node1 | grep Labels

Common Node Labels:

  • disktype: ssd, hdd, nvme
  • gpu: true, false
  • environment: production, staging, development
  • zone: us-east-1a, us-east-1b (cloud zones)
  • instance-type: t3.large, m5.xlarge
  • dedicated: database, ml-training

Type 1: Required Affinity (Hard Requirement)

What it means: Pod MUST be scheduled on matching node, or stays Pending.

When to use:

  • Critical hardware requirements (GPU for ML)
  • Compliance requirements (data must stay in region)
  • License restrictions (software tied to specific nodes)

Example 1: Simple Required Affinity

apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
  containers:
  - name: app
    image: nginx

Example 2: Multiple Requirements (AND logic)

apiVersion: v1
kind: Pod
metadata:
  name: gpu-ml-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: In
            values:
            - "true"
          - key: gpu-type
            operator: In
            values:
            - nvidia-v100
            - nvidia-a100
          - key: environment
            operator: In
            values:
            - production
  containers:
  - name: ml-training
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 1

Example 3: Multiple Options (OR logic)

apiVersion: v1
kind: Pod
metadata:
  name: flexible-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:  # Option 1: SSD nodes
          - key: disktype
            operator: In
            values:
            - ssd
        - matchExpressions:  # OR Option 2: NVMe nodes
          - key: disktype
            operator: In
            values:
            - nvme
  containers:
  - name: app
    image: nginx

Type 2: Preferred Affinity (Soft Requirement)

What it means: Scheduler PREFERS matching nodes but will schedule elsewhere if needed.

When to use:

  • Performance optimization (prefer SSD but HDD acceptable)
  • Cost optimization (prefer cheaper nodes)
  • Best-effort placement

How weights work:

  • Higher weight = stronger preference
  • Weights range: 1-100
  • Scheduler calculates total score for each node
  • Node with highest score wins

Example 1: Single Preference

apiVersion: v1
kind: Pod
metadata:
  name: preferred-ssd-pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
  containers:
  - name: app
    image: nginx

Example 2: Multiple Preferences with Weights

apiVersion: v1
kind: Pod
metadata:
  name: weighted-preferences-pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80  # Strongly prefer SSD
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      - weight: 60  # Moderately prefer production
        preference:
          matchExpressions:
          - key: environment
            operator: In
            values:
            - production
      - weight: 20  # Slightly prefer us-east-1a
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-east-1a
  containers:
  - name: app
    image: nginx

Example 3: Combined Required + Preferred

apiVersion: v1
kind: Pod
metadata:
  name: combined-affinity-pod
spec:
  affinity:
    nodeAffinity:
      # MUST have GPU
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: In
            values:
            - "true"
      # PREFER V100 over other GPUs
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: gpu-type
            operator: In
            values:
            - nvidia-v100
  containers:
  - name: ml-app
    image: tensorflow/tensorflow:latest-gpu

Available Operators:

Operator Description Example
In Label value in list disktype In [ssd, nvme]
NotIn Label value not in list environment NotIn [dev]
Exists Label key exists gpu Exists
DoesNotExist Label key doesn’t exist spot-instance DoesNotExist
Gt Greater than (numeric) cpu-cores Gt 8
Lt Less than (numeric) memory-gb Lt 32

Operator Examples:

# Exists - any node with GPU label
- key: gpu
  operator: Exists

# DoesNotExist - avoid spot instances
- key: spot-instance
  operator: DoesNotExist

# Gt - nodes with more than 16 CPU cores
- key: cpu-cores
  operator: Gt
  values:
  - "16"

# NotIn - avoid development nodes
- key: environment
  operator: NotIn
  values:
  - development
  - testing

Pod Affinity & Anti-Affinity - Co-locating or Spreading Pods

What is Pod Affinity/Anti-Affinity? Pod Affinity/Anti-Affinity controls Pod placement based on OTHER Pods already running on nodes, not node labels.

Pod Affinity vs Anti-Affinity:

Feature Pod Affinity Pod Anti-Affinity
Purpose Co-locate pods Spread pods apart
Use Case Cache near app High availability
Example Redis near web app Spread replicas across nodes
Topology Same node/zone Different nodes/zones

Why Use Pod Affinity?

  • Performance: Place cache near application (reduce latency)
  • Data Locality: Co-locate data processing with data source
  • Communication: Keep tightly-coupled services together
  • Cost: Reduce inter-zone data transfer costs

Why Use Pod Anti-Affinity?

  • High Availability: Spread replicas across failure domains
  • Resource Distribution: Avoid resource contention
  • Fault Tolerance: Survive node/zone failures
  • Load Balancing: Distribute load across infrastructure

Understanding Topology Keys:

The topologyKey defines the scope of affinity/anti-affinity:

Topology Key Scope Use Case
kubernetes.io/hostname Single node Co-locate on same node
topology.kubernetes.io/zone Availability zone Spread across zones
topology.kubernetes.io/region Cloud region Multi-region deployment
Custom labels Custom domains Custom topology

Pod Affinity - Co-locating Pods

Example 1: Cache Near Application (Required)

apiVersion: v1
kind: Pod
metadata:
  name: cache-pod
  labels:
    app: cache
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web
        topologyKey: kubernetes.io/hostname  # Same node
  containers:
  - name: redis
    image: redis:7

How it works:

  1. Scheduler finds nodes running Pods with label app=web
  2. Schedules cache-pod on same node as web Pod
  3. If no web Pods exist, cache-pod stays Pending

Example 2: Data Processing Near Data Source (Preferred)

apiVersion: v1
kind: Pod
metadata:
  name: processor-pod
spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - database
          topologyKey: topology.kubernetes.io/zone  # Same zone
  containers:
  - name: processor
    image: data-processor:latest

Example 3: Multi-tier Application Co-location

apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend
spec:
  replicas: 3
  selector:
    matchLabels:
      tier: backend
  template:
    metadata:
      labels:
        tier: backend
        app: myapp
    spec:
      affinity:
        podAffinity:
          # Co-locate with frontend
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  tier: frontend
                  app: myapp
              topologyKey: kubernetes.io/hostname
      containers:
      - name: backend
        image: backend:latest

Pod Anti-Affinity - Spreading Pods Apart

Example 1: High Availability Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web
            topologyKey: kubernetes.io/hostname  # Different nodes
      containers:
      - name: web
        image: nginx

Result: Each replica runs on a different node for fault tolerance.

Example 2: Spread Across Availability Zones

apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical
  template:
    metadata:
      labels:
        app: critical
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: critical
            topologyKey: topology.kubernetes.io/zone  # Different zones
      containers:
      - name: app
        image: critical-app:latest

Result: Each replica in different availability zone (survives zone failure).

Example 3: Preferred Anti-Affinity (Soft)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flexible-app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: flexible
  template:
    metadata:
      labels:
        app: flexible
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: flexible
              topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: app:latest

Result: Tries to spread across nodes, but allows multiple on same node if needed.

Example 4: Combined Affinity + Anti-Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: smart-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: smart
  template:
    metadata:
      labels:
        app: smart
        tier: backend
    spec:
      affinity:
        # Co-locate with database (same zone for low latency)
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: database
              topologyKey: topology.kubernetes.io/zone
        # Spread replicas across nodes (high availability)
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: smart
            topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: smart-app:latest

Result:

  • Replicas spread across different nodes (HA)
  • All replicas prefer same zone as database (performance)

Taints and Tolerations - Repelling and Allowing Pods

What are Taints and Tolerations?

  • Taints: Applied to nodes to repel Pods (unless they tolerate the taint)
  • Tolerations: Applied to Pods to allow scheduling on tainted nodes

Taints vs Affinity:

Feature Taints Node Affinity
Applied to Nodes Pods
Purpose Repel pods Attract pods
Default Blocks all pods Allows all pods
Use Case Dedicated nodes Hardware requirements

Why Use Taints?

  • Dedicated Nodes: Reserve nodes for specific workloads (GPU, database)
  • Node Maintenance: Prevent scheduling during maintenance
  • Special Hardware: Isolate expensive resources
  • Workload Isolation: Separate production from development
  • Node Problems: Mark problematic nodes

Taint Format:

key=value:effect

Taint Effects Explained

1. NoSchedule (Hard Restriction)

  • What it does: Prevents NEW Pods from scheduling
  • Existing Pods: Continue running
  • Use when: Want to dedicate node but keep existing workloads
kubectl taint nodes node1 dedicated=database:NoSchedule

2. PreferNoSchedule (Soft Restriction)

  • What it does: Tries to avoid scheduling, but not guaranteed
  • Existing Pods: Continue running
  • Use when: Prefer to avoid node but allow if necessary
kubectl taint nodes node1 maintenance=soon:PreferNoSchedule

3. NoExecute (Eviction)

  • What it does: Prevents new Pods AND evicts existing Pods
  • Existing Pods: Evicted unless they tolerate
  • Use when: Need to clear node immediately
kubectl taint nodes node1 hardware=faulty:NoExecute

Managing Taints

Add taint:

# Basic taint
kubectl taint nodes node1 key=value:NoSchedule

# GPU node
kubectl taint nodes gpu-node1 nvidia.com/gpu=true:NoSchedule

# Production node
kubectl taint nodes prod-node1 environment=production:NoSchedule

# Maintenance
kubectl taint nodes node1 maintenance=true:NoExecute

View taints:

kubectl describe node node1 | grep Taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Remove taint:

# Remove specific taint
kubectl taint nodes node1 key:NoSchedule-

# Remove all taints with key
kubectl taint nodes node1 key-

Tolerations in Pods

Toleration Operators:

  • Equal: Key, value, and effect must match
  • Exists: Only key and effect must match (any value)

Example 1: Equal Operator

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  containers:
  - name: cuda-app
    image: nvidia/cuda:11.0

Example 2: Exists Operator (Tolerate Any Value)

apiVersion: v1
kind: Pod
metadata:
  name: flexible-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Exists"
    effect: "NoSchedule"
  containers:
  - name: app
    image: nginx

Example 3: Tolerate All Taints

apiVersion: v1
kind: Pod
metadata:
  name: system-pod
spec:
  tolerations:
  - operator: "Exists"  # Tolerates everything
  containers:
  - name: monitoring
    image: prometheus:latest

Example 4: Toleration with Timeout (NoExecute)

apiVersion: v1
kind: Pod
metadata:
  name: graceful-pod
spec:
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300  # Stay for 5 minutes before eviction
  containers:
  - name: app
    image: nginx

Example 5: Multiple Tolerations

apiVersion: v1
kind: Pod
metadata:
  name: multi-tolerant-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  - key: "environment"
    operator: "Equal"
    value: "production"
    effect: "NoSchedule"
  - key: "node.kubernetes.io/disk-pressure"
    operator: "Exists"
    effect: "NoSchedule"
  containers:
  - name: app
    image: myapp:latest

Built-in Taints

Kubernetes automatically adds taints for node conditions:

Taint Key Condition Effect
node.kubernetes.io/not-ready Node not ready NoExecute
node.kubernetes.io/unreachable Node unreachable NoExecute
node.kubernetes.io/memory-pressure Low memory NoSchedule
node.kubernetes.io/disk-pressure Low disk NoSchedule
node.kubernetes.io/pid-pressure Too many processes NoSchedule
node.kubernetes.io/network-unavailable Network issue NoSchedule
node.kubernetes.io/unschedulable Cordoned NoSchedule

DaemonSets automatically tolerate these taints!

Production Examples

Example 1: Dedicated GPU Nodes

# Setup GPU nodes
kubectl taint nodes gpu-node1 nvidia.com/gpu=true:NoSchedule
kubectl taint nodes gpu-node2 nvidia.com/gpu=true:NoSchedule
kubectl label nodes gpu-node1 gpu-type=v100
kubectl label nodes gpu-node2 gpu-type=v100
apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu-type
            operator: In
            values:
            - v100
  containers:
  - name: training
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 1

Example 2: Production/Development Isolation

# Taint production nodes
kubectl taint nodes prod-node1 environment=production:NoSchedule
kubectl taint nodes prod-node2 environment=production:NoSchedule

# Label for identification
kubectl label nodes prod-node1 environment=production
kubectl label nodes prod-node2 environment=production
apiVersion: apps/v1
kind: Deployment
metadata:
  name: production-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: prod-app
  template:
    metadata:
      labels:
        app: prod-app
    spec:
      tolerations:
      - key: environment
        operator: Equal
        value: production
        effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: environment
                operator: In
                values:
                - production
      containers:
      - name: app
        image: prod-app:v1.0

Example 3: Node Maintenance

# Mark node for maintenance
kubectl taint nodes node1 maintenance=true:NoExecute

# Pods without toleration will be evicted
# Add toleration to critical pods that should stay
apiVersion: v1
kind: Pod
metadata:
  name: critical-monitoring
spec:
  tolerations:
  - key: maintenance
    operator: Equal
    value: "true"
    effect: NoExecute
    tolerationSeconds: 3600  # Stay for 1 hour
  containers:
  - name: monitor
    image: monitoring:latest

Combining Strategies - Complete Example

Scenario: High-availability web application with database

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
        tier: frontend
    spec:
      # Tolerate production taint
      tolerations:
      - key: environment
        operator: Equal
        value: production
        effect: NoSchedule
      
      affinity:
        # MUST run on production nodes
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: environment
                operator: In
                values:
                - production
              - key: disktype
                operator: In
                values:
                - ssd
        
        # PREFER same zone as database
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: database
              topologyKey: topology.kubernetes.io/zone
        
        # SPREAD across nodes for HA
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: web
            topologyKey: kubernetes.io/hostname
      
      containers:
      - name: web
        image: web-app:v2.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Best Practices

  1. Use Taints for Dedicated Nodes

    • GPU nodes, high-memory nodes, special hardware
  2. Combine with Node Affinity

    • Taints repel, affinity attracts
    • Use both for precise control
  3. Use Pod Anti-Affinity for HA

    • Spread replicas across nodes/zones
    • Survive node/zone failures
  4. Test Before Production

    • Verify scheduling behavior
    • Check Pod placement
  5. Document Taints

    • Label nodes clearly
    • Document taint purposes
  6. Monitor Pending Pods

    • Watch for scheduling failures
    • Check affinity/toleration mismatches
  7. Use Preferred When Possible

    • Soft requirements more flexible
    • Fallback options available

Conclusion

Advanced scheduling ensures optimal Pod placement. Use Node Affinity for attraction, Taints/Tolerations for repulsion, and Pod Affinity for co-location.

Next: DaemonSets and Jobs

Resources