Kubernetes Advanced Scheduling: Node Affinity, Taints, and Tolerations

Introduction

By default, Kubernetes scheduler automatically places Pods on available nodes. But what if you need more control? Perhaps you want GPU-intensive workloads on specific nodes, or need to isolate production from development workloads. Node Affinity, Taints, and Tolerations give you fine-grained control over Pod placement.

Understanding Kubernetes Scheduling

Default Scheduler Behavior: The Kubernetes scheduler automatically assigns Pods to nodes based on:

Available resources (CPU, memory)
Node conditions (ready, disk pressure, memory pressure)
Pod resource requests and limits
Quality of Service (QoS) class

Why Advanced Scheduling? Default scheduling doesn’t consider:

Hardware requirements (SSD, GPU, specific CPU)
Workload isolation (production vs development)
Data locality (co-locate with data source)
High availability (spread across zones)
Cost optimization (use spot instances)

Advanced Scheduling Tools:

Tool	Purpose	Use Case
Node Affinity	Attract pods to nodes	GPU workloads, SSD storage
Pod Affinity	Co-locate pods	Cache near application
Pod Anti-Affinity	Spread pods apart	High availability
Taints	Repel pods from nodes	Dedicated nodes
Tolerations	Allow pods on tainted nodes	Special workloads
Node Selector	Simple node selection	Basic requirements

Node Affinity - Attracting Pods to Specific Nodes

What is Node Affinity? Node Affinity is a set of rules that constrain which nodes your Pod can be scheduled on, based on node labels. It’s like saying “I prefer/require nodes with these characteristics.”

Why Use Node Affinity?

Hardware Requirements: Schedule ML workloads on GPU nodes
Performance: Place databases on SSD-backed nodes
Compliance: Keep sensitive data in specific regions
Cost Optimization: Use cheaper nodes for dev workloads
Workload Isolation: Separate production from development

Node Affinity vs Node Selector:

Feature	Node Selector	Node Affinity
Syntax	Simple key-value	Complex expressions
Operators	Equality only	In, NotIn, Exists, etc.
Required/Preferred	Required only	Both supported
Multiple Rules	AND only	AND/OR combinations
Use Case	Simple selection	Complex requirements

How Node Affinity Works:

Label nodes with characteristics
Define affinity rules in Pod spec
Scheduler evaluates rules during placement
Pod scheduled on matching node (or pending if none match)

Step 1: Label Your Nodes

# Label nodes with hardware characteristics
kubectl label nodes node1 disktype=ssd
kubectl label nodes node1 cpu-type=high-performance
kubectl label nodes node2 gpu=true
kubectl label nodes node2 gpu-type=nvidia-v100
kubectl label nodes node3 environment=production
kubectl label nodes node3 zone=us-east-1a

# View node labels
kubectl get nodes --show-labels
kubectl describe node node1 | grep Labels

Common Node Labels:

disktype: ssd, hdd, nvme
gpu: true, false
environment: production, staging, development
zone: us-east-1a, us-east-1b (cloud zones)
instance-type: t3.large, m5.xlarge
dedicated: database, ml-training

Type 1: Required Affinity (Hard Requirement)

What it means: Pod MUST be scheduled on matching node, or stays Pending.

When to use:

Critical hardware requirements (GPU for ML)
Compliance requirements (data must stay in region)
License restrictions (software tied to specific nodes)

Example 1: Simple Required Affinity

apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
  containers:
  - name: app
    image: nginx

Example 2: Multiple Requirements (AND logic)

apiVersion: v1
kind: Pod
metadata:
  name: gpu-ml-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: In
            values:
            - "true"
          - key: gpu-type
            operator: In
            values:
            - nvidia-v100
            - nvidia-a100
          - key: environment
            operator: In
            values:
            - production
  containers:
  - name: ml-training
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 1

Example 3: Multiple Options (OR logic)

apiVersion: v1
kind: Pod
metadata:
  name: flexible-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:  # Option 1: SSD nodes
          - key: disktype
            operator: In
            values:
            - ssd
        - matchExpressions:  # OR Option 2: NVMe nodes
          - key: disktype
            operator: In
            values:
            - nvme
  containers:
  - name: app
    image: nginx

Type 2: Preferred Affinity (Soft Requirement)

What it means: Scheduler PREFERS matching nodes but will schedule elsewhere if needed.

When to use:

Performance optimization (prefer SSD but HDD acceptable)
Cost optimization (prefer cheaper nodes)
Best-effort placement

How weights work:

Higher weight = stronger preference
Weights range: 1-100
Scheduler calculates total score for each node
Node with highest score wins

Example 1: Single Preference

apiVersion: v1
kind: Pod
metadata:
  name: preferred-ssd-pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
  containers:
  - name: app
    image: nginx

Example 2: Multiple Preferences with Weights

apiVersion: v1
kind: Pod
metadata:
  name: weighted-preferences-pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80  # Strongly prefer SSD
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      - weight: 60  # Moderately prefer production
        preference:
          matchExpressions:
          - key: environment
            operator: In
            values:
            - production
      - weight: 20  # Slightly prefer us-east-1a
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-east-1a
  containers:
  - name: app
    image: nginx

Example 3: Combined Required + Preferred

apiVersion: v1
kind: Pod
metadata:
  name: combined-affinity-pod
spec:
  affinity:
    nodeAffinity:
      # MUST have GPU
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: In
            values:
            - "true"
      # PREFER V100 over other GPUs
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: gpu-type
            operator: In
            values:
            - nvidia-v100
  containers:
  - name: ml-app
    image: tensorflow/tensorflow:latest-gpu

Available Operators:

Operator	Description	Example
`In`	Label value in list	`disktype In [ssd, nvme]`
`NotIn`	Label value not in list	`environment NotIn [dev]`
`Exists`	Label key exists	`gpu Exists`
`DoesNotExist`	Label key doesn’t exist	`spot-instance DoesNotExist`
`Gt`	Greater than (numeric)	`cpu-cores Gt 8`
`Lt`	Less than (numeric)	`memory-gb Lt 32`

Operator Examples:

# Exists - any node with GPU label
- key: gpu
  operator: Exists

# DoesNotExist - avoid spot instances
- key: spot-instance
  operator: DoesNotExist

# Gt - nodes with more than 16 CPU cores
- key: cpu-cores
  operator: Gt
  values:
  - "16"

# NotIn - avoid development nodes
- key: environment
  operator: NotIn
  values:
  - development
  - testing

Pod Affinity & Anti-Affinity - Co-locating or Spreading Pods

What is Pod Affinity/Anti-Affinity? Pod Affinity/Anti-Affinity controls Pod placement based on OTHER Pods already running on nodes, not node labels.

Pod Affinity vs Anti-Affinity:

Feature	Pod Affinity	Pod Anti-Affinity
Purpose	Co-locate pods	Spread pods apart
Use Case	Cache near app	High availability
Example	Redis near web app	Spread replicas across nodes
Topology	Same node/zone	Different nodes/zones

Why Use Pod Affinity?

Performance: Place cache near application (reduce latency)
Data Locality: Co-locate data processing with data source
Communication: Keep tightly-coupled services together
Cost: Reduce inter-zone data transfer costs

Why Use Pod Anti-Affinity?

High Availability: Spread replicas across failure domains
Resource Distribution: Avoid resource contention
Fault Tolerance: Survive node/zone failures
Load Balancing: Distribute load across infrastructure

Understanding Topology Keys:

The topologyKey defines the scope of affinity/anti-affinity:

Topology Key	Scope	Use Case
`kubernetes.io/hostname`	Single node	Co-locate on same node
`topology.kubernetes.io/zone`	Availability zone	Spread across zones
`topology.kubernetes.io/region`	Cloud region	Multi-region deployment
Custom labels	Custom domains	Custom topology

Pod Affinity - Co-locating Pods

Example 1: Cache Near Application (Required)

apiVersion: v1
kind: Pod
metadata:
  name: cache-pod
  labels:
    app: cache
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - web
        topologyKey: kubernetes.io/hostname  # Same node
  containers:
  - name: redis
    image: redis:7

How it works:

Scheduler finds nodes running Pods with label app=web
Schedules cache-pod on same node as web Pod
If no web Pods exist, cache-pod stays Pending

Example 2: Data Processing Near Data Source (Preferred)

apiVersion: v1
kind: Pod
metadata:
  name: processor-pod
spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - database
          topologyKey: topology.kubernetes.io/zone  # Same zone
  containers:
  - name: processor
    image: data-processor:latest

Example 3: Multi-tier Application Co-location

apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend
spec:
  replicas: 3
  selector:
    matchLabels:
      tier: backend
  template:
    metadata:
      labels:
        tier: backend
        app: myapp
    spec:
      affinity:
        podAffinity:
          # Co-locate with frontend
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  tier: frontend
                  app: myapp
              topologyKey: kubernetes.io/hostname
      containers:
      - name: backend
        image: backend:latest

Pod Anti-Affinity - Spreading Pods Apart

Example 1: High Availability Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web
            topologyKey: kubernetes.io/hostname  # Different nodes
      containers:
      - name: web
        image: nginx

Result: Each replica runs on a different node for fault tolerance.

Example 2: Spread Across Availability Zones

apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical
  template:
    metadata:
      labels:
        app: critical
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: critical
            topologyKey: topology.kubernetes.io/zone  # Different zones
      containers:
      - name: app
        image: critical-app:latest

Result: Each replica in different availability zone (survives zone failure).

Example 3: Preferred Anti-Affinity (Soft)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flexible-app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: flexible
  template:
    metadata:
      labels:
        app: flexible
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: flexible
              topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: app:latest

Result: Tries to spread across nodes, but allows multiple on same node if needed.

Example 4: Combined Affinity + Anti-Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: smart-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: smart
  template:
    metadata:
      labels:
        app: smart
        tier: backend
    spec:
      affinity:
        # Co-locate with database (same zone for low latency)
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: database
              topologyKey: topology.kubernetes.io/zone
        # Spread replicas across nodes (high availability)
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: smart
            topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: smart-app:latest

Result:

Replicas spread across different nodes (HA)
All replicas prefer same zone as database (performance)

Taints and Tolerations - Repelling and Allowing Pods

What are Taints and Tolerations?

Taints: Applied to nodes to repel Pods (unless they tolerate the taint)
Tolerations: Applied to Pods to allow scheduling on tainted nodes

Taints vs Affinity:

Feature	Taints	Node Affinity
Applied to	Nodes	Pods
Purpose	Repel pods	Attract pods
Default	Blocks all pods	Allows all pods
Use Case	Dedicated nodes	Hardware requirements

Why Use Taints?

Dedicated Nodes: Reserve nodes for specific workloads (GPU, database)
Node Maintenance: Prevent scheduling during maintenance
Special Hardware: Isolate expensive resources
Workload Isolation: Separate production from development
Node Problems: Mark problematic nodes

Taint Format:

key=value:effect

Taint Effects Explained

1. NoSchedule (Hard Restriction)

What it does: Prevents NEW Pods from scheduling
Existing Pods: Continue running
Use when: Want to dedicate node but keep existing workloads

kubectl taint nodes node1 dedicated=database:NoSchedule

2. PreferNoSchedule (Soft Restriction)

What it does: Tries to avoid scheduling, but not guaranteed
Existing Pods: Continue running
Use when: Prefer to avoid node but allow if necessary

kubectl taint nodes node1 maintenance=soon:PreferNoSchedule

3. NoExecute (Eviction)

What it does: Prevents new Pods AND evicts existing Pods
Existing Pods: Evicted unless they tolerate
Use when: Need to clear node immediately

kubectl taint nodes node1 hardware=faulty:NoExecute

Managing Taints

Add taint:

# Basic taint
kubectl taint nodes node1 key=value:NoSchedule

# GPU node
kubectl taint nodes gpu-node1 nvidia.com/gpu=true:NoSchedule

# Production node
kubectl taint nodes prod-node1 environment=production:NoSchedule

# Maintenance
kubectl taint nodes node1 maintenance=true:NoExecute

View taints:

kubectl describe node node1 | grep Taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Remove taint:

# Remove specific taint
kubectl taint nodes node1 key:NoSchedule-

# Remove all taints with key
kubectl taint nodes node1 key-

Tolerations in Pods

Toleration Operators:

Equal: Key, value, and effect must match
Exists: Only key and effect must match (any value)

Example 1: Equal Operator

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  containers:
  - name: cuda-app
    image: nvidia/cuda:11.0

Example 2: Exists Operator (Tolerate Any Value)

apiVersion: v1
kind: Pod
metadata:
  name: flexible-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Exists"
    effect: "NoSchedule"
  containers:
  - name: app
    image: nginx

Example 3: Tolerate All Taints

apiVersion: v1
kind: Pod
metadata:
  name: system-pod
spec:
  tolerations:
  - operator: "Exists"  # Tolerates everything
  containers:
  - name: monitoring
    image: prometheus:latest

Example 4: Toleration with Timeout (NoExecute)

apiVersion: v1
kind: Pod
metadata:
  name: graceful-pod
spec:
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300  # Stay for 5 minutes before eviction
  containers:
  - name: app
    image: nginx

Example 5: Multiple Tolerations

apiVersion: v1
kind: Pod
metadata:
  name: multi-tolerant-pod
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  - key: "environment"
    operator: "Equal"
    value: "production"
    effect: "NoSchedule"
  - key: "node.kubernetes.io/disk-pressure"
    operator: "Exists"
    effect: "NoSchedule"
  containers:
  - name: app
    image: myapp:latest

Built-in Taints

Kubernetes automatically adds taints for node conditions:

Taint Key	Condition	Effect
`node.kubernetes.io/not-ready`	Node not ready	NoExecute
`node.kubernetes.io/unreachable`	Node unreachable	NoExecute
`node.kubernetes.io/memory-pressure`	Low memory	NoSchedule
`node.kubernetes.io/disk-pressure`	Low disk	NoSchedule
`node.kubernetes.io/pid-pressure`	Too many processes	NoSchedule
`node.kubernetes.io/network-unavailable`	Network issue	NoSchedule
`node.kubernetes.io/unschedulable`	Cordoned	NoSchedule

DaemonSets automatically tolerate these taints!

Production Examples

Example 1: Dedicated GPU Nodes

# Setup GPU nodes
kubectl taint nodes gpu-node1 nvidia.com/gpu=true:NoSchedule
kubectl taint nodes gpu-node2 nvidia.com/gpu=true:NoSchedule
kubectl label nodes gpu-node1 gpu-type=v100
kubectl label nodes gpu-node2 gpu-type=v100

apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu-type
            operator: In
            values:
            - v100
  containers:
  - name: training
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 1

Example 2: Production/Development Isolation

# Taint production nodes
kubectl taint nodes prod-node1 environment=production:NoSchedule
kubectl taint nodes prod-node2 environment=production:NoSchedule

# Label for identification
kubectl label nodes prod-node1 environment=production
kubectl label nodes prod-node2 environment=production

apiVersion: apps/v1
kind: Deployment
metadata:
  name: production-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: prod-app
  template:
    metadata:
      labels:
        app: prod-app
    spec:
      tolerations:
      - key: environment
        operator: Equal
        value: production
        effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: environment
                operator: In
                values:
                - production
      containers:
      - name: app
        image: prod-app:v1.0

Example 3: Node Maintenance

# Mark node for maintenance
kubectl taint nodes node1 maintenance=true:NoExecute

# Pods without toleration will be evicted
# Add toleration to critical pods that should stay

apiVersion: v1
kind: Pod
metadata:
  name: critical-monitoring
spec:
  tolerations:
  - key: maintenance
    operator: Equal
    value: "true"
    effect: NoExecute
    tolerationSeconds: 3600  # Stay for 1 hour
  containers:
  - name: monitor
    image: monitoring:latest

Combining Strategies - Complete Example

Scenario: High-availability web application with database

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
        tier: frontend
    spec:
      # Tolerate production taint
      tolerations:
      - key: environment
        operator: Equal
        value: production
        effect: NoSchedule
      
      affinity:
        # MUST run on production nodes
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: environment
                operator: In
                values:
                - production
              - key: disktype
                operator: In
                values:
                - ssd
        
        # PREFER same zone as database
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: database
              topologyKey: topology.kubernetes.io/zone
        
        # SPREAD across nodes for HA
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: web
            topologyKey: kubernetes.io/hostname
      
      containers:
      - name: web
        image: web-app:v2.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Best Practices

Use Taints for Dedicated Nodes
- GPU nodes, high-memory nodes, special hardware
Combine with Node Affinity
- Taints repel, affinity attracts
- Use both for precise control
Use Pod Anti-Affinity for HA
- Spread replicas across nodes/zones
- Survive node/zone failures
Test Before Production
- Verify scheduling behavior
- Check Pod placement
Document Taints
- Label nodes clearly
- Document taint purposes
Monitor Pending Pods
- Watch for scheduling failures
- Check affinity/toleration mismatches
Use Preferred When Possible
- Soft requirements more flexible
- Fallback options available

Conclusion

Advanced scheduling ensures optimal Pod placement. Use Node Affinity for attraction, Taints/Tolerations for repulsion, and Pod Affinity for co-location.

Next: DaemonSets and Jobs

Introduction#

Understanding Kubernetes Scheduling#

Node Affinity - Attracting Pods to Specific Nodes#

Pod Affinity & Anti-Affinity - Co-locating or Spreading Pods#

Pod Affinity - Co-locating Pods#

Pod Anti-Affinity - Spreading Pods Apart#

Taints and Tolerations - Repelling and Allowing Pods#

Taint Effects Explained#

Managing Taints#

Tolerations in Pods#

Built-in Taints#

Production Examples#

Combining Strategies - Complete Example#

Best Practices#

Conclusion#

Resources#