Introduction
By default, Kubernetes scheduler automatically places Pods on available nodes. But what if you need more control? Perhaps you want GPU-intensive workloads on specific nodes, or need to isolate production from development workloads. Node Affinity, Taints, and Tolerations give you fine-grained control over Pod placement.
Understanding Kubernetes Scheduling
Default Scheduler Behavior: The Kubernetes scheduler automatically assigns Pods to nodes based on:
- Available resources (CPU, memory)
- Node conditions (ready, disk pressure, memory pressure)
- Pod resource requests and limits
- Quality of Service (QoS) class
Why Advanced Scheduling? Default scheduling doesn’t consider:
- Hardware requirements (SSD, GPU, specific CPU)
- Workload isolation (production vs development)
- Data locality (co-locate with data source)
- High availability (spread across zones)
- Cost optimization (use spot instances)
Advanced Scheduling Tools:
Tool | Purpose | Use Case |
---|---|---|
Node Affinity | Attract pods to nodes | GPU workloads, SSD storage |
Pod Affinity | Co-locate pods | Cache near application |
Pod Anti-Affinity | Spread pods apart | High availability |
Taints | Repel pods from nodes | Dedicated nodes |
Tolerations | Allow pods on tainted nodes | Special workloads |
Node Selector | Simple node selection | Basic requirements |
Node Affinity - Attracting Pods to Specific Nodes
What is Node Affinity? Node Affinity is a set of rules that constrain which nodes your Pod can be scheduled on, based on node labels. It’s like saying “I prefer/require nodes with these characteristics.”
Why Use Node Affinity?
- Hardware Requirements: Schedule ML workloads on GPU nodes
- Performance: Place databases on SSD-backed nodes
- Compliance: Keep sensitive data in specific regions
- Cost Optimization: Use cheaper nodes for dev workloads
- Workload Isolation: Separate production from development
Node Affinity vs Node Selector:
Feature | Node Selector | Node Affinity |
---|---|---|
Syntax | Simple key-value | Complex expressions |
Operators | Equality only | In, NotIn, Exists, etc. |
Required/Preferred | Required only | Both supported |
Multiple Rules | AND only | AND/OR combinations |
Use Case | Simple selection | Complex requirements |
How Node Affinity Works:
- Label nodes with characteristics
- Define affinity rules in Pod spec
- Scheduler evaluates rules during placement
- Pod scheduled on matching node (or pending if none match)
Step 1: Label Your Nodes
# Label nodes with hardware characteristics
kubectl label nodes node1 disktype=ssd
kubectl label nodes node1 cpu-type=high-performance
kubectl label nodes node2 gpu=true
kubectl label nodes node2 gpu-type=nvidia-v100
kubectl label nodes node3 environment=production
kubectl label nodes node3 zone=us-east-1a
# View node labels
kubectl get nodes --show-labels
kubectl describe node node1 | grep Labels
Common Node Labels:
disktype
: ssd, hdd, nvmegpu
: true, falseenvironment
: production, staging, developmentzone
: us-east-1a, us-east-1b (cloud zones)instance-type
: t3.large, m5.xlargededicated
: database, ml-training
Type 1: Required Affinity (Hard Requirement)
What it means: Pod MUST be scheduled on matching node, or stays Pending.
When to use:
- Critical hardware requirements (GPU for ML)
- Compliance requirements (data must stay in region)
- License restrictions (software tied to specific nodes)
Example 1: Simple Required Affinity
apiVersion: v1
kind: Pod
metadata:
name: ssd-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: app
image: nginx
Example 2: Multiple Requirements (AND logic)
apiVersion: v1
kind: Pod
metadata:
name: gpu-ml-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- "true"
- key: gpu-type
operator: In
values:
- nvidia-v100
- nvidia-a100
- key: environment
operator: In
values:
- production
containers:
- name: ml-training
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
Example 3: Multiple Options (OR logic)
apiVersion: v1
kind: Pod
metadata:
name: flexible-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions: # Option 1: SSD nodes
- key: disktype
operator: In
values:
- ssd
- matchExpressions: # OR Option 2: NVMe nodes
- key: disktype
operator: In
values:
- nvme
containers:
- name: app
image: nginx
Type 2: Preferred Affinity (Soft Requirement)
What it means: Scheduler PREFERS matching nodes but will schedule elsewhere if needed.
When to use:
- Performance optimization (prefer SSD but HDD acceptable)
- Cost optimization (prefer cheaper nodes)
- Best-effort placement
How weights work:
- Higher weight = stronger preference
- Weights range: 1-100
- Scheduler calculates total score for each node
- Node with highest score wins
Example 1: Single Preference
apiVersion: v1
kind: Pod
metadata:
name: preferred-ssd-pod
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: app
image: nginx
Example 2: Multiple Preferences with Weights
apiVersion: v1
kind: Pod
metadata:
name: weighted-preferences-pod
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80 # Strongly prefer SSD
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 60 # Moderately prefer production
preference:
matchExpressions:
- key: environment
operator: In
values:
- production
- weight: 20 # Slightly prefer us-east-1a
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east-1a
containers:
- name: app
image: nginx
Example 3: Combined Required + Preferred
apiVersion: v1
kind: Pod
metadata:
name: combined-affinity-pod
spec:
affinity:
nodeAffinity:
# MUST have GPU
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- "true"
# PREFER V100 over other GPUs
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: gpu-type
operator: In
values:
- nvidia-v100
containers:
- name: ml-app
image: tensorflow/tensorflow:latest-gpu
Available Operators:
Operator | Description | Example |
---|---|---|
In |
Label value in list | disktype In [ssd, nvme] |
NotIn |
Label value not in list | environment NotIn [dev] |
Exists |
Label key exists | gpu Exists |
DoesNotExist |
Label key doesn’t exist | spot-instance DoesNotExist |
Gt |
Greater than (numeric) | cpu-cores Gt 8 |
Lt |
Less than (numeric) | memory-gb Lt 32 |
Operator Examples:
# Exists - any node with GPU label
- key: gpu
operator: Exists
# DoesNotExist - avoid spot instances
- key: spot-instance
operator: DoesNotExist
# Gt - nodes with more than 16 CPU cores
- key: cpu-cores
operator: Gt
values:
- "16"
# NotIn - avoid development nodes
- key: environment
operator: NotIn
values:
- development
- testing
Pod Affinity & Anti-Affinity - Co-locating or Spreading Pods
What is Pod Affinity/Anti-Affinity? Pod Affinity/Anti-Affinity controls Pod placement based on OTHER Pods already running on nodes, not node labels.
Pod Affinity vs Anti-Affinity:
Feature | Pod Affinity | Pod Anti-Affinity |
---|---|---|
Purpose | Co-locate pods | Spread pods apart |
Use Case | Cache near app | High availability |
Example | Redis near web app | Spread replicas across nodes |
Topology | Same node/zone | Different nodes/zones |
Why Use Pod Affinity?
- Performance: Place cache near application (reduce latency)
- Data Locality: Co-locate data processing with data source
- Communication: Keep tightly-coupled services together
- Cost: Reduce inter-zone data transfer costs
Why Use Pod Anti-Affinity?
- High Availability: Spread replicas across failure domains
- Resource Distribution: Avoid resource contention
- Fault Tolerance: Survive node/zone failures
- Load Balancing: Distribute load across infrastructure
Understanding Topology Keys:
The topologyKey
defines the scope of affinity/anti-affinity:
Topology Key | Scope | Use Case |
---|---|---|
kubernetes.io/hostname |
Single node | Co-locate on same node |
topology.kubernetes.io/zone |
Availability zone | Spread across zones |
topology.kubernetes.io/region |
Cloud region | Multi-region deployment |
Custom labels | Custom domains | Custom topology |
Pod Affinity - Co-locating Pods
Example 1: Cache Near Application (Required)
apiVersion: v1
kind: Pod
metadata:
name: cache-pod
labels:
app: cache
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: kubernetes.io/hostname # Same node
containers:
- name: redis
image: redis:7
How it works:
- Scheduler finds nodes running Pods with label
app=web
- Schedules cache-pod on same node as web Pod
- If no web Pods exist, cache-pod stays Pending
Example 2: Data Processing Near Data Source (Preferred)
apiVersion: v1
kind: Pod
metadata:
name: processor-pod
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: topology.kubernetes.io/zone # Same zone
containers:
- name: processor
image: data-processor:latest
Example 3: Multi-tier Application Co-location
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
spec:
replicas: 3
selector:
matchLabels:
tier: backend
template:
metadata:
labels:
tier: backend
app: myapp
spec:
affinity:
podAffinity:
# Co-locate with frontend
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
tier: frontend
app: myapp
topologyKey: kubernetes.io/hostname
containers:
- name: backend
image: backend:latest
Pod Anti-Affinity - Spreading Pods Apart
Example 1: High Availability Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: kubernetes.io/hostname # Different nodes
containers:
- name: web
image: nginx
Result: Each replica runs on a different node for fault tolerance.
Example 2: Spread Across Availability Zones
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-app
spec:
replicas: 3
selector:
matchLabels:
app: critical
template:
metadata:
labels:
app: critical
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: critical
topologyKey: topology.kubernetes.io/zone # Different zones
containers:
- name: app
image: critical-app:latest
Result: Each replica in different availability zone (survives zone failure).
Example 3: Preferred Anti-Affinity (Soft)
apiVersion: apps/v1
kind: Deployment
metadata:
name: flexible-app
spec:
replicas: 5
selector:
matchLabels:
app: flexible
template:
metadata:
labels:
app: flexible
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: flexible
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: app:latest
Result: Tries to spread across nodes, but allows multiple on same node if needed.
Example 4: Combined Affinity + Anti-Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: smart-deployment
spec:
replicas: 3
selector:
matchLabels:
app: smart
template:
metadata:
labels:
app: smart
tier: backend
spec:
affinity:
# Co-locate with database (same zone for low latency)
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: database
topologyKey: topology.kubernetes.io/zone
# Spread replicas across nodes (high availability)
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: smart
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: smart-app:latest
Result:
- Replicas spread across different nodes (HA)
- All replicas prefer same zone as database (performance)
Taints and Tolerations - Repelling and Allowing Pods
What are Taints and Tolerations?
- Taints: Applied to nodes to repel Pods (unless they tolerate the taint)
- Tolerations: Applied to Pods to allow scheduling on tainted nodes
Taints vs Affinity:
Feature | Taints | Node Affinity |
---|---|---|
Applied to | Nodes | Pods |
Purpose | Repel pods | Attract pods |
Default | Blocks all pods | Allows all pods |
Use Case | Dedicated nodes | Hardware requirements |
Why Use Taints?
- Dedicated Nodes: Reserve nodes for specific workloads (GPU, database)
- Node Maintenance: Prevent scheduling during maintenance
- Special Hardware: Isolate expensive resources
- Workload Isolation: Separate production from development
- Node Problems: Mark problematic nodes
Taint Format:
key=value:effect
Taint Effects Explained
1. NoSchedule (Hard Restriction)
- What it does: Prevents NEW Pods from scheduling
- Existing Pods: Continue running
- Use when: Want to dedicate node but keep existing workloads
kubectl taint nodes node1 dedicated=database:NoSchedule
2. PreferNoSchedule (Soft Restriction)
- What it does: Tries to avoid scheduling, but not guaranteed
- Existing Pods: Continue running
- Use when: Prefer to avoid node but allow if necessary
kubectl taint nodes node1 maintenance=soon:PreferNoSchedule
3. NoExecute (Eviction)
- What it does: Prevents new Pods AND evicts existing Pods
- Existing Pods: Evicted unless they tolerate
- Use when: Need to clear node immediately
kubectl taint nodes node1 hardware=faulty:NoExecute
Managing Taints
Add taint:
# Basic taint
kubectl taint nodes node1 key=value:NoSchedule
# GPU node
kubectl taint nodes gpu-node1 nvidia.com/gpu=true:NoSchedule
# Production node
kubectl taint nodes prod-node1 environment=production:NoSchedule
# Maintenance
kubectl taint nodes node1 maintenance=true:NoExecute
View taints:
kubectl describe node node1 | grep Taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
Remove taint:
# Remove specific taint
kubectl taint nodes node1 key:NoSchedule-
# Remove all taints with key
kubectl taint nodes node1 key-
Tolerations in Pods
Toleration Operators:
- Equal: Key, value, and effect must match
- Exists: Only key and effect must match (any value)
Example 1: Equal Operator
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: cuda-app
image: nvidia/cuda:11.0
Example 2: Exists Operator (Tolerate Any Value)
apiVersion: v1
kind: Pod
metadata:
name: flexible-pod
spec:
tolerations:
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: app
image: nginx
Example 3: Tolerate All Taints
apiVersion: v1
kind: Pod
metadata:
name: system-pod
spec:
tolerations:
- operator: "Exists" # Tolerates everything
containers:
- name: monitoring
image: prometheus:latest
Example 4: Toleration with Timeout (NoExecute)
apiVersion: v1
kind: Pod
metadata:
name: graceful-pod
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # Stay for 5 minutes before eviction
containers:
- name: app
image: nginx
Example 5: Multiple Tolerations
apiVersion: v1
kind: Pod
metadata:
name: multi-tolerant-pod
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
- key: "environment"
operator: "Equal"
value: "production"
effect: "NoSchedule"
- key: "node.kubernetes.io/disk-pressure"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: app
image: myapp:latest
Built-in Taints
Kubernetes automatically adds taints for node conditions:
Taint Key | Condition | Effect |
---|---|---|
node.kubernetes.io/not-ready |
Node not ready | NoExecute |
node.kubernetes.io/unreachable |
Node unreachable | NoExecute |
node.kubernetes.io/memory-pressure |
Low memory | NoSchedule |
node.kubernetes.io/disk-pressure |
Low disk | NoSchedule |
node.kubernetes.io/pid-pressure |
Too many processes | NoSchedule |
node.kubernetes.io/network-unavailable |
Network issue | NoSchedule |
node.kubernetes.io/unschedulable |
Cordoned | NoSchedule |
DaemonSets automatically tolerate these taints!
Production Examples
Example 1: Dedicated GPU Nodes
# Setup GPU nodes
kubectl taint nodes gpu-node1 nvidia.com/gpu=true:NoSchedule
kubectl taint nodes gpu-node2 nvidia.com/gpu=true:NoSchedule
kubectl label nodes gpu-node1 gpu-type=v100
kubectl label nodes gpu-node2 gpu-type=v100
apiVersion: v1
kind: Pod
metadata:
name: ml-training
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values:
- v100
containers:
- name: training
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
Example 2: Production/Development Isolation
# Taint production nodes
kubectl taint nodes prod-node1 environment=production:NoSchedule
kubectl taint nodes prod-node2 environment=production:NoSchedule
# Label for identification
kubectl label nodes prod-node1 environment=production
kubectl label nodes prod-node2 environment=production
apiVersion: apps/v1
kind: Deployment
metadata:
name: production-app
spec:
replicas: 3
selector:
matchLabels:
app: prod-app
template:
metadata:
labels:
app: prod-app
spec:
tolerations:
- key: environment
operator: Equal
value: production
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- production
containers:
- name: app
image: prod-app:v1.0
Example 3: Node Maintenance
# Mark node for maintenance
kubectl taint nodes node1 maintenance=true:NoExecute
# Pods without toleration will be evicted
# Add toleration to critical pods that should stay
apiVersion: v1
kind: Pod
metadata:
name: critical-monitoring
spec:
tolerations:
- key: maintenance
operator: Equal
value: "true"
effect: NoExecute
tolerationSeconds: 3600 # Stay for 1 hour
containers:
- name: monitor
image: monitoring:latest
Combining Strategies - Complete Example
Scenario: High-availability web application with database
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
tier: frontend
spec:
# Tolerate production taint
tolerations:
- key: environment
operator: Equal
value: production
effect: NoSchedule
affinity:
# MUST run on production nodes
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- production
- key: disktype
operator: In
values:
- ssd
# PREFER same zone as database
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: database
topologyKey: topology.kubernetes.io/zone
# SPREAD across nodes for HA
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: web-app:v2.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Best Practices
-
Use Taints for Dedicated Nodes
- GPU nodes, high-memory nodes, special hardware
-
Combine with Node Affinity
- Taints repel, affinity attracts
- Use both for precise control
-
Use Pod Anti-Affinity for HA
- Spread replicas across nodes/zones
- Survive node/zone failures
-
Test Before Production
- Verify scheduling behavior
- Check Pod placement
-
Document Taints
- Label nodes clearly
- Document taint purposes
-
Monitor Pending Pods
- Watch for scheduling failures
- Check affinity/toleration mismatches
-
Use Preferred When Possible
- Soft requirements more flexible
- Fallback options available
Conclusion
Advanced scheduling ensures optimal Pod placement. Use Node Affinity for attraction, Taints/Tolerations for repulsion, and Pod Affinity for co-location.
Next: DaemonSets and Jobs