The Kubernetes Scheduling
Node Selectors, Affinity, and Taints Explained
I’m not going to waste your time with a fluffy introduction.
You’re here because your pods are sitting in Pending, or landing on the wrong nodes, or you’re trying to figure out why the scheduler ignores your carefully crafted affinity rules.
By the end of this, you’ll understand:
What node selectors, node affinity, and taints/tolerations actually do
When to use each one (and when not to)
How they interact (and why that breaks your scheduling)
The production failure modes nobody warns you about
Let’s go.
Node Selectors
Node selectors are the oldest, simplest way to control pod placement.
You label your nodes. You reference those labels in your pod spec. Done.
nodeSelector:
disktype: ssdThis pod will only schedule on nodes labeled disktype=ssd. If no such node exists, the pod sits in Pending. Forever.
Why it seems smart:
You have GPU nodes. You have CPU nodes. You have nodes with fast disks. Node selectors let you be explicit. Database pods go on SSD nodes. ML workloads go on GPU nodes. No guessing.
Why does this break in production:
Node selectors are binary. Either the node matches or it doesn’t. There’s no fallback. No “prefer this but tolerate that.” If you label your database pods with disktype=ssd and all your SSD nodes are full, your database doesn’t start. Even if you have 50 empty CPU nodes sitting idle.
Worse, node selectors compose poorly with autoscaling. Your autoscaler spins up a new node. But that node doesn’t have the disktype=ssd label yet. Maybe it takes 30 seconds for your labeling script to run. Maybe it never runs because you forgot to add it to your node provisioning template. Your pod sits in Pending on a cluster that has plenty of capacity.
And then there’s label drift. Someone manually changes a label. Or a node replacement happens and the new node doesn’t get labeled correctly. Your production database suddenly can’t schedule because the label isn’t there.
When to use node selectors:
Hard requirements. GPU workloads that literally cannot run without a GPU. Compliance workloads that must run in a specific zone or on specific hardware. Binary decisions.
When not to use node selectors:
Preferences. Performance optimizations. Anything where “this would be better, but that works too” applies. Don’t use node selectors for soft requirements. That’s what affinity is for.
Node Affinity
Node affinity is node selectors with nuance.
Instead of “this node or nothing,” you get “prefer this node, but I’ll take that one if needed.”
There are two types:
requiredDuringSchedulingIgnoredDuringExecution: Hard requirement. Acts like a node selector.preferredDuringSchedulingIgnoredDuringExecution: Soft preference. The scheduler tries, but doesn’t guarantee.
Here’s what “preferred” looks like:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssdThe scheduler scores nodes. Nodes matching your preference get bonus points. But if no matching nodes exist, your pod still schedules somewhere.
Why it seems smart:
Flexibility. You’re saying, “I want SSD nodes, but if they’re all full, I’ll take spinning disk.” You’re being pragmatic. Your pods don’t get stuck in Pending just because the ideal node isn’t available.
Why does this break in production:
The scheduler doesn’t care about your operational reality. It cares about scoring.
Your preferred affinity says, “I prefer nodes in us-east-1a.” Great. The scheduler puts your pod there. But us-east-1a already has 90% of your pods. You just created a single point of failure. One AZ outage takes down your entire service.
Or you set a weight. You think “weight: 100 is high priority.” But the scheduler also considers other factors. Node resource availability. Spreading. Taints. Your weight might get overwhelmed by other scoring rules. Your pod lands somewhere you didn’t expect.
And then there’s the “Ignored During Execution” part. That’s not a detail. That’s a landmine. Once your pod is running, the affinity rules stop mattering. You can change the node labels. You can remove the nodes entirely. Your pod stays where it is. It doesn’t reschedule. It doesn’t care that it no longer matches your affinity rules.
This means your cluster drifts. Over time, your pods end up on nodes that don’t match your intent. You have to manually drain and reschedule to fix it.
When to use node affinity:
Soft preferences that improve performance but aren’t critical. “I’d like to be near the database nodes, but I’ll survive elsewhere.” “I prefer high-memory nodes, but I can run on standard nodes.”
When not to use node affinity:
Critical placement decisions. If it matters for correctness, use a hard requirement. If it matters for availability, don’t use affinity at all—use pod anti-affinity or topology spread constraints to force distribution.
Taints and Tolerations
Taints are the opposite of affinity. Instead of pulling pods toward nodes, they push pods away.
You taint a node. Pods without a matching toleration won’t schedule there.
# On the node
taints:
- key: workload
value: batch
effect: NoSchedule# In your pod spec
tolerations:
- key: workload
operator: Equal
value: batch
effect: NoScheduleOnly pods with this toleration can schedule on that node. Everything else is blocked.
Why it seems smart:
Isolation. You have expensive GPU nodes. You don’t want random pods landing there and wasting resources. You taint the GPU nodes. Only ML workloads tolerate the taint. Problem solved.
Or you have nodes that are about to be drained. You taint them with NoSchedule. New pods avoid them. Existing pods stay until you’re ready to evict them.
Why does this break in production:
Taints are invisible to most engineers. Node selectors and affinity are in the pod spec. You can see them. Taints live on the node. Someone adds a taint to “temporarily” isolate a node for debugging. They forget to remove it. Now your pods can’t schedule, and nobody knows why.
Worse, taints stack. You taint nodes for GPU workloads. You taint nodes for specific teams. You taint nodes during maintenance. Suddenly, your pod needs three tolerations just to find a node. And if it’s missing even one, it sits in Pending.
And then there’s the NoExecute effect. This one is brutal. NoSchedule prevents new pods from landing on a node. NoExecute kicks off existing pods. If you taint a node with NoExecute and your pods don’t tolerate it, they get evicted. Immediately. No grace period. No warning.
You’re doing a rolling node upgrade. You taint the old nodes with NoExecute to force pods onto new nodes. But you forgot that some pods have long startup times. They get evicted, reschedule, start initializing, and then get evicted again because you’re draining the next node. You’re in an eviction loop.
When to use taints:
Dedicated workloads. GPU nodes that should only run ML jobs. Compliance nodes that should only run regulated workloads. Temporary isolation during maintenance (but set a reminder to remove the taint).
When not to use taints:
General workload separation that could be handled by affinity. Anything where you’re just trying to “encourage” certain placement. Taints are exclusionary. They’re a hard boundary. Don’t use them for soft preferences.
How They Fight Each Other
Here’s where it gets messy.
You can use all three at once. Node selectors, affinity, and tolerations. Each one filters the set of nodes your pod can land on.
The scheduler evaluates them in order:
Node selectors eliminate nodes that don’t match labels.
Required affinity eliminates more nodes.
Taints eliminate nodes without matching tolerations.
Preferred affinity scores the remaining nodes.
Each filter shrinks the pool. If your pool goes to zero, your pod is Pending.
The production trap:
You add a node selector for disktype=ssd. Makes sense. You need a fast disk.
Then someone adds a taint to the SSD nodes to reserve them for databases. Your pod isn’t a database, so it doesn’t tolerate the taint.
Now your pod requires SSD nodes (node selector), but is blocked from SSD nodes (taint). Pending forever.
Or you use preferred affinity to stay in us-east-1a. But us-east-1a Nodes are tainted during a maintenance window. Your pod can’t schedule there. It falls back to us-east-1b. But you also have anti-affinity rules that prevent multiple replicas on the same node. All the us-east-1b nodes already have a replica. Pending.
The filters are stacked. Each one was reasonable. Together, they created an unsolvable constraint.
What to do instead:
Minimize rules. Every selector, every affinity, every taint is a constraint. The more constraints, the smaller your scheduling surface.
Ask: Does this rule prevent an outage? If not, make it a soft preference or remove it.
Use anti-affinity and topology spread for availability. Use taints for hard isolation. Use affinity sparingly. Use node selectors only when something literally cannot work elsewhere.
And test your constraints. Simulate node failures. Simulate maintenance windows. Make sure your pods can still schedule when half your cluster is unavailable.
The Real Failure Modes
Pending Pods With “Plenty” of Capacity
60% CPU utilization. 50% memory. Plenty of room.
Your pod sits in Pending.
kubectl describe pod says: 0/12 nodes are available: 4 node(s) didn't match node selector, 5 node(s) had taint that the pod didn't tolerate, 3 node(s) didn't match pod affinity rules.
You have 12 nodes. Zero passes all your filters.
This happens because your rules were written in isolation. Node selector was added six months ago. Taint added last week. Affinity from a copied template. Nobody checked if they’re compatible.
The fix: Audit your constraints. For every critical pod, ask: If I lose half my nodes, can this still schedule? If no, you’re fragile.
The Autoscaler That Makes Things Worse
Traffic spikes. Autoscaler adds nodes. Pods still Pending.
The new nodes don’t match your constraints. Missing labels. Wrong zone. Wrong instance type.
The autoscaler sees Pending pods. Adds more nodes. They don’t help either.
The fix: Match node provisioning to pod constraints. Apply labels during bootstrap. Before the node joins the cluster. Don’t rely on manual labeling.
The Slow Rollout That Never Finishes
Rolling update starts. First pod terminates. Replacement goes Pending.
The rollout stalls. New pod can’t schedule because all your nodes are full, and your required affinity won’t let it land anywhere else.
Old pods running. New pods Pending. Neither makes progress.
The fix: Leave headroom. Don’t run at 90% utilization with strict placement rules. Aim for 70-75%. Or relax your rules during deployments.
The Maintenance Window That Cascades
Node maintenance. Nodes get drained. Pods reschedule.
But your anti-affinity rules won’t let replicas share nodes. Not enough nodes remain. Pods sit in Pending.
Service degrades for 20 minutes until maintenance ends.
The fix: Use topologySpreadConstraints with whenUnsatisfiable: ScheduleAnyway. Try to spread, but if you can’t, schedule the pod somewhere instead of leaving it Pending.
The Taint You Forgot to Remove
Someone taints a node for debugging. “I’ll remove it in 5 minutes.”
They don’t.
Three weeks later, the pods are Pending. You have capacity but can’t use it. Hours wasted before you notice the taint.
The fix: Automate taint removal with TTLs. Or use labels instead. Reserve taints for permanent isolation only.
What the Docs Don’t Tell You
The scheduler is optimistic. It doesn’t warn you when you create unsatisfiable rules. It just marks pods Pending.
Preferred affinity is a lie. It almost always behaves like a hard requirement. You think you’re being flexible. You’re not.
Taints don’t propagate. Existing pods stay on tainted nodes. Only new pods are blocked. Your cluster drifts.
Node selectors are always AND. You can’t create OR logic. Multiple selectors must all match.
The scheduler doesn’t re-evaluate. Pods don’t move when labels or taints change. They stay where they are unless you force a reschedule.
The Rules That Actually Work
1. Default to no constraints. Only add them when preventing a specific failure.
2. Use topology spread for availability. Not anti-affinity. Use whenUnsatisfiable: ScheduleAnyway.
3. Taint only for hard isolation. GPU nodes. Compliance workloads. Never temporarily.
4. Use soft affinity for performance, not correctness. Required affinity is for correctness only.
5. Label nodes automatically. During provisioning. Before they join the cluster.
6. Test constraint combinations. Count valid nodes. Simulate failures. Ensure enough headroom.
7. Monitor pending pods. Alert when Pending for more than 2 minutes. Fix the constraint, not the pod.
That’s it.
Node selectors, affinity, and taints give you control. But control is a trap. The more you take, the more ways you break scheduling.
Most clusters are better with fewer rules. Let the scheduler work.
Only constrain when you have to. And when you do, test it.
Because in production, things always go wrong.
Consider subscribing, or else your production will be 😂:



