We have a production readiness policy to add
limitsto every pod.
- Some person
Why is our p99 latency so bad? Oh, wait! Our pods seem to be consistently throttled on CPU way before reaching their CPU
- Same person
Wow, our response time improved by 80%+ after removing
limits- without increasing CPU usage.
- Same person, after reading this post
limits is not about protecting processes from CPU-hungry neighbours, nor is it about defining a guaranteed CPU bandwidth quota1. It is about limiting CPU utilization.
The long version
I believe CPU
limits is the most misunderstood (and abused) feature in Kubernetes -
requests being a close second. I suspect that’s because its trade-off has consequences that are subtle and counter-intuitive.
I often hear people explain CPU resource management in Kubernetes somewhat like this:
requestsdefines a guaranteed lower bound access to CPU bandwidth (and also controls pod scheduling).
limitsdefines a guaranteed upper bound access to CPU bandwidth.
That would mean, a container will have at least
requests.cpu / node's total CPU percent of CPU bandwidth available at all times, but will use up to
limits.cpu / node's total CPU at all times - making this combination of
limits a mechanism to control the CPU usage of a container, and ensure it’s kept within a certain range at all times.
And because it seems to give predictability to resource use at no cost but configuration, it quickly becomes company policy, industry best-practice, and interview question. However, this half-understanding ignores important aspects of
requests, frequently leading people to misuse them.
The truth about CPU limits
The “Managing Resources for Containers” documentation page clearly explains what
spec.containers.resources.limits.cpuis converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms [emphasis added]. A container cannot use more than its share of CPU time during this interval.
Can you spot the difference between this definition and the problematic mental model? Limiting the amount of CPU time during a specific time interval is very different than enforcing a maximum CPU usage at all times.
If a container has
limits.cpu=200m it will have access to CPU during
0.2 * 100ms = 20ms on every
100ms interval, and after that
20ms it is throttled.
[_x___x____] ^ |------ container is throttled. x = busy _ = idle
If the container’s CPU usage is more evenly distributed during the interval (e.g. busy for 2ms, then idle for 8ms), its observed CPU usage should be ~20% (i.e. it was busy on CPU in 20% of the samples). However, if it burst to use as much CPU as it can, it will tend to consume its quota at the beginning of the period and sit idle - even if the CPU is available!
What if your node has 2 vCPUs, and your runtime defaults to use as many threads in parallel as possible2? In that case, you can have two threads consuming
20ms of CPU in the first
10ms. In that case, observed CPU usage will likely be lower than 20%.
[x_________] [x_________] ^ |------ container is throttled.
What users expect is that when they set the limit to 1 CPU, they can continue to use 1 CPU continuously and not more than that. In reality though with CPU Quota applied, in a slot of CPU scheduling time, a pod can use more than 1 CPU then get unscheduled and not get access to CPU for the rest of the slot. This means the spikes are penalized very heavily which is not what users expect [emphasis added].
CPU thrashing can be handled with just requests. If requests are set appropriately, a single user cannot thrash other users sharing a node. […] In-order to guarantee latency, you need to set CPU requests to be equal to 99%ile usage. […] Unless a pod is
Guaranteed, it can still affect other pods running on the node because limits is over-committed, while requests is not [emphasis added].
If you are still unconvinced, think about this example: a container with
limits.cpu=2000m running on a node with 4 vCPUs can use
2 * 100ms = 200ms every
100ms period. That does not mean 2 CPUs will be available all the time, because the container can still use all the node’s CPU bandwidth during
4 * 50ms = 200ms), monopolizing the node’s CPUs for half of the period, and leaving everyone else competing for all CPUs during the other half.
limits does not impose an upper bound on continued access to CPU bandwidth, neither does it protects containers from thrashing due to CPU-hungry neighbours. It will heavily penalize bursts of CPU, and affect your latency. It will keep CPU utilization low, though - if that’s your only priority.
The truth about CPU requests
requests influence two aspects of Kubernetes: scheduling and quality of service (QoS).
From a scheduling perspective,
requests is used as a filtering criterion3 for pod placement decisions. By making sure a pod is not scheduled on a node with less than its total
requests available, Kubernetes guarantees that nodes are not over-committed in regard to
requests. This aspect is also usually well-understood by practitioners. Because its impact is limited to scheduling time, the feedback loop is short and it hardly goes unnoticed.
From a QoS perspective,
requests determines both the lower bound for access to CPU bandwidth and how excess CPU resources shall be distributed. Quoting the “Resource Quality of Service in Kubernetes” design document:
Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). […] Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 100 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).
That means CPU
requests actually provides guarantees of minimum and maximum bounds for CPU bandwidth when multiple processes are competing for CPU. But if that’s the case, why do people still use CPU
Why do people still use CPU limits?
My main hypothesis is that people are introduced to
limits under high-stakes incidents. And fear is a helluva disincentive.
For some reason, their nodes start to becomes unresponsible, and they need to sprinkle stability magic dust. Right now.
They ignore that it might have been caused by their last cost-saving4 initiative, that reduced node count at the expense of increased pod density. Or that it might be related to some previous initiative, when they outsmarted EKS recommendation for
kubeReserved because they wanted to increase pod density. They are afraid, they wanted to have the cake and eat it too, and have been delivering cost-savings on borrowed resources, but now the bill is due.
At that moment, they trade-off throughput and latency for lower CPU utilization AND higher pod density. They save the day, and it becomes best practice. At least, until they need higher throughput and lower latency again.
What shall I do?
- Don’t use
limitsunless you want to trafe-off lower CPU utilization for lower throughput and higher latency.
- Properly use
requeststo define CPU bandwidth5 requirements.
- Don’t take shortcuts to cost savings.
- Read the documentation, even when they omit examples for the sake of brevity.
As in “this pod shall use at max 2 vCPUs at all time!”. ↩︎
Golang 1.5+ has a default
GOMAXPROCS=runtime.NumCPU, for example. See: https://github.com/golang/go/issues/33803 and https://twitter.com/embano1/status/1149654812595646471 ↩︎
kube-schedulerselects a node in two-steps: filtering and scoring. The default scheduling policy contains the
PodFitsResourcespredicate, that filters our nodes that don’t match the pod’s resource requirements. See “Scheduling Policies” documentation for more details. ↩︎
Because nothing is better for Fridays than that nice “X% node count reduction” graph on Slack. ↩︎
The p90/p95 CPU usage at a 5m/1m resolution is usually a good starting point. See: