We have a production readiness policy to add limits to every pod.
- Some person

Why is our p99 latency so bad? Oh, wait! Our pods seem to be consistently throttled on CPU way before reaching their CPU limits!
- Same person

Wow, our response time improved by 80%+ after removing limits - without increasing CPU usage.
- Same person, after reading this post


CPU limits is not about protecting processes from CPU-hungry neighbours, nor is it about defining a guaranteed CPU bandwidth quota1. It is about limiting CPU utilization.

Instead, set appropriate CPU requests along with reasonably generous Kube Reserved and System Reserved.

The long version

I believe CPU limits is the most misunderstood (and abused) feature in Kubernetes - requests being a close second. I suspect that’s because its trade-off has consequences that are subtle and counter-intuitive.

I often hear people explain CPU resource management in Kubernetes somewhat like this:

  • requests defines a guaranteed lower bound access to CPU bandwidth (and also controls pod scheduling).
  • limits defines a guaranteed upper bound access to CPU bandwidth.

That would mean, a container will have at least requests.cpu / node's total CPU percent of CPU bandwidth available at all times, but will use up to limits.cpu / node's total CPU at all times - making this combination of requests and limits a mechanism to control the CPU usage of a container, and ensure it’s kept within a certain range at all times.

And because it seems to give predictability to resource use at no cost but configuration, it quickly becomes company policy, industry best-practice, and interview question. However, this half-understanding ignores important aspects of limits and requests, frequently leading people to misuse them.

The truth about CPU limits

The “Managing Resources for Containers” documentation page clearly explains what limits does:

The spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms [emphasis added]. A container cannot use more than its share of CPU time during this interval.

Can you spot the difference between this definition and the problematic mental model? Limiting the amount of CPU time during a specific time interval is very different than enforcing a maximum CPU usage at all times.

If a container has limits.cpu=200m it will have access to CPU during 0.2 * 100ms = 20ms on every 100ms interval, and after that 20ms it is throttled.

      |------ container is throttled.

x = busy
_ = idle

If the container’s CPU usage is more evenly distributed during the interval (e.g. busy for 2ms, then idle for 8ms), its observed CPU usage should be ~20% (i.e. it was busy on CPU in 20% of the samples). However, if it burst to use as much CPU as it can, it will tend to consume its quota at the beginning of the period and sit idle - even if the CPU is available!

What if your node has 2 vCPUs, and your runtime defaults to use as many threads in parallel as possible2? In that case, you can have two threads consuming 20ms of CPU in the first 10ms. In that case, observed CPU usage will likely be lower than 20%.

 |------ container is throttled.

As mentioned by @vishh in kubernetes#51135, this is usually not what most people expect:

What users expect is that when they set the limit to 1 CPU, they can continue to use 1 CPU continuously and not more than that. In reality though with CPU Quota applied, in a slot of CPU scheduling time, a pod can use more than 1 CPU then get unscheduled and not get access to CPU for the rest of the slot. This means the spikes are penalized very heavily which is not what users expect [emphasis added].

An equally wrong consequence of that mental model is the belief that using limits is how one implements protection against CPU thrashing. And again, @vishh in kubernetes#51135 to the rescue:

CPU thrashing can be handled with just requests. If requests are set appropriately, a single user cannot thrash other users sharing a node. […] In-order to guarantee latency, you need to set CPU requests to be equal to 99%ile usage. […] Unless a pod is Guaranteed, it can still affect other pods running on the node because limits is over-committed, while requests is not [emphasis added].

If you are still unconvinced, think about this example: a container with requests.cpu=500m and limits.cpu=2000m running on a node with 4 vCPUs can use 2 * 100ms = 200ms every 100ms period. That does not mean 2 CPUs will be available all the time, because the container can still use all the node’s CPU bandwidth during 50ms (4 * 50ms = 200ms), monopolizing the node’s CPUs for half of the period, and leaving everyone else competing for all CPUs during the other half.

In summary, limits does not impose an upper bound on continued access to CPU bandwidth, neither does it protects containers from thrashing due to CPU-hungry neighbours. It will heavily penalize bursts of CPU, and affect your latency. It will keep CPU utilization low, though - if that’s your only priority.

The truth about CPU requests

CPU requests influence two aspects of Kubernetes: scheduling and quality of service (QoS).

From a scheduling perspective, requests is used as a filtering criterion3 for pod placement decisions. By making sure a pod is not scheduled on a node with less than its total requests available, Kubernetes guarantees that nodes are not over-committed in regard to requests. This aspect is also usually well-understood by practitioners. Because its impact is limited to scheduling time, the feedback loop is short and it hardly goes unnoticed.

From a QoS perspective, requests determines both the lower bound for access to CPU bandwidth and how excess CPU resources shall be distributed. Quoting the “Resource Quality of Service in Kubernetes” design document:

Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). […] Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 100 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).

That means CPU requests actually provides guarantees of minimum and maximum bounds for CPU bandwidth when multiple processes are competing for CPU. But if that’s the case, why do people still use CPU limits?

Why do people still use CPU limits?

My main hypothesis is that people are introduced to limits under high-stakes incidents. And fear is a helluva disincentive. For some reason, their nodes start to becomes unresponsible, and they need to sprinkle stability magic dust. Right now.

They ignore that it might have been caused by their last cost-saving4 initiative, that reduced node count at the expense of increased pod density. Or that it might be related to some previous initiative, when they outsmarted EKS recommendation for kubeReserved because they wanted to increase pod density. They are afraid, they wanted to have the cake and eat it too, and have been delivering cost-savings on borrowed resources, but now the bill is due.

At that moment, they trade-off throughput and latency for lower CPU utilization AND higher pod density. They save the day, and it becomes best practice. At least, until they need higher throughput and lower latency again.

What shall I do?

  1. Don’t use limits unless you want to trafe-off lower CPU utilization for lower throughput and higher latency.
  2. Properly use requests to define CPU bandwidth5 requirements.
  3. Don’t take shortcuts to cost savings.
  4. Read the documentation, even when they omit examples for the sake of brevity.

Have fun!

  1. As in “this pod shall use at max 2 vCPUs at all time!”. ↩︎

  2. Golang 1.5+ has a default GOMAXPROCS=runtime.NumCPU, for example. See: https://github.com/golang/go/issues/33803 and https://twitter.com/embano1/status/1149654812595646471 ↩︎

  3. kube-scheduler selects a node in two-steps: filtering and scoring. The default scheduling policy contains the PodFitsResources predicate, that filters our nodes that don’t match the pod’s resource requirements. See “Scheduling Policies” documentation for more details. ↩︎

  4. Because nothing is better for Fridays than that nice “X% node count reduction” graph on Slack. ↩︎

  5. The p90/p95 CPU usage at a 5m/1m resolution is usually a good starting point. See: