Discover Azul's high-performance Java platform providing faster speed, startup, & efficiency without code changes
Blog chevron_right Java

Best Practices for Ramps in Performance Tests

One of the most common objectives in performance-testing containerized environments is to see how much load a pod can handle before you need to autoscale more pods. Since many autoscaling K8 clusters trigger scaling on CPU utilization, in practice this means testing how much throughput a pod can take before you go over a certain level of CPU utilization. (We think a business metric like failed transactions or response latency is better than just pure CPU utilization, but that’s a story for another post.)

One way people find this limit is to gradually ramp up the load on a machine and see when you go over a given CPU utilization. This is exactly what the TUSSLE framework does, although with the more valuable metric of max latency. However, there are pitfalls in this approach that can have you basing business decisions on garbage data if you’re not careful. Ramps in performance tests are very useful if you follow best practices.

Give each step in your ramp enough time

When you’re performing a ramp capacity test, the main knobs you can turn in a ramp test are:

  • Initial load
  • Eventual load
  • The rate at which you grow the load
  • How long you stay at each load

Of these, the most critical one for drawing valid conclusions is how long you stay at each load. This is because it often takes a long time to see the effects of handling a higher load, things like allocation rates and garbage collection. Live sets often take a long time to build up to breaking point. So if each step in your ramp takes 5 minutes, but it takes 15 minutes for the bad effects of the new load to show up, your failure point will actually be at a much lower load than your test results think it is.


If each step in your ramp takes 5 minutes, but it takes 15 minutes for the bad effects of the new load to show up, your failure point will actually be at a much lower load than your test results think it is.


“So how do I know that I’ve given enough time at each step in my ramp?” you might ask. Here are a couple ideas.

Check your GC logs

Look at various things in your GC logs. It’s fairly simple to tell if JIT has completed by looking at the CPU utilization and Tier 2 compile queues. You can look at whether a live set is still increasing or has stabilized, overall memory utilization has stabilized, and so forth.

But really, the proof is in the pudding: You should be getting repeatable data for each step and from run to run, with your key business metrics (transactions per second, orders completed, latency numbers) stabilized. If you’re still seeing lots of noisy test results or high variance between tests, then keep expanding the time of your overall test run and the time between steps.

Warm up fully before starting your ramp

Make sure you explicitly handle JIT warmup as a separate step. Don’t assume that JIT will just take care of itself as part of the ramp up. The best practice is to do a run at a reasonable rate, wait for JIT CPU utilization to calm down and the tier 2 compile queue to settle, then start the ramp test.

Allowing for warmup is especially important on an optimized JDK like Azul Platform Prime that has longer JIT activity to produce more performant code. Eventual warmed-up capacity is almost always higher with Prime, but your test could show a lower breaking point because the JVM is spending CPU on both JIT and business logic. Consider using ReadyNow, Cloud Native Compiler, or other warmup optimizations to shorten the warmup curve.


Eventual warmed-up capacity is almost always higher with an optimized JDK like Azul Platform Prime, but your test could show a lower breaking point because the JVM is spending CPU on both JIT and business logic.


Evaluating results

In general, we don’t like being too focused in on a black-and-white test like “When did I reach 75%?” We prefer to look at CPU behavior through the life of the test. Consider how you are actually collecting the CPU utilization. Assuming you’re looking at a Grafana chart, Prometheus is probably using top/ps/something else to collect that data. Be aware that it’s usually averaging CPU utilization over 30 seconds to 2 minutes, not 1s instance CPU count. This means you could have lots of 75% spikes hidden in your Grafana chart.

Also, look at the trends throughout the whole run, not just when something tripped a trigger. Is there sustained bumping up of CPU against thresholds or isolated spikes? Can you account for spikes in CPU by collating them to things like GC cycles? Are your spikes pathological and evidence of trashing or controlled one-offs? You may need to look at more granular result data so you’re not hiding the actual view of things inside of long averaging windows. Good luck with your ramps in performance tests.

Performance Testing Is Hard

Our performance experts can help you get better performance at lower cloud costs.