You may have heard that profiling can make your services faster, cheaper, and more reliable, but if you haven’t used a profiler before, it can seem like a dark art.
In this guide, we’ll explain profiling, then profile a sample service that has a performance problem and see how we can use Continuous Profiler to understand and fix the problem!
A profiler shows how much “work” each function is doing by collecting data about the program as it’s running. For example, infrastructure monitoring might show that your app servers are using 80% of their CPU but you may not know why. Profiling could then show you that function
doSomeWork is using 48% of the CPU, function
renderGraph is using another 19%, and so on. This is important when working on performance problems because many programs spend a lot of time in very few places and it’s often not obvious which places those are. Guessing at which parts of a program to optimize can often cause engineers to spend a lot of time and not get much in the way of results. By using a profiler, we find exactly which parts of the code we should optimize to get the most bang for our buck.
If you’ve used an APM tool, you might think of profiling like a “deeper” tracer that provides a very fine grained view of your code without needing any instrumentation.
Datadog Continuous Profiler can track various types of “work”, including CPU usage, amount and types of objects being allocated in memory, time spent waiting to acquire locks, amount of network or file I/O, and more! The profile types available depend on the language being profiled.
We’ve got an example service with a performance problem that you can easily experiment with. The example service has an API which allows searching a “database” of 5000 movies. We’ll fix a performance problem with it! This guide shows you the process, but for the full experience you can follow along with your own shell, browser, and IDE.
Get the example service up and running with:
git clone https://github.com/DataDog/dd-continuous-profiler-example.git cd dd-continuous-profiler-example echo "DD_API_KEY=YOUR_API_KEY" > docker.env docker-compose up -d
Once all the containers are built and running, you can jump into a “toolbox” container to explore:
docker exec -it dd-continuous-profiler-example_toolbox_1 bash
You can try out the API with:
curl -s http://movies-api-java:8080/movies?q=wars | jq
There’s also a Python version of the example service, called
movies-api-py. If that’s more your style, you can adjust the commands throughout the tutorial accordingly.
Let’s generate more traffic using the ApacheBench tool, ab. We’ll have it run 10 concurrent HTTP clients sending requests for 20 seconds. Still inside the toolbox container:
ab -c 10 -t 20 http://movies-api-java:8080/movies?q=the ... Reported latencies by ab: Percentage of the requests served within a certain time (ms) 50% 464 66% 503 75% 533 80% 553 90% 614 95% 683 98% 767 99% 795 100% 867 (longest request)
Head on over to Profile Search and look for a profile covering the period in which we were generating traffic. It may take a minute or so. You’ll be able to tell which profile includes the load test because the CPU usage will be higher:
When you open it, you’ll see a visualization of the profile that looks like this:
This is a flame graph. The most important things it shows are how much CPU each method used (since this is a CPU profile) and how each method was called. For example, reading from the second row from the top, we see that
QueuedThreadPool$2.run() (amongst other things), which called
QueuedThreadPool.runjob(Runnable), which called
ReservedTheadExecutor$ReservedThread.run(), and so on.
Zooming in to one area on the bottom of the flame graph, we see this:
The tooltip shows us that roughly 390ms (0.90%) of CPU time was spent within this
String.length() is right below
parse(), which means that
parse() calls it.
If we hover over
String.length(), we see it took about 112ms of CPU time. So we can also tell that 278ms seconds were spent directly in
parse(): 390ms - 112ms. That’s visually represented by the part of the
parse() box that doesn’t have anything below it.
It’s worth calling out that the flame graph does not represent the progression of time. Looking at this bit of the profile,
Gson$1.write() didn’t run before
TypeAdapters$16.write() but it may not have run after it either. They could have been running concurrently, or the program could have run several calls of one, then several calls of the other, and kept switching back and forth. The flame graph merges together all the times that a program was running the same series of functions so you can tell at a glance which parts of the code were using the most CPU without tons of tiny boxes showing each time a function was called.
Zooming back out, we can see that about 87% of CPU usage was within the
replyJSON() method. And if we look lower down,
we can see that
replyJSON(), and the methods it calls, eventually branch into four main code paths (“stack traces”), each of which run functions that have something to do with sorting and date parsing.
We also see a part of the CPU profile that looks like this:
The fact that it spent almost 6% of CPU time in garbage collection suggests that we may be producing a lot of garbage. So hop over to the Allocated Memory profile type:
On an Allocated Memory profile, the size of the boxes shows how much memory each function allocated, and again, the call stack that led to the function doing the allocating. Here we can see that during this one minute profile, the
replyJSON() method and other methods that it called, allocated 17.47 GiB, mostly related to the same date parsing code that we saw in the CPU profile above:
Time to look at the code and see what’s going on! Looking again at the CPU flame graph, we can see that these expensive code paths go through a lambda on line 66, which calls
That corresponds to this bit of code in
dd-continuous-profiler-example/java/src/main/java/movies-api/Server.java, where we see the call to
LocalDate.parse() on line 66:
This is the sorting logic in the API, to return results in order of descending release date. It does this by using the release date converted to a
LocalDate as the sorting key. We could cache these
LocalDates so we only parse each movie’s release date once rather than on every request, but there’s an even easier fix. We can notice that we’re parsing dates in ISO 8601 format (yyyy-mm-dd), which means we can sort them as strings and not parse them at all!
Go ahead and replace the whole try/catch with
return m.releaseDate; like this:
Then rebuild and restart the service:
docker-compose build movies-api-java docker-compose up -d
Let’s check the results! Generate traffic once more like we did earlier:
docker exec -it dd-continuous-profiler-example_toolbox_1 bash ab -c 10 -t 20 http://movies-api-java:8080/movies?q=the
The new results should look something like:
Reported latencies by ab: Percentage of the requests served within a certain time (ms) 50% 82 66% 103 75% 115 80% 124 90% 145 95% 171 98% 202 99% 218 100% 315 (longest request)
p99 went from 795ms to 218ms, and overall, this is four to six times faster than before!
If you find the profile containing this new load test and look at the CPU profile, you’ll see that the
replyJSON parts of the flame graph that we were looking at are now a much smaller percentage of the total CPU usage.
When you’re done exploring, you can clean up with:
Improving CPU usage like this can easily translate into saving money. If this had been a real service, this small improvement might have enabled us to scale down to half the servers, potentially saving thousands of dollars a year. Not bad for about 10 minutes of work!
We’ve only skimmed the surface here but this should give you a sense of how to get started. Give it a shot on your services!