I did two flamegraphs profiling a simple Ringo web app and an optimized version of the same app.
What an awesome visualization tool! I'll definitely create a flamegraph next time I need to profile something.
The tree visualization encouraged me to think about what actually triggers code to get executed and why some frames sometimes go into wide plateous (which is bad) and at which point I could most easily cut of such mountains.
One glaring example is Reinhardt's autoescaping feature.
Unoptimized application flamegraph. Highlighting the template autoescape stack.
Autoescaping requires that Reinhardt matches every string against a regex to get rid of possible injections. The cost of autoescaping is clearly visible as a wide plateau in the graph above.
Below are the two links to the full SVG flamegraph files. The SVG allows you to double click on part of a stack to zoom in on it:
I'll annotate the two flamegraphs to make clearer what's going on. The optimization clearly changes which code paths are hot. Some traces dissapear completely from the sampling.
The profiled code is the example application for Reinhardt and as such it is tuned for easy development and not speed. Thus, it was an easy target to optimize and the tweaks made are fairly obvious; this isn't about tricky optimizations but should give an example of how to interpret a Ringo flamegraph.
If you don't know how to read flamegraphs, the flamegraphs website provides a useful summary:
The colors aren't significant, and are usually picked at random to be warm colors.
Unoptimized application: main areas of computation annotated
Looking at the overall graph, six types of stack traces can be distinguished. Going from right to left, we see:
By default, the Reinhardt template engine will read the template from disk, then split the string into semantic chunks. The chunks are interpreted by a parser assembling the actual template function. Rendering the template is done by simply calling the template function with a rendering context.
The optimized version uses a cached loader which performs all steps except rendering only once for each template. After the first request, the template function is cached in a ConcurrentHashMap and filesystem loading, lexing and parsing are no longer performed. Note that I do not cache the result of passing the context to the template function; the template is rendered in both versions.
Additionally, I disabled autoescaping. Autoescaping helps to prevent XSS attacks but it has significant performance costs: every string must be checked for unescaped HTML/XHTML tags and additionally checks are necessary to prevent double-escaping.
One oddity in the graph is that the first three template steps are present twice: loading , lexing  and parsing  are the steps taken to render a template. But due to template inclusion, the sub-template triggers the same first three steps: [1b], [2b] and [3b]. Autoescaping  and rendering , on the other hand, are only done once for the final template function.
Unoptimized application, detailed annotation of computation areas
Also in the unoptimized app, we can easily see the autoescaping feature. It's a wide plateau in the middle of the graph .
Enter the optimized version:
Optimized application, detailed annotation of computation areas
Both runs were performed with the following wrk call: ten connections are kept open, all connections are asynchronously handled by one thread. The benchmarks ran for 120 seconds without warmup (JVM warmup gives better performance).
./wrk -c 10 -d 120 -t 1 --latency http://localhost:8080
This setup kept one of my four CPU cores busy with wrk and Java's load was between 1.5 and 2.5. My RAM usage was always below 10%. I didn't want to saturate any part of my machine.
I also removed stack traces for which I recieved less then 10 samples to draw a cleaner picture.
Simon Oberhammer, 05/Jan/2015