Flamegraphing a Ringo web application

I did two flamegraphs profiling a simple Ringo web app and an optimized version of the same app.

What an awesome visualization tool! I'll definitely create a flamegraph next time I need to profile something.

The tree visualization encouraged me to think about what actually triggers code to get executed and why some frames sometimes go into wide plateous (which is bad) and at which point I could most easily cut of such mountains.

Quick example to get you hungry

One glaring example is Reinhardt's autoescaping feature.

In the unoptimized app, autoescaping is a wide plateau in the middle of the graph. You can see it below labeled as "Template autoescaping" within the bigger block of the JavaScript application:

Unoptimized application flamegraph. Highlighting the template autoescape stack.

Autoescaping requires that Reinhardt matches every string against a regex to get rid of possible injections. The cost of autoescaping is clearly visible as a wide plateau in the graph above.

Compare the above graph with the optimized version below. I disabled autoescaping (among other things). The autoescaping plateau is completely gone. Due to this and other optimization, we now see less than 30% of the total time is spent in the applicaton's JavaScript code compared to 80% in the original application. Now, for the most part we see JSGI and Jetty writing responses.

Optimized application flamegraph: more time is spent writing the response instead of generating it in JavaScript.

Below are the two links to the full SVG flamegraph files. The SVG allows you to double click on part of a stack to zoom in on it:

About Flamegraphs
The big picture
Optimizations performed
Annotated flamegraphs
Benchmark

What's this about?

I'll annotate the two flamegraphs to make clearer what's going on. The optimization clearly changes which code paths are hot. Some traces dissapear completely from the sampling.

The profiled code is the example application for Reinhardt and as such it is tuned for easy development and not speed. Thus, it was an easy target to optimize and the tweaks made are fairly obvious; this isn't about tricky optimizations but should give an example of how to interpret a Ringo flamegraph.

About Flamegraphs

If you don't know how to read flamegraphs, the flamegraphs website provides a useful summary:

Each box represents a function in the stack (a "stack frame").
The y-axis shows stack depth (number of frames on the stack). The top box shows the function that was on-CPU. Everything beneath that is ancestry. The function beneath a function is its parent.
The x-axis spans the sample population. It does not show the passing of time from left to right, as most graphs do. The left to right ordering has no meaning (it's sorted alphabetically).
The width of the box shows the total time it was on-CPU or part of an ancestry that was on-CPU (based on sample count). Wider box functions may be slower than narrow box functions, or, they may simply be called more often. The call count is not shown (or known via sampling).

The colors aren't significant, and are usually picked at random to be warm colors.

The big picture

Unoptimized application: main areas of computation annotated

Looking at the overall graph, six types of stack traces can be distinguished. Going from right to left, we see:

[1] After launching Jetty, Ringo takes a step back and only manages a set of JavaScript Workers. Jetty triggers JavaScript invocations in those Workers through the HTTP Servlet.
[2] Jetty launches its own ThreadPool which is the big bar at the bottom which is the parent frame of most of the code executed.
[3] Jetty's Selector thread figures out which connection needs servicing by a Thread.
[4] A Jetty servlet triggers the response generation. In our simple example, there's only one path straight up through all Jetty handlers.
[5] Ringo code is triggered by the JSGI servlet. Once RingoWorker.invoke is called, we are in JavaScript land.
[6] Jetty reads the response from the network ands writes our application's response.

Optimizations performed

By default, the Reinhardt template engine will read the template from disk, then split the string into semantic chunks. The chunks are interpreted by a parser assembling the actual template function. Rendering the template is done by simply calling the template function with a rendering context.

The optimized version uses a cached loader which performs all steps except rendering only once for each template. After the first request, the template function is cached in a ConcurrentHashMap and filesystem loading, lexing and parsing are no longer performed. Note that I do not cache the result of passing the context to the template function; the template is rendered in both versions.

Additionally, I disabled autoescaping. Autoescaping helps to prevent XSS attacks but it has significant performance costs: every string must be checked for unescaped HTML/XHTML tags and additionally checks are necessary to prevent double-escaping.

Annotated flamegraphs

In the default application, most time is spent running code in JavaScript, particularly in the template system. All steps from [1] to [5] are related to templating. That's not surpising because the Reinhardt example application contains only one line which renders a template with a predefined context. Everything the application does, is this:

templates.renderResponse('base.html', context);

One oddity in the graph is that the first three template steps are present twice: loading [1], lexing [2] and parsing [3] are the steps taken to render a template. But due to template inclusion, the sub-template triggers the same first three steps: [1b], [2b] and [3b]. Autoescaping [4] and rendering [5], on the other hand, are only done once for the final template function.

Unoptimized application, detailed annotation of computation areas

Also in the unoptimized app, we can easily see the autoescaping feature. It's a wide plateau in the middle of the graph [4].

What we would want for better throughput, is less time spent executing JavaScript to figure out which bytes to send as a response. Jetty and JSGI should be writing responses most of the time. Subsequently, if we get faster servicing of connections, new requests should arrive faster, which would require more network reading by Jetty.

Enter the optimized version:

Optimized application, detailed annotation of computation areas

In the optimized version, the lexing and parsing steps are no longer visible (they only happened once) and most time is spent in our containers writing out the response bytes ([1] and [2]). This is good! The tasks "ringo housekeeping" and "jetty nio selector" now look wide and expensive but this was expected: the less time we spend running JavaScript, the more the containers around it will act and become visible in the sampling.

Benchmark

Both runs were performed with the following wrk call: ten connections are kept open, all connections are asynchronously handled by one thread. The benchmarks ran for 120 seconds without warmup (JVM warmup gives better performance).

./wrk -c 10 -d 120 -t 1 --latency http://localhost:8080

This setup kept one of my four CPU cores busy with wrk and Java's load was between 1.5 and 2.5. My RAM usage was always below 10%. I didn't want to saturate any part of my machine.

Software used

RingoJs
lightweight-java-profiler to sample stack traces
flamegraph
wrk benchmark

Stack cleanup

I replaced or removed some parts of the stack traces to make it more readable. Note especially how all calls to Rhino scriping ('org.mozilla.javascript'), which is the glue between our JS code and Java, are completely removed:

removed all lines matching the following regex:
- \t_home_simon_reinhardt_examples_.*\n
- \torg.mozilla.javascript.*\n
removed prefix "org.mozilla.javascript.gen."
renamed "reinhardt_lib_" to "reinhardt_"
removed prefix "org.eclipse."

I also removed stack traces for which I recieved less then 10 samples to draw a cleaner picture.

Simon Oberhammer, 05/Jan/2015