grpc-java/benchmarks
Carl Mastrangelo 5c37a8360c core: cache Accept-Encoding headers (#2766)
* core: cache Accept-Encoding headers

This avoids rebuilding the raw bytes for each RPC.  The decode path
is not yet optimized to avoid pulling to much into a single commit.
Decompressors may still use invalid names, but this is equivalent to
previous behavior, as cleaing happens later in the caching.

Also, internal accessors on DecompressorRegistry are now hidden.

Before:
Benchmark                                                    (extraEncodings)    Mode      Cnt        Score    Error  Units
DecompressorRegistryBenchmark.marshalOld                                    0  sample   928744      124.104 ± 11.159  ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.00                   0  sample                84.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.50                   0  sample                94.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.90                   0  sample               107.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.95                   0  sample               114.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.99                   0  sample               202.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.999                  0  sample              4944.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.9999                 0  sample             12178.008           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p1.00                   0  sample           2056192.000           ns/op
DecompressorRegistryBenchmark.marshalOld                                    1  sample  1345050      150.123 ±  6.952  ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.00                   1  sample               109.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.50                   1  sample               127.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.90                   1  sample               142.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.95                   1  sample               152.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.99                   1  sample               243.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.999                  1  sample              4640.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.9999                 1  sample             11472.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p1.00                   1  sample           2101248.000           ns/op
DecompressorRegistryBenchmark.marshalOld                                    2  sample  1130903      175.846 ±  1.392  ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.00                   2  sample               131.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.50                   2  sample               148.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.90                   2  sample               164.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.95                   2  sample               174.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.99                   2  sample               311.000           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.999                  2  sample              6048.768           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.9999                 2  sample             12349.107           ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p1.00                   2  sample            112000.000           ns/op

After:
Benchmark                                                    (extraEncodings)    Mode      Cnt        Score   Error  Units
DecompressorRegistryBenchmark.marshalOld                                    0  sample  1095005       67.555 ± 5.529  ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.00                   0  sample                42.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.50                   0  sample                52.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.90                   0  sample                69.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.95                   0  sample                84.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.99                   0  sample               133.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.999                  0  sample              3324.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.9999                 0  sample             11056.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p1.00                   0  sample           1820672.000          ns/op
DecompressorRegistryBenchmark.marshalOld                                    1  sample  1437034       78.089 ± 0.723  ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.00                   1  sample                60.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.50                   1  sample                69.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.90                   1  sample                79.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.95                   1  sample                83.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.99                   1  sample                96.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.999                  1  sample              2728.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.9999                 1  sample             11104.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p1.00                   1  sample            105344.000          ns/op
DecompressorRegistryBenchmark.marshalOld                                    2  sample  1203782       95.213 ± 0.864  ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.00                   2  sample                68.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.50                   2  sample                85.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.90                   2  sample                98.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.95                   2  sample               101.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.99                   2  sample               119.000          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.999                  2  sample              3209.736          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p0.9999                 2  sample             11257.947          ns/op
DecompressorRegistryBenchmark.marshalOld:marshalOld·p1.00                   2  sample             63168.000          ns/op
2017-03-01 15:30:28 -08:00
..
src core: cache Accept-Encoding headers (#2766) 2017-03-01 15:30:28 -08:00
README.md Update param name for saving histograms 2016-12-08 11:21:50 -08:00
build.gradle benchmarks: update to jmh 1.17.3 2016-12-19 16:14:59 -08:00

README.md

grpc Benchmarks

QPS Benchmark

The "Queries Per Second Benchmark" allows you to get a quick overview of the throughput and latency characteristics of grpc.

To build the benchmark type

$ ./gradlew :grpc-benchmarks:installDist

from the grpc-java directory.

You can now find the client and the server executables in benchmarks/build/install/grpc-benchmarks/bin.

The C++ counterpart can be found at https://github.com/grpc/grpc/tree/master/test/cpp/qps

Visualizing the Latency Distribution

The QPS client comes with the option --save_histogram=FILE, if set it serializes the histogram to FILE which can then be used with a plotter to visualize the latency distribution. The histogram is stored in the file format of HdrHistogram. That way it can be plotted very easily using a browser based tool like http://hdrhistogram.github.io/HdrHistogram/plotFiles.html. Simply upload the generated file and it will generate a beautiful graph for you. It also allows you to plot two or more histograms on the same surface in order two easily compare latency distributions.

JVM Options

When running a benchmark it's often useful to adjust some JVM options to improve performance and to gain some insights into what's happening. Passing JVM options to the QPS server and client is as easy as setting the JAVA_OPTS environment variables. Below are some options that I find very useful:

  • -Xms gives a lower bound on the heap to allocate and -Xmx gives an upper bound. If your program uses more than what's specified in -Xmx the JVM will exit with an OutOfMemoryError. When setting those always set Xms and Xmx to the same value. The reason for this is that the young and old generation are sized according to the total available heap space. So if the total heap gets resized, they will also have to be resized and this will then trigger a full GC.
  • -verbose:gc prints some basic information about garbage collection. It will log to stdout whenever a GC happend and will tell you about the kind of GC, pause time and memory compaction.
  • -XX:+PrintGCDetails prints out very detailed GC and heap usage information before the program terminates.
  • -XX:-HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=path when you are pushing the JVM hard it sometimes happens that it will crash due to the lack of available heap space. This option will allow you to dive into the details of why it happened. The heap dump can be viewed with e.g. the Eclipse Memory Analyzer.
  • -XX:+PrintCompilation will give you a detailed overview of what gets compiled, when it gets compiled, by which HotSpot compiler it gets compiled and such. It's a lot of output. I usually just redirect it to file and look at it with less and grep.
  • -XX:+PrintInlining will give you a detailed overview of what gets inlined and why some methods didn't get inlined. The output is very verbose and like -XX:+PrintCompilation and useful to look at after some major changes or when a drop in performance occurs.
  • It sometimes happens that a benchmark just doesn't make any progress, that is no bytes are transferred over the network, there is hardly any CPU utilization and low memory usage but the benchmark is still running. In that case it's useful to get a thread dump and see what's going on. HotSpot ships with a tool called jps and jstack. jps tells you the process id of all running JVMs on the machine, which you can then pass to jstack and it will print a thread dump of this JVM.
  • Taking a heap dump of a running JVM is similarly straightforward. First get the process id with jps and then use jmap to take the heap dump. You will almost always want to run it with -dump:live in order to only dump live objects. If possible, try to size the heap of your JVM (-Xmx) as small as possible in order to also keep the heap dump small. Large heap dumps are very painful and slow to analyze.

Profiling

Newer JVMs come with a built-in profiler called Java Flight Recorder. It's an excellent profiler and it can be used to start a recording directly on the command line, from within Java Mission Control or with jcmd.

A good introduction on how it works and how to use it are http://hirt.se/blog/?p=364 and http://hirt.se/blog/?p=370.