mirror of https://github.com/grpc/grpc.io.git
395 lines
18 KiB
Markdown
395 lines
18 KiB
Markdown
---
|
|
title: So You Want to Optimize gRPC - Part 1
|
|
date: 2018-03-06
|
|
spelling: cSpell:ignore kvstore Mastrangelo MILLIS OOMs
|
|
author:
|
|
name: Carl Mastrangelo
|
|
link: https://github.com/carl-mastrangelo
|
|
position: Google
|
|
---
|
|
|
|
A common question with gRPC is how to make it fast. The gRPC library offers users access to high
|
|
performance RPCs, but it isn't always clear how to achieve this. Because this question is common
|
|
enough I thought I would try to show my thought process when tuning programs.
|
|
|
|
<!--more-->
|
|
|
|
## Setup
|
|
|
|
Consider a basic key-value service that is used by multiple other programs. The service needs to
|
|
be safe for concurrent access in case multiple updates happen at the same time. It needs to be
|
|
able to scale up to use the available hardware. Lastly, it needs to be fast. gRPC is a perfect
|
|
fit for this type of service; let's look at the best way to implement it.
|
|
|
|
For this blog post, I have written an example
|
|
[client and server](https://github.com/carl-mastrangelo/kvstore) using gRPC Java. The program is
|
|
split into three main classes, and a protobuf file describing the API:
|
|
|
|
* [KvClient](https://github.com/carl-mastrangelo/kvstore/blob/01-start/src/main/java/io/grpc/examples/KvClient.java)
|
|
is a simulated user of the key value system. It randomly creates, retrieves, updates,
|
|
and deletes keys and values. The size of keys and values it uses is also randomly decided
|
|
using an [exponential distribution](https://en.wikipedia.org/wiki/Exponential_distribution).
|
|
* [KvService](https://github.com/carl-mastrangelo/kvstore/blob/01-start/src/main/java/io/grpc/examples/KvService.java)
|
|
is an implementation of the key value service. It is installed by the gRPC Server to handle
|
|
the requests issued by the client. To simulate storing the keys and values on disk, it adds
|
|
short sleeps while handling the request. Reads and writes will experience a 10 and 50
|
|
millisecond delay to make the example act more like a persistent database.
|
|
* [KvRunner](https://github.com/carl-mastrangelo/kvstore/blob/01-start/src/main/java/io/grpc/examples/KvRunner.java)
|
|
orchestrates the interaction between the client and the server. It is the main entry point,
|
|
starting both the client and server in process, and waiting for the the client to execute its
|
|
work. The runner does work for 60 seconds and then records how many RPCs were completed.
|
|
* [kvstore.proto](https://github.com/carl-mastrangelo/kvstore/blob/01-start/src/main/proto/kvstore.proto)
|
|
is the protocol buffer definition of our service. It describes exactly what clients can expect
|
|
from the service. For the sake of simplicity, we will use Create, Retrieve, Update, and Delete
|
|
as the operations (commonly known as CRUD). These operations work with keys and values made up
|
|
of arbitrary bytes. While they are somewhat REST like, we reserve the right to diverge and
|
|
add more complex operations in the future.
|
|
|
|
[Protocol buffers](https://developers.google.com/protocol-buffers/) (protos) aren't required to use
|
|
gRPC, they are a very convenient way to define service interfaces and generate client and server
|
|
code. The generated code acts as glue code between the application logic and the core gRPC
|
|
library. We refer to the code called by a gRPC client the _stub_.
|
|
|
|
## Starting Point
|
|
|
|
### Client
|
|
|
|
Now that we know what the program _should_ do, we can start looking at how the program performs.
|
|
As mentioned above, the client makes random RPCs. For example, here is the code that makes the
|
|
[creation](https://github.com/carl-mastrangelo/kvstore/blob/f422b1b6e7c69f8c07f96ed4ddba64757242352c/src/main/java/io/grpc/examples/KvClient.java#L80)
|
|
request:
|
|
|
|
```java
|
|
private void doCreate(KeyValueServiceBlockingStub stub) {
|
|
ByteString key = createRandomKey();
|
|
try {
|
|
CreateResponse res = stub.create(
|
|
CreateRequest.newBuilder()
|
|
.setKey(key)
|
|
.setValue(randomBytes(MEAN_VALUE_SIZE))
|
|
.build());
|
|
if (!res.equals(CreateResponse.getDefaultInstance())) {
|
|
throw new RuntimeException("Invalid response");
|
|
}
|
|
} catch (StatusRuntimeException e) {
|
|
if (e.getStatus().getCode() == Code.ALREADY_EXISTS) {
|
|
knownKeys.remove(key);
|
|
logger.log(Level.INFO, "Key already existed", e);
|
|
} else {
|
|
throw e;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
A random key is created, along with a random value. The request is sent to the server, and the
|
|
client waits for the response. When the response is returned, the code checks that it is as
|
|
expected, and if not, throws an exception. While the keys are chosen randomly, they need to be
|
|
unique, so we need to make sure that each key isn't already in use. To address this, the code
|
|
keeps track of keys it has created, so as not to create the same key twice. However, it's
|
|
possible that another client already created a particular key, so we log it and move on.
|
|
Otherwise, an exception is thrown.
|
|
|
|
We use the **blocking** gRPC API here, which issues a requests and waits for a response.
|
|
This is the simplest gRPC stub, but it blocks the thread while running. This means that at most
|
|
**one** RPC can be in progress at a time from the client's point of view.
|
|
|
|
### Server
|
|
|
|
On the server side, the request is received by the
|
|
[service handler](https://github.com/carl-mastrangelo/kvstore/blob/f422b1b6e7c69f8c07f96ed4ddba64757242352c/src/main/java/io/grpc/examples/KvService.java#L34):
|
|
|
|
```java
|
|
private final Map<ByteBuffer, ByteBuffer> store = new HashMap<>();
|
|
|
|
@Override
|
|
public synchronized void create(
|
|
CreateRequest request, StreamObserver<CreateResponse> responseObserver) {
|
|
ByteBuffer key = request.getKey().asReadOnlyByteBuffer();
|
|
ByteBuffer value = request.getValue().asReadOnlyByteBuffer();
|
|
simulateWork(WRITE_DELAY_MILLIS);
|
|
if (store.putIfAbsent(key, value) == null) {
|
|
responseObserver.onNext(CreateResponse.getDefaultInstance());
|
|
responseObserver.onCompleted();
|
|
return;
|
|
}
|
|
responseObserver.onError(Status.ALREADY_EXISTS.asRuntimeException());
|
|
}
|
|
```
|
|
|
|
The service extracts the key and value as `ByteBuffer`s from the request. It acquires the lock
|
|
on the service itself to make sure concurrent requests don't corrupt the storage. After
|
|
simulating the disk access of a write, it stores it in the `Map` of keys to values.
|
|
|
|
Unlike the client code, the service handler is **non-blocking**, meaning it doesn't return a
|
|
value like a function call would. Instead, it invokes `onNext()` on the `responseObserver` to
|
|
send the response back to the client. Note that this call is also non-blocking, meaning that
|
|
the message may not yet have been sent. To indicate we are done with the message, `onCompleted()`
|
|
is called.
|
|
|
|
### Performance
|
|
|
|
Since the code is safe and correct, let's see how it performs. For my measurement I'm using my
|
|
Ubuntu system with a 12 core processor and 32 GB of memory. Let's build and run the code:
|
|
|
|
```sh
|
|
./gradlew installDist
|
|
time ./build/install/kvstore/bin/kvstore
|
|
Feb 26, 2018 1:10:07 PM io.grpc.examples.KvRunner runClient
|
|
INFO: Starting
|
|
Feb 26, 2018 1:11:07 PM io.grpc.examples.KvRunner runClient
|
|
INFO: Did 16.55 RPCs/s
|
|
|
|
real 1m0.927s
|
|
user 0m10.688s
|
|
sys 0m1.456s
|
|
```
|
|
|
|
Yikes! For such a powerful machine, it can only do about 16 RPCs per second. It hardly used any
|
|
of our CPU, and we don't know how much memory it was using. We need to figure out why it's so
|
|
slow.
|
|
|
|
|
|
## Optimization
|
|
|
|
### Analysis
|
|
|
|
Let's understand what the program is doing before we make any changes. When optimizing, we need
|
|
to know where the code is spending its time in order to know what we can optimize. At this early
|
|
stage, we don't need profiling tools yet, we can just reason about the program.
|
|
|
|
The client is started and serially issues RPCs for about a minute. Each iteration, it [randomly
|
|
decides](https://github.com/carl-mastrangelo/kvstore/blob/f422b1b6e7c69f8c07f96ed4ddba64757242352c/src/main/java/io/grpc/examples/KvClient.java#L49)
|
|
what operation to do:
|
|
|
|
```java
|
|
void doClientWork(AtomicBoolean done) {
|
|
Random random = new Random();
|
|
KeyValueServiceBlockingStub stub = KeyValueServiceGrpc.newBlockingStub(channel);
|
|
|
|
while (!done.get()) {
|
|
// Pick a random CRUD action to take.
|
|
int command = random.nextInt(4);
|
|
if (command == 0) {
|
|
doCreate(stub);
|
|
continue;
|
|
}
|
|
/* ... */
|
|
rpcCount++;
|
|
}
|
|
}
|
|
```
|
|
|
|
This means that **at most one RPC can be active at any time**. Each RPC has to wait for the
|
|
previous one to complete. And how long does each RPC take to complete? From reading the server
|
|
code, most of the operations are doing a write which takes about 50 milliseconds. At top
|
|
efficiency, the most operations this code can do per second is about 20:
|
|
|
|
20 queries = 1000ms / (50 ms / query)
|
|
|
|
Our code can do about 16 queries in a second, so that seems about right. We can spot check this
|
|
assumption by looking at the output of the `time` command used to run the code. The server goes
|
|
to sleep when running queries in the
|
|
[simulateWork](https://github.com/carl-mastrangelo/kvstore/blob/f422b1b6e7c69f8c07f96ed4ddba64757242352c/src/main/java/io/grpc/examples/KvService.java#L88)
|
|
method. This implies that the program should be mostly idle while waiting for the RPCs to
|
|
complete.
|
|
|
|
We can confirm this is the case by looking at the `real` and `user` times of the command above.
|
|
They say that the amount of _wall clock_ time was 1 minute, while the amount of _cpu_ time
|
|
was 10 seconds. My powerful, multicore CPU was only busy 16% of the time. Thus, if we could
|
|
get the program to do more work during that time, it seems like we could get more RPCs complete.
|
|
|
|
### Hypothesis
|
|
|
|
Now we can state clearly what we think is the problem, and propose a solution. One way to speed
|
|
up programs is to make sure the CPU is not idling. To do this, we issue work concurrently.
|
|
|
|
In gRPC Java, there are three types of stubs: blocking, non-blocking, and listenable future. We
|
|
have already seen the blocking stub in the client, and the non-blocking stub in the server. The
|
|
listenable future API is a compromise between the two, offering both blocking and non-blocking
|
|
like behavior. As long as we don't block a thread waiting for work to complete, we can start
|
|
new RPCs without waiting for the old ones to complete.
|
|
|
|
### Experiment
|
|
|
|
To test our hypothesis, let's modify the client code to use the listenable future API. This
|
|
means that we need to think more about concurrency in our code. For example, when keeping track
|
|
of known keys client-side, we need to safely read, modify, and write the keys. We also need to
|
|
make sure that in case of an error, we stop making new RPCs (proper error handling will be covered
|
|
in a future post). Lastly, we need to update the number of RPCs made concurrently, since the
|
|
update could happen in another thread.
|
|
|
|
Making all these changes increases the complexity of the code. This is a trade off you will need
|
|
to consider when optimizing your code. In general, code simplicity is at odds with optimization.
|
|
Java is not known for being terse. That said, the code below is still readable, and program flow
|
|
is still roughly from top to bottom in the function. Here is the
|
|
[doCreate()](https://github.com/carl-mastrangelo/kvstore/blob/f0113912c01ac4ea48a80bb7a4736ddcb3f21e24/src/main/java/io/grpc/examples/KvClient.java#L92)
|
|
method revised:
|
|
|
|
```java
|
|
private void doCreate(KeyValueServiceFutureStub stub, AtomicReference<Throwable> error) {
|
|
ByteString key = createRandomKey();
|
|
ListenableFuture<CreateResponse> res = stub.create(
|
|
CreateRequest.newBuilder()
|
|
.setKey(key)
|
|
.setValue(randomBytes(MEAN_VALUE_SIZE))
|
|
.build());
|
|
res.addListener(() -> rpcCount.incrementAndGet(), MoreExecutors.directExecutor());
|
|
Futures.addCallback(res, new FutureCallback<CreateResponse>() {
|
|
@Override
|
|
public void onSuccess(CreateResponse result) {
|
|
if (!result.equals(CreateResponse.getDefaultInstance())) {
|
|
error.compareAndSet(null, new RuntimeException("Invalid response"));
|
|
}
|
|
synchronized (knownKeys) {
|
|
knownKeys.add(key);
|
|
}
|
|
}
|
|
|
|
@Override
|
|
public void onFailure(Throwable t) {
|
|
Status status = Status.fromThrowable(t);
|
|
if (status.getCode() == Code.ALREADY_EXISTS) {
|
|
synchronized (knownKeys) {
|
|
knownKeys.remove(key);
|
|
}
|
|
logger.log(Level.INFO, "Key already existed", t);
|
|
} else {
|
|
error.compareAndSet(null, t);
|
|
}
|
|
}
|
|
});
|
|
}
|
|
```
|
|
|
|
The stub has been modified to be a `KeyValueServiceFutureStub`, which produces a `Future` when
|
|
called instead of the response itself. gRPC Java uses an extension of this called `ListenableFuture`,
|
|
which allows adding a callback when the future completes. For the sake of this program, we are
|
|
not as concerned with getting the response. Instead we care more if the RPC succeeded or not.
|
|
With that in mind, the code mainly checks for errors rather than processing the response.
|
|
|
|
The first change made is how the number of RPCs is recorded. Instead of incrementing the counter
|
|
outside of the main loop, we increment it when the RPC completes.
|
|
|
|
Next, we create a new object
|
|
for each RPC which handles both the success and failure cases. Because `doCreate()` will already
|
|
be completed by the time RPC callback is invoked, we need a way to propagate errors other than
|
|
by throwing. Instead, we try to update an reference atomically. The main loop will occasionally
|
|
check if an error has occurred and stop if there is a problem.
|
|
|
|
Lastly, the code is careful to only add a key to `knownKeys` when the RPC is actually complete,
|
|
and only remove it when known to have failed. We synchronize on the variable to make sure two
|
|
threads don't conflict. Note: although the access to `knownKeys` is threadsafe, there are still
|
|
[race conditions](https://en.wikipedia.org/wiki/Race_condition). It is possible that one thread
|
|
could read from `knownKeys`, a second thread delete from `knownKeys`, and then the first thread
|
|
issue an RPC using the first key. Synchronizing on the keys only ensures that it is consistent,
|
|
not that it is correct. Fixing this properly is outside of the scope of this post, so instead we
|
|
just log the event and move on. You will see a few such log statements if you run this program.
|
|
|
|
### Running the Code
|
|
|
|
If you start up this program and run it, you'll notice that it doesn't work:
|
|
|
|
```sh
|
|
WARNING: An exception was thrown by io.grpc.netty.NettyClientStream$Sink$1.operationComplete()
|
|
java.lang.OutOfMemoryError: unable to create new native thread
|
|
at java.lang.Thread.start0(Native Method)
|
|
at java.lang.Thread.start(Thread.java:714)
|
|
...
|
|
```
|
|
|
|
What?! Why would I show you code that fails? The reason is that in real life making a change often
|
|
doesn't work on the first try. In this case, the program ran out of memory. Odd things begin to
|
|
happen when a program runs out of memory. Often, the root cause is hard to find, and red herrings
|
|
abound. A confusing error message says "unable to create new native thread"
|
|
even though we didn't create any new threads in our code. Experience is very helpful in fixing
|
|
these problems rather than debugging. Since I have debugged many OOMs, I happen to know Java tells
|
|
us about the straw that broke the camel's back. Our program started using way more memory, but the
|
|
final allocation that failed happened, by chance, to be in thread creation.
|
|
|
|
So what happened? _There was no pushback to starting new RPCs._ In the blocking version, a new
|
|
RPC couldn't start until the last one completed. While slow, it also prevented us from creating
|
|
tons of RPCs that we didn't have memory for. We need to account for this in the listenable
|
|
future version.
|
|
|
|
To solve this, we can apply a self-imposed limit on the number of active RPCs. Before starting a
|
|
new RPC, we will try to acquire a permit. If we get one, the RPC can start. If not, we will wait
|
|
until one is available. When an RPC completes (either in success or failure), we return the
|
|
permit. To [accomplish](https://github.com/carl-mastrangelo/kvstore/blob/02-future-client/src/main/java/io/grpc/examples/KvClient.java#L94)
|
|
this, we will using a `Semaphore`:
|
|
|
|
```java
|
|
private final Semaphore limiter = new Semaphore(100);
|
|
|
|
private void doCreate(KeyValueServiceFutureStub stub, AtomicReference<Throwable> error)
|
|
throws InterruptedException {
|
|
limiter.acquire();
|
|
ByteString key = createRandomKey();
|
|
ListenableFuture<CreateResponse> res = stub.create(
|
|
CreateRequest.newBuilder()
|
|
.setKey(key)
|
|
.setValue(randomBytes(MEAN_VALUE_SIZE))
|
|
.build());
|
|
res.addListener(() -> {
|
|
rpcCount.incrementAndGet();
|
|
limiter.release();
|
|
}, MoreExecutors.directExecutor());
|
|
/* ... */
|
|
}
|
|
```
|
|
|
|
Now the code runs successfully, and doesn't run out of memory.
|
|
|
|
### Results
|
|
|
|
Building and running the code again looks a lot better:
|
|
|
|
```sh
|
|
./gradlew installDist
|
|
time ./build/install/kvstore/bin/kvstore
|
|
Feb 26, 2018 2:40:47 PM io.grpc.examples.KvRunner runClient
|
|
INFO: Starting
|
|
Feb 26, 2018 2:41:47 PM io.grpc.examples.KvRunner runClient
|
|
INFO: Did 24.283 RPCs/s
|
|
|
|
real 1m0.923s
|
|
user 0m12.772s
|
|
sys 0m1.572s
|
|
```
|
|
|
|
Our code does **46%** more RPCs per second than previously. We can also see that we used about 20%
|
|
more CPU than previously. As we can see our hypothesis turned out to be correct and the fix
|
|
worked. All this happened without making any changes to the server. Also, we were able to
|
|
measure without using any special profilers or tracers.
|
|
|
|
Do the numbers make sense? We expect to issue mutation (create, update, and delete) RPCs each
|
|
about with 1/4 probability. Reads are also issue 1/4 of the time, but don't take as long. The
|
|
mean RPC time should be about the weighted average RPC time:
|
|
|
|
```nocode
|
|
.25 * 50ms (create)
|
|
.25 * 10ms (retrieve)
|
|
.25 * 50ms (update)
|
|
+.25 * 50ms (delete)
|
|
------------
|
|
40ms
|
|
```
|
|
|
|
At 40ms on average per RPC, we would expect the number of RPCs per second to be:
|
|
|
|
25 queries = 1000ms / (40 ms / query)
|
|
|
|
That's approximately what we see with the new code. The server is still serially handling
|
|
requests, so it seems like we have more work to do in the future. But for now, our optimizations
|
|
seem to have worked.
|
|
|
|
|
|
## Conclusion
|
|
|
|
There are a lot of opportunities to optimize your gRPC code. To take advantage of these, you
|
|
need to understand what your code is doing, and what your code is supposed to do. This post shows
|
|
the very basics of how to approach and think about optimization. Always make sure to measure
|
|
before and after your changes, and use these measurements to guide your optimizations.
|
|
|
|
In [Part 2](/blog/optimizing-grpc-part-2/), we will continue optimizing the server part of the code.
|