--- title: So You Want to Optimize gRPC - Part 2 date: 2018-04-16 author: name: Carl Mastrangelo link: https://carlmastrangelo.com/ position: Google --- How fast is gRPC? Pretty fast if you understand how modern clients and servers are built. In [part 1](/blog/optimizing-grpc-part-1), I showed how to get an easy **60%** improvement. In this post I show how to get a **10000%** improvement. ## Setup As in [part 1](/blog/optimizing-grpc-part-1), we will start with an existing, Java based, key-value service. The service will offer concurrent access for creating, reading, updating, and deleting keys and values. All the code can be seen [here](https://github.com/carl-mastrangelo/kvstore/tree/03-nonblocking-server) if you want to try it out. ## Server Concurrency Let's look at the [KvService](https://github.com/carl-mastrangelo/kvstore/blob/f422b1b6e7c69f8c07f96ed4ddba64757242352c/src/main/java/io/grpc/examples/KvService.java) class. This service handles the RPCs sent by the client, making sure that none of them accidentally corrupt the state of storage. To ensure this, the service uses the `synchronized` keyword to ensure only one RPC is active at a time: ```java private final Map store = new HashMap<>(); @Override public synchronized void create( CreateRequest request, StreamObserver responseObserver) { ByteBuffer key = request.getKey().asReadOnlyByteBuffer(); ByteBuffer value = request.getValue().asReadOnlyByteBuffer(); simulateWork(WRITE_DELAY_MILLIS); if (store.putIfAbsent(key, value) == null) { responseObserver.onNext(CreateResponse.getDefaultInstance()); responseObserver.onCompleted(); return; } responseObserver.onError(Status.ALREADY_EXISTS.asRuntimeException()); } ``` While this code is thread safe, it comes at a high price: only one RPC can ever be active! We need some way of allowing multiple operations to happen safely at the same time. Otherwise, the program can't take advantage of all the available processors. ### Breaking the Lock To solve this, we need to know a little more about the _semantics_ of our RPCs. The more we know about how the RPCs are supposed to work, the more optimizations we can make. For a key-value service, we notice that _operations to different keys don't interfere with each other_. When we update key 'foo', it has no bearing on the value stored for key 'bar'. But, our server is written such that operations to any key must be synchronized with respect to each other. If we could make operations to different keys happen concurrently, our server could handle a lot more load. With the idea in place, we need to figure out how to modify the server. The `synchronized` keyword causes Java to acquire a lock on `this`, which is the instance of `KvService`. The lock is acquired when the `create` method is entered, and released on return. The reason we need synchronization is to protect the `store` Map. Since it is implemented as a [`HashMap`](https://en.wikipedia.org/wiki/Hash_table), modifications to it change the internal arrays. Because the internal state of the `HashMap` will be corrupted if not properly synchronized, we can't just remove the synchronization on the method. However, Java offers a solution here: `ConcurrentHashMap`. This class offers the ability to safely access the contents of the map concurrently. For example, in our usage we want to check if a key is present. If not present, we want to add it, else we want to return an error. The `putIfAbsent` method atomically checks if a value is present, adds it if not, and tells us if it succeeded. Concurrent maps provide stronger guarantees about the safety of `putIfAbsent`, so we can swap the `HashMap` to a `ConcurrentHashMap` and remove `synchronized`: ```java private final ConcurrentMap store = new ConcurrentHashMap<>(); @Override public void create( CreateRequest request, StreamObserver responseObserver) { ByteBuffer key = request.getKey().asReadOnlyByteBuffer(); ByteBuffer value = request.getValue().asReadOnlyByteBuffer(); simulateWork(WRITE_DELAY_MILLIS); if (store.putIfAbsent(key, value) == null) { responseObserver.onNext(CreateResponse.getDefaultInstance()); responseObserver.onCompleted(); return; } responseObserver.onError(Status.ALREADY_EXISTS.asRuntimeException()); } ``` ### If at First You Don't Succeed Updating `create` was pretty easy. Doing the same for `retrieve` and `delete` is easy too. However, the `update` method is a little trickier. Let's take a look at what it's doing: ```java @Override public synchronized void update( UpdateRequest request, StreamObserver responseObserver) { ByteBuffer key = request.getKey().asReadOnlyByteBuffer(); ByteBuffer newValue = request.getValue().asReadOnlyByteBuffer(); simulateWork(WRITE_DELAY_MILLIS); ByteBuffer oldValue = store.get(key); if (oldValue == null) { responseObserver.onError(Status.NOT_FOUND.asRuntimeException()); return; } store.replace(key, oldValue, newValue); responseObserver.onNext(UpdateResponse.getDefaultInstance()); responseObserver.onCompleted(); } ``` Updating a key to a new value needs two interactions with the `store`: 1. Check to see if the key exists at all. 2. Update the previous value to the new value. Unfortunately `ConcurrentMap` doesn't have a straightforward method to do this. Since we may not be the only ones modifying the map, we need to handle the possibility that our assumptions have changed. We read the old value out, but by the time we replace it, it may have been deleted. To reconcile this, let's retry if `replace` fails. It returns true if the replace was successful. (`ConcurrentMap` asserts that the operations will not corrupt the internal structure, but doesn't say that they will succeed!) We will use a do-while loop: ```java @Override public void update( UpdateRequest request, StreamObserver responseObserver) { // ... ByteBuffer oldValue; do { oldValue = store.get(key); if (oldValue == null) { responseObserver.onError(Status.NOT_FOUND.asRuntimeException()); return; } } while (!store.replace(key, oldValue, newValue)); responseObserver.onNext(UpdateResponse.getDefaultInstance()); responseObserver.onCompleted(); } ``` The code wants to fail if it ever sees null, but never if there is a non-null previous value. One thing to note is that if _another_ RPC modifies the value between the `store.get()` call and the `store.replace()` call, it will fail. This is a non-fatal error for us, so we will just try again. Once it has successfully put the new value in, the service can respond back to the user. There is one other possibility that could happen: two RPCs could update the same value and overwrite each other's work. While this may be okay for some applications, it would not be suitable for APIs that provide transactionality. It is out of scope for this post to show how to fix this, but be aware it can happen. ## Measuring the Performance In the last post, we modified the client to be asynchronous and use the gRPC ListenableFuture API. To avoid running out of memory, the client was modified to have at most **100** active RPCs at a time. As we now see from the server code, performance was bottlenecked on acquiring locks. Since we have removed those, we expect to see a 100x improvement. The same amount of work is done per RPC, but a lot more are happening at the same time. Let's see if our hypothesis holds: Before: ```sh $ ./gradlew installDist $ time ./build/install/kvstore/bin/kvstore Apr 16, 2018 10:38:42 AM io.grpc.examples.KvRunner runClient INFO: Did 24.067 RPCs/s real 1m0.886s user 0m9.340s sys 0m1.660s ``` After: ```sh Apr 16, 2018 10:36:48 AM io.grpc.examples.KvRunner runClient INFO: Did 2,449.8 RPCs/s real 1m0.968s user 0m52.184s sys 0m20.692s ``` Wow! From 24 RPCs per second to 2,400 RPCs per second. And we didn't have to change our API or our client. This is why understanding your code and API semantics is important. By exploiting the properties of the key-value API, namely the independence of operations on different keys, the code is now much faster. One noteworthy artifact of this code is the `user` timing in the results. Previously the user time was only 9 seconds, meaning that the CPU was active only 9 of the 60 seconds the code was running. Afterwards, the usage went up by more than 5x to 52 seconds. The reason is that more CPU cores are active. The `KvServer` is simulating work by sleeping for a few milliseconds. In a real application, it would be doing useful work and not have such a dramatic change. Rather than scaling per the number of RPCs, it would scale per the number of cores. Thus, if your machine had 12 cores, you would expect to see a 12x improvement. Still not bad though! ### More Errors If you run this code yourself, you will see a lot more log spam in the form: ```sh Apr 16, 2018 10:38:40 AM io.grpc.examples.KvClient$3 onFailure INFO: Key not found io.grpc.StatusRuntimeException: NOT_FOUND ``` The reason is that the new version of the code makes API level race conditions more apparent. With 100 times as many RPCs happening, the chance of updates and deletes colliding with each other is more likely. To solve this we will need to modify the API definition. Stay tuned for the next post showing how to fix this. ## Conclusion There are a lot of opportunities to optimize your gRPC code. To take advantage of these, you need to understand what your code is doing. This post shows how to convert a lock-based service into a low-contention, lock-free service. Always make sure to measure before and after your changes. In Part 3, we will optimize the code even further. 2,400 RPC/s is just the beginning!