Java AWS Lambda Valkey Client Hangs: Fix For V2.2.0+
Hey everyone! Let's dive into a head-scratcher that some of you might have run into if you're using the Valkey Java client with AWS Lambda, especially after bumping up to versions 2.2.0-rc3 or later. So, picture this: you've got your Java code happily running on AWS Lambda, using the java21 runtime on x86_64. Everything's groovy. Then, you decide to update the Valkey client from, say, 2.1.1 to 2.2.0. Suddenly, your tests start timing out, and it's not just a little hiccup – all requests seem to hang indefinitely. No matter what you do, the futures you're getting back from the client never seem to complete. Pretty annoying, right? We're talking about simple calls like setting a key with some options, where the .get() call just sits there, lost in the void. And the crazy part? Even turning on trace logging doesn't give you any juicy clues. This definitely isn't the behavior we expect, as the Java client is supposed to give you a promise that futures will complete, either with a success or an error.
When Futures Go AWOL: The Mystery of the Hanging Requests
So, what's actually happening when your Java AWS Lambda requests hang indefinitely with these newer Valkey client versions? The core of the problem lies in how the client, specifically the valkey-glide component, interacts with Java's classloading system within certain environments, like AWS Lambda. When you're running code in environments where the glide client isn't loaded by the standard ClassLoader.getSystemClassLoader(), but rather by a custom classloader (which is exactly what happens in AWS Lambda), things can get a bit tangled. You'll notice that simple calls to any data plane operation, like set or get, result in futures that just never complete. It's like sending a message in a bottle and never getting a reply. We've seen this specifically with versions >= 2.2.0-rc3. The expected behavior, of course, is that these futures should resolve, either successfully or by throwing an exception, giving you some kind of feedback. But in this scenario, they just hang there, leaving your application in a perpetual state of waiting. This is a pretty critical issue because it breaks the fundamental contract of asynchronous operations – they are supposed to finish eventually! If you're trying to debug this, you might be tempted to enable trace logging, hoping for a smoking gun. However, you'll likely find that the logs are surprisingly quiet on the actual client operations, adding to the mystery. The problem doesn't stem from a lack of communication with the Valkey server itself, but rather from an internal mechanism within the client failing to finalize the operation and return the result to your application thread. It's a subtle but significant breakdown in the client's internal callback and completion handling.
Unraveling the ClassLoader Conundrum: Why Things Break
Okay, guys, let's dig a bit deeper into why these requests hang indefinitely on AWS Lambda when you're using the Valkey client. The main culprit here is a classic Java classloading issue, exacerbated by the specific environment of AWS Lambda. When the valkey-glide client needs to interact with certain internal classes, like glide.internal.AsyncRegistry, it uses JNI (Java Native Interface) to call back into Java. The JNI FindClass function has a peculiar behavior: if there's no current native method associated with a classloader, it defaults to using ClassLoader.getSystemClassLoader(). Now, in a standard application, this might be fine. But in AWS Lambda, your code and its dependencies are often loaded by a custom classloader – something like com.amazonaws.services.lambda.runtime.api.client.CustomerClassLoader. When FindClass is invoked from the native side and it defaults to the system classloader, it's looking for AsyncRegistry in the wrong place. It can't find it there because it's actually loaded by the custom Lambda classloader. This results in a java.lang.ClassNotFoundException: glide.internal.AsyncRegistry, which, unfortunately, was being silently ignored by the client in previous versions. Because the client couldn't find this crucial class, it couldn't properly set up the asynchronous registry needed to handle callbacks and complete the futures. That's why your futures never resolve – the mechanism to complete them is broken due to this ClassNotFoundException. The patch we identified addresses this by making sure that: 1. When FindClass fails, it now doesn't ignore the exception. Instead, it logs it properly so you can see what's going on. 2. It also includes logic to print out the exception details, which was key in diagnosing the NoClassDefFoundError. Understanding this classloader behavior is super important for anyone deploying Java applications in serverless or other non-standard classloader environments. It's not always obvious, but these subtle differences in how classes are loaded can cause hard-to-debug issues like hanging requests.
The Smoking Gun: A Silent Exception and a ClassLoader Mishap
So, we've pinpointed the issue causing all requests to hang indefinitely on AWS Lambda for Valkey client versions >= 2.2.0-rc3. The core problem was a java.lang.NoClassDefFoundError for glide/internal/AsyncRegistry, which was being thrown but then silently ignored. Imagine that! A critical error happening, but your program just carries on as if nothing's wrong, leading to those perpetual hangs. The actual root cause of this NoClassDefFoundError is, as we discussed, related to classloaders. When the Valkey Java client, specifically the valkey-glide part, needs to load the AsyncRegistry class via JNI, the FindClass method doesn't look in the correct classloader context within the AWS Lambda environment. Instead of using the custom CustomerClassLoader that loaded your application code, it falls back to the ClassLoader.getSystemClassLoader(). Since AsyncRegistry isn't available via the system classloader in this setup, FindClass fails, throwing a ClassNotFoundException, which then leads to the NoClassDefFoundError. The crucial insight came from patching the code to not ignore these errors and to actually log them. When the error glide_rs::jni_client: Unable to complete callback: Failed to find AsyncRegistry class: Java exception was thrown started appearing (after the first patch, which just logs the error better), it was a huge clue. The subsequent patch that ensured exceptions from env.find_class were properly handled and described revealed the underlying java.lang.NoClassDefFoundError and the Caused by: java.lang.ClassNotFoundException: glide.internal.AsyncRegistry. Printing the classloader of AsyncRegistry.class in the application code confirmed the suspicion: it showed com.amazonaws.services.lambda.runtime.api.client.CustomerClassLoader@..., proving it was loaded by Lambda's custom loader, not the system one. This mismatch is the direct cause of the futures never completing, because the client can't initialize its necessary internal components.
The Fix: Patching for Robustness and Visibility
Alright, let's talk about the fix, or rather, the patches that were instrumental in solving this request hang issue in AWS Lambda for the Valkey client. We actually have two key patches that work together to resolve the problem. The first patch is about making things more visible and robust. It modifies the process_callback_job function in jni_client.rs. Previously, if complete_java_callback returned an error, it was just ignored. This patch changes that behavior. It now checks the result of complete_java_callback and, if there's an error, it logs it using log::error!. This is super important because it surfaces the previously hidden errors, like the Unable to complete callback: Failed to find AsyncRegistry class message. Without this logging, you'd just be left with hanging requests and no idea why. The second, and arguably more critical, patch addresses the root cause: the FindClass failure for glide/internal/AsyncRegistry. In the get_method_cache function, when `env.find_class(