Wednesday, 11 September 2013

Controlling Time in Java

What's the point of this post?

I felt compelled to post something about controlling time in Java after helping out a colleague who had been tenaciously trying to get to the bottom of an apparent "problem" with Netflix's Hystrix Circuit Breaker.

Now I'll forgive you if you haven't heard of a Circuit Breaker in software terms, or Hystrix for that matter. See this Netflix tech blog for an insight  For those who will simply want the executive summary, it's Java library which assists with ensuring that a chain of inter-operating back-end services don't all "jam-up" in the event of too much happening at once, some service runs so slow that consuming services all hang waiting for the slow service to catch up with them, or when a singular or compound failure occurs.  Put even simpler, when a back-end service fails or stalls in an interconnected web of disparate services, a cascading failure scenario becomes ever more likely.   Consider a circuit breaker in your home - when too much load on the electric system of your house occurs (or even when lightning strikes), an appliance meltdown isn't far away; if you're lucky, the fuse blows, and everything switches off.  The worst-case scenario - a direct-hit lightning strike will cause appliances to actually blow up as the fuse didn't quite catch the power surge quickly enough.   The same thing can happen in distributed systems - but it's any combination of high-demand, service failure, timeouts and sluggish responses that will cause a cascading effect, rather than lightning itself.

So that's the context around what we were trying to test and prove - that the Circuit Breaker trips under the combination of aforementioned conditions.

So, for a successful test, we have the following options at our disposal:
  1. Create Unit Test cases to prove our scenarios behave as expected.
  2. Create integration test cases to prove everything hangs together and behaves in the way we expect.
  3. Manually test the application, contriving failure modes, in order to verify the automated tests are valid.
In our case, we had written a large number of test cases for the first two points in order to prove things worked as expected.   Yet my colleague, almost at the end of his tether, was baffled why he couldn't get the circuit to trip open when he manually tested the scenarios.  In his mind, it was the Netflix code was to blame as he didn't completely understand what was going on inside it.  

After we stepped through the Hystrix code it became apparent that there were many threads running and the Circuit Breaker state was changing as we were debugging.  For me, this was the smoking gun.

So, delving a bit deeper we discovered another thread was resetting state periodically.  We needed to understand why this was happening, so we had a quick exchange of ideas and hypothesised that something was configured to clean up state more frequently than the manual test could possibly keep up with.  No problem with the automated tests, so it must be a timing issue, right?  

We looked at the code and learned about rolling statistical windows.    After a bit more code inspection, we learned that the rouge thread was using this value to reset failure statistics which our tests relied upon.

We finally nailed the cause when we looked at Hystrix's default configuration items for "metricsRollingStatisticalWindowInMilliseconds".

We patched this value on-the-fly in a debugger to 100 times it's default value to ensure our manual tests could complete within the time frame.   and re-tested manually.  We successfully managed to contrive the scenario we were expecting.  Bingo!  Our hypothesis was bang on the money.   The problem turned out to be that the default window size didn't really help us whilst testing manually.  It also boiled down to the fact that we weren't initially aware that Hystrix kept a rolling statistical window, but as always, the truth is in the code.  It always pays to get your hands really dirty and getting into the guts of any library you're working with will - most of the time - reward you with the purest truth.  For astute software engineers with a fearless attitude, using Open Source Software will save the need expensive support contracts.  It avoids waiting for patch fixes, as you can fix it yourself (if there's a bug of course!),  It also negates the worrying whether your problem genuinely did get fixed.  

So that's the reason for this blog.  I felt compelled to share with the world that there are ways and means of controlling time, which incidentally, Hystrix doesn't do.  In this example I'll use Java.

How do you control time?

First, you don't use System.currentTimeMs() or System.nanotime() to get the time.
Second, you write a mediator class which enables you to wire-in how you want to control the time. Granted, this only works for your own applications, and where you use third-party libraries, you're kinda stuffed unless you start looking at using dynamic proxies over the System class (I may post that in a later blog, or not, depending on whether people ask for it - not even entirely sure whether ASM or CGLIB would even work on such a core package, we'll find out soon I suppose!).

So here are the classes that will assist us (I won't show the Unit Tests as they are simple enough to understand).


First of all, the mediator class, namely the TimeProvider

 package net.gf.time;  
 import java.util.Date;  
 public class TimeProvider {  
   private static RealTimeClock clock = new RealTimeClock();  
   public static long getTime() {  
     return clock.getTime();  
   }  
   public static void setClock(RealTimeClock theClock) {  
     clock = theClock;  
   }  
   public static Date getDate() {  
     return new Date(getTime());  
   }  
 }  

Secondly, the RealTimeClock (default clock)

 package net.gf.time;  
 public class RealTimeClock {  
   public long getTime() {  
     return System.currentTimeMillis();  
   }  
 }  



Finally, the clocks that change how we see time in Java

A Fixed time clock for precise checking of edge cases.


 package net.gf.time;  
 public class FixedTimeClock extends RealTimeClock {  
   private long fixedTime;  
   public long getTime() {  
     return fixedTime;  
   }  
   public void setTime(long time) {  
     this.fixedTime = time;  
   }  
 }  


A skewing clock for simulating real-time, but skewed by a provided delta in milliseconds.  Useful for testing threads that should do something after midnight, for example.

 package net.gf.time;   
  public class TimeShiftingClock extends RealTimeClock {   
   private long deltaMsFromNow;   
   public long getTime() {   
    return super.getTime() + deltaMsFromNow;   
   }   
   public void setTime(long deltaMsFromNow) {   
    this.deltaMsFromNow = deltaMsFromNow;   
   }   
  }   

So how do these classes help me?

Consider you're testing a timeout edge case in your unit tests.   So to test this, the simplest thing to do would be to set the timeout to some ridiculously low number, and use Thread.sleep() for a little bit longer than that timeout value and you're good to go.   This however, as a compound effect, will eventually make your Unit Test suite take much longer than it needs to.

Also, because we're using a mediator, we can interchange the types of clock we're using as and when necessary.  This makes a very powerful time machine at your disposal.

For example:


 @Test  
 public void testTimeout() throws InterruptedException {  
   thing.setTimeout(1); //ms  
   thing.initiateSomething();  
   Thread.sleep(2);  
   assertTrue(thing.hasTimedOut());  
 }  

Instead, consider this:


 @Test  
 public void testTimeout() throws InterruptedException {  
   FixedTimeClock ftc = new FixedTimeClock();  
   TimeProvider.setClock(ftc);  
   thing.setTimeout(1000); //ms  
   ftc.setTime(0);  
   thing.initiateSomething();  
   ftc.setTime(1000);  
   assertTrue(thing.hasTimedOut());  
 }  

Pretend the method thing.initiateSomething() does something, and it's only complete when thing.complete() is called.   Given that thing.complete() never gets called, thing.hasTimedOut() will check how long it has been active for.  If it's been active for too long, it will return true.

On the surface, this test case looks slightly more complicated, it is inherently clear that you're shifting time and you are able to precisely test timeout edge cases.

We can obviously make that test case cleaner by making the FixedTimeClock a member variable and initialise it in a setup method, but that's not the point.


How does my Thing class get the time to work out if a timeout has occurred?

To tie things up, all the Thing class has to do is call the following, rather than System.currentTimeMillis() 

TimeProvider.getTime();



Conclusion

Using the Mediator design pattern has a massive advantage here when your application requires something that you want to control rather than relying on Java's underlying classes and introducing "fudge factors" to get your application to work as expected.

The fact that Hystrix doesn't use a TimeProvider motivates me to contribute back this concept into their code base in order to improve their testing capabilities, and also help other developers who choose to use it!  

I just hope that one day, we all see time the same way!

Thanks for reading.


No comments:

Post a Comment