Wednesday, 18 September 2013

Fear Driven Development

If you're an agile practitioner, you'll have certainly heard of most, if not all, of the following:

  • Test Driven Development
  • Behaviour Driven Development
  • Feature Driven Development,
  • Accept Test Driven Development
  • .. and there are probably a few more which I haven't mentioned...

But have you heard of Fear Driven Development?  What does this mean?


Let me be clear, I've worked in many organisations, from start-ups to Blue Chips to some heavy-weight dot coms.  FDD isn't something new and isn't attributed to any particular place I have worked at.  It's not some manifestation of new methodology.  FDD is the product of a programmer's insecurity fuelled by code complexity, constant pressure to deliver new features, company culture and a resignation that things will never change.  That's just how things are, and there's no point in going against the grain. 

An anectodal story of how FDD manifests


Assume you're the new guy in the team.  They're all pretty friendly, but incredibly busy.  "Wow, these guys are hard working!" you think to yourself.



In an effort to make a good impression during your probation period, you learn as much as you can about the systems you will be working on.  Any documentation you find is either incomplete, out of date, or even worse, just plain wrong.    You have a look at the code and find multiple layers of abstraction [which I'll cover in a future post].   So, it kind of looks well-written, if not a bit complicated to understand. You find some Unit Tests, but they're a bit hazy and incomplete.   Some don't even attempt tell you what the test is trying to prove or disprove.  Confusion ensues.

After a few days of familiarising yourself with the code base, you find there's code in there which you're not completely sure about what it is doing, or even why it's doing it.

As you're still in the honeymoon period in your new job, you say to yourself "I'm new, so it will take a while to figure out, so it's okay not to have the full understanding."

You're given your first real-world task.  


"That looks easy enough" you say to yourself.   Then when you actually delve into the code and step through what's going on around the feature you're implementing or enhancing, you find yourself in a jungle of code.   You find yourself wandering aimlessly in the code trying to figure out what the hell is going on.   You have no idea what's going on.   Okay, you can use a debugger to slowly get there, but with all this abstraction and all these little things that happen it gets even more confusing.

Then you realise you've eaten up half your time finding your feet when you should have been writing good, testable code.    Instead, panic ensues as you don't want to look like an idiot.   You've already asked every member of your team a couple of questions each (so not to use up all your credit with any particular individual.  And hey, it's an excuse to get to know your colleagues, right?) 

Going against all your best-practice principles, you're getting nervous that you're not going to meet the deadline.   So what do you do?  Well, there's no choice but diving in and writing a bit of code where you *think* you need to put the code.   Dammit.  That's broken things.   You aim again.  Fire!   Doh!  That's the wrong place.   Confusion turns to nervousness turns to panic.  You've only got a few hours left to complete this task.   

Your team leader asks "How's it going?".  You say you're still wrestling with the code base and it's a bit confusing.  You've got to own up otherwise you're in for it when they find out you've not written a single line of production-ready code.  You're off the hook - temporarily. 

The next day you get the same question posed to you.  You nervously reply "Err, yeah, I think I'm on to something, but I'm not quite sure"

Your team leader is beginning to look a bit impatient.  He's under pressure to deliver a bunch of features.   You find yourself working late, and surprisingly, all your colleagues are working late too.  You don't want to let the team down.  Neither do you want to go through the hassle of looking for another job.

And so the vicious cycle continues.  The more pressure to you are under to deliver, the more hacky your code becomes.  It's the only way to get things done and not look bad.   You hope that nobody will inspect your code in the fear that you'll get that tap on the shoulder for a quick chat.  You're also getting worried that nobody discovers that you didn't write comprehensive unit tests (if any for that matter!)

Every feature you subsequently deliver envelopes you in panic.   How long will it be before you're found out?

For the enlightened ones, it dawns on them that the reason why the team are so hard working is because they're all feeling exactly the same way as you.  Nobody wants to be found out and they all work their backsides off in silence hoping nobody finds out about their hacks.

So there you have it.   That's how fear drives a team to deliver code.   That code base you're working on will have dire consequences on your company's future agility.   

If you stick around long enough, you will eventually see the proverbial hit the fan.  

My advice to you is that if you see code that sucks and a bunch of scared developers, either be brave enough to encourage better practices within your team and push back to your dev lead that this code needs serious refactoring, or do your best to start looking for a job elsewhere. 

Otherwise you will find that someone will be tapping on your shoulder one day.



Wednesday, 11 September 2013

Controlling Time in Java

What's the point of this post?

I felt compelled to post something about controlling time in Java after helping out a colleague who had been tenaciously trying to get to the bottom of an apparent "problem" with Netflix's Hystrix Circuit Breaker.

Now I'll forgive you if you haven't heard of a Circuit Breaker in software terms, or Hystrix for that matter. See this Netflix tech blog for an insight  For those who will simply want the executive summary, it's Java library which assists with ensuring that a chain of inter-operating back-end services don't all "jam-up" in the event of too much happening at once, some service runs so slow that consuming services all hang waiting for the slow service to catch up with them, or when a singular or compound failure occurs.  Put even simpler, when a back-end service fails or stalls in an interconnected web of disparate services, a cascading failure scenario becomes ever more likely.   Consider a circuit breaker in your home - when too much load on the electric system of your house occurs (or even when lightning strikes), an appliance meltdown isn't far away; if you're lucky, the fuse blows, and everything switches off.  The worst-case scenario - a direct-hit lightning strike will cause appliances to actually blow up as the fuse didn't quite catch the power surge quickly enough.   The same thing can happen in distributed systems - but it's any combination of high-demand, service failure, timeouts and sluggish responses that will cause a cascading effect, rather than lightning itself.

So that's the context around what we were trying to test and prove - that the Circuit Breaker trips under the combination of aforementioned conditions.

So, for a successful test, we have the following options at our disposal:
  1. Create Unit Test cases to prove our scenarios behave as expected.
  2. Create integration test cases to prove everything hangs together and behaves in the way we expect.
  3. Manually test the application, contriving failure modes, in order to verify the automated tests are valid.
In our case, we had written a large number of test cases for the first two points in order to prove things worked as expected.   Yet my colleague, almost at the end of his tether, was baffled why he couldn't get the circuit to trip open when he manually tested the scenarios.  In his mind, it was the Netflix code was to blame as he didn't completely understand what was going on inside it.  

After we stepped through the Hystrix code it became apparent that there were many threads running and the Circuit Breaker state was changing as we were debugging.  For me, this was the smoking gun.

So, delving a bit deeper we discovered another thread was resetting state periodically.  We needed to understand why this was happening, so we had a quick exchange of ideas and hypothesised that something was configured to clean up state more frequently than the manual test could possibly keep up with.  No problem with the automated tests, so it must be a timing issue, right?  

We looked at the code and learned about rolling statistical windows.    After a bit more code inspection, we learned that the rouge thread was using this value to reset failure statistics which our tests relied upon.

We finally nailed the cause when we looked at Hystrix's default configuration items for "metricsRollingStatisticalWindowInMilliseconds".

We patched this value on-the-fly in a debugger to 100 times it's default value to ensure our manual tests could complete within the time frame.   and re-tested manually.  We successfully managed to contrive the scenario we were expecting.  Bingo!  Our hypothesis was bang on the money.   The problem turned out to be that the default window size didn't really help us whilst testing manually.  It also boiled down to the fact that we weren't initially aware that Hystrix kept a rolling statistical window, but as always, the truth is in the code.  It always pays to get your hands really dirty and getting into the guts of any library you're working with will - most of the time - reward you with the purest truth.  For astute software engineers with a fearless attitude, using Open Source Software will save the need expensive support contracts.  It avoids waiting for patch fixes, as you can fix it yourself (if there's a bug of course!),  It also negates the worrying whether your problem genuinely did get fixed.  

So that's the reason for this blog.  I felt compelled to share with the world that there are ways and means of controlling time, which incidentally, Hystrix doesn't do.  In this example I'll use Java.

How do you control time?

First, you don't use System.currentTimeMs() or System.nanotime() to get the time.
Second, you write a mediator class which enables you to wire-in how you want to control the time. Granted, this only works for your own applications, and where you use third-party libraries, you're kinda stuffed unless you start looking at using dynamic proxies over the System class (I may post that in a later blog, or not, depending on whether people ask for it - not even entirely sure whether ASM or CGLIB would even work on such a core package, we'll find out soon I suppose!).

So here are the classes that will assist us (I won't show the Unit Tests as they are simple enough to understand).


First of all, the mediator class, namely the TimeProvider

 package net.gf.time;  
 import java.util.Date;  
 public class TimeProvider {  
   private static RealTimeClock clock = new RealTimeClock();  
   public static long getTime() {  
     return clock.getTime();  
   }  
   public static void setClock(RealTimeClock theClock) {  
     clock = theClock;  
   }  
   public static Date getDate() {  
     return new Date(getTime());  
   }  
 }  

Secondly, the RealTimeClock (default clock)

 package net.gf.time;  
 public class RealTimeClock {  
   public long getTime() {  
     return System.currentTimeMillis();  
   }  
 }  



Finally, the clocks that change how we see time in Java

A Fixed time clock for precise checking of edge cases.


 package net.gf.time;  
 public class FixedTimeClock extends RealTimeClock {  
   private long fixedTime;  
   public long getTime() {  
     return fixedTime;  
   }  
   public void setTime(long time) {  
     this.fixedTime = time;  
   }  
 }  


A skewing clock for simulating real-time, but skewed by a provided delta in milliseconds.  Useful for testing threads that should do something after midnight, for example.

 package net.gf.time;   
  public class TimeShiftingClock extends RealTimeClock {   
   private long deltaMsFromNow;   
   public long getTime() {   
    return super.getTime() + deltaMsFromNow;   
   }   
   public void setTime(long deltaMsFromNow) {   
    this.deltaMsFromNow = deltaMsFromNow;   
   }   
  }   

So how do these classes help me?

Consider you're testing a timeout edge case in your unit tests.   So to test this, the simplest thing to do would be to set the timeout to some ridiculously low number, and use Thread.sleep() for a little bit longer than that timeout value and you're good to go.   This however, as a compound effect, will eventually make your Unit Test suite take much longer than it needs to.

Also, because we're using a mediator, we can interchange the types of clock we're using as and when necessary.  This makes a very powerful time machine at your disposal.

For example:


 @Test  
 public void testTimeout() throws InterruptedException {  
   thing.setTimeout(1); //ms  
   thing.initiateSomething();  
   Thread.sleep(2);  
   assertTrue(thing.hasTimedOut());  
 }  

Instead, consider this:


 @Test  
 public void testTimeout() throws InterruptedException {  
   FixedTimeClock ftc = new FixedTimeClock();  
   TimeProvider.setClock(ftc);  
   thing.setTimeout(1000); //ms  
   ftc.setTime(0);  
   thing.initiateSomething();  
   ftc.setTime(1000);  
   assertTrue(thing.hasTimedOut());  
 }  

Pretend the method thing.initiateSomething() does something, and it's only complete when thing.complete() is called.   Given that thing.complete() never gets called, thing.hasTimedOut() will check how long it has been active for.  If it's been active for too long, it will return true.

On the surface, this test case looks slightly more complicated, it is inherently clear that you're shifting time and you are able to precisely test timeout edge cases.

We can obviously make that test case cleaner by making the FixedTimeClock a member variable and initialise it in a setup method, but that's not the point.


How does my Thing class get the time to work out if a timeout has occurred?

To tie things up, all the Thing class has to do is call the following, rather than System.currentTimeMillis() 

TimeProvider.getTime();