Sunday 29 March 2015

What is Hystrix?

Hystrix is a Netflix library. The definition provided at Github reads:

"Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable."

Now to grasp what it implies one has to think of a “distributed environment”. Today, most applications are moving towards a modular architecture. Meaning a big monolithic application encapsulating everything is no longer preferred. Instead, it is broken down into more manageable smaller modules; or microservices each dealing with a specific chunk of the application. To present a crude example, let’s say, we have an online shopping application. Different chunks like maintaining data on products and registered users, authentication of users, payment processing etc. could be exposed via different services or modules or third party libraries. Now a call to any of the services or client library that could invoke a request over the network is a potential source of latency or worse, failure. This is where Hystrix comes in.

Consider an application that entertains heavy user traffic in such a distributed environment with a lot of dependencies. Now if a certain service is down or is too slow to respond it could slow down or throttle the entire application. The following diagram from the Hystrix site draws a picture.



                                                                 Fig.1: (courtesy: GitHub)


Now what Hystrix does is it creates a pool of threads for each dependency in the application. So even if a service is not behaving as expected, the application system continues to function. Take a look at the following picture offered by Netflix to explain this scenario.




                                                                    Fig.2 (courtesy GitHub)

Thus, it helps to isolate such points of access between services thereby, avoiding cascading failure across the different application layers. It also provides fallback options, facilitates monitoring the system state and many other desirable features; thus, improving upon the application’s fault-tolerance and resiliency.

In fact, Hystrix was born out of the resilience engineering work undertaken by Netflix around 2011. Yes, modular programming has its own price tag but according to the data collected and analysed the value it offers far exceeds its cost. 

Hope that summarizes the basics of what Hystrix is all about. Wrapping it up with a few of the jargon.  

a) Commands -- any request to a dependency has to be wrapped in a Command. Think of it as a Java class to which the arguments required when invoking the request are to be passed as parameters. There are two types of commands:
    i) HystrixCommand -- used when a single response is expected from the dependency
                      HystrixCommand cmd = new HystrixCommand(arg1, arg2);

  ii)  HystrixObservableCommand -- used when the dependency is expected to return an Observable                                           that could emit a response(s)
                      HystrixObservableCommand cmd = new HystrixObservableCommand(arg1);


b) Command Execution -- a command can be executed in one of the following four ways.
   i)   execute() -- makes a blocking, synchronous call that either returns a single response or an                                       exception
  ii)   queue()  --  returns a Future from which the single response can be later retrieved
 iii)   observe() -- subscribes to the Observable that represents the response(s) from the dependency
 iv)   toObservable() -- returns an Observable that when subscribed to executes the command and                                   returns the response(s) 

c) Circuit-Breaker Pattern -- This is a much talked about feature offered by Hystrix that helps to check cascading failure across the different application layers. If the load on a certain dependency exceeds a certain threshold or if a service has not been responding for a certain number of consecutive requests, the circuit is considered "open"; implying no further requests are routed to it for a certain window period. After the elapse of this period, a request is made to see if the service is ready to entertain further requests. If yes, further request is resumed; if not, the circuit is again considered "open" for the window period. The good thing is, it is all configurable-- the threshold at which the circuit should be opened; the window period etc. In fact, one could just "open" the circuit and check how it behaves.


I think, this much should suffice for now. More details and examples on using it would be taken up another time.


2 comments:

  1. great. a bit more info on circuit breaker pls like how the cascading affect can be checked

    ReplyDelete
    Replies
    1. Hi!
      Well, as explained in the post once a service is unresponsive, further requests to it are suspended. That is, the circuit is "opened". Imagine a switch being turned off so that no current flows through the circuit. Now since there are no more service requests and no queuing of requests either, the failure at the unresponsive service does not propagate to all the application layers. Consider that a DAO layer makes a call to an external service which is down. Now calls from the DAO would fail. In the face of the failure, a cached response could be provided or some other degraded service response could be offered. But the thing is, this failure at the DAO would not propagate up to the Service or Business layer of the application.
      Thus, "cascading effect" is checked. Hope, that clarifies it? :)

      Delete