Traditionally HA, or high availability has been done at the service level. This has worked well enough thanks to various work arounds. We used queues within the services, quorum drives and other techniques to try to avoid data loss. These methods still work just fine, but are they the best way forwards in a world of micro-services and cloud?
Containerisation (at the most basic level) enables a single process to live inside its own instance. When the process ends that container dies with it, allowing an “ever-green” service to be provided since each process starts with the latest code. This removes many of the needs for traditional HA. These needs were:
Data centre moves
So let’s see what we need to think about in a world of containers. Software patching is (kind of) irrelevant because we are starting each container on the current stable code and it lives only for the duration of one process. This means that we just do rolling upgrades which are by their very nature non-disruptive since no user should be caught in an upgrade cycle. Firmware, data centre moves, cabling, power and server upgrades are all taken care of at the host level, whether that means hypervisor or container host. These can simply be carried out by draining the container hosts or migrating it to a different hypervisor host. it may take a while to drain a busy host, but that’s fine in a world of virtualisation when CPU and memory diminish and get reasigned as the host drains, we don’t even need to patch a host we can just spawn a new one and kill the old once all processes are complete; this takes the container concept up a level to the container host or up two levels to the hypervisor. With automation and monitoring there’s no extra work involved and we can allow for compliance by using deadlines where a host is forced offline after a period of time whether drained or not.
So finally we have server failure scenarios. This is what most people think of when you mention HA, although in traditional data centre scenarios it’s actually the least important and commonly used. If the host dies then the container dies. Unless you want to design a crazy complex clustered container image (go for it if you want nerd kudos!) then the container and process will die and lose data or at least not process some data and return results. The solution here is no longer to make the process resilient, it’s to make the starting and completion of a process resilient. Try, try, try again is the mantra we need here. Processes should not be called directly but from a service bus layer or similar resilient queue which will spawn a new container if the old one doesn’t return as completed in a timely manner.
With service bus (or clever FIFO queue as I like to think of it) requests are queued and each request starts a process. If that process completes it tells the service bus of this success and the service bus deletes the request. If any other scenario happens such as delays, failures, no response the service bus simply starts another process with the same data and tries again. Naturally this can be configured with a retry limit and hard failure scenarios so if the data is bad we can give up and log a failure.
Is HA dead? Of course not, we just don’t need to care quite so much about failures these days.