re·sil·ience
noun \ri-ˈzil-yən(t)s\
: the ability to become strong, healthy, or successful again after something bad happens
: the ability of something to return to its original shape after it has been pulled, stretched, pressed, bent, etc.
A character of a person is sometimes defined by the negative life changing events and challenges one is able to rise from and establish themselves successfully. The character trait which allows someone to face and rise up repeatedly from all the adversities is resilience.
My goal here is analyze how one can capture the essence of this character trait and apply it directly to build software which could be truly be called resilient. It should be able to pin point where errors lie in the system and also be able to act in a consistent manner for day to day operations which govern the working of the software on the lower layers.
The big question now is...how can one code and capture and implement such a character in a software ?
To answer this question, one should understand which basic ingredients, when mixed together would create resilience.
From the above definition I can deduce some of the absolute required ingredients to create resilience. These are as follows, in no particular order :
- intelligence
- reliability
- communication
- discipline
- ownership
- understanding
Let me explain how all these ingredients mixed in the right quantities, in the hands of a developer who focuses on attention to detail and is diligent with his code, could create the ultimate resilient distributed software.
One of the biggest fallacy a distributed system is designed with is that "the network is stable". Any good engineer who has worked hands on with networks in any capacity would agree, the most unstable and unreliable part of infrastructure and computing are the networks they run on.
So, to design software with the understanding that any and all control or data traffic which has to traverse the network could be lost, and could effect the working of components internally in the system would and should create the realization in the minds of engineers as to how expensive a single network I/O is and how to design these components so that they are not fatally affected when network interruptions do happen. The software should be resilient to survive network glitches and as well as survive some loss of control or data traffic without completely getting annihilated.
Let me now start mixing all the ingredients above defined and see what we end up with :
Intelligence
An absolute way to create resilience in software is to design a layer of intelligence in every component. What I mean by intelligence here is, that every component has to be aware about its own state as well as the states of other components which it interacts with. If a defined sequence in a component causes a state change, then the intelligence in the software should be able to first of all :
- realize which state the sequence affected and caused the failure
- determine how to recover from this state after a failure
- clean up after itself, so it can again execute the sequence, if conditions which caused failure have been corrected or corrected themselves over time.
If conditions which caused the failure still exist, then software should be intelligent to tell which :
- element of the sequence of operation failed
- what are the recovery options
- what are the cause and effect of certain operations to the availability of the system and or components which interact with each other.
The states of each of the components should be advertised constantly, so everyone is aware of each other and everyone operates with known "rules of engagement" with each other.
Reliability
The first and foremost way to achieve reliability is by being a good citizen who is aware of all other citizens around. Reliability also has to be ensured by making sure that each line of code that is written is unit tested. Each component has to have its own way to close the loop and never get into a state from which it doesn't know how to recover itself.
Communication
Communication is the key to everything. Each component in software has to be able to communicate with other components and understand the states of each component (as described above in the intelligence section).
Software has to be able to communicate effectively with the use of logs and error messages which are legible,admins should be able to pin point components and also elements in components which have failed. This communication is key for debugging the system and also key to understand a where in the hierarchy of a sequence did the failure occur, so engineers can then easily drill down to the piece of code which could have caused the failure.
Communication among engineers writing code for various components is also most important, so each piece of code is written in a way which has predefined "rules of engagement" and don't step over each other.
Discipline
This is the most basic and holistic principle which should guide how every line of code is written. I've talked about "rules of engagement", discipline is the implementation as well as inherent principles to follow the rules that are defined.
Just as there are constructs,variables and certain rules to follow when a piece of software is written.
An invisible but important trait to follow when writing code is discipline.
Ownership
Every component should take ownership of executing sequences its responsible for and also make sure transition can happen from state to state in a stateful manner. In case of failure the component itself should take ownership of cleaning up its known state and be able to retain a known good state. The known good state is important as these are the states that have been advertised before and this ensures that all other components which are talking to this component know which state and what specific rule should be used to engage with the component.
Understanding
Over and above everything engineers and hence the components they write should have understanding of the other piece of code they have to inter operate with. This can only be achieved if there is a coding discipline which forces Engineers to talk to each other an understand all components in the systems and understand each inter dependency of each component. Many times this basic principle is not followed and engineers design in a vacuum not interacting and understanding how every other component in the system is inter dependent. Due to this a failure in one component in the system can cause disruptive failures to other seemingly unrelated components, because they must have used elements of each others components without fully understanding under what circumstances would certain elements affect and or a state change of that element would affect others.
In my mind I believe these aspects are necessary and should be considered over all when designing a distributed system.
In my next blog, I'll talk about how to best code consistency and predicability to achieve the final goal of creating resilient software.