Quantcast
Viewing all articles
Browse latest Browse all 20

Resilience - Part 2

In my blog Resilience - Part 1, I wrote about what resilience means to me holistically, and what in my mind are ingredients that make up resilience.

In this blog we will consider one way out of many in which resilience, consistency and predictability can be achieved in software programming to create distributed systems.

FSM (Finite state machine) style of coding is one way of being able to maintain clear and defined boundaries of where each component starts and what the rules of engagement are between each of these components. 

For example components can be the following :

- network
- storage
- storage controller 
- memory
- cpu

Take the network component as an example. There would be a multitude of changes/configuration and states that need to be maintained.

With FSM we can break down all these changes/configuration parameters down into 2 different lists:

- disruptive
- non-disruptive

If there are 15 config changes which can be listed for the network component, 7 of these could be disruptive changes and 8 of them could be non-disruptive. If FSM was implemented with all these changes and the sequence of operations for each of these changes listed clearly, then it would be easy to build "resilience" in the overall system because each of these components could now be compared and contrasted and inter dependence can be established. 
If a certain non disruptive sequence needs to be executed, that component should be able to execute it without and need to communicate it with the rest of the components in the system, but on the other hand if the sequence is a disruptive change, then all the other components which have been identified as being dependent on that particular component need to be provided with an intention sequence of  the proposed change, then the change should be executed and again the components which were affected should be probed to see if they have received the new information.
All components can execute in parallel but each sequence would be executed sequentially inside of these components so a state of where the sequence fails can be traced and also logged.

In my mind I can see a system implemented with the above qualities listed behave resiliently under most user driven operations and also would create a frame work of code base which would be easier to add and append to with newer and improved conditions,elements inside of each component.

As an example, a hostname change on the system due to a bodged upgraded should not in any circumstances bring down cassandra backend responsible for the meta data and data in our system.

If all the principles we have talked about above are implemented a hostname change would be as following :

Hostname would be classified as an element under the the network component and a certain set of rules would be implemented in the system for this operation.

First the sequence of operation would be defined, which would indicate the commands and also the files that would need to be edited to make this change possible, for example :
- the command hostname - this would be required
- edit the file /etc/hosts
- edit the file /etc/sysconfig/network

Now that we have identified what is required for the hostname to be changed, we have to exhaustively find out what are all the other components in the system which rely on “hostname”. Once all these components have been identified, we would also need to understand if the hostname change for each of these components would be disruptive to them or either disruptive to the whole cluster/system. 
We would need to make sure the intention of changing the hostname is propagated to all these components and make sure proper acknowledgment has been received before moving forward with making the change.

Now we can move forward with the 3 steps mentioned above to change the hostname of the system and is persisted across reboots.

This kind of checking for error conditions before making disruptive changes to the system has to adopted as religion by each of our engineers so we can ensure all components which are inter dependent operate with predefined and well understood “rules of engagement” so as to not cause disruptions when transitions thru states happen.

Viewing all articles
Browse latest Browse all 20

Trending Articles