Early catch: Before ground realities become trends!: Part IV: Nuts-and-bolts of a Scale-out Distributed Storage System

This post covers Availability/Fault Tolerance

There are multiple metrics to quantify the availability of the system. The most common metric is number of 9s of availability. 5 9s availability translates to 5.26 minutes downtime in a year. A related concept to availability is data durability or reliability. It represents the probability that a data object is lost due to hardware failures rates (MTBF,MTTDL). Traditional enterprise deployments used Recovery (RPO) and Recovery Time Objective (RTO), mostly to represent a Disaster Recovery (DR) scenario. RPO represents the recent updates that can be potentially lost, while RTO represents the time to get the system online.

In-order to design for high availability, the system needs to be designed with fault tolerance at each level in the stack. The four common building principles are:

Data Consistency during failures: Most common failures involve sub-components failing permanently or transiently becoming unavailable. It is critical to ensure system consistency by making committed operations durable in the correct order and rollback of partial updates. In distributed systems, if the minimum number of replicas are not updated (i.e., no quorum), the updates should either be rolled back or the system remembers this for later repair. Partial failures increase the probability of data loss, since successive failures could knock-off the non-quorum updated copies (analogous to a double disk failure in a RAID 1 system).
Failure detection for failover: This ensures that permanent and transient failures trigger appropriate recovery actions.
Failover orchestration: Steps involved in re-directing the incoming requests to the appropriate target. This could essentially be client-side logic that retries the request to the replica location.
Repair and Failback: The steps involved in getting the failed component back online. This is typically a distributed recovery process. Its typically a build versus repair decision. Once the node fully recovers, it rejoins the cluster and serve the previous data or new data.

The rest of the post flushes out additional details for each of these dimensions:

1. Ensuring Data Consistency during failures

This is essentially related with the write commit protocol i.e., how updates are persisted across replicas. An ideal solution is non-buffered writes that are strongly consistent across all the replicas. Such as ideal solution is not quite practical in most scenarios given the extreme trade-off of latency, availability, and scalability. Most practical systems implement data buffering (similar to group commit semantics), as well as update replicas asynchronously using eventually consistent models.

Crash consistency: This is a traditional topic, required even in non-distributed systems. Typical strategies are journaling, shadow pages, soft updates, LFS (journaling + soft-updates)
Handling non-quorum updates: This has multiple sub-topics:

Conflicting replicas: In a model where any replica can be updated, non-quorum update can lead to two replicas having completely different versions of data
Fragile updates: If the non-quorum replicas (worst case = 1 replica), loss of the replica can lead to data loss.
Rollback of replicas: In the absence of 2PC or equivalent, enforcing quorum requires rollback of the replicas to be explicitly implemented

2. Failure detection:

A critical aspect of FT is the ability to detect failures. There are two broad categories of failure detection: Centralized versus P2P gossip style detection. There are several challenging aspects of failure detection:

Types of faults: While most systems are designed for fail-stop, there are other classes of faults that are much more complex to deal with namely byzantine faults.
Distinguishing between permanent versus transient failures: The threshold needs to be adaptive since the fault could be a transient.
Differentiating between failed node versus a network partitioned node

3. Failover orchestration:

This is the process to redirect requests to redundant services/data copies, while the original node is being repaired. Failover orchestration depends on the combination of the following three aspects:

Type of component failure: Disk, Node, Network, Site failure. Note that these component failures are not black-and-white -- a node can appear to have failed due to a network partitioned or congestion.
Type of service affected: A scale-out storage system comprises of different services such as Metadata Master, data node, lock service, operation scheduler, etc.
Topology of the failure:

Scenario where the coordinator service is partitioned from te rest of the nodes
Scenario where the nodes are partitioned into a quorum and non-quorum cluster

The appropriate failover workflow is typically implemented as a combination of the above three scenarios.

4. Repair and Failback: The steps involved in getting the failed component back online. This is typically a distributed recovery process where multiple existing nodes participate to apply the update journal. Once the node fully recovers, it rejoins the cluster and serves either the previous data or is commissioned for new data. The following scenarios need to be addressed:

Handling stale replicas when it return on-line. The key challenge is to decide between repair with the missing updates verus a complete discard and rebuild.
Bootstrapping a new service instance: This is typically the case for hardware failures – the system is repaired and re-commissioned. A related scenario is where existing disks are plugged in the node.
Cluster re-merge: Rejoining the cluster after the partition. This workflow will generate several repair tasks namely getting the replicas on the non-quorum side up-to-date.

1 comment:

UnknownSeptember 18, 2015 at 8:11 AM
Great blog. I prefer application crash consistency vs data crash consistency (aka black magic done by the underlying FS/storage). Former will always recover/restart from a sane point and in the latter case I've seen quite a few disastrous cases. Sure there are always some use cases where you will benefit from not making every app support freeze APIs. stateless apps don't need those. I wish developers understood the importance of check pointing. I think every developer should get a chance to work on a recovery-crash customer case. Once you experience the pain you will program your subconscious to think differently.

Early catch: Before ground realities become trends!

Mapping Technology Trends to Enterprise Product Innovation

Wednesday, November 12, 2014

Part IV: Nuts-and-bolts of a Scale-out Distributed Storage System

This post covers Availability/Fault Tolerance

1 comment: