Mapping Technology Trends to Enterprise Product Innovation

Scope: Focusses on enterprise platform software: Big Data, Cloud platforms, software-defined, micro-services, DevOps.
Why: We are living in an era of continuous change, and a low barrier to entry. Net result: Lot of noise!
What: Sharing my expertise gained over nearly two decades in the skill of extracting the signal from the noise! More precisely, identifying shifts in ground realities before they become cited trends and pain-points.
How: NOT based on reading tea leaves! Instead synthesizing technical and business understanding of the domain at 500 ft. 5000 ft., and 50K ft.

(Disclaimer: Personal views not representing my employer)

Friday, October 16, 2015

A new tip of the iceberg in Enterprise Storage!

Several decades ago, enterprise storage started as a naive layer in the software stack, responsible mainly for persistence and retrieval of data. The storage layer never understood what the data meant  -- the knowledge of data semantics (e.g., this directory is an Exchange mailbox) was higher up in the stack, closer to the application. We are now seeing the emergence of storage that is scratching the surface of becoming “data-aware” -- DataGravity, Cohesity, Rubrik, to name a few front-runners. Imagine the storage administrator now being able to define policies or optimize traditional workflows for backup, DR, tiering, etc., based on the “contents” of the data! The storage industry categories this segment as “Data Governance,” and is applicable both for primary or secondary/backup data. Data Governance address the business pain-point to manage data assets for security, availability, integrity, compliance, and accessibility policies. But why accomplish Data Governance at the lowest common denominator (instead of specialized application-specific tools)? Also, why am I referring to this trend as the tip of the iceberg? Read on and your questions will be answered!


Let’s level-set with a few examples on how using data-awareness addresses pain-points in existing storage workflows:
  • Let’s assume you are in an industry where regulatory compliance requires files containing a social security number to be encrypted -- today, this will require a periodic read of the changed data at the application-level with some ad-hoc rule processing, which can be a nightmare to manage and debug during errors (to say the least).
  • Assume there is a data corruption in your primary data; you need to find the backup when a specific data table got written. If the storage system indexes the data contents of a backup, this could be a breeze versus mounting every snapshot or nosing through the database transaction logs.


So, why implement Data Governance at the lowest common denominator? Especially when richer application-specific tools or data management platforms have existed for ages! There is one catch with these rich tools -- they are silo’ed. For the last decade, the mantra of enterprise storage was eliminating silos, but the rapid innovation in hardware technologies and CAP variants are forcing enterprises to adopt multiple technologies that are optimized for specific usage and data models (i.e., the end of one-size-fits-all). The specialized application-specific tools operate within specific data model silos. It's not a question of either or, but they are actually targeting different technical buyers (I hope) within the enterprise --  a storage administrator enforcing governance across all data silos versus a data scientist requiring detailed control and data analytics within a specific silo.


So why now? Could this not be done two years back? The mainstream adoption of Software-defined Storage (SDS) is the catalyst in making data analysis within storage increasingly plausible. SDS represents a new breed of scale-out storage systems that are being built ground up, often internally leveraging the micro-services architecture. Data management services are much easier to incorporate as a micro-service either in the IO path (ingestion/retrieval) or background post-processing. Further, since SDS runs on commodity building blocks that are essentially a combination of compute, storage, memory, and network, the balance can be adjusted easily to accommodate compute intensive activities (no sweat!).


So finally, why is this the tip of the iceberg? There is an inherent blurring of database and storage technologies that will increasingly make richer data processing available as the part of the storage layer. It's the convergence of a perfect storm: SDS micro-services architecture, commoditization of data analytics libraries, and the business pain-point to get a better control/ derive differentiated value from the data. It wouldn't be too far fledged to envision a marketplace of data analysis micro-services being built around 2-3 leading SDS platforms (analogous to a marketplace of applications for iOS, Andriod, etc.).

Are we there yet? In my mind, we are far from the tipping point and require some solid use-cases in multiple industry verticals demonstrating the value of building data-centric functionality at the lowest common denominator within the stack.   

Friday, September 11, 2015

Engineering an A+ Team for Software-defined Storage & Big Data Platform Innovations -- Stop chasing unicorns!

There is a growing appetite among enterprises to gain agility and cloud-like self-service capabilities for on-premise IT operations. The appetite is especially being fueled by Big Data analytics that are re-defining the role of IT to a critical differentiator for business decision-making, and now contributing directly to the bottom-line in some cases! This new IT stack is broadly described as “Software-defined”, implying the ability to programmatically provision, manage, and reconfigure the behavior of the infrastructure. Software-defined is not just a technology disruption, but holistically a combination of people, process, and technology disruption. So, how does one innovate in such a dynamic, global environment to translate a Technology into a profitable Product in a hyper-growth Market segment? Yes, the most critical element is an A+ team, which is also one of the most difficult to get right!

So, what does it mean to have a A+ team? I have often heard statements like “let’s hire rock-star distributed systems and scale-out engineers.” Hmmn..there are many different dimensions to software-defined architectures -- such statements are hardly actionable, and can get recruiters searching for unicorns (such that their odds of winning a lottery might seem higher than recruiting such developers). I have been a great believer in “engineering” the right team with the appropriate skill-set balance -- one of the best ever articles on this topic is from Vinod Khosla where he coined the term “gene pool engineering” (must read if you haven't already!). In my nearly two decades experience of translating technology disruptions into next-generation enterprise products, I have applied and learnt tremendously from the concepts of gene pool engineering. The objective of this post is to share my insights in mapping the Gene Pool engineering concepts into actionable insights for teams pursuing Software-defined innovations  -- to stay grounded, I will focus on Software-defined Storage and Big Data Solutions. As the market continues to get competitive, the effects of skill-set shortage are already visible -- hopefully this article gets you out of being stuck looking for unicorns, and gets you putting the right skill-set in place.

Let’s level-set on the principles of Software-defined Storage (SDS) for the purpose of this discussion. In contrast to traditional hardware-defined storage that were designed mainly as one-size-fits-all POSIX-based enterprise applications, Software-defined storage is designed to support a wide spectrum of application semantics (POSIX as well as emerging non-POSIX applications), providing rich set of APIs for QoS trade-offs and cluster scaling that can all be controlled programmatically from the comfort of a policy-based interface. The foundation of SDS is scale-out (instead of scale-up) architectures that needs to address distributed concepts related with asynchrony, distributed state, and failover scenarios that are a norm rather than an exception. Broadly, the SDS modules can be categorized into the IO Plane and the Management Plane. In this post, I will focus mainly on building the IO Plane, and cover the Management Plane in a future post.


For the gene pool discussion, an SDS architecture can be broadly represented (at 50,000 ft) as a layered architecture:


  • Cluster Management Logic: This layer provides the core services that define the basic personality of the overall solution. The services implemented within this layer such as state management, transport, consensus management, etc., are analogous to the foundation of the building.
  • Namespace Management Logic: This layer represents how the physical resources get exposed to the applications as a logical namespace. For instance, the SDS could expose a logical volume, a filesystem, or a Key-Value storage, etc. This layer defines the properties of the namespace such as reads-writes being mutually exclusive/isolated, etc.
  • IO Coordination Logic: This is analogous to the runtime manager where operations across different modules need to be coordinated to service IO operations or housekeeping tasks such as replica repair, garbage collection, etc.
For inquisitive minds, also check-out the 5000 ft view of the architecture where I have spelled out the key modules within each layer.


The essence of gene pool engineering has been to transform team-building and hiring from an art to science. The key idea is to clearly articulate risks, and map them to appropriate skill-set hiring by targeting appropriate centers of excellence. The layering representation brings out the point that a different diversity of experience and domain expertise is required at different levels. In this post, we cover three key areas:
  • Prioritization of implementation risks: Pinpointing the most complex piece of your solution and focusing on it first. For instance, if you have a requirement to support linearizable global transactions across objects in multiple shards, and if this linked with the core USP of the solution, it is critical to de-risk this aspect (instead of trying to optimize the on-disk layout format, which is important but a more bounded problem).
  • Diversity in team: A good balance is required between storage domain experts, distributed system experts, ninja generalist programmers -- a team with all distributed systems or storage rock-stars cannot go too far. Similarly, diversity in experience and backgrounds is key.
  • Culture for optimizing cycle times: Agility to iterate is key -- basically building a culture for getting the 2-star version early and validate with customers, instead of over designing a 5-star solution with longer gestation. In the domain of scale-out storage, this requires a lot of foresight and leadership to get the iteration right!
The rest of the post covers additional details for each of these aspects.


Prioritization of implementation risks:
  • Knowing what is possible in distributed systems: A good understanding of related work is absolutely critical! Knowing what is possible versus where the leap of faith is required helps in targeting (and going all out) for the right skill-set. You will be amazed by the investments that already exist in the form of related work -- this can significantly shorten your design exploration phase, and leverage design experience of academia and others that are in the similar problem domain.
  • Keep it real: The biggest risk is when you treat all product requirements with equal importance. It is critical to identify where we would need to defy the laws of physics. For instance, strong consistency of metadata will ultra low latency -- prioritize such tricky requirements and focus on mitigating risks on these quickly. Another approach is to start with a standard vanilla scale-out design, and analyze where specialization of module design is required to meet the product requirements. Being able to scope-out a requirement is equally critical i.e., attempting to keep all stakeholders happy at the expense of complicating the design of the first release is not a winning strategy. Following are top 5 modules that commonly represent the elephants in the room.
  • Data-driven selection of technology choices: Data-driven understanding is key especially for the core components that are extremely difficult to rip-and-replace otherwise. It is important to document these aspects (on the wiki) -- often times these questions get re-visited several times, especially as newer team members may attempt to the reinvent the wheel.


Diversity in team
  • Diversity in domain knowledge: Each layer in the architecture represents a different mix of expertise w.r.t. distributed systems, storage domain, generalist ninja coders, API and manageability, UX/UI, tools and automation. For instance, the Cluster Management layer represents core services, and critical to get the best distributed systems expertise. Similarly, higher up the stack, the storage domain and enterprise use-cases become critical to understand.
  • Prior experience diversity: Instead of focusing solely on enterprise product developers, have a good balance of folks with prior experience running Web 2.0 services, as well as Cloud provider services. The background in owning a service helps tremendously in baking APIs and management metrics & profiling in the code. Also, this facilitates avoiding mistakes based on lessons learnt from a broader set of scale-out design experiences.  
  • Years of experience diversity: It’s typically a good idea to mix “white hair” with “high energy levels.” Core services in the lower layers are difficult to rip-and-replace, and typically have higher dependencies from services in the higher levels --its critical to have the most experienced folks driving these modules.


Culture to reduce iteration/cycle times
  • Culture of throw-away code for quick experimentation: For core components, it is better to have a lot of experimentation to eliminate design choices -- it is OK to have throw-away code and needs to appropriated reflected in sprint planning.
  • Phased execution culture: It is difficult to find engineers who can help define the balance between the module functionality and time to ship. I refer to as a 5-star versus 2-star version of the product. Given the intense competition in the marketplace, it is critical to get a product iteration in alpha/beta, versus delaying it for full functional, polished offering. As a part of the phased model, being ruthless to scope out costly features such as distributed transactions, serializability, distributed recovery, etc. from the MVP (unless absolutely critical for the USP).
  • Don`t go overboard with agile: The sprint-based model for execution can be at sometimes at odds with distributed systems implementation. The core services get deeply rooted in the design, and are difficult to retrofit -- rather than always chunk work in 1-2 week sprint windows and not factor any time to eliminate technical debt (or specifically design debt). This is critical to avoid significantly costly failovers in the future.
  • Leverage open-source for early validation: Use open-source components and focus on the glue code and prove the end-to-end domain logic -- the idea is to mitigate the execution risks, and then incrementally replace the open-source vanilla blocks with customized homegrown components.


In summary, there is no secret recipe for putting together a A+ team, but the approach of finding rock-star distributed systems or storage developers almost always is a losing formula.

6 key hardware trends in the context of Software-defined Storage and Big Data Platforms

Often times, I get in discussions where fellow engineers/architects are solely fixated by properties of the emerging storage hierarchy across disks, flash, and NVM. We are in an era where there is a significant diversity of storage devices in terms of $/GB, $/IOPS, latency, capacity, throughput, and write endurance. While storage is an important part of the overall solution puzzle, it is easy to forgot that it's still just a “part” and not the whole puzzle. In going from disk to flash to NVM, the IO latencies have shifted from milliseconds to microseconds to nanoseconds respectively.  Does this mean that we can simply add NVM or SLC flash to storage arrays, and completely harness the radically improved service times? Of course not. To generalize the case in point, the design of Software-defined Storage and Big Data Platforms need to be vetted with a holistic analysis across the core building blocks i.e., CPU, memory, network, and storage. The goal this blog post is to highlight relevant trends that engineers, system administrators, CIOs should keep in perspective while developing or adopting next generation data platform solutions.

  1. CPU Scaling: Slowdown of Moore’s law: Few camps believe that Moore’s law is already dead, while others believe in the imminent death. Irrespective, the key point is that we are no longer able to double compute every 18 months. Instead, we have settled with multi-core, which from the software standpoint provides linear (30-40%) scaling per year, given the limited ability of truly exploit parallel programming models. With data growing at an exponential rate, the key takeaway is the reducing availability of cycles per unit of data.
  2. Memory gets faster and cheaper: In the beginning of the decade, volatile memory within servers was a scarce resource (I remember spending endless hours on metadata optimizations to avoid paging overheads). The maturity of fabrication technologies have turned the dial w.r.t. average size and $/GB. Standard server configurations are available today with 256GB of memory, comparing to few years back where 8-16 GB was the norm. 
  3. Latency lags Bandwidth: This is an observation from David Patterson in 2004 based on his details analysis across CPU, network, memory, and storage. To illustrate, from mid-1980s to 2009, the disk bandwidth increased from 2GB/s to 100GB/sec (a 100X improvement), but latencies only improved from 20ms to 10 ms (2X). Discontinuous innovation in storage technologies with NVM are creating an outlier for storage, but the bandwidths can potentially have a similar jump with the emergence of NVMe Fabric effort. Overall, Latency lags Bandwidth will continue to be true in the long run.  
  4. Distinct Performance and Capacity tiers: The diversity in storage technologies are creating distinct tiers of storage w.r.t. $/IOPS and $/GB. Rotating disks will continue to be the leaders w.r.t. $/GB, but are expensive on a $/IOPS basis compared to flash and NVM. The distinction clearly lends itself to innovation across caching and tiering technologies.

  5. Network is the new bottleneck: With disks at 6-8msec, latency optimization in the remaining IO stack was inconsequential. With NVM at few hundred nanoseconds, every optimization, mainly the network round trip is now critical. In fact, the best case network round trip of 1-2 microsecond is now the bottleneck in the overall NVM-based IO path.
     
  6. Closing of the latency gap: In the disk era, the latency to access data on disk was order of milliseconds. There was multiple orders of magnitude latency difference a cache/memory hit (order of nanoseconds) versus access from disks (order of milliseconds). This was referred to as the “latency gap.” With the adoption of NAND Flash within enterprises, the latency of a miss was 1000X faster (order of microseconds). With the upcoming byte-addressable NVM, another 1000X speedup (order of nanoseconds) is now imminent, essentially closing the “latency gap” from a technology physics standpoint! A standard server configuration will be available in different configuration mix of NVM and DDR RAM on the memory controller, Flash on the PCIe bus, and disks on the SAS/SATA controller.


To summarize, remember these trends next time you have a discussion on Enterprise storage and Big Data Platforms -- try to analyze how well the solutions are future-proof in the context of the least common denominator i.e., the hardware trends across CPU, Memory, Network, and Storage.

Wednesday, March 11, 2015

Software Defined Storage (SDS) -- the fever catches on!

IBM recently announced an overhaul of their go-to-market strategy for storage. Several of the existing products have been packaged together under the IBM Spectrum umbrella. Overall, it does have the sound bytes of an "old wine in new bottle" --  the part that stood out to me was the Spectrum Accelerate announcement, in which their premier storage array product XIV is going to be also available as software only! XIV has been the cash cow for IBM, and certainly a bold move to cannibalize a high margin hardware offering! The move certainly makes a lot of sense, given XIV's highly scalable, shared nothing internal architecture -- having worked on XIV since the acquisition in 2008, it was definitely ahead of its time, and I have great respect of this product!

Close on the heels of the IBM announcement, SolidFire also jumped in the SDS bandwagon with a similar software-based offering of their product.

I think its very clear that the broader storage industry is recognizing the broader opportunity with SDS, even if it translates to cannibalizing the hardware-based margins in the short-run. I wouldn't be surprised if there are similar announcements across the board, with eventually every existing storage array also being available in a software-only version (certainly not trivial to accomplish, and won`t happen overnight given the exponential complexity introduced by the hardware qualification permutations).

To sum it up, SDS is a classic example of the "Innovators Dilemma!" 

Monday, December 8, 2014

The implicit guarantees applications want from the storage layer?

Application programmers require a "predictable" behavior from persistence storage. So how do we define predictable? Traditionally, this dilemma was been addressed by the POSIX standard for file access. In the domain of data management, ACID has been our de-facto for how the database is expected to persist updates and handle concurrent read-write and write-write behavior.

So why do we need alternatives? Applications have evolved from enterprise scale to internet scale.   The semantics of POSIX and ACID consistency may sometimes not be worth the tradeoff for scale, performance, and availability. Essenbtially, there is no "one size fits all," -- for instance, when you are shopping on Amazon, and browsing the catalog/adding to cart, the operations are not ACID -- the moment you hit checkout, the transaction is handled with ACID guarantees.

Defining the ingredients of predictable behavior -- the al la crate model: Instead of getting stuck with monolithic ACID/POSIX semantics, the programmers can now "pick-and-choose" -- this is possible since data management systems are increasingly being co-designed with the persistence layer e.g., Google File System (GFS) was developed to address very specific workload characteristics related with the persistence of URLs, indexes, etc. The following defines some of the key dimensions an application programmer would use to describe predictable behavior for storage.

  • Atomicity guarantee: This is considered a pre-requisite to ensure 0 or 1 semantics for the data update. Most systems also support atomic test-and-set operations on single table row or object.
  • Ordering guarantee: This ensures that the updates are visible in the order in which they are processed by the system -- defined by  Lelise Lamport's happens-before casual relationship/logical clocks. On one extreme there can be global ordering for all the update operations across sites (e.g., Google's Spanner). On the other extreme the ordering can be w.r.t. individual object updates (e.g., Amazon S3).  Alvaro et. al. propose an interesting taxonomy for the granularity of ordering: Object-, Dataflow-, and Language-level consistency.
  • Data Freshness Guarantee + Monotonicity guarantee: Given the inherent asynchrony in distributed systems, the write followed by a read may not always reflect the latest value of the data. This can be artifact of two aspects: 1) Update prorogation to replicas and how the replicas are accessed during the read operation; 2) Read-Write Mutual exclusion semantics enforced by the system -- defined by Lamport's registers and also memory consistency models such as Sequential consistency, MESI, etc.
  • WW mutual exclusion guarantee: As the name suggests, essentially describing how concurrent writes are handled. On one extreme, the system can support Last Writer Wins (e.g., Cassandra) and other extreme it may enforce 2 PL (with variants such as majority voting) or Time-stamp ordering.
  • Transactions guarantee: This typically is in the context of multi-object operations. The key aspects are the atomicity of the updates and the isolation guarantee (linearizable, serializable, read repeatable, read committed, read uncommitted)
  • Data integrity guarantee (unavailable rather than corrupt): Most scale-out systems continuously scrub data internally, and compare the replicas to ensure integrity -- it may make the data unavailable rather than serve corrupt data. Having this guarantee relieves the application from the role of verifying data integrity.
  • Replica fidelity guarantee: For systems that allow applications to access to the replicas directly, this is guarantee defines what the application can expect -- the strongest guarantee is byte wise fidelity (e.g., Windows Azure). On the other extreme, at least once semantics with potentially different ordering within the replica (e.g., Google GFS that requires application to define Record identifiers to access the updates across the replicas).    
  • Mutability model: While this is internal to the persistence layer implementation, the mutability model helps shape the crash consistency and transactional semantics implemented at the application layer. A few common mutability models are in-place, versioning, immutable, out-of-place (append-only, CoW), journaled.

Wednesday, November 12, 2014

Part V: Nuts-and-bolts of a Scale-out Distributed Storage System

This post covers Elasticity.


The configuration of a distributed system is constantly changing. The workload access characteristics shift the hotspots to different parts of the namespace.

Elasticity represents the capability of the system to automatically manage changes in cluster configuration and workload. The actuators to accomplish elasticity are:

  • Re-sharding: Splitting the namespace for better load balancing across the nodes
  • Re-mapping:
    • Re-assigning Shard-to-node mapping: This is typically for a subset of the namespace, and could be a result of nodes added/deleted, newly created shards due to splitting, and other configuration changes.
    • Re-assigning replica-to-node mapping: Changes in access can also affect the replica placement
    •  Re-assigning namespace-to-caching servers
  • Auto-scaling: This is the eventual goal, where the system can dynamically model the load and spin up/spin down service instances, nodes, network routers on the fly.


Part IV: Nuts-and-bolts of a Scale-out Distributed Storage System

This post covers Availability/Fault Tolerance

There are multiple metrics to quantify the availability of the system.  The most common metric is number of 9s of availability. 5 9s availability translates to 5.26 minutes downtime in a year. A related concept to availability is data durability or reliability. It represents the probability that a data object is lost due to hardware failures rates (MTBF,MTTDL). Traditional enterprise deployments used Recovery (RPO) and Recovery Time Objective (RTO), mostly to represent a Disaster Recovery (DR) scenario.  RPO represents the recent updates that can be potentially lost, while RTO represents the time to get the system online. 

In-order to design for high availability, the system needs to be designed with fault tolerance at each level in the stack.  The four common building principles are:

  • Data Consistency during failures: Most common failures involve sub-components failing permanently or transiently becoming unavailable. It is critical to ensure system consistency by making committed operations durable in the correct order and rollback of partial updates. In distributed systems, if the minimum number of replicas are not updated (i.e., no quorum), the updates should either be rolled back or the system remembers this for later repair. Partial failures increase the probability of data loss, since successive failures could knock-off the non-quorum updated copies (analogous to a double disk failure in a RAID 1 system).
  • Failure detection for failover: This ensures that permanent and transient failures trigger appropriate recovery actions.
  • Failover orchestration: Steps involved in re-directing the incoming requests to the appropriate target. This could essentially be client-side logic that retries the request to the replica location.
  • Repair and Failback: The steps involved in getting the failed component back online. This is typically a distributed recovery process. Its typically a build versus repair decision. Once the node fully recovers, it rejoins the cluster and serve the previous data or new data.

The rest of the post flushes out additional details for each of these dimensions:

  
1. Ensuring Data Consistency during failures
This is essentially related with the write commit protocol i.e., how updates are persisted across replicas. An ideal solution is non-buffered writes that are strongly consistent across all the replicas. Such as ideal solution is not quite practical in most scenarios given the extreme trade-off of latency, availability, and scalability. Most practical systems implement data buffering (similar to group commit semantics), as well as update replicas asynchronously using eventually consistent models.
  • Crash consistency: This is a traditional topic, required even in non-distributed systems. Typical strategies are journaling, shadow pages, soft updates, LFS (journaling + soft-updates)
  • Handling non-quorum updates: This has multiple sub-topics:
    • Conflicting replicas: In a model where any replica can be updated, non-quorum update can lead to two replicas having completely different versions of data
    • Fragile updates: If the non-quorum replicas (worst case = 1 replica), loss of the replica can lead to data loss.
    • Rollback of replicas: In the absence of 2PC or equivalent, enforcing quorum requires rollback of the replicas to be explicitly implemented
 2. Failure detection:
A critical aspect of FT is the ability to detect failures. There are two broad categories of failure detection: Centralized versus P2P gossip style detection. There are several challenging aspects of failure detection:
  • Types of faults: While most systems are designed for fail-stop, there are other classes of faults that are much more complex to deal with namely byzantine faults.
  • Distinguishing between permanent versus transient failures: The threshold needs to be adaptive since the fault could be a transient.
  • Differentiating between failed node versus a network partitioned node

3. Failover orchestration:
This is the process to redirect requests to redundant services/data copies, while the original node is being repaired. Failover orchestration depends on the combination of the following three aspects:
  • Type of component failure: Disk, Node, Network, Site failure. Note that these component failures are not black-and-white --  a node can appear to have failed due to a network partitioned or congestion.
  • Type of service affected:  A scale-out storage system comprises of different services such as Metadata Master, data node, lock service, operation scheduler, etc. 
  • Topology of the failure:
    • Scenario where the coordinator service is partitioned from te rest of the nodes
    • Scenario where the nodes are partitioned into a quorum and non-quorum cluster

The appropriate failover workflow is typically implemented as a combination of the above three scenarios.

4. Repair and Failback: The steps involved in getting the failed component back online. This is typically a distributed recovery process where multiple existing nodes participate to apply the update journal. Once the node fully recovers, it rejoins the cluster and serves either the previous data or is commissioned for new data. The following scenarios need to be addressed:

  • Handling stale replicas when it return on-line.  The key challenge is to decide between repair with the missing updates verus a complete discard and rebuild.
  • Bootstrapping a new service instance: This is typically the case for hardware failures – the system is repaired and re-commissioned.  A related scenario is where existing disks are plugged in the node.
  • Cluster re-merge: Rejoining the cluster after the partition. This workflow will generate several repair tasks namely getting the replicas on the non-quorum side up-to-date.