Mapping Technology Trends to Enterprise Product Innovation

Scope: Focusses on enterprise platform software: Big Data, Cloud platforms, software-defined, micro-services, DevOps.
Why: We are living in an era of continuous change, and a low barrier to entry. Net result: Lot of noise!
What: Sharing my expertise gained over nearly two decades in the skill of extracting the signal from the noise! More precisely, identifying shifts in ground realities before they become cited trends and pain-points.
How: NOT based on reading tea leaves! Instead synthesizing technical and business understanding of the domain at 500 ft. 5000 ft., and 50K ft.

(Disclaimer: Personal views not representing my employer)

Wednesday, November 12, 2014

Part II: Nuts-and-bolts of a Scale-out Distributed Storage System

This post covers Data Durability.

Data Durability ensures minimal data loss in the event of hardware failures, component corruption, software bugs. The most common approach for data durability is to create multiple copies of data. Both of them have pros and cons. Replication imposes overheads w.r.t. space usage (e.g., 3X the capacity), but is cheaper w.r.t. partial update overheads and the amount of data required to be read during recovery. Erasure coding across nodes has the inverse pros/cons compared to replication.
With the adoption of All-Flash environments, erasure coding is getting lot of attention in recent research.


The key building services are:

       Replica Placement: Replica placement needs to take into account:
      Fault domain-awareness for namespace and replica distribution
      Replica Server Allocation: Deciding the replica servers for namespace
       Replication Orchestration: Deals with the actual mechanism for the actual data replication process. There are overlapping aspects with data consistency
      Read-write protocol for replicas
      Coordination (serialization and ordering) of updates to replicas
      State versus operation-based replication
       Replica repair: While a writes are committed across a quorum of replicas, a replica can get out-of-sync and needs repair under the following scenarios:
      Offline replica connects back
      Conflict in replica updates especially in AP systems (i.e., any replica update model without quorum consensus).
       Data Integrity/Scrubbing: This involves storing checksums and accessing the disk blocks in the background thread to guarantee data correctness.  
       Geo-redundancy service: Replication across sites. The aspects are similar to replication with data center, with the additional aspect of network optimization techniques.




No comments:

Post a Comment