Mapping Technology Trends to Enterprise Product Innovation

Scope: Focusses on enterprise platform software: Big Data, Cloud platforms, software-defined, micro-services, DevOps.
Why: We are living in an era of continuous change, and a low barrier to entry. Net result: Lot of noise!
What: Sharing my expertise gained over nearly two decades in the skill of extracting the signal from the noise! More precisely, identifying shifts in ground realities before they become cited trends and pain-points.
How: NOT based on reading tea leaves! Instead synthesizing technical and business understanding of the domain at 500 ft. 5000 ft., and 50K ft.

(Disclaimer: Personal views not representing my employer)

Saturday, September 14, 2013

Art of Data Sharding

Data Sharding is a common term used in the context of scale-out storage and data management systems. It refers to the ability to divide the responsibility of serving data and metadata operations among the nodes within the cluster. In our previous posts, we covered different control taxonomies namely Master, Masterless, and Multi-Master. In this blog, we analyze sharding from the perspective of namespace distribution.

 To clarify terminology, for the purposes of sharding, the namespace is viewed as a key-space derived either lexically, or by different hashing functions. Sharding algorithms divide this key-space into different ranges and assign them to the nodes. Note that the key-space distribution across the nodes does not necessarily have to be deterministic i.e., it can be random, but will then incur a layer of indirection in data location look-up (more suitable for Master/Multi-Master taxonomies).

Before we look at various schemes for data sharding, lets start by understanding the key design considerations:

  • Infrastructure: 
    • The nodes participating within the cluster may be homogenous or heterogenous in terms of their CPU, memory, network, and disk/SSD configurations.
    • Single data-center versus multi-data-center; within a data-center, there can be cell-based/pod-based architecture.
    • The goal is minimize the occupancy deviation between the most filled and least-filled disks. Skewed resource utilization can lead to nodes becoming hotspots/stranglers. 
  • Data Model & Workload access patterns
    • The application naming scheme typically dictates how data is referenced. This can lead to skews or binomial distributions compared to uniform distribution of accesses across the key-space. 
    • Access patterns may follow a lexical ordering of keys, or exhibit strong locality behavior in key-space access. 
    • Operations/transactions involving multiple keys spanning different key ranges. Examples such as backup of data or  accessing large files striped across key spaces the span multiple nodes. 
    • Other aspects such as tenant-based key distribution   
  • QoS Goals
    • Criteria for optimization -- latency versus throughput
    • Scalability goals
    • Performance locality especially in a multi data-center environments 

 A point to note is data sharding is also used in P2P environments -- but the design considerations for scale-out storage environments differ significantly. P2P environments exhibit highly variable bandwidth, frequent reconfiguration, untrusted participants, byzantine fault tolerance instead of fail-stop.

The next post will delve into common design patterns for deterministic sharding, load balancing, correlation between sharding and replication.

No comments:

Post a Comment