Early catch: Before ground realities become trends!

Art of Data Sharding

Data Sharding is a common term used in the context of scale-out storage and data management systems. It refers to the ability to divide the responsibility of serving data and metadata operations among the nodes within the cluster. In our previous posts, we covered different control taxonomies namely Master, Masterless, and Multi-Master. In this blog, we analyze sharding from the perspective of namespace distribution.

To clarify terminology, for the purposes of sharding, the namespace is viewed as a key-space derived either lexically, or by different hashing functions. Sharding algorithms divide this key-space into different ranges and assign them to the nodes. Note that the key-space distribution across the nodes does not necessarily have to be deterministic i.e., it can be random, but will then incur a layer of indirection in data location look-up (more suitable for Master/Multi-Master taxonomies).

Before we look at various schemes for data sharding, lets start by understanding the key design considerations:

Infrastructure:

The nodes participating within the cluster may be homogenous or heterogenous in terms of their CPU, memory, network, and disk/SSD configurations.
Single data-center versus multi-data-center; within a data-center, there can be cell-based/pod-based architecture.
The goal is minimize the occupancy deviation between the most filled and least-filled disks. Skewed resource utilization can lead to nodes becoming hotspots/stranglers.

Data Model & Workload access patterns

The application naming scheme typically dictates how data is referenced. This can lead to skews or binomial distributions compared to uniform distribution of accesses across the key-space.
Access patterns may follow a lexical ordering of keys, or exhibit strong locality behavior in key-space access.
Operations/transactions involving multiple keys spanning different key ranges. Examples such as backup of data or accessing large files striped across key spaces the span multiple nodes.
Other aspects such as tenant-based key distribution

QoS Goals

Criteria for optimization -- latency versus throughput
Scalability goals
Performance locality especially in a multi data-center environments

A point to note is data sharding is also used in P2P environments -- but the design considerations for scale-out storage environments differ significantly. P2P environments exhibit highly variable bandwidth, frequent reconfiguration, untrusted participants, byzantine fault tolerance instead of fail-stop.

The next post will delve into common design patterns for deterministic sharding, load balancing, correlation between sharding and replication.

Early catch: Before ground realities become trends!

Mapping Technology Trends to Enterprise Product Innovation

Saturday, September 14, 2013

Art of Data Sharding

No comments:

Post a Comment