Early catch: Before ground realities become trends!: Part I: Nuts-and-bolts of a Scale-out Distributed Storage System

In this post, we cover Core Services.

These represent the bare-bones services for a distributed management framework.

· Cluster Management: Responsible for tracking the state and membership of the nodes within the cluster. It involves the following aspects:

Tracking cluster membership: This could be centralized or distributed, and responsible to track the state (up, failed, partitioned) and membership of nodes within the cluster.
Role Selection process: Depending on the cluster taxonomy (symmetric versus asymmetric), the service needs to decide how the data and metadata coordination tasks are distributed among the nodes. The service could trigger a re-assignment of namespace or election of a new master/coordinator using a consensus protocol among the cluster components.
Communication middleware between cluster components: This is the middleware by which components communicate within the cluster. Ideally, the middleware should provide in-order, guaranteed delivery of messages.

· Namespace Metadata Management: Responsible for tracking the namespace exported to applications, and the persistence and consistency of the metadata.

• Sharding: Dividing the namespace serving responsibility among the nodes.

• Deciding the Namespace-to-Resource Mapping

• Object lookup strategies: How the client accesses the object

• Handling heterogeneous HW: Accounting for different hardware capabilities

• Metadata persistence and durability: Division of responsibility among cluster components in maintaining metadata: a) Metadata structures for tracking (superblock, inode, dentry, fd); b) Free space tracking and allocation (Ensuring space allocation in parallel, Book-keeping of free-space); C) Operation log structure. This also includes metadata durability i.e., how metadata is replicated within the system.

• Metadata serialization: Deals with coordination of metadata updates across multiple concurrent updates.

• Metadata Crash Recovery: Ensuring consistent state of metadata after a crash: a) Transactional updates across md components; b) Rollback for failed updates.

· Data Persistence: Deals with making sure that committed data is reliably persisted, with reasonable parallelism and latency (given that disks have traditionally been the slowest component in the IO path).

o Write Buffering: Deferring updates to persistent media – essentially equivalent to group commit semantics in databases. In the era of hard-disk latencies, buffering was critical for application performance. Buffering needs to be compliant with coherence semantics for read-write operations w.r.t. reflecting the updates.

o Caching: Maintaining commonly accessed data in memory to reduce traffic to the persistent storage. Caching typically refers to reads, while buffering refers to writes. The aspects involved in the caching design are:

§ What: Data versus Metadata layout

§ Where: Clients, metadata nodes, Data nodes

§ When: Prefetching, on reads, on writes, -access,..

§ How:

· Algorithms to cache the most active data;

· In-memory layout

§ Invalidation protocol:

· Handling Data updates

· Handling Metadata changes (ACL, resource map,..)

§ Multi data-center caching coordination

§ Bolt-on caching models: Memcache, INDG,..

o Persistence Storage Writer: This service is responsible for writing data to disk. There are multiple aspects involved:

§ Layout on disk: This involves the following:

· On-disk format

· Stripping data across disks and nodes

· Tiering of data across memory/flash/disks

· Resource allocation across resources via LVM or local FS

§ Data mutability model: Some of the aspects involved are: a) how new writes are persisted; b) tracking updates to existing data; c) tracking deletes in immutable storage

o Garbage collection: For out-of-place updates, reclaiming the space for the original blocks

Early catch: Before ground realities become trends!

Mapping Technology Trends to Enterprise Product Innovation

Wednesday, November 12, 2014

Part I: Nuts-and-bolts of a Scale-out Distributed Storage System

No comments:

Post a Comment