Early catch: Before ground realities become trends!

Master-based Taxonomy continued...

This is a continuation of the previous blog post that introduced the concept of Master-based taxonomy. In this post, we will cover design patterns used for implementing the typical workflows:

Read workflow:

Clients query the Master for a particular object/file name -- the Master looks up the node storing the data and responds back to the client
The client connects with the node. This information is cached by the clients and future accesses directly go to the node
If the node becomes unavailable or is no longer responsible for the particular file/object, the client gets an invalid response -- it flushes its cache and connects back to the Master.

Writes workflow:

A write can be an in-place update or a new write requiring space allocation.
In-place update is similar to a read workflow w.r.t. client-server interaction -- the client queries the Master with the object/file and the offset.

Space Allocation workflow/New writes:

The Master decides where the new objects or additional allocation for an existing object is done. It also decides where the replicas will be placed. The node selection can be based on several heuristics such as current load, available space, disk and node repairs.
An object is striped across multiple nodes, typically in a round-robin fashion. In GFS, the stripe size is 64MB.
After the allocation, the clients will cache the information of the assigned primary node and its replicas -- all future writes are directly sent to the primary node without querying the Master.

Delete workflow:

Delete requests are typically handled by the Master. The metadata is updated with the delete flag. The nodes responsible for the file/object are notified.
A lazy garbage collection process is invoked to purge the deleted objects, and reclaim the space.

Data Replication:

The node with the primary copy is responsible for coordinating with the replicas via 2PC or Paxos to ensure either all the replicas are committed or none of them are updated.
The actual data replication is orchestrated either by the client, or by the primary node. Another common approach is pipeline or chaining replication, where a peer replica updates the other replica -- this allows better utilization of the network bandwidth.

Concurrency control:

Clients request read/write locks to the Master. The locks are tracked by leases i.e., a client needs to continuously bootstrap the lease, else it loses the lock.

Master Bootstrapping:

The metadata of the cluster is either maintained solely on the Master (as in the case of NFS server), or split among the nodes (as in the case of GFS). The Master only tracks the mapping of files/objects to the nodes; each node in turn maintains the physical location within the local file-system.
Typically, the Master metadata is constructed by scanning the nodes, and listing the objects. This avoids conflicts in consolidate Master-state w.r.t. actual placement of the physical objects in the cluster.
The transient state of the Master (w.r.t. operations in-flight, locks, repair operations) are typically maintained in a Write-ahead Log (WAL) that can be replayed.

Early catch: Before ground realities become trends!

Mapping Technology Trends to Enterprise Product Innovation

Monday, August 26, 2013

Master-based Taxonomy continued...

No comments:

Post a Comment