Mapping Technology Trends to Enterprise Product Innovation

Scope: Focusses on enterprise platform software: Big Data, Cloud platforms, software-defined, micro-services, DevOps.
Why: We are living in an era of continuous change, and a low barrier to entry. Net result: Lot of noise!
What: Sharing my expertise gained over nearly two decades in the skill of extracting the signal from the noise! More precisely, identifying shifts in ground realities before they become cited trends and pain-points.
How: NOT based on reading tea leaves! Instead synthesizing technical and business understanding of the domain at 500 ft. 5000 ft., and 50K ft.

(Disclaimer: Personal views not representing my employer)

Wednesday, September 18, 2013

Locating data within the Scale-out Storage Cluster

In the previous posts, we have discussed namespace sharding within the nodes of the scale-out cluster -- in this post, we discuss design patterns for clients to locate the data for read/write operations.


  • Directory-based look-up: 
    • This is the most common pattern -- the client queries a centralized node (i.e., Master) for the location of data. This information is typically cached by the clients. This approach is applicable to Master or Multi-Master systems that track the global cluster state w.r.t. namespace -> resource allocation. GFS/HDFS, GPFS, Lustre, Isilon OneFS, BigTable, etc. are all examples of this model.
  • Key-based Routing: 
    • This is a common pattern in Masterless systems. The hash of the key is used to route the request to the appropriate cluster node. 
    • The look-up can be done by any node within the cluster or can be computed directly by the client (e.g., Redis, Microsoft's Flat Data-center Storage). In the worst-case, the request is re-directed O(n) where n is the number of nodes within the cluster.
    • A variant of key-based routing is Content-based addressing -- in this approach, the logical address of the data within the cluster is derived by hashing the entire the data contents. Two data blocks with the same data will hash to the same logical address. Thus this scheme naturally provides deduplication or single-instance storage semantics. 
  • Request Broadcasting:
    • This is  not very common (though I have seen a commercial product implement this). The client request is broadcast to all the nodes within the cluster -- the appropriate node and its replicas can respond. Typically, this is not scalable, and applicable to small cluster sizes.  
  • Proxy-based Routing:
    • The client requests are sent to a load-balancer that distributes the requests to the appropriate nodes. The proxy can be DNS-based (Microsoft Azure Storage) or general purpose such as HAProxy. This approach is especially applicable when the namespace is sharded across multiple data-centers or requires multi-tenant authentication.


No comments:

Post a Comment