There is a growing appetite among enterprises to gain agility and cloud-like self-service capabilities for on-premise IT operations. The appetite is especially being fueled by Big Data analytics that are re-defining the role of IT to a critical differentiator for business decision-making, and now contributing directly to the bottom-line in some cases! This new IT stack is broadly described as “Software-defined”, implying the ability to programmatically provision, manage, and reconfigure the behavior of the infrastructure. Software-defined is not just a technology disruption, but holistically a combination of people, process, and technology disruption. So, how does one innovate in such a dynamic, global environment to translate a Technology into a profitable Product in a hyper-growth Market segment? Yes, the most critical element is an A+ team, which is also one of the most difficult to get right!
So, what does it mean to have a A+ team? I have often heard statements like “let’s hire rock-star distributed systems and scale-out engineers.” Hmmn..there are many different dimensions to software-defined architectures -- such statements are hardly actionable, and can get recruiters searching for unicorns (such that their odds of winning a lottery might seem higher than recruiting such developers). I have been a great believer in “engineering” the right team with the appropriate skill-set balance -- one of the best ever articles on this topic is from Vinod Khosla where he coined the term “gene pool engineering” (must read if you haven't already!). In my nearly two decades experience of translating technology disruptions into next-generation enterprise products, I have applied and learnt tremendously from the concepts of gene pool engineering. The objective of this post is to share my insights in mapping the Gene Pool engineering concepts into actionable insights for teams pursuing Software-defined innovations -- to stay grounded, I will focus on Software-defined Storage and Big Data Solutions. As the market continues to get competitive, the effects of skill-set shortage are already visible -- hopefully this article gets you out of being stuck looking for unicorns, and gets you putting the right skill-set in place.
Let’s level-set on the principles of Software-defined Storage (SDS) for the purpose of this discussion. In contrast to traditional hardware-defined storage that were designed mainly as one-size-fits-all POSIX-based enterprise applications, Software-defined storage is designed to support a wide spectrum of application semantics (POSIX as well as emerging non-POSIX applications), providing rich set of APIs for QoS trade-offs and cluster scaling that can all be controlled programmatically from the comfort of a policy-based interface. The foundation of SDS is scale-out (instead of scale-up) architectures that needs to address distributed concepts related with asynchrony, distributed state, and failover scenarios that are a norm rather than an exception. Broadly, the SDS modules can be categorized into the IO Plane and the Management Plane. In this post, I will focus mainly on building the IO Plane, and cover the Management Plane in a future post.
For the gene pool discussion, an SDS architecture can be broadly represented (at 50,000 ft) as a layered architecture:
Cluster Management Logic: This layer provides the core services that define the basic personality of the overall solution. The services implemented within this layer such as state management, transport, consensus management, etc., are analogous to the foundation of the building.
Namespace Management Logic: This layer represents how the physical resources get exposed to the applications as a logical namespace. For instance, the SDS could expose a logical volume, a filesystem, or a Key-Value storage, etc. This layer defines the properties of the namespace such as reads-writes being mutually exclusive/isolated, etc.
IO Coordination Logic: This is analogous to the runtime manager where operations across different modules need to be coordinated to service IO operations or housekeeping tasks such as replica repair, garbage collection, etc.
For inquisitive minds, also check-out the 5000 ft view of the architecture where I have spelled out the key modules within each layer.
The essence of gene pool engineering has been to transform team-building and hiring from an art to science. The key idea is to clearly articulate risks, and map them to appropriate skill-set hiring by targeting appropriate centers of excellence. The layering representation brings out the point that a different diversity of experience and domain expertise is required at different levels. In this post, we cover three key areas:
Prioritization of implementation risks: Pinpointing the most complex piece of your solution and focusing on it first. For instance, if you have a requirement to support linearizable global transactions across objects in multiple shards, and if this linked with the core USP of the solution, it is critical to de-risk this aspect (instead of trying to optimize the on-disk layout format, which is important but a more bounded problem).
Diversity in team: A good balance is required between storage domain experts, distributed system experts, ninja generalist programmers -- a team with all distributed systems or storage rock-stars cannot go too far. Similarly, diversity in experience and backgrounds is key.
Culture for optimizing cycle times: Agility to iterate is key -- basically building a culture for getting the 2-star version early and validate with customers, instead of over designing a 5-star solution with longer gestation. In the domain of scale-out storage, this requires a lot of foresight and leadership to get the iteration right!
The rest of the post covers additional details for each of these aspects.
Prioritization of implementation risks:
Knowing what is possible in distributed systems: A good understanding of related work is absolutely critical! Knowing what is possible versus where the leap of faith is required helps in targeting (and going all out) for the right skill-set. You will be amazed by the investments that already exist in the form of related work -- this can significantly shorten your design exploration phase, and leverage design experience of academia and others that are in the similar problem domain.
Keep it real: The biggest risk is when you treat all product requirements with equal importance. It is critical to identify where we would need to defy the laws of physics. For instance, strong consistency of metadata will ultra low latency -- prioritize such tricky requirements and focus on mitigating risks on these quickly. Another approach is to start with a standard vanilla scale-out design, and analyze where specialization of module design is required to meet the product requirements. Being able to scope-out a requirement is equally critical i.e., attempting to keep all stakeholders happy at the expense of complicating the design of the first release is not a winning strategy. Following are top 5 modules that commonly represent the elephants in the room.
Data-driven selection of technology choices: Data-driven understanding is key especially for the core components that are extremely difficult to rip-and-replace otherwise. It is important to document these aspects (on the wiki) -- often times these questions get re-visited several times, especially as newer team members may attempt to the reinvent the wheel.
Diversity in team
Diversity in domain knowledge: Each layer in the architecture represents a different mix of expertise w.r.t. distributed systems, storage domain, generalist ninja coders, API and manageability, UX/UI, tools and automation. For instance, the Cluster Management layer represents core services, and critical to get the best distributed systems expertise. Similarly, higher up the stack, the storage domain and enterprise use-cases become critical to understand.
Prior experience diversity: Instead of focusing solely on enterprise product developers, have a good balance of folks with prior experience running Web 2.0 services, as well as Cloud provider services. The background in owning a service helps tremendously in baking APIs and management metrics & profiling in the code. Also, this facilitates avoiding mistakes based on lessons learnt from a broader set of scale-out design experiences.
Years of experience diversity: It’s typically a good idea to mix “white hair” with “high energy levels.” Core services in the lower layers are difficult to rip-and-replace, and typically have higher dependencies from services in the higher levels --its critical to have the most experienced folks driving these modules.
Culture to reduce iteration/cycle times
Phased execution culture: It is difficult to find engineers who can help define the balance between the module functionality and time to ship. I refer to as a 5-star versus 2-star version of the product. Given the intense competition in the marketplace, it is critical to get a product iteration in alpha/beta, versus delaying it for full functional, polished offering. As a part of the phased model, being ruthless to scope out costly features such as distributed transactions, serializability, distributed recovery, etc. from the MVP (unless absolutely critical for the USP).
Don`t go overboard with agile: The sprint-based model for execution can be at sometimes at odds with distributed systems implementation. The core services get deeply rooted in the design, and are difficult to retrofit -- rather than always chunk work in 1-2 week sprint windows and not factor any time to eliminate technical debt (or specifically design debt). This is critical to avoid significantly costly failovers in the future.
In summary, there is no secret recipe for putting together a A+ team, but the approach of finding rock-star distributed systems or storage developers almost always is a losing formula.