Untangling the data center from complexity and human oversight

Vinod Khosla, Khosla Ventures

Sun, Dec 14, 2014, 9:30 AM

Our investment thesis at Khosla Ventures is that simplicity through abstraction and automation through autonomic behavior will rule in the enterprise’s “New Stack,” a concept that embraces several industry changes:

The move to distributed, open source centric, web-era stacks and architectures for new applications. New Stack examples include Apache web stacks, noSQL engines, Hadoop/Spark, etc. deployed on open source infrastructure such as Docker, Linux/KVM/Openstack and the like.
The emergence of DevOps (a role that that didn’t even exist 10 years ago) and general “developer velocity” as a priority: e.g. giving developers better control of infrastructure and the ability to rapidly build/deploy/manage services.
Cloud-style hardware infrastructure that provides cost and flexibility advantage of commodity compute pools in both private datacenters and public cloud services, giving enterprises the same benefits that Google and Facebook have gained through in-house efforts.

The most profound New Stack efficiency will come from radically streamlining developer and operator interactions with the entire application/infrastructure stack, and embracing new abstractions and automation concepts to hide complexity. The point isn’t to remove the humans from IT — it’s to remove humans from overseeing areas that are beyond human reasoning, and to simplify human interactions with complex systems.

The operation of today’s enterprise data centers is inefficient and unnecessarily complex because we have standardized on manual oversight. For example, in spite of vendors’ promises of automation, most applications and services today are manually placed on specific machines, as human operators reason across the entire infrastructure and address dynamic constraints like failure events, upgrades, traffic surges, resource contention and service levels.

The best practice in data center optimization for the last 10 years has been to take physical machines and carve them into virtual machines. This made sense when servers were big and applications were small and static. Virtual machines let us squeeze a lot of applications onto larger machines. But today’s applications have outgrown servers and now run across multitudes of nodes, on-premise or in the cloud. That’s more machines and more partitions for humans to reason with as they manage their growing pool of services. And the automation that enterprises try to script over this environment amounts to linear acceleration of existing manual processes and adds fragility on top of abstractions that are misfits for these new applications and the underlying cloud hardware model. Similarly, typical “cloud orchestration” vendor products increase complexity by layering on more management components that themselves need to be managed, instead of simplifying management.

Embracing the New Stack developers

Server-side developers are no longer writing apps that run on single machines. They are often building apps that span dozens to thousands of machines and run across the entire data center. More and more mobile or internet applications built today are decomposed into a suite of “micro-services” connected by APIs. As these applications grow to handle more load and changing functionality, it becomes necessary to constantly re-deploy and scale back-end service instances. Developers are stalled by having these changes go through human operators, who themselves are hampered by a static partitioning model where each service is run on an isolated group of machines.

Even in mature DevOps organizations, developers face unnecessary complexity by being forced to think about individual servers and partitions, and by creating bespoke operational support tooling (such as service discovery and coordination) for each app they develop. The upshot is pain of lost developer time on tooling, provisioning labor and hard costs underutilization that results from brute force “service per machine” resource allocation.

We believe the simpler path for the New Stack is to give power to developers to write modern data center–scale applications against an aggregation of all of the resources in the data center, to build operational support into apps as they are created, and to avoid management of individual machines and other low-level infrastructure.

Delivering such an abstraction lays the foundation for an increasingly autonomic model where logical applications (composed of all of their physical instances and dependent services) are the first-class citizens deployed and automatically optimized against the underlying cloud-style infrastructure. Contrast this with the typical enterprise focus on the operation of servers as first-class citizens — a backward-looking approach that represents pre-Cloud, pre-DevOps thinking.

Distributed computing isn’t just for Google and Twitter

Turing Award winner Barbara Liskov famously quipped that all advances in programming have relied on new abstractions. That truth is even more pronounced today.

Most enterprises adopting New Stack today will have a mixed fleet of applications and services with different characteristics: long running interactive applications, API/integration services, real-time data pipelines, scheduled batch jobs, etc. Each distributed application is dependent on other services and made up of many individual instances (sometimes thousands) that run across large numbers of servers. This mixed topology of distributed applications running across many servers is geometrically more complex than an application running on a single server. Each service that comprises the application needs to simultaneously operate independently and coordinate with all of the interlocking parts to act as a whole.

In the above model, it’s inefficient to use human reasoning to think about individual tasks on individual servers. You need to create abstractions and automations that aggregate all of the individual servers into what behaves like one pool of resources, where applications can call upon the computation they need to run (the CPU/memory/I/O/storage/networking) without having to think about servers. To achieve optimal cost structure and utilization, the resource pool is aggregated from low-cost, relatively homogenous equipment/capacity from multiple vendors and cloud providers, and deployed under a uniform resource allocation and scheduling framework. This strategy avoids costly, specialized equipment and proprietary feature silos that lead to lock-in and ultimately less flexibility and manageability at scale.

Google was the first to overcome the limits of human oversight of data center resources with this resource aggregation approach by building its own resource management framework (which was initially called Borg, then evolved into Omega). Twitter rebuilt its entire infrastructure on top of the Apache Mesos distributed systems kernel to kill the “fail whale.” You could argue that Google and Twitter — in the absence of innovation from the big systems players — created their own operating systems for managing applications and resources across their data centers. That simple idea of a data center operating system — although complex to create and execute in the first place — is what drove our most recent investment in Mesosphere.

We believe the adoption of this type of “data center OS” outside of the largest web-scale businesses is an inevitability. Even small mobile applications have outgrown single machines and evolve much more rapidly. Managing change as a process instead of a discrete event has become table stakes for CIOs, and daily changes in business models make data center resource requirements highly unpredictable. Elastic infrastructures are no longer a “nice-to-have.” And human beings and manual oversight have reached their limits.

Vinod Khosla

Image copyright JamesBrey/iStock.

Related research and analysis from Gigaom Research:
Subscriber content. Sign up for a free trial.