Delivery Fog

“Cloud” != using hypervisors

I spend a lot of time speaking with customers, vendors, analysts, and “cloud providers” about IT Cloud services. It never ceases to amaze me how many think Cloud is nothing more than a guest Operating System running under a VMWare or Xen hypervisor. This incredibly narrow view entirely misses the point of Cloud and its real potential.

Cloud == On-demand consumable IT service

At its core, “Cloud” is really nothing more than an IT service with one or more published APIs and a known set of capabilities that is consumable on-demand. It can be as simple as an infrastructure service like virtual machine container, storage, or network. It can also be far more complex, such as a CRM, billing, or full-blown set of integrated telecommunications services. Its power is that it is all there immediately, it is delivered within a set of agreed service levels, and consumers do not need to worry about any of the details involved with maintaining the service.

Having a bank of servers hosting guest operating systems running under a hypervisor that you manage is equivalent to buying a single bus and calling yourself a “transportation service”. Sure, what you can provide to passengers is a primitive form of transportation, but the level of flexibility and “on-demand”-ness of it is limited to the number of buses you have and your ability to reach the desired destination in a reasonable amount of time. If the number of passengers is larger than what your bus can carry, passengers want to go to different locations than you or other passengers want to go, or the locations are across oceans, the service is not particularly “on-demand”.

Building a cloud service

In order to build a successful cloud service, you need to consider several things.

First, you need to be absolutely clear about what capabilities you are going to provide, and at what sort of service level. Doing this in a “cloudy” way means being able to build all aspects of your service ecosystem to a known and entirely reproducible way. Variation creates unpredictability, which makes it difficult to maintain consistently adequate service levels.

You also need to know how your service scales, and how much friction is involved in growing and shrinking all aspects of service capacity. The more friction there is, the less “on-demand” you can be without overbuilding.

Your APIs need to be clear, stable, published, versioned, and (optimally) secure. These are the contract that your users rely upon to contact and consume your services. Any variation can limit your users being able to use your service, which limits its value. Poor security damages trust with your user community, and with it the likelihood they will be willing to rely upon it for anything important.

Finally, you need to instrument your service ecosystem so that you understand its dynamics, and its ability to provide the capabilities that your customers rely upon. Instrumentation acts as both a way to provide evidence to back any assumptions you made when building your service, and provides clues to potentially dangerous capability gaps that may need to be closed.

Consuming a Cloud Service

When consuming a cloud service, it is always a good idea to minimize the number of unknown blind areas that might exist. This might sound antithetical to using a cloud service, but it is important to ensure success.

The way that I do this is by trying to ensure everything about the service that I rely upon clearly fits into one of two categories:

Known Knowns

Known knowns are the details you know and have evidence of their full nature. These are the functional and performance characteristics that you can count on always being true.

When you are running the service ecosystem yourself, the number of known knowns can be quite rich. You know the load levels on the servers, can instrument and run traces to watch all the activity hitting your code and how it behaves, and can test out and see all the various ways that your service can fail and how it will behave.

Of course, you have no such visibility with a cloud service. The best that you can do is instrument to track the observed behavior of the API to ensure that it is meeting your needs within the thresholds you require. In an ideal world, the known knowns will match exactly with the full contract of capabilities and service levels provided by the service.

Known Unknowns

Known unknowns are those details in the ecosystem that you know you do not know that will impact the functioning and performance of any areas that are important to you and what you are trying to accomplish. These include everything from abnormal behavior from the cloud service, malware/bad actor security attacks, traffic spikes, lost packets, excessive latency, etc. You need to make sure that these areas are resilient to any such problems that might come from these known unknowns.

Chaos engineering is an emerging discipline that tries to help developers and organizations alike to think about and design resilience in to defend against known unknowns.

Unknowns

Anything that does not fall into those two categories that is consequential is an unknown. Unknowns are dangerous, as they can cause problems you never knew were possible and have not prepared for. Unfortunately, most in the IT community are exceedingly optimistic about everything to do with delivering a service, from how long it will take to build it to how it will behave once it is deployed. As cloud services grow in popularity, this over optimism has both increased the likelihood of encountering an unknown along with the potential severity of the damage it might wreak.