So You Think You Can Compute In the Cloud
TL;DR
- The goal of cloud computing is to unlock the potential of all computer programmers in your organisation.
- As a technology leader you want to achieve that goal using the least amount of cloud computing products.
- Your cloud infrastructure footprint should be a function of what you can maintain, not what you can build.
Prelude
The essential goal of playing golf is to play the least amount of golf. Who do you think will win? A professional golf player who is only allowed to bring five clubs, or an amateur player who can bring a bag full of clubs? Now imagine a world where a new type of golf club is introduced every week. And all new models offer a slightly different approach to an existing situation. In this world, being able to effectively play the game using a set of only five clubs instead of the allowed fourteen becomes a competitive advantage. Welcome to cloud computing.
Essential Cloud Computing
The essential goal of cloud computing is to allow every individual computer programmer in your organisation to independently write, build, and deploy a software service in a single day regardless of whether that service will serve 10, 10 thousand, or 10 million users.
Your mission as a technology leader is to achieve the essential goal of cloud computing using the least viable amount of cloud computing products.
There are two kinds of people working in cloud computing: the kind who do not understand this and the kind who do not want to tell you this. That friendly AWS Solutions Architect? That's the second kind. That one speaker at a conference who did a great talk about how easy it is to set up AWS CodeBuild? The first kind.
Often achieving the essential goal of cloud computing is necessitated by a business drive to improve engineering agility in your organisation. Because it is difficult to quickly prototype ideas under contention of interdepartmental dependencies on systems operations, network operations, or data center operations. More often, cloud computing is necessitated by the fact that your business does not have any of these departments in the first place. Because, well, you know, you are a cloud native startup. Congratulations! You are now lucky enough to manage networking, systems operations, data center placement, authentication and authorisation, and (virtual) "hardware" selection yourself. Without deep expertise in any of these areas. What could possibly go wrong?
What typically goes wrong is that the number of cloud computing products you use turns out to be a function of time and the number of engineers on your team, not a function of the problems you are solving. You fail to balance simplicity and capability. If you use too few cloud products, a single developer can not build a meaningful service in one day. If you use too many, your team can not realistically support the cloud infrastructure that you run. But adding yet another cloud product is always easier to quickly solve the problem at hand. Unfortunately, that is how complexity gets out of hand as well.
As a result teams of senior engineers still end up with over 25 cloud computing products in their infrastructure. And in practice it does matter whether there are 10, 10 thousand, or 10 million users?
Let us challenge ourselves by answering some simple questions…
About Networking
Q: How does an outside web request reach the backend service that responds to it?
To appreciate the breadth and depth of that question, consider the following example. You have a state of the art React frontend that will perform on the order of 100 backend requests to populate the landing page for a user (in case you believe that is outrageous: an unauthenticated request to a Facebook page yields more than 140 requests). How many TCP connections is that? How many TLS handshakes? How many internal and external name resolutions? Which protocols are used at each hop? How many L4 routing hops within your (virtual) network? How many L7 routing hops within (virtual) your network? How many of those 100 L7 requests are multiplexed onto a single L4 connection and for what part of the transit? Are they de-multiplexed at your edge or deeper into your network? Is your load balancer DNS based or does it use an anycast IP?
If you are serious about doing business on the internet in 2020 then you care about all of this, because you care about your tail end-to-end latency. During normal operation, under contention, and in presence of failures.
About Provisioning
Q: How do you guarantee idempotent partial (re-)provisioning?
Today is the day that your cloud computing vendor has decided to be a nuisance and kills the VM that runs your VPN server without bringing it back up. No more access for you. Your best option is to run your Terraform / Cloudformation / Deployment Manager code. How certain are you that this action will recreate the VPN server and only the VPN server? Is the VPN VPC being touched as a cascading action? Are there any resource identifiers that change as a result of the partial deployment? Humour yourself and schedule your infrastructure code to run once every ten minutes. Now kill a random VM instance once a day. Not just the ones that are in auto scaling groups. Really, kill anything. Still confident?
About Zones and Regions
Q: How do you survive or degrade in response to availability zone or region failures?
Spoiler: "we tested killing all instances in a single region and the app still worked" is not the answer. What happens when instances stay alive, but the zone's network is partitioned off? What happens when the network inter-region network still works but loses 20% of all packets?
About Attack Vectors
Q: What is your path to full recovery from different types of compromise?
One of your employees goes rogue. How long until you have manually revoked their access to all accounts? I mean all of them: email, AWS, Slack, your SaaS CI/CD solution, Github, Atlassian Suite, VPN. All of it. Could they have left an authorised key for SSH access on one of the machines?
Someone has compromised your root AWS account. How do you get it back? What if you do not? The best way to defend against compromise is to not be compromised in the first place. The realistic way to defend against compromise is to be able to recover from complete loss of access. How do you achieve that?
You Are No Exception
A team of senior engineers takes at least six to eight months to formulate and implement satisfactory answers to the questions above. We have not yet covered secrets management, deployment and delivery, backup and restore, scaling and auto-scaling, authentication and authorisation, perimeter security (DDoS, etc.), and several other considerations.
Going Forward
Infrastructure is only as good as your ability to manage it. So here is a simple plan for your next cloud deployment: start with networking, compute (VMs, containers, or otherwise), and storage (any database, really). Then do not allow yourself to add any additional services before you have formulated and implemented an answer to all questions above for the services that you are already using.
This means that you can not bring product features to production any faster than you can master the infrastructure they run on. This has always been the case. If you need to go faster, find someone who helps you master your existing infrastructure. Not someone who tells you to add more infrastructure to the mix.