Why You Should Not Neglect Your Developer’s Kubernetes Clusters
So you’ve finally succeeded in convincing your organization to use Kubernetes and you’ve even gotten first services in production. Congratulations!
You know uptime of your production workloads is of utmost importance so you set up your production cluster(s) to be as reliable as possible. You add all kinds of monitoring and alerting, so that if something breaks your SREs get notified and can fix it with the highest priority.
But this is expensive and you want to have staging and development clusters, too - maybe even some playgrounds. And as budgets are always tight, you start thinking…
DEV? Certainly can’t be as important as
PROD, right? Wrong!
The main goal with all of these nice new buzzwordy technologies and processes was Developer Productivity. We want to empower developers and enable them to ship better software faster.
But if you put less importance on the reliability of your
DEV clusters, you are basically saying “It’s ok to block my developers”, which indirectly translates to “It’s ok to pay good money for developers (internal and external) and let them sit around half a day without being able to work productively”.
Ah yes, the SAP DEV Cluster is also sooo important because of that many external and expensive consultants. Fix DEV first, than PROD which is earning all the money.— Andreas Lehr (@shakalandy) September 13, 2018
Furthermore, no developer likes to hear that they are less important than your customers.
We consider our dev cluster a production environment, just for a different set of users (internal vs external).— Greg Taylor (@gctaylor) September 18, 2018
What could go wrong?
Let’s look at some of the issues you could run into, when putting less importance on DEV, and the impact they might have.
I did not come up with these. We’ve witnessed these all happen before over the last 2+ years.
Scenario 1: K8s API of the DEV cluster is down
Your nicely built CI/CD pipeline is now spitting a mountain of errors. Almost all your developers are now blocked, as they can’t deploy and test anything they are building.
This is actually much more impactful in
DEV than in production clusters as in
PROD your most important assets are your workloads, and those should still be running when the Kubernetes API is down. That is, if you did not build any strong dependencies on the API. You might not be able to deploy a new version, but your workloads are fine.
Scenario 2: Cluster is full / Resource pressure
Some developers are now blocked from deploying their apps. And if they try (or the pipeline just pushes new versions), they might increase the resource pressure.
Pods start to get killed. Now your priority and QoS classes kick in - you did remember to set those, right? Or was that something that was not important in
DEV? Hopefully, you have at least protected your Kubernetes components and critical addons. If not, you’ll see nodes going down, which again increases resource pressure. Thought
DEV clusters could do with less buffer? Think again.
This sadly happens much more in
DEV because of two things:
- Heavy CI running in
- Less emphasis on clean definition of resources, priorities, and QoS classes.
Scenario 3: Critical addons failing
In most clusters, CNI and DNS are critical to your workloads. If you use an Ingress Controller to access them, then that counts also as critical. You’re really cutting edge and are already running a service mesh? Congratulations, you added another critical component (or rather a whole bunch of them - looking at you Istio).
Now if any of the above starts having issues (and they do partly depend on each other), you’ll start seeing workloads breaking left and right, or, in the case of the Ingress Controller, them not being reachable outside the cluster anymore. This might sound small on the impact scale, but just looking at our past postmortems, I must say that the Ingress Controller (we run the community NGINX variant) has the biggest share of them.
A multitude of thinkable and unthinkable things can happen and lead to one of the scenarios above.
Most often we’ve seen issues arising because of misconfiguration of workloads. Maybe you’ve seen one of the below (the list is not conclusive).
- CI is running wild and filling up your cluster with Pods without any limits set
- CI “DoSing” your API
- Faulty TLS certs messing up your Ingress Controller
- Java containers taking over whole nodes and killing them
DEV with a lot of teams? Gave each team
cluster-admin rights? You’re in for some fun. We’ve seen pretty much anything, from “small” edits to the Ingress Controller template file, to someone accidentally deleting the whole cluster.
If it wasn’t clear from the above:
DEV clusters are important!
Just consider this: If you use a cluster to work productively then it should be considered similarly important in terms of reliability as
DEV clusters usually need to be reliable at all times. Having them reliable only at business hours is tricky. First, you might have distributed teams and externals working at odd hours. Second, an issue that happens at off-hours might just get bigger and then take longer to fix once business hours start. The latter is one of the reasons why we always do 24/7 support, even if we could offer only business hours support for a cheaper price.
Some things you should consider (not only for
- Be aware of issues with resource pressure when sizing your clusters. Include buffers.
- Separate teams with namespaces (with access controls!) or even different clusters to decrease the blast radius of misuse.
- Configure your workloads with the right requests and limits (especially for CI jobs!).
- Harden your Kubernetes and Addon components against resource pressure.
- Restrict access to critical components and do not give out
- Have your SREs on standby. That means people will get paged for
- If possible enable your developers to easily rebuild
DEVor spin up clusters for development by themselves.
Why don’t devs have the capability to rebuild dev 🤷♂️— Chris Love (@chrislovecnm) September 14, 2018
If you really need to save money, you can experiment with downscaling in off-hours. If you’re really good at spinning up or rebuilding
DEV, i.e. have it all automated from cluster creation to app deployments, then you could experiment with “throw-away-clusters”, i.e. clusters that get thrown away at the end of the day and start a new shortly before business hours.
Whatever you decide, please, please, please, do not block your developers, they will be much happier, and you will get better software, believe me.
P.S. Thanks to everyone responding and giving feedback on Twitter!
Image created using https://xkcd-excuse.com/ by Mislav Cimperšak. Original image created by Randall Munroe from XKCD. Released under Creative Commons Attribution-NonCommercial 2.5 License.