A Take On Distributed Systems and a bit of Kubernetes

November 13, 2020

Or, “Monolith First, Complicated Thing Later, Also Plan Before Switching, And How Those Things Have Been Known For Ages, HELP”

I wan't to start this with Gall's law, from General Systemantics:

A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.

When I see large scale distributed systems being written with Kubernetes or whatever, I have the gut feeling Gall's law wasn't followed, though my sample size is very small: 1.5 systems. 0.5 comes in because the other system I'm managing isn't Kubernetes (yet, I'm sure it'll become that after a while).

I consider that jumping from a monolith line of thought to a microservice line of thought, for any team or system, is a very large jump that must be scrutinized before actually doing it. Microservices are no silver bullet, mostly because developers are taught on writing monoliths for their whole life, and then you push up microservices, which come with a lot of specific tooling, and as years pass, even more tooling is created, to the point enumerating them all can be absolute sensory overload. Really.

My Bad Kubernetes Experience

In one of the systems I was developing on, I wasn't actually messing with Kubernetes directly, but, regardless, I had a very poor experience integrating with it. The monolith app I developed was inside a VPS, and had to push data to MongoDB and Redis nodes which were living inside the Kubernetes cluster. The person that managed it told us to use kubectl port-forward, and sure, we tested it out, and it worked, at least at the start, but we all know how it goes from here.

While the system was operating at spike load, the cluster restarted itself, and kubectl port-forward lost connectivity with it. In my mental model, it would reconnect to the cluster: it has authentication, it doesn't need to hold any state. Yet it didn't, and our app kept unable to connect to Redis (wish a very cryptic ECONNRESET in the app logs) for a long time while we attempted to debug it, a restart on the systemd unit that did the port-forward “fixed” it, after we talked with the person operating the cluster how redis was, and they told it was up.

After talking to a friend about that situation (a long time later), they told me port-forward is “a broken ssh tunnel that might not even work”, and “it's not meant to be used for anything other than testing”. So I could have used some other Kubernetes solution which wouldn't have made me write three paragraphs.

My Bad Microservices Experience

I think I can say this one is caused by bad management wanting to put “novel” things on teams that are very new, don't know anything about what a container actually is, or don't know what distributed system design actually entails to the failure modes of such system.

When you jump into microserivces, and design with them in mind, you have to know that you're designing a distributed system, and while the units composing that system are small and simple and understandable, the whole can be much, much more complicated (see: natural systems in the whole world, inculding human-made ones!). As well as you must know all the “fallacies of distributed systems”, which can become somewhat cliché, but are actually quite true, and at this rate, if I'm developing a distributed system, I should print out the list and put it next to me and every other developer, I've seen them happen, even though the fallacies have been enumerated since... 1997? It's weird how we're still having issues that come from those fallacies in 2020. We should do better.

I think this boils down to “people should learn a little bit before jumping in the microservice hype”, and in most cases, Gall's law still applies. I believe that things should be written first as a monolith, to have the developers learn the business logic in an environment that function calls aren't remote everywhere. And if scaling is required, then proceed to investigate a distributed architecture, if you can actually scale there, or do something else.

Offloading to the database

Another data point from overall system design is that you shouldn't offload everything to the database, as in, make every microservice need it for any operation, because as things like autoscaling jump in and now you have 10 times the amount of microservices, your database will suffer 10 times the load. That might catch fire.

Thouh if the database does catch on fire AND is a managed database solution, you can blame the cloud company providing it instead of yourself (hrrrrm Discord loved to do this on their postmortems, but nowadays they don't even give any postmortem. It's sad).