Designing Systems - Efficiency vs Flexibility

Or why you should consider taking a course in System Theory.

Posted by Holger Reinhardt    on November 22, 2019 in Dev tagged with Devops, Development

This blog post was inspired by a deployment tool discussion while drafting engineering blueprints to converge the organization on common application patterns. Since I have been in this kind of discussions repeatedly, I wanted to take the time to write down some thoughts about tradeoffs in system design. While the example at hand is about deployment tooling, it applies equally for instance to system integration and system architecture.

On a side note, I think that software engineering as a profession can learn a lot by studying System Theory and Control Theory - I wish those two subjects would be taught as the foundational subjects in software engineering at school. If feels we are wasting a lot of time in software engineering rediscovering fundamental truths about complex systems which every social science and physics major learns in his or her first couple of years.

But coming back to the discussion at hand. We discussed the pros and cons of encouraging the use of the native tooling of cloud providers like AWS Codestar and Azure DevOps vs standalone tooling like Jenkins and Terraform and Serverless.

The main argument for the latter is flexibility. In my experience it is not the code which creates a lock-in to a particular vendor, but the tooling we use to deploy the code. Think about this for a second - while the code representing the actual business value is mostly benign to the place where we run it, it is the hidden investment in tooling and associated process which can and in most cases does create a vendor lock-in. And the reason is quite trivial - it is much easier to argue for an investment in changing code to create business value for customers. It is much harder to argue for investing into changing tools to end up in the exact same place from a business and customer perspective. Tooling is a hidden enabler, whereas code represents real value to the customer in form of features. Yet it is the hidden enable which creates the lock-in, not the code itself. If you picked the Cloud-vendor native tooling, do you really think you are ever going to change vendors again? Even when at some point your and your vendors interest start to diverge? Or if the vendor decides to moneytize a previously free service or becomes a reputational risk which is detrimental to your business?

On the other hand maybe you are trying to optimize for the short to medium term and that risk is acceptable for you. Because if you spend too much time designing for hidden flexibility and correspondingly too little into visible customer value, you might be too late to capture the market opportunity. Dying in beauty is still dead.

On the other side the argument is efficiency. Stuff just works, frictionless and convenient. Fast time to value, as most vendors would claim. And they are not wrong. Developer tooling from Microsoft and AWS simply works with their respective services.

And you won’t encounter issues like the breaking changes to your terraform configuration when upgrading from terraform 0.11 to 0.12. This is a perfect example of the costs hidden in flexibility. The trigger for the upgrade is actually quite trivial: AWS has given notice that the node8 serverless runtime will be EOL at the end of the year. If you have deployed your serverless functions with terraform 0.11 you might be surprised to learn that in order to deploy node10 you need to upgrade to terraform 0.12. All is good until the first local run of the newly updated terraform: you learn you that you need to change your terraform files to match the new syntax. And conveniently it supplies a script for that too. All good so far. Since you might use Jenkins you will need to upgrade the terraform version there too. But latest at that point some yellow flags start to go up - you are now no longer changing single service deployment of your distributed application, but the global deployment tooling for all services of your application. And you start to wonder what will happen to all those non-Javascript services you have been deploying with terraform 0.11. Now you are in a tough place - will you upgrade the terraform files for all those none-affected services too? Who will spend the time to test all this functionality, when really you should be busy building the features customers are paying you for? You realize with a sinking heart that the only way out is for you to switch to containerized builds in your Jenkins unless you are willing to do a forklift upgrade of your entire application.

But here is the catch - once you create such a containerized deployment pipeline, you are playing - from a systems perspective - at a different league. You are now being able to deploy each service in true isolation to any provider and system of your choosing. Any subsequent update will not longer have any knock-on effect on other services.

So while you incurred short term additional costs for simple system hygiene, the end result is actually a much better system for the mid to long-term. This is actually not a random outcome, but a systemic effect in open systems and is described as ‘anti-fragile’ by Nassim Taleb in his manifesto Antifragile: Things that Gain from Disorder. Closed systems decay over time, whereas open systems can be reconfigured and actually improved over time but it comes with the cost of additional friction and overhead.

As engineers we are always faced with this tradeoff between efficiency and flexibility. Between a system which is open but incurs overhead and contains redundancies and a tightly integrated closed system which is highly efficient. And as I mentioned before, this is not just applicable to tooling, but to almost every other dimension of a reasonable complex software system. Think Monolith vs Microservices in system architecture, or ESB vs Smart Endpoints and Dumb Pipes in systems integration.

And software engineering is not the first one to encounter this tradeoff. Clayton Christensen described in his second book The Innovator’s Solution the Law of Conservation of Attractive Profits and related Theory of Interdependence and Modularity, trying to explain the economic fundamentals behind the success of companies producing tightly integrated closed products vs companies thriving with modular and flexible products. His insight was that when demand is higher than what technology can deliver, closed and tightly integrated systems tend to have a performance edge and win. But the moment technology delivers more than what is demanded, modular and open systems gain the edge by being more innovative. As we discussed above, open and modular systems have an inherent overhead which at that point can be absorbed by the performance surplus generated by the technology. Performance is not longer critical, since the technology at this point produces adequate (= ‘good enough’) performance in both. Which allows other factors like speed of innovation and costs to become decisive.

If you are curious to explore this topic more, you should check out the Technology Adoption Cycle as jump off point or read through Christensen’s two part paper on ‘Exploring the limits of the Technology S-curve’: Component Technologies and Architectural Technologies.

For a somewhat counter-intuitive argument on being able to combine both efficiency AND flexibility, check out Flexibility vs Efficiency. This paper analyzes the efforts by a Toyota subsidiary in Fremont, California (known as one of the first successfully mastering Lean in production outside of Japan) on combining botzh flexibility with efficiency. It draws from a wide variety of research in organizational design and social sciences. The story of this plant is probably worth a separate blog post, as it was the worst performing GM facility before it became the best performing facility with the same workers, but with a new management system and philosophy from Toyota (the home of Lean).

I just do not think that there is a simple right or wrong. There is a time and place for each. If it is about time to value and meeting well defined short term objectives, you might to decide to choose efficiency over flexibility. On the other hand if you are looking longer term and/or want to preserve options for the future in the face of uncertainty, investing into more flexibility might be the better choice. Just make sure to not ‘die in beauty’.

Maybe next time you are asked to make such tradeoff, answer ‘It Depends’ and think through the consequences of your choice from a larger system perspective. You might still not get it right completely, but at least you had a chance to think about it. And hopefully some of the books and articles I referenced above can help you making a better informed choice - we are Creators of Worlds after-all. ;)

Feedback from @mamund:

This choice between “all-in” on a platform and mix and match your own also has a maturity element. When you first begin, let other make choices for you while you focus on what’s important. As you grow more adept, start to see patterns and “blocks” of functionality in the system (big picture) where you can make changes (add my own monitoring tools, etc.) and, finally, when you are well-versed in the way the system behaves, you can better afford to take on added complexity of choice and flexibility. So in addition there is a maturity-over-time dimension affecting your trade-off.

From @patkua in his latest email newletter

He references two articles which are expressing the need for systems thinking much better than I ever can do: Russel Ackoff: A Lifetime of System Thinking and Will Larsen’s: Notes on Evolutionary Architecture