> the ultimate goal of software design should be (organizational) knowledge building
One thing to add: the author talks about reviving a system as “slow and difficult process”, and it is. However, the concrete example described is not worthy of hand-wringing of this kind: a system that could have been built by a single competent engineer in 6 months (inevitably of alpha quality, at best), could be resurrected by a competent team of several programmers and brought to, say, beta quality, while keeping the lights on their alpha system on in how long? Let’s say, 9-12 months. No biggie, really.
Most companies routinely discard man-years of programmer’s effort, so those 9-12 months are likely just a blip in the lifetime of that firm.
The article mentions Zach Tellman's newsletter "Explaining Software Design" (https://explaining.software/) which I highly recommend reading. I have found his works to provide deep insight into the process of software design.
I believe the only reason I've been successful in this is because I agonize over simplicity. There are times during the development of any project where one might be tempted to hack around an issue, or commit the ugly code that seems to work. These are the rough edges that inheritors of a codebase use as evidence that a blank slate would be preferable. They're also the bits where the underlying business logic becomes murky. My goal is for the code to be so clear that documentation would feel redundant.
This approach of course takes more time and requires that your management trusts you and is willing to compromise on timelines. It's extremely rewarding if you can sell it and deliver.
And getting "compromise on timelines" is a most sublime political art. It requires the combination of both a humble, competent manager and an established, successful engineer worthy of trust.
Congratulations on your success on those two varied fronts!
If the question is what does the software actually do, then of course the code, toolchain, and runtime are the authority.
It also doesn't have to be that detailed - just a one line comment at the top of a file saying why it's there and what for can make an immense difference to the time it takes to understand code.
Class comments are great too if they have in them everything that's NOT in a ChatGPT summary :-). i.e. I can paste code into ChatGPT myself to get a summary if I really wanted to so I don't need that - but I need all the things it doesn't tell you which is basically why the class exists and what it's intended for.
The lower down the hierarchy it goes the less comments matter IMO.
> There are times during the development of any project where one might be tempted to hack around an issue, or commit the ugly code that seems to work. These are the rough edges that inheritors of a codebase use as evidence that a blank slate would be preferable.
Yes, one thing I've learned is to never underestimate the power of inertia in a codebase. When adding functionality, 99% of developers will go for the path of least resistance, which is mimicking whatever patterns already exist. To loop back to the article, this is often due to lack of full understanding; the default assumption is that because something is written in a certain way, that it's the best way. This isn't true; it may not even be the correct way! But copying what already exists has an element of safety built into it, without needing to spend the effort to deeply understand existing code (which tends to be developers' least favorite activity).
So if you put in an ugly hack, or have a code structure which doesn't make sense, expect that to persist for years, or decades.
Or, more importantly, explain to others why you're deviating from "standard"
IMO setting up reasonable patterns for others to follow is part of a good design. I'm not saying that I personally am great at it - it's an ideal!
I think you are right though - very non-understandable things tend to persist because nobody wants to touch them.
Which is why we used to study a book called Design Patterns.
I struggle to believe this. Perhaps my personal situation, inheriting a 150k line embedded C programme, which started sprouting weird bugs when ported from X86 -> ARM.
> minimal oversight or documentation
Why? Why do you not have documentation?
> I've been successful in this is because I agonize over simplicity
I will break this down: "I've been successful in this " I do not believe this statement
> I agonize over simplicity
I wonder if the subordinates in your organisation who are not allowed to criticise you, wish you had agonised over documentation (I do not know what power you have over the folks who follow you, I am hypothesising it is a lot)
Documentation is very hard. It is harder than writing code because there is no parsing of documentation, no demonstration of correctness.
Inaccurate, or lazy, documentation can be worse than useless, but no documentation condemns the system to a slow death
I wish my fellow computer programmers would stop making excuses for not doing the extremely hard work of documenting what they were thinking they were doing when they (inevitably) did something slightly differnent
> Documentation is very hard. It is harder than writing code because there is no parsing of documentation, no demonstration of correctness.
You answered your own question :)
So "do not do the hard parts"?
That is very unprofessional
No. It is geeking out, part of the job...
> achieving desired result without doing the hard parts is not only professional, but smart and actually kind of awesome.
That is a menace. I think I am working on code you wrote
It is the opposite of professional. It is amateur, irresponsible dilettantism
Most employers have less than zero interest in paying coders to document in my experience. If they want documentation to exist, they hire a technical writer.
Sadly, I've never met an employed tech writer (and no, journalists don't count).
That is the problem
Not that it is true, it is not, for many reasons. It is a problem that is believed
The information that is in your head might be nonsense for this person, and there is chance that it is not reducing the time it takes to understand in a meaningful way.
Every codebase is going to have different definitions of "professional standards".
Will also note I have no subordinates. In most cases I've handed these services off to teams with more seniority/higher rank than myself.
Re: documentation, I suspect the embedded C and adjacent systems you work on warrant docs more than the web app plumbing work that I do. I've done brief write-ups with some diagrams, but I wouldn't know how to document further without just restating what is already clear from the code.
Every company I've worked had parts of the codebase that were full of complicated business logic whose purpose was totally non-obvious, or complex interactions with outside APIs etc. I took care to document those things carefully so they would be understandable.
I also agree with this person for the most part. For all I know the original poster might indeed be successful with their approach, but in general having docs of some sort is a good idea.
I think most devs have the sometimes mistaken belief (coupled with some arrogance/cargo culting) that code should be self-documenting, skipping over the part where they can document WHAT but not the WHY in as much detail as would be needed to tell the full story.
Sometimes a simple comment explaining the basis for doing things a certain way, a Markdown README/ADR in the same repo, or even a link to a particular Jira issue (to even indicate that one with useful stuff exists, in the midst of thousands of others) will all be better and save someone a headache in the case of them missing out on important context.
The correct amount of documentation is as little as you can get away with (without being apathetic or ignorant of the developer experience of others in the project that don't know all that you do), but not zero. The code naming conventions and structure, as well as even code tests (both correctness, how it should work and how to use it) and any automation (e.g. Dockerfiles that detail the dependencies, or something like Ansible playbooks that detail the needed environment, or systemd service file definitions, or even your project files and build scripts) might explain a lot about it, but not all.
My tooling is not showing anything.
What tooling do you recommend?
> This approach of course takes more time and requires that your management trusts you and is willing to compromise on timelines.
I would say that most management and even most programmers don't see the value in this. In my experience focusing on simplicity gives much better long-term results but it has higher and more unpredictable upfront cost. Blasting code onto main is seen as being more productive even though long-term it seems to have much higher overall costs.
Getting v0.1 out, albeit with murky code and iterating to v2.5 with 10 paying customers is the way to progress. The hard, non-science part is getting management to spend billable hours for no short term benefit. Thats the key skill.
It’s more than the usual software maintenance too-It’s the entire operation of a piece of software. Scaling it out, being on call for it, adding monitoring, alerting and logging. Inter-operating with other software in the company. Developing libraries and services for other developers to consume. Security. Understanding and deploying the dependencies of the software. And more.
The clever recsys in my example is only the tiny kernel of the actual challenge of delivering this to users. Its the complex care and feeding of a live service that matters.
In this example, it isn't entirely clear if this service ('saas middleware') is deeply integrated with org's core competency or value. But I'll assume it isn't.
They do not understand the service domain well enough and cannot staff or motivate the people to build and maintain it correctly. This is exactly why SaaS exists and is so ubiquitous. You're just not going to be able to build something as good or better for less in the long run.
If they properly understood the cost of building and maintaining this, compared to the (probably) insignificant increase in enterprise value to the org, they would have probably would have just RIF'd these spare engineers and just pay the SaaS provider.
I deal with the internals of many engineering teams across companies of all sizes, and sure as shit, every. single. one. of them has multiple of these internal failed creations. I just don't think people truly understand how much of a liability these systems are orgs.
But yeah, once they made the first mistake, the rest of the blog pretty much hits the nail on the head.
> 3 ...ORG spends an egregious amount of money on middleware SaaS
> 4 ...executive figures they should be able to replace SaaS with in-house system
> 5 ...manager tasks one of ORG ’s finest engineers with the job of building it
If in Point 4 it was determined this was a low value, high TCO project with many replacements, the stud engineer doesn't work on the project, no events past this point occur.
If point 3 was that they had an opportunity to build a flagship product/feature in their wheelhouse and drastically grow market share, nothing past 6 and or 7 happens.
Like I said, there are interesting things here about knowledge transfer, but the root cause seems to be missed from the analysis. Maybe there's some other real world scenarios where teams of critical software are getting replaced whole sale and remain confused, but I'm not convinced most of these issues would come up in a situation that wasn't the one described in 3-5.
I'll argue that the higher-level context introduced in point 2 is even more important here: "ORG shifts from assume we have infinite budget mode to we need to break even next year or we’ll die" i.e. the whole scenario exists not because the business can't accurately evaluate TCO, it's because the business is in do-or-die mode and long-term TCO doesn't matter NOW.
That is, this whole scenario takes place in a situation where there is a organizationally vital need to cut costs. What happens afterwards is a trade of long-term risk (internalizing an essential business function and giving it a bus factor of one) for immediate financial improvement (no more SaaS spend). Long-term TCO doesn't matter if the company collapses next quarter, right?
And in that short-term frame, the project is an unqualified success: X10 delivers exactly what was needed, and the SaaS spend is eliminated. But the risk hits: X10 leaves the company.
[So, pointing out this hypothetical company isn't correctly estimating TCO is correct, but irrelevant; they're in a position where having to pay the long term costs will be a better problem than the one they have now - a reasonable business decision, though not a great one to have to make.]
For what it's worth, I completely agree with your original point: organizations really do systemically underestimate the total cost of ownership of a service. Within the example in the article, the flawed assumption is pretty explicitly laid out in point 7: "For all intents and purposes, development is done, they only need to keep the lights on." - and exploring WHY this assumption is flawed is the core of the article (section 3).
So, ultimately I agree with dambi0 in the GP comment - the lede hasn't been buried here, rather the whole article is a discussion of one aspect of the very point you make. Why DO organizations systematically underestimate service TCO? Because, at least in part, there is not yet a widespread understanding that a service is not "software" in and of itself; rather, a service is the organizational understanding of a solution to an organizational problem domain, and maintaining organizations is orders of a magnitude more expensive than maintaining tools in and of themselves.
This is spot on. I was never able to put it in such precise words.
This theory has Brook's Law as a corollary:
"Adding manpower to a late software project makes it later."
Because the developers need time to develop this mental model before they can meaningfully contribute to the codebase.
Presumably for code you would get enough why/process via comments but that seems unlikely. Perhaps coding needs to take some other tools from project management or something? Knowledge sharing/transfer is hard.
It’s one of my favorite parts of the process.
People just want to build the app that they want to build. I’ve talked to engineers who just say “I don’t really care until we can start coding.”
I got into engineering because I like building things that are useful.
How about "Solution Architect"?
I think a collection of nonsense job titles for computer programmers would be fun...
It’s not like physical products are immune to this problem. I could list you a billion poorly designed products that don’t seem to meet the correct requirements.
At the end of the day, some people just like to build stuff without understanding who they are building for. It could be because they like engineering. It could be because they think they will make money because “people will come if you build it.” Both strategies make poor solutions.
When it should be “the users have these specific problems and the product should make their life easier.”
I am perfectly happy to be called a "programmer" though. IMO that's a very adequate description and honourable. No need to steal anyone else's glory.
TL;DR, I have direct experience of: “I don’t really care until we can start coding.”
The truth is that requirements gathering is also a moment of discovery for the client.
As someone who _really_ enjoyed requirements gathering for many years and now has become one of the "I don't care let's just build it" people I can assure you that some of us crashed out thanks to Scrum Masters™, Project Managers™, Product Owners™, or any of the other big "A" Agile™ cronies.
And surprisingly this is an aspect in which I see very very little progress.
The most we have are tools like confluence or Jira that are actually quite bad in my opinion.
The bad part is how knowledge is shared. At the moment is just formatted text with a questionable search.
LLMs I believe can help in synthesize what knowledge is there and what is missing.
Moreover it would be possible to ask what is missing or what could be improved. And it would be possible to continuously test the knowledge base, asking the model question about the topic and checking the answer.
I am working on a prototype and it is looking great. If someone is interested, please let me know.
If you put knowledge in wiki, no one will read it and they will keep asking about stuff anyway.
Then if you put it there and keep it up to date you open yourself to a bunch of attacks from unhappy coworkers who might use it as a weapon nagging that you did not do good job or find some gaps they can nag about.
How could the LLM help?
Given that it is missing the critical context and knowledge described in the article, wouldn’t it be (at best) on par with a new developer making guesses about a codebase?
https://philarchive.org/rec/DIEEOT-2
While humans and computers both suffer from the frame problem, the LLMs do not have access to symantic properties, let alone the open domain.
This is related to why pair programming and self organizing cross functional teams work so well btw.
Knowledge is organised into topic, and each topic has a title and a goal. Topics are made of markdown chunks.
I see the model being able to generate insightful questions about what is missing to the chunks. As well as synthesise good answer for specific queries.
An LLM that was trained up on these sources might be very powerful at helping people not to solve the same problem many times over.
I built a chatbot under the same assumption you have for a large ad agency in 2017, an "analyst assistant" for pointing to work that's already been done, offering to run scripts that were written years ago so you don't have to write them from scratch
Through user testing the chat interface was essentially reduced to drop-down menus of various categories of documentation, but actually it was the hype of having a chatbot that justified the funding to pull all the resources together into one database with the proper access controls.
I would expect after you went through the trouble of training an LLM on all that data, people using the system would just use the search function on the database itself instead of chatting with it, but be grateful management finally lifted all the information silo-ing.
I love your point about the chatbot being the catalyst for doing something obvious. I curate a page for my team with all the common links to important documentation and services and find myself nevertheless posting that link over and over again to the same people because nobody can be bothered to bookmark the blasted thing. Sometimes I feel it's pointless making any effort to improve but I think you have a clever solution.
The other aspect of it, IMO is that searching for the obvious terms doesn't always return the critical information. That might be my company's penchant for frequently changing the term it likes to use for something - as Architects decide on "better terminology". I imagine an LLM somehow helping to get past this need for absolute precision in search terms - but perhaps that's just wishful thinking.
As software developers we’re intimately familiar with these ideas. But the industry still treats it as “folk knowledge”, despite decades of academic work and systemization attempts like the original Agile.
We really need more connective work, relating the theoretical ideas to the observed behavior of real-life software projects, and to the subsequent damage and dysfunction. I liked this essay because it scratches that itch for me. But we need this work to go beyond personal blogs/newsletters/dev.to articles. It needs to be recognized & accepted as formal “scientific” knowledge, and to be seen and grokked by industry and corporate leadership.
[1] https://dl.acm.org/doi/pdf/10.5555/257734.257788
- The size and complexity of the code base (for some definition of size and complexity)
- The quality of the code and docs (for some definition of quality)
- The skill and experience of the people involved
In four years in a big tech role, my team twice inherited and had to modify a code base without any input from the original authors. One was a quagmire, the other was a resounding success:
- The first was a media player control that we had to update to support a new COM interface and have a new UI. We decided that it was too complicated, and nobody understood it, so we’d reimplement it from scratch. One year later it mostly worked, but still had bugs and performance issues that the original version didn’t have. In hindsight, I suspect it would’ve been cheaper to try to revive the original code base.
- The second was a music database for an app running on a mobile device. Our current one was based on the version of SQL available, but some principal engineers on another team suggested replacing it with a custom in-memory database that already shipped in another device. We argued that the original authors had left and the code was unwieldy; they argued that “it’s just code, we can read it” and its performance was known to be better. They did the work to revive it and successfully integrated it into our app. Wild success.
The flip side of “it’s impossible to revive a dead system” is “don’t rewrite a working system from scratch”. Absent more research, the only way to correctly guess which situation you’re actually in is to have tons of experience.
1. Get the original dev to explain his theories (keep employees longer or engage them as consultants) 2. Make and get a "diary" of the original devs theory building.
In this story, and in probably many places, the business environment however supports the explained outcome.
Over the years, I've built a number of contraptions under a similar set of circumstances: a technical problem usually created by an organizational issue suddenly appears that is both severe enough to threaten a project yet falls outside the core business, so it needs to be fixed both yesterday and on the cheap.
Inevitably, I get saddled with it and produce a kludge that is equally effective and cursed before going back to business as usual. More than once I've learned to my horror that years later the thing is not only unmaintained yet still in place, but its usage expanded to the point where it became load-bearing, because the underlying organizational issue was never solved.
In a manner of speaking, it is the opposite situation as described in the article: a complete lack of software design that somehow manages to survive in spite of a lack of knowledge building.
I work a lot in the transition area between commando and infantry aka X_10 and TEAM. I’ve also found myself on TEAM++ coming in to replace TEAM.
It is difficult to explain to customers that SVC was built on a set of assumptions which I turn informed the design. Once the assumptions changed then design typically needs to change as well.
You need a history of the assumptions so that new developers can know what's legacy and what isn't.
I've never yet had a set of requirements that didn't change.
Isn't this (at least in part (and perhaps only approximately)) what Architecture Decision Records are for?
I know this is not what the article is about. But perhaps exec should have spent resources and time trying to increase revenue instead of cutting cost marginally, and creating an expensive system down the road. Derailing team focus.
Build vs buy… Build is almost always not cheaper. Many other reasons to build though.
I’m not sure that most developers are willing to revive software, based on my observation that very few read anything at all, especially the source code. Instead I see a lot of adjusting the input and output of the existing program by adding a new layer. This new code is totally understood by the new dev, and they can modify/maintain it easily without worrying that they broke the existing system. It also usually duplicates something that already exists inside the system. As the process repeats more and more layers are added.
I think a few lucky teams have developed a culture that encourages learning the existing code. (Popular web frameworks comes to mind as an example.)
Hence, write down your thought process, mental model, and assumptions alongside the code. Tip: Call the process "writing documentation" instead of "commenting code".
Yeah right. "We don't know how it works so we're going to ditch it and create it again."
It works in some cases but by no means should that be the default.
SVC was an unwanted child. It wasn't their "product". One employee was tasked to write it to save paying money to a "seemingly innocuous middleware SaaS".
To anyone in ORG working on it, it was a dead end. No one wanted to own it and perhaps no one did. A team was asked to add features to it.
Doing the ground work of actually understanding SVC had many negative consequences:
* It would take a very long time, making managers not happy. It would be largely a wasted effort, since no further work was then needed on SVG.
* If you became an expert on SVG, it would be yours to keep and no one wanted that.
A more cynical take (that I'm inclined towards is): the median software developer is simply not very good. X10 was a good developer; the people on TEAM and TEAM++ were not.
Knowledge exists in mental models and team structures. Small components and systems can be understood by a person, but team structure also embodies knowledge of larger systems. People will need complementary mental models to understand a large system together.
That's why adding manpower to a late project makes it later. That's why maintenance is hard and handover is harder. That's why systems devolve into big balls of mud. Because companies and managers do not respect the fact that you need people and teams who have good mental models and that mental models take time to build and share.
No amount or quality of code can make up for this fact. And simulacra of ownership - having "product owners" or whatever - won't cut it. You need people to own their systems, understand them deeply. Moving people around, churn, treating developers as interchangeable, substituting rituals for deep work, accumulating 'technical debt' (deferred work as in ship now and think later) etc are all detrimental to building and sharing sound mental models.
How does it end up like this? Why doesn't the last commiter just delete everything and write it in a single line instead?
Code can embody knowledge, but it is not the embodiment of knowledge. It can express functionality but it is not a functional component of a system. I think aspect is the best description: when you look at a system from the source code, you see some of it. Not a projection of the system over a set of dimensions as some people seem to treat it. It is not a textual description of the system. It is the part of the system you can see when you come at it from that side.
Speaking of tests, I've many times learnt more about how some code is supposed to work from the tests than from the documentation. Yet another reason to test everything you can.
And if you take the time to write a series of high level cases, they can show the full expected behavior of a process. E.g: "Don't accept another request on the same object while we have another request on that object in the queue."
A unit test is great, but I've seen people delete unit tests rather than try to understand what's going on.
It's rare that just reading the code will actually capture the spirit of what it means, that you could skip the step of asking the folks who wrote it why things are the way they are or the step of experimenting with it yourself to get a feel for it.
Or in other words, it doesn't really matter who reads the code. You still don't get to skip the knowledge building.
Software is a cog. You're code can't be that self documenting to become domain expert for the domain it is trying to fill. That's like documenting how to train for a marathon by looking at running shoes.
(I'm not even a Test Driven Design evangelist. There's just no other way to "prove" that things kind of, almost, sort of, work in a possible environment.)
Around 12 years ago, my employer tasked me with building a quote engine for a new product they wanted to sell online. The engine needed to produce 4 additional quotes (2 lower, 2 higher) to either prevent potential walk-aways or to offer upsells to capture potential additional revenue.
And it struck me at the time that this sort of hand-wavey selling tactic would be just the sort of thing that was likely to change so I put all this logic into a pure function - pass in an original quote request and the function returns a collection of alternative quote requests.
And, sure enough, a couple of years later, the business decides to change the approach and offer 1 lower quote and 3 upsells. And they give the job of implementing this to another developer.
I was still at the company and was known as the original developer (my name was in a comment at the top of the file, for a start) so I was asked to review the code changes.
I was surprised to learn that the other dev had left the pure function untouched and had, instead, written a bunch of new logic to generate the alternative quotes. Not based on the original quote request but on the collection of alternative quotes returned from the original function. Furthermore, this new logic was placed in main procedure - mama's finest spaghetti in the making, right there.
So I rejected the change and told the dev where to put the actual logic. Then I waited for the re-review request to come in.
What happened instead is that the code went live anyway - the dev had simply re-raised the review request and assigned it to another dev who rubber stamped it.
Looking back, I don't think all the documentation in the world would have prevented this behaviour. A better approach would be for the company to pass the changes to the original developer and to pair with another dev - like Fred Brooks' Chief Programmer plus Assistant recommendation.
I was never approached for an end of year review for this developer and I left the company before them. It's not personal but I'll resign on the spot if a company I work for employs this developer in future.
i.e. explaining it to those who make decisions doesn't work that well because it doesn't fit their mental model very well - they don't understand that a large part of their asset is sitting in people's heads.
That guy wants 5k more.....? Pay it. The cost of hiring someone new and training them up and relearning it all will be far higher. Don't force people back to the office, be relaxed about everything and keep the knowledge.
At the same time make sure other people are learning it so you do have replacements.
At the same time make sure you have proper tests so you can work with code you don't understand yet if someone leaves.
At the same time try to gather all the correct and uptodate documentation and explanations somewhere.
It might have been possible to say, great job you delivered on time, now you have as much time as you need to write your mental model down. On full pay of course, and we won't count that against you as non-technical work in the next performance review.
That increases the upfront cost of SVC but is an investment that pays back interest as soon as someone else has to fix anything.
I totally believe this. I see zombie programs at my employer, including the ones that are currently making a profit, that just don't know yet that they are dead. The Jira model of non-ownership software development being a leading cause in my opinion...