There's an infinite number of nice-to-haves. A nice good deadline makes it super easy to clarify what you actually need vs what you only want.
Delaying or just not doing certain features that have low ROI can drastically shorten the development time without really affecting quality.
This is something that as an industry we seem to have unlearned. Sure, it still exists in the startup space, with MVPs, but elsewhere it's very difficult. In the last 20 years I feel like engineers have been pushed more and more away from the client, and very often you just get "overspecified everything" from non-technical Product Managers and have to sacrifice in quality instead.
The designer was working in a redesign in their own free time, so they were using this "new design" as a template for all recent mockups. The Product Manager just created tickets with the new design and was adamant that changing the design was part of the requirements. The feature itself was simple, but the redesign was significantly hard.
Talking with the business person revealed that they were not even aware of the redesign and it was blocked until next year.
In my experience, risk has two dimensions (probability and severity), and three ways to be handled (prevention, mitigation, and remediation).
The starting slope of the line in your carefulness graph has no slope. Which means you’re basically telling X that we can turn carefulness to 6 with no real change in delivery date. Are you sure that’s the message you’re trying to send?
Managers go through the five stages of grief every time they ask for a pony and you counteroffer with a donkey. And the charts often offer them a pony instead of a donkey. Doing the denial, anger and bargaining in a room full of people becomes toxic, over time. It’s a self goal but bouncing it off the other team’s head. Don’t do that.
I don’t think the boss will find it funny though.
This strikes me as a pedantic argument, since the graph was clearly drawn by hand and is meant to illustrate an upward curving line. Now, maybe there's essentially no clear difference between 5 and 5.1, but when you extrapolate out to where 6 would be (about 65 pixels to the right of 5, if I can be pedantic for a moment), there actually is a difference.
A flat line will lead to bargaining, as I said. Don’t paint yourself into uncomfortable conversations.
If you don’t want the wolf in the barn don’t open the door.
But if not then my original thesis that this is needlessly asking for a pointless argument that costs social capital stands.
Mostly free is not the same thing as free, and the scale of ‘carefulness’ is already completely arbitrary. How much more careful is 6 than 5?
You don’t have to argue about it, because the scale doesn’t represent anything. The only thing to say is, sure, we’ll set the carefulness to 6.
I agree with the original comment, as professionals we can do better than simplified analogies (or at least we should strive to)
Good insight, thanks for that
So goddamned many times. And multiply your 2-10 minute sidebar by the number of people in the room. You just spent over $200 on a wobble in a line.
Plus you’ve usually lost the plot by that point.
"I said the minimum is at 5, but if you want me to trace the line more accurately then let's take a 2 minute break and I'll do that".
Principles of Product Development makes the point that a lot of real world tradeoffs in your development process are U-shaped curves, which implies that you will have very small costs for missing the optimum by a little. A single decision that you get wrong by a lot is likely to dominate those small misses.
See the quote below.
That said, while I didn’t think it was a confusing drawing, I now wish he’d drawn the rest of the parabola, because it would’ve prevented this whole conversation.
> EM: Woah! That’s no good. Wait, if we turn the carefulness knob down, does that mean that we can go even faster?
> TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.
If you centered it on the y axis, that would mean your carefulness scale goes from -5 to 5, and that's just confusing.
It could have put that on the y-axis, and labeled the left “extremely rushed” and the right side “extremely careful”. Maybe that would’ve been clearer, though I really think it’s clear if are charitable and don’t assume the author has made a mistake.
Also parabolas are algebra not calculus, but the same counter argument stands.
One of my guys made a mistake while deploying some config changes to Production and caused a short outage for a Client.
There's a post-incident meeting and the client asks "what are we going to do to prevent this from happening in the future?" - probably wanting to tick some meeting boxes.
My response: "Nothing. We're not going to do anything."
The entire room (incl. my side) looks at me. What do I mean, "Nothing?!?".
I said something like "Look, people make mistakes. This is the first time that this kind of mistake had happened. I could tell people to double-check everything, but then everything will be done twice as slowly. Inventing new policies based on a one-off like this feels like an overreaction to me. For now I'd prefer to close this one as human error - wontfix. If we see a pattern of mistakes being made then we can talk about taking steps to prevent them."
In the end the conceded that yeah, the outage wasn't so bad and what I said made sense. Felt a bit proud for pushing back :)
When you have zero incidents using the temporary process people will automatically start to assume it’s due to the temporary process, and nobody will want to take responsibility for taking it out.
"Wanting to tick some meeting boxes" feels a bit ungenerous. Ideally, a production outage shouldn't be a single mistake away, and it seems reasonable to suggest adding additional safeguards to prevent that from happening again[1]. Generally, I don't think you need to wait until after multiple incidents to identify and address potential classes of problems.
While it is good and admirable to stand up for your team, I think that creating a safety net that allows your team to make mistakes is just as important.
I didn't want to add a wall of text for context :) And that was the only time I've said something like that to a client. I was not being confrontational, just telling them how it is.
I suppose my point was that there's a cost associated with increasing reliability, sometimes it's just not worth paying it. And that people will usually appreciate candor rather than vague promises or hand-wavy explanations.
Corollary is that Risk Management is a specialist field. The least risky thing to do is always to close down the business (can't cause an incident if you have no customers).
Engineers and product folk, in particular, I find struggle to understand Risk Management.
When juniors ask me what technical skill I think they should learn next my answers is always; Risk Management.
(Heavily recommended reading: "Risk, the science and politics of fear")
How do you do engineering without risk management? Not the capitalized version, but you’re basically constantly making tradeoffs. I find it really hard to believe that even a junior is unfamiliar with the concept (though the risk they manage tends to be skewed towards risk to their reputation).
...but that's not really nothing? You're acknowledging the error, and saying the action is going to be watch for a repeat, and if there is one in a short-ish amount of time, then you'll move to mitigation. From a human standpoint alone, I know if I was the client in the situation, I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.
Don't get me wrong; I agree with your assessment. But don't sell non-technical actions short!
Which is important but not taking an action.
> and saying the action is going to be watch for a repeat
That watching was already happening. Keeping the status quo of watching is below the level of meaningful action here.
> if there is one in a short-ish amount of time, then you'll move to mitigation.
And that would be an action, but it would be a response to the repeat.
> I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.
They did say roughly those things, worded in a different way. It's not like they planned to say "nothing" and then walk out without elaborating!
The client was satisfied after we owned the mistake, explained that we have a number of measures in place for preventing various mistakes, and that making a test for this particular one doesn't make sense. Like, nothing will prevent me from creating a cron job that does "rm -rf * .o". But lights will start flashing and fixing that kind of blunder won't take long.
You basically took the ROAM approach, apparently without knowing it. This is a good thing. https://blog.planview.com/managing-risks-with-roam-in-agile/
The final assessment in the Incident Review was that we should have a multi-cloud strategy. Our luck that we had a very reasonable CTO that prevented the team do to that.
He said something along the lines that he would not spend 3/4 of a million plus 40% of our engineering time to cover something that rarely happens.
I’d go further to say that it’s a trap to try, it’s obvious that you can’t get 100% reliability, but people still feel uneasy with doing nothing
Like, sure, people with access to the servers can run <ansible 'all' -m cmd -a 'shutdown now' -b> and worse. And we've had people nuke productive servers, so there is some impact involved in our work style -- though redundancy and gradually ramping up people from non-critial systems to more critical systems mitigates this a lot.
But some people got a bit concerned about the potential impact.
However if you realistically look at the amount of changes people push into the infrastructure on a daily basis, the chance of this occurring seems to very low - and errors mostly happen due to pressure and stress. And our team is already over capacity, so adding more controls on this will slow all of our internal customers down a lot too.
So now it is just a documented and accepted risk that we're able to burn production to the ground in one or two shell commands.
The amount of deliberate damage anyone on my team can do is pretty much catastrophic. But we accept this as risk. It is appropriate for the environment. If we were running a bank, it would be inappropriate, but we're not running a bank.
I pushed back on risk management one time when The New Guy rebuilt our CI system. It was great, all bells and whistles and tests, except now deploying a change took 5 minutes. Same for rolling back a change. I said "Dude, this used to take 20 seconds. If I made a mistake I would know, and fix it in 20 seconds. Now we have all these tests which still allow me to cause total outage, but now it takes 10 minutes to fix it." He did make it faster in the end :)
It’s also a ridiculous low bar for engineering managers to not even understand the most fundamental of tradeoffs in software. Of course they want things done faster, but then they can go escalate to the common boss/director and argue about prioritization against other things on the agenda. Not just “work faster”. Then they can go manage those whose work output is proportional to stress, not programmers.
Safety is a business/management decision, even in structural engineering. A pedestrian bridge could be constructed to support tanks and withstand nuclear explosions, but why. Many engineered structures are actually extremely dangerous - for example mountain climbing trails.
Also yes, you have many opportunities to just YOLO without significant consequences in software. A hackathon is a good example - I love them, always great to see the incredible projects at the end. The last one I visited was sponsored by a corporation and they straight up incorporated a startup next day with the winning team.
If management intends expected use to be a low-load-quick-and-dirty-temporary-use prototype to be delivered in days, it seems the engineers are not doing their job if they calibrate their safety process to a heavy-duty-life-critical application. And vice versa.
Making the decision about the levels of use, durability, reuse-ability, scalability, AND RISK is all management. Implementing those decisions as decided by management is on engineering. It is not on engineering to fix a bad-trade-off management decision beyond what is reasonably possible (if you can, great, but go look to work someplace less exploitative).
If management allocates time and resources only for a quick-&-dirty prototype not for public use, then releases it to the public with bad consequences, they will definitely ask the engineers about it. If the engineers properly covered their paper trail, i.e., kept receipts for when management refused their requests for resources to build-in greater safety, then engineering will be not responsible. Ethically, this is the correct model.
But if he & you are saying that management will try to exploit engineering and then blame failures on engineering when it was really bad management, yup, you should expect that kind of ethical failure from management. Yes, there are exceptions, but the structure definitely encourages such unethical management behaviors.
Think about it this way: the person who suffers the consequences of the decision should be making the decision. That's not management; they will never, ever accept any level of blame for anything. They'll immediately pass that buck right on to you. So that makes it your decision. Fuck management; build what needs building instead.
Look at what happened when "management" started making decisions at Boeing about risk, instead of engineers making the decisions.
And yet, "manager" is usually[1] only responsible for ensuring the boards get carried from the truck to the construction site and that two workers don't shoot at each other with nail guns, not "we, collectively, are building the right house."
I freely admit that my cynicism is based on working in startups, where who knows what the right thing actually is, but my life experience is that managers for sure do not: they're just having meetings to ensure the workers are executing on the plan that the manager heard in their meeting
1: I am also 1000000% open to the fact that I fall into the camp of not having seen this mythical "competent manager" you started with
LT: Get it done quick, and don't break anything either, or else we're all out of a job.
EM: Got it, yes sir, good idea!
[EM surreptitiously turns the 'panic' dial to 10, which reduces a corresponding 'illusion of agency' dial down to 'normal']
I’ve seen code with next to no verification turn out great.
Stop negotiating quality; negotiate scope and set a realistic time. Shipping a lot of crap faster is actually slower. 99% of the companies out there can't focus on doing _one_ thing _well_, that's how you beat the odds.
This is a really critical property that doesn't get highlighted nearly often enough, and I'm glad to see it reinforced here. Slow is smooth, smooth is fast. And predictable.
Has that become forgotten lore? (It might well be. It's old, and our profession doesn't do well with knowledge transmission. )
cheap and cheerful : . cheap fast
best of breed : good . fast
mature technology : good cheap .
It's almost as if I'm questioning their skill as a engineer.
I don't know about you but when I'm driving a road and there is black ice around the corner a warning from a fellow driver is welcomed.
I have no idea which situations you’re finding yourself in. It may help to sit back and see if you could word things different. I have gotten better communicating by asking the people I’m working with if there was a better way I could have said something. Some managers I’ve had had good advice. (I’ve also gotten myself dragged into their office for said advice.)
I have no idea how you approached it, but you could let them decide if they want your advice and have specific examples on how things went wrong if they do. “Hey, I noticed you’re working on this. We’ve had some problems in the past. If you have time, I can go into more detail.”
Then again you could just be working with assholes.
I have been failing quite successfully at communicating my tone over text for some time now, so I confess that and admit it upfront.
That’s not how our jobs work. We don’t “adjust a carefulness meter.” We make conscious choices day to day based on the work we’re doing and past experience. As an EM, I’d be very disappointed if the incident post mortem was reduced to “your team needs to be more careful.”
What I want from a post mortem is to know how we could prevent, detect or mitigate similar incidents in future and to make those changes to code or process. We then need to lean on data and experience of what the trade offs of those changes would be. Asking a test? Go for it. Adding extra layers of approval before shipping? I’ll need to see some very strong reasons for that.
The answer this post gives to that bizarre question that always gets asked, is ‘nothing’, unless you want to significantly adjust the speed that we deliver features.
Any added process or check is going to impose overhead and make the team a little bit less happy. Ocassionally you’ll have a unicorn situation where there is actually a relatively simple fix, but those are few and far between.
In extremis, you’re reduced to a situation in which you have zero incidents, but you also have zero work getting done.
Perhaps we have different backgrounds, but even in late stage startups I find there is an abundance of low hanging fruit and simple fixes. I'm sure it's different at Google, though.
On the other hand, enforcing a manual external QA check on every release WILL slow things down.
You’re repeating the same mistake as the article by assuming “process” sits on a grade that naturally slows work down. This is because you’re not being precise in your reasoning. Look at the specifics and make a decision based on the facts in front of you.
I agree with the premise here, but in my experience running incidents review the issue that I see it’s a mixture of a performatic safetycism with reactivity.
Processes are the cheap bandaid to fix design, architectural and cultural issues.
Most of the net positive micro-reforms that we had after incident reviews were the ones that invested in safety nets, faster recoveries, and guardrailing than a new process and will tax everyone.
They can be, yes. I have a friend that thinks I'm totally insane by wanting to release code to production multiple times a day. His sweet spot is once every 2 weeks because he wants QA to check over every change. Most of his employers can manage once a month at best, and once a quarter is more typical.
> Most of the net positive micro-reforms that we had after incident reviews were the ones that invested in safety nets, faster recoveries, and guardrailing than a new process and will tax everyone.
I 100% agree with this. Your comment also reminded me to say that incident reviews are necessary but not sufficient. You also need engineering leadership reviewing at a higher-level to make bigger organisational or technical changes to further improve things.
We used to have an issue of deployments breaking in production, and one of the reasons was that we did not have some kind of smoke test in the post deployment (in our case we only had a rolling update as a strategy).
The rational solution was only create that post-deployment step. The solution that our German managers demanded: cut access to deployment for the entire team, “on-call” to check the deployments, and a deployment spreadsheet to track it.
It leads to situations where you need a certificate from your landlord that you take to the certified locksmith that your landlord contracted and show a piece of ID to order a key double that arrives 3 business days later at a cost of 60€. A smart German knows that there’s a locksmith in the basement of a nearby shopping mall that will gladly duplicate any key without a fuss, but even then the price is inflated by the authorised locksmiths.
I document German bureaucracy for immigrants. Everything is like this. Every time I think “it can’t really be this ridiculous, I’m being uncharitable”, a colleague has a story that confirms that the truth is even more absurd.
It’s funny until you realise the cost it has for society at large. All the wasted labour, all the bottlenecks, and little to show for it.
When Grandpa was 20 years old he left the house and forgot to take his keys, so every time he left the house he checked his pockets for his keys.
When he was 24 he left the house and left the stove on. He learned to check the stove before leaving the house.
When he was 28 he left his wallet at home. He learned to check for his wallet.
...
Now Grandpa is 80. His leaving home routine includes: checks for his keys, his phone, his wallet. He ensures the lights are off, the stove is off, the microwave door is closed, the iron is off, the windows are closed in case it rains...
Grandpa has learned from his mistakes so well that it now takes him roughly an hour to leave the house. Also, he finds he doesn't tend to look forward to going out as much as he once did...
> We then need to lean on data and experience of what the trade offs of those changes would be.
As engineering leaders, this is a key part of our job. We don't just blindly add processes to prevent every issue. I should add that we also need to analyse our existing processes to see what should change or is not needed any more.
However, you say you're agreeing with scott_w, and scott_w is criticizing the article. So this is confusing.
I'm going to strongly disagree with this (when it's done well, not bureaucratically).
We review what can be improved due to problems and we incorporate it into our basic understanding of everything we do, it's the gaining of experience and muscle memory to execute fast while also accounting for things proactively.
It's a long term process but the payoff is great. Reduced time+effort on problems after the fact ends up long term increasing amount of valuable work produced.
The key is to balance this process pragmatically.
As far as I can tell, the author doesn't really give any generalized advice on how careful you should be, he's just pointing at the "carefulness dial" and telling people to make an informed decision.
there's a tradeoff on shipping garbage fast which won't explode on your hands and getting promoted. and there's also the political art of selling what you want to other people by masking it as what they want.
you and i and most people here will never understand any of that. good luck. people who do understand will have the careless knob stuck at 11.
these analogies help us rational people point out the BS at least, without having to fight the BSer.
Your comment starts with an ableist slur so I’m sure it’s going to be good /s
> you and i and most people here will never understand any of that. good luck. people who do understand will have the careless knob stuck at 11.
Nah, reading this comment wasn’t worthwhile after all.
> these analogies help us rational people point out the BS at least, without having to fight the BSer.
How cute, you think you’re “rational.”
What is your objection? Oversimplification? Lack of utility?
Same energy here. “Be more careful” is extraordinarily hand-wavy for a profession that calls itself engineering.
Yeah, sure, that never happens. That's why "I told you so" is not at all a common phrase amongst folks working on reliability-related topics ;)
The key here is automating your "carefulness" processes. This is how you push that effectiveness curve to the right. And, the corollary here is that a lack of IC carefulness is not to blame when things break. It is always, always, always process.
And to reemphasize the main point of TFA, things breaking is often totally fine. The optimal position on the curve is almost never "things never break". The gulf between "things never break" and "things only break .0001% of the time" is a gulf of gazillions of dollars, if you can even find engineers motivated enough and processes effective enough to get you anywhere close to there. This is what SLAs are for: don't give your stakeholders false impressions that you'll always work forever because you're the smartest and most dedicated amongst all your competitors. All I want is an SLA and a compensation policy. That's professional; that's engineering.
But if you can trust ICs a bit, you can move faster
This is not hard.
Enumerate risks. List them. Talk about them.
If you want to turn into something prioritisable, for each one, quantify them: on a scale of 1 to 10 what's the likelihood? On a scale of 1 to 10, what's the impact? Multiply the numbers. Communicate these numbers and see if others agree with your assessment of the numbers. As a team, if the product is more than 15, spend some time thinking about mitigation work you can do to reduce either likelihood or impact or both. The higher the number, the more important you put mitigations into your backlog or "definition of done". Below 15? Check with the team you're going to ignore this.
Mitigations are extra work. They add time. They slow down delivery. That's fine, you add them to your backlog as dependent tasks, and your completion estimates move out. Need to hit a deadline? Look at descoping, and include in that descoping a conversation about removing some of the risk mitigations and accepting the risk likelihood and impact.
Having been EM, TL and X in this story (and the TPM, PM, CTO and other roles), I don't want a "knob" that people are turning in their heads about their subjective measure of "careful".
I want enumerated risks with quantified impact and likelihood and adult conversations about appropriate mitigations that lead to clear decisions.
The whole point is whether or not to turn it up, not what the process is once its turned up. Not everyone can afford to waste time planning so much.
If you have a list of risks and you don't know how to mitigate them, you're just yolo'ing.
"We don't have time to plan" is the biggest source of nonsense in this industry. The process I just described takes about 15 minutes to go through for a month's worth of work. Nobody is so busy they can't spare 15 minutes to think about things that might cause major problems that soak up far, far, far more time.
Some tasks are hard to estimate because they have an element of experimentation or research. Here a working model is the "run-break-fix" model where you expect to require an unknown number of attempts to solve the problem. In that case there are two variables you can control: (1) be able to solve the problem in less tries, and (2) take less time to make a try.
The RBF model points out various problems with carelessness as an ideology. First of all, being careless can cause you to require more tries. Being careless can cause you to ship something that doesn't work. Secondly, and more important, the royal road to (2) is automation and the realization that slow development tools cause slow development.
That is, careless people don't care if they have a 20 minutes build. It's a very fast way to make your project super-late.
I worked at a place that organized a 'Hackathon' where we were supposed to implement something with our project in two hours. I told them, "that's alright, but it takes 20 minutes for us to build our system, so if we are maximally efficient we get 6 tries at this". The eng manager says "it doesn't take 20 minutes to build!" (he also says we "write unit tests" and we don't, he says we "handle errors with Either in Scala" which we usually don't, and says "we do code reviews", which I don't believe) I set my stopwatch, it takes 18 minutes. (It is creating numerous Docker images for various parts of the system that all need to get booted up)
That organization was struggling with challenging requirements from multiple blue chip customers -- it's not quite true that turning that 20 minute build into a 2 minute build will accelerate development 10x but putting some care in this area should pay for itself.
[1] https://www.amazon.com/Have-Fun-at-Work-Livingston/dp/093706...
If your company is needing to have conversations like this more than rarely—let alone experiencing the actual issue being discussed—then that's a fundamental problem with leadership.