LOL i wish
They are able to staff 24x7 by spreading the cost over multiple customers and working through the process of making your application manageable by a 3rd party is super beneficial.
Most of these companies will also do performance monitoring and analysis as well.
They see issues and optimization opportunities across multiple applications and know more than a single team who's only built one.
We typically split our teams, so we have ~16 split across two time zones so that our shifts are just 12 hours during the day. It works well, but it is expensive, so we support a lot of services (or a small number of very high priority services) as a result.
I'm finding surprisingly little discussion on HN regarding the costs/benefits of MSPs. Or rather, under which conditions (such as company size) they make sense.
Any big players or companies you would recommend?
[1] also Wednesday/Thursdays. Wednesdays were my favorite in good working environments, it felt like running a successful marathon, but it was more prone to falling apart due to short-term thinking.
You start off over the weekend, when you have energy and can survive the two days alone. Ideally no Friday releases so the transition is calm, but as the writer says the batches might fail.
You spend the week fixing whatever breaks. You’re cleanly off the Monday to Monday sprint, just doing on-call/ops.
You finish Friday evening and immediately get Friday night and the weekend to recover when you need it most.
One thing I really liked in a previous job was a split daytime-vs-nighttime rotation. It was well worth a little annoyance to set up in our tools. One week you'd be the 'daytime' oncall for business hours (something like 9-5 Mon-Fri, though we might have tweaked those hours a bit; it might have been 10-6 or something). The next you'd be on call for the complementary time (5-9, weekends). You were on call for the same total amount of time, just smeared over two different weeks. It ended up being less of a burden to optimize your schedule for a reasonable response time, but operational work still got done. And in practice awareness of operational issues was not too hard to maintain between the two members of the split.
(I think the best thing, if you can swing it, is probably a follow-the-sun rotation where there are three teams distributed 8 hours apart around the globe, and they trade off 8-hour workday shifts. But a lot of uncommon things probably have to be true of your organization for that idea to even be on the radar.)
Hand-off meetings with the whole team work really well (in my opinion!) when you have a relatively small team--we have 9 FT teammates. Often someone else may have been delegated the page or bug that arose and can discuss how they handled it, or someone who wasn't involved may have insight for how to handle a situation better the next time. Since we're all going to be on rotation at least once during a quarter, it's great to know what happened in case a similar page pops up later.
Finally, we also fill out a running Doc before/during the meeting with links to the pages/bugs, along with short descriptions of how they were handled. This forms a great living memory of how to deal with incidents, and is also often the birthplace of new playbooks for handling new types of incidents.
That reminds me of Amazon's abysmally bad employee discount, which was "10% off anything on the site, up to $100 / year".
But let's align incentives. Any time spent fixing issues on-call is compensated 4-to-1. Workers may accrue compensation time, and any compensation time in excess of 20 hours is paid 10-to-1 when the employee leaves. The idea here isn't for workers to accrue and cash out comp time, but instead to give an incentive to the organization to ensure workers use their comp time.
Let's align incentives, what's hard on the worker should be hard on the owners and management.
When that's done, chill for a while, do some recruiting and education in the workplace and think about what the next realistic change ought to be.
Oncall should be compensated, always. The oncall person should get a flat rate just for being on standby, and should also receive a per-page payout, and that amount should be larger if the page happens outside regular business hours.
Then management will actually realize there's a cost to pushing features and pulling in deadlines at the expense of robust engineering practices. Or they can decide they are fine with that, and paying the oncall person is a cost of doing business they way they want to.
I've seen too many instances either issues they come up during oncall never get fixed, and just page and page and page.
I will never again work at a company where oncall is "just a part of the job". I value my own time too much.
I was going to say, this would almost certainly be the outcome. Companies have no problem throwing millions at AWS, DataDog, etc. They certainly aren’t going to blink at an employee making a couple hundred bucks extra per day.
In my company we get approximately 800€ for every week of on-call and each hour of intervention is also compensated with salary.
From my point of view this should be high enough for the company to be willing to focus on on-call issues. Ater years of being on-call I must admit the salary is comfortable but it doesn't cover the pain and constraints of being on-call: being kinda "stuck" at home basically, lots of consequences on private life etc.
I would be interested in how response times are?
Mine is 15mins. So I have to respond and be in a incident call within 15mins.
If there was an actual incident, you'd get paid as if that was overtime worked, and it depended on when it occurred (e.g. weekends and public holidays carried a higher than normal multiplier). There were also limits on how much rest you'd need to be guaranteed, etc.
On average, our actual incidents were relatively infrequent, and the pay out mostly depended on the size of the team, which dictated how often you got rotated in. It worked out to something like +10% salary though.
If you got paged you'd get 150% of your hourly pay, per started hour. So if you got pages at 22:00 and again at 3:00 that's three hours of pay, regardless of each issue only taking 5 or 10 minutes to fix.
That's roughly $1000/€950 per week of on-call, plus the hours. You'd have four/five of these per year and you could pick up an extra month of pay per year with the standby pay, plus the hours and maybe pick up another day here and there.
Holidays where normally distributed on a volunteer basis (you'd still get paid, but you opted-in to those days). So maybe I'd be home on New Years, but out on Christmas, so I'd offer to cover Christmas, while another colleague might care more about being able to drink on New Years.
Originally we where almost 50 people handling on-call, so you'd have one week per year, but that's not sustainable, you forget how to handle common issues or how to fill out incident reports and handoff correctly.
The most stupid on-call schedule I ever did was midnight to midnight, every other day... It was only me an my boss. That was incredibly stupid, because you couldn't go out on one day, and the day where you could go out, you had to be careful about having one drink to many.
I really wish we'd gotten paid for hours worked rather than TOIL, not for personal preference but because it would have aligned the company's incentives better. We might actually have fixed some of the problems if not doing so cost the business a tangible sum of money.
Still, it was better than working for free.
Worked well until E&Y came in and "fixed" things with a strategic plan.
Anyway, I prefer Mon - Thu, Fri - Sun shifts.
My team have put a lot of effort into only rarely being paged: a normal week on-call won't have any out-of-hours activity at all.
I'm not even sure if doing on-call duty without compensation is legal in my country.
In the past, there were some cases of "fake" incidents, but the amount of documentation makes sure that the company is able to crack down on this.
Big ecommerce companies and even global brands like AWS and BMW require 24x7 on call without any compensation.
Best on-call I’ve had.
You’re making yourself available 24/7. That has a non trivial lifestyle impact which I’ve always thought deserves more than is typically rewarded.
If I receive 100 total units of compensation, I'd way rather get 100 units of base pay (and 0 on-call pay) than 90 units of base pay and 10 units of specific on-call pay. (What if the company eliminates on-call? What if I get injured and my insurance only covers base pay? Severance is usually based only on base pay; I would not be paid on-call while I'm on PTO or other paid leave, annual raise percentages typically apply to base pay, etc...)
What will financially encourage my company to stop paging me overnight if there isn't a labor cost to the company every time an on-call incident occurs?
> What if I get injured and my insurance only covers base pay?
Insurance payouts can be easily based on wages that include reported commissions, tips, and overtime. They can very easily be based on an average of past actual wages paid in the last handful of months at the company.
> Severance is usually based only on base pay
Severance is a completely optional practice that is based entirely on what the company wants to do. I would argue that severance is more accurately based on "The lowest safe number to pay to this particular employee to make sure their termination does not become a legal risk."
> I would not be paid on-call while I'm on PTO or other paid leave
But also, PTO days and on-call days don't indersect. If you took time off during an on-call shift you would be trading it with a team member, so you would never lose that extra wage.
Example: I'm taking a week off, it's during my scheduled on-call shift. I would normally get paid my on-call hours but I didn't this week. But when I get back from my vacation, I'm picking up an extra on-call shift because my team member covered my shift when I was on vacation.
Now, I'm taking a week off, but it's not during my on-call shift. I wouldn't have been paid on-call hours this week anyway. When I get back from my vacation, I am going on my normally scheduled on-call shift.
I personally have never felt compensated dynamically enough for on-call schedules. Most corporate jobs seem to pay for a sliver of the life disruption, maybe paying for half my phone and Internet bill or something like that. They all say that the on-call is baked into the compensation, but I'm not so sure.
I think this is true in _most cases_, but is not a given. I myself have encountered scenarios where it isn’t true: switching with someone much later in the rotation, only to then end up having to switch again for instance. You could envision a nefarious teammate weaseling out of their fair share with sneaky switches like this, too, though paying for it would maybe incentivize them not to!
Almost right! I see it as an extension of what I call the basic rules, "I am as nice to you as you are to me", and "I care exactly as much as you do."
That does, in some cases, expand severance a little beyond the cold risk calculation. If the severance is going to someone who helped the company make it, then helping make sure they make it to their next gig is part of the equation.
Not everyone boils it all down that far, but a whole lot of us do!
Which makes your comment solid, and mine a quibble, but one I consider worthy of some discussion.
If you have any national holidays, somebody still ends up being on-call for that holiday. I've been on-call for almost every US holiday this year.
Lugging around a laptop and the on-call phone when going anywhere, checking every now and then when the phone was not with your for a while (e.g. pool, gym etc), making sure you don't go places with no signal was enough of a PITA that knowing we were paid every hour of that had a nice psychological effect.
If I’m not paid to be reachable, nobody gets to complain when I don’t pick up the phone though.
Most don’t even like the question. For them such questions are red flags or the candidate is not “motivated enough”. Rarely some even have follow the sun policy. They might have one in their HQ, true for a lot of US/EU firms, but their offices in a developing country like India - it’s always something on the lines of “oh, engineers here take full ownership; they are the owners”.
Also, I have seen — 2-3 days rotation with follow the sun is best, week long or longer being worst.
Then there are companies where it could be forever on-call with no follow the sun - e.g. Amazon, Uber (in India at least). That’s another world altogether.
Cooperatives really should be more common.
It's just rhetorical trickery.
Tech entrepreneurs should give more weight to choosing markets that don’t require this
Tech entrepreneurs should give no weight to this. The market seems to support engineers doing on-call rotations, and a service that can’t tolerate any downtime is (theoretically) a service that is worth a lot to a lot of people- which is perfect for monetizing.
Tech entrepreneurs should stop giving excessive “nines” of availability. Even 99% is probably enough for most customers to never notice, and significantly easier to engineer than 99.999….
The connection makes sense but one must not think in this order. One must think “people will pay for this” and then consider “does this need to be highly available?”
If you have more than one road to choose from, and one of them doesn’t require high availability, then give that one some bonus points for that.
There’s a time and a place for heroics, but we go to it for shit that doesn’t really matter, or worse, allow the culture of heroics to cover up the real problems that are much harder to fix.
I’m not saying no one should create a highly available web service. I am saying that this is one of those things that techies assume, and shouldn’t, because it’s a huge plus to hiring, business and engineering simplification, and morale if you can define away non-business-hour problems.
For example, they may not want to fix quality issues as long as their consequences can be pushed to the weekend. Or they may start to demand people work weekends to do maintenance.
Or -- worst of all -- they realise they can avoid deployments entirely on weekdays, and then do these big bang deployments on weekends.
This makes engineer's lives miserable but looks like rational optimisation to management.
- Mon/Tue - Wed/Thu - Fri - Sat/Sun
Original reason for this schedule was that on-call was paid by days per quarter in a tiered system so this guaranteed that all members got the 5% on-call for 10 days/quarter rather than one person hitting 9 days and dropping to 3%, but I stand by this as a better on-call rotation.
The number of people does need to be not wholly divisible so the days rotate so if you run into this you can combine Fri into Sat/Sun or break Sat/Sun apart. It’s a bit complex to set up but the mental impact of on-call is greatly reduced and if you need a week for vacation you can much more easily find someone to cover your shift for a couple days in a nearby week rather than ending up with 2 weeks back to back 6 weeks from now. And if you pull a weekend you get the week off rather than losing your weekend to on-call and going into a work week still on-call.
I'm pretty sure that'd be illegal here in .au
On call coverage while an employee is on vacation is a management problem, not an employee problem.
At my current job we have an automated scheduler which uses our gcal to ensure that it never schedules if people have an AFK entry. It also schedules fairly based on how long since the person was last on-call, not putting them on on a weekend if they were on last weekend etc (we do 24hr shifts).
That's exactly the time where "finding your own cover" is the most stressful.
Mon-Mon is nice because it’s a logical time to start something fresh at the start of the week. Tuesday is good for the reasons in the post, Wednesday is similar. Thursday is nice because after you’re done you can relax on Friday. Friday-Friday is less common but can be nice because you get the satisfaction of being done on the last day of the week.
"- Step 1: handling it
- Step 2: making sure it doesn’t happen again
So when a major issue happens over the weekend only Step 1 happens during the weekend. Step 2 involves following up with other teams, creating new alarms and updating the runbook. And all that usually happen during the week. The oncall is going to spend at minimum their Monday doing that so it’s better if the schedule reflects that."
We previously had a week long rotation, and some folks were initially skeptical of the idea to change, saying they were worried they'd feel like they were "oncall all the time". But, they agreed to try it for a month. That was a bit over a year ago now, and no complaints.
I think it ends up being a lower stress configuration, because it just becomes part of your normal expected work-week routine, and generally isn't as mentally draining. It does make end of year PTO/holiday time a bit more complex to work out, but so far my team has been okay with that tradeoff.
What about the person who has Friday? Do they never go out on a Friday evening?
Sounds a nice idea in theory but not all week days are equally inconvenient.
When we talk about on-call, we’re not referring to systems like Netflix streaming a major fight for 65 million users, but rather essential infrastructure like healthcare systems, nuclear power plants, military operations, financial markets, and the vast array of SCADA (Supervisory Control and Data Acquisition) systems that monitor and control industrial processes.
These systems are crucial to our safety, economy, and everyday lives, and downtime or failure is not an option.
Before Apple, I worked at Microsoft in the Azure team, where I logged over 2,016 hours of on-call support each year. This involved six 24/7 on-call rotations, each lasting two weeks, with responsibilities alternating between primary and secondary support. While there were certainly tough moments and challenging issues during those rotations, they also provided valuable learning experiences and helped me develop problem-solving skills under pressure.
On-call support is a necessary evil.
If the company/product isn’t large enough to be distributed, is it really important that it have a 10 minute time-to-acknowledge?
I can understand on-call hours if you're a literal firefighter or paramedic who saves lives. I understand that, as a building superintendent, every once in a long while you have to run out and fix a burst pipe before property is destroyed. I don't understand why some of these tech companies have on-call responsibilities like there was some hazard to life or property.
They need five nines of availability to make sure they don't lose one cent of potential ad revenue? Good luck with that, I guess, but I'll be over here actually sleeping through the night.
What is much better, though, is splitting the week into a 4/3 or 5/2 split, with a primary and backup on-call. Primary takes the weekdays, then switches with Backup for the weekend. You’re still sharp and aware of any current issues should the need arise, but the odds of a weekend page are (hopefully) lower, so you can relax a bit.
This of course requires enough people to have a reasonable rotation; 6 at a minimum, but 8 is better.
Expectation is you are 100% oncall during the working day, so it works out pretty well between weekend vs non-weekend shifts.
I much prefer the shorter shifts to a full week. A full week on-call usually means delaying important project work, etc. for a full week.
2) You said they want to. They dont. If you offer same pay for a job with and without it, exactly nobody would choose the job with extra on call duties.
The obvious part you are missing is that people do it because they are paid to do it, and they like money.
So lets say that it just magically happens that when YOU are on call, stuff breaks all the time but when its your coworkers it doesn’t. You are all paid the same, does it seem fair to you now?
Unless it’s written in paper where a salaried worker will be getting X extra per hour you are just working for free. The definition of a salaried worker in the US is having 40hrs of total work time averaged throughout a year.
I think we got to the heart of things. This is absolutely not true! Not legally, and not in practice. There are Overtime exempt salaried positions and non-exempt positions [1]. An exempt salary position position pays more than $685/week and means you do the "the role" however your employer defines it. That can be 40 hours, 80 hours, or whatever they choose. It can require you live on-site for the whole year.
After a weekend of on call, it sucks to have yet another day of on call on Monday. This overpowered all other reasons (most of them listed in the blog post) for us.
Outside work-hours? Most alarms (if they happen) are due to bad alarm configurations. Because nothing ever happens. There was one alert this month, and it was because a randomly generated ID contained the string "ERROR" and was logged due to a warning.
I know that my company isn't the "biggest" (only a few hundred requests per minute) and traffic amount is mostly correlated to usual business hours in my country, so there's just not much happening at night (but never zero traffic). Still, I'm always surprised that other companies seem to have really stressful on-call shifts, because the most annoying part to me is having to carry my laptop if I leave my home for more than 20 minutes.
I refuse to accept on-call duties, full stop. If a job posting expects it, I don't apply. If a hiring manager says they have it, I do not accept the offer. If management starts talking about maybe implementing it, I protest. If it becomes enacted, I resign.
There is absolutely no situation in which I will ever participate in another on-call shift. I've been there, I've done it, now that chapter of my life is closed. Find some younger kid, pay them better than you paid me for the miserable intrusion on their life. I'm done.
Just wanted to be the voice who says what, hopefully, some of the more seasoned and battle-scarred readers here are thinking.
Once, while traveling in an RV for some work related marketing thing, the discussion turned to the lack of fuel economy...
The RV might perform better if the engine powered the RV by blowing fuel right out the tail pipe. Horrible efficiency, terrible for the planet, and, and all the negatives packed right into a quick expression.
Your comment is on point. Solid and I just felt like sharing my appreciation for the morbid fun it contains.
Nice work. Worth a healthy chuckle. Thanks.
Sometimes, yes, Devs get called out for stuff outside their control like infrastructure failing. However, at my job, we just had two devs that quit over on call and guess what, their service was one of worst offenders in "Opps, we pushed bug to production."
secondly, many if not most of the issues that arise are part of some infrastructure automation or third party service or database. expecting me to be fluent in all of those to be useful in the hot seat is a pretty substantial investment and qualifies me to be an SRE on top of my other duties
thirdly, one major reason why my code might fail in production is that it wasn't sufficiently tested, probably because the service as a whole is basically untestable, and even if it were, building test and test infrastructure is likely not at all valued. in many places just filling in that hole would take a year.
onto to the fourth, the story is supposed to be that by operating the service, I'll be incentivized to fix automation and come up with solutions to make it more robust. I actually know how to do this, and every week I'm on call is time that I _dont_ spend doing this. furthermore, getting permission to do so is often like pulling teeth. sounds complicated. sure that would be nice, look at that when you have time in the indefinite future.
so what this often looks like from a development perspective is that I'm being paid to be a developer, I was judged based on my ability to be a developer, but at the end of the day I'm not building the service. I _am_ the service.
I get all political reasons that your code may not work. However, refusing to be on call doesn't fix any of those reasons, it's just ignoring work. Flip side as SRE, I ask if Devs are on call. If they are not, I don't take the job because there is zero incentive for them to fix anything vs churn out 5 features, chuck it over the fence and be like "Ops problem now"
I agree. For me though, it gives me pride to own my services and be fully accountable to the business, especially as part of a team with whom I build comradery, and of course our value to the business justifies our good compensation. It only works because we are empowered to make decisions that keep our on calls sustainable.
Making people work 24/7 is not conducive to good anything, thus on call is a terrible way to do things.
If on call balloons your 40 hours to 70 hours, yes, you have an issue. That's not normal and you should consider changing jobs.
The corporate gaslighting is strong with this one.
I always take responsibility for my own work, even after hours fixes, etc. But active on-call orgs usually are just reaping tech debt that others sowed. Sorry not going to rally for that.
Not trying to dunk on you, I’m honestly glad you get to do this, it must make your life considerably better.
If I'm getting paged for a legitimate issue that is related to something I built or maintain, then, yes, I am going to respond on-call. Because it's a fucking privilege to get paid this much money to sit on my ass and type into a screen.
If I'm getting paged repeatedly, or for an issue that isn't my responsibility, then I will get pissed off, and yell and scream until I'm no longer on-call (or they fix the issue, whichever comes first). But I am grateful to be able to have this life. I can spend an hour or two after hours to fix my shit that broke.
In a more healthy situation an on-call rotation is the price of being able to move quickly, get stuff out the door, and have compensation that reflects that the company isn't paying a whole team of extra people to stare at dashboards 24/7 just for the rare situations that things break after-hours.
Gigs with low-overhead + customers that don't expect 24/7 operations are kinda the real sweet-spot dev compensation + role-wise, but ... pretty rare.
1) gigs without 24/7 operations are rare, because there is no good reason for a tech product not to be 24/7. it's not costing extra electricity to keep the lights on overnight, nor more staff. there are a bunch of these gigs (my last gig had no customers for 2+ years) but you shouldn't expect them, because part of the reason we're paid so much money is we're expected to deliver "continuous value". most devs would agree with this, because they all want to be able to deploy continuously, whenever they want. (which is a terrible idea, but it is the status quo.) furthermore, if you're doing your job right (and so is Ops), supporting a 24/7 product should not result in on-call pages, because nothing should be breaking outside regular business hours. if it is breaking outside regular hours, somebody sucks at their job. and Ops' job is pretty simple, so...
2) you do have lots of control over the roadmap, planning, etc. but nobody is going to walk up to you and say "hey we were just thinking of maybe doing this in the roadmap, is that okay with you?" you have to get involved, early, and consistently. you have to show you're not going to rock the boat, but that you will have good suggestions, and can show they will turn into better outcomes. you have to play a little politics, a little product ownership, and also an engineering role, in order to influence what the business decides to do. as you get more senior this gets easier because people will defer to you more, but even an extremely likeable junior can influence the roadmap.
on the off-chance that you're just trapped in engineering hell, with hostile management, a terrible product, and a completely apathetic and terrified staff, quit immediately. this isn't normal and you shouldn't think "oh, I'm trapped here." people don't stay in abusive relationships because there's no other choice, they stay because they've justified their own abuse.
Oh, and that weekend is the weekend before Christmas.
I have never been at a job where on-call was done as well as it could be, and most were/are pretty bad in general. But I could always get changes made to on-call, so that when shit started rolling down hill, it didn't hit me.
(… and I'd like to avoid distracting arguments that amount to "my company does on-call badly" — yeah, those problems do exist and we should strive to fix them. But if I'm to not categorize the argument here as the baby with the bathwater, then we need something to replace on-call with. Prod goes down on a Saturday afternoon; are you going to tell management "tough cookies" until Monday?)
I'm responsible for the software I put into production from 9 AM to 5 PM for about 200 days a year. At 3 AM, I am responsible for taking care of myself by getting a good night's sleep.
If you need 24 hour coverage, taking into account vacations and weekends, you need 5 or 6 people.
I am responsible for my code, but we need to be realistic about the impact. Not all outages are created equal.
I used to work nights watching over the hardware, operating systems, and applications running in it. We’d do upgrades and break/fix stuff. Some things were worth waking someone up for, but a lot of things weren’t. We’d do what we could do fix it on our own, but for a non-prod environment, it could wait until morning if we couldn’t do it on our own. This idea seems to be lost on people now. I get that 100% uptime of 100% of the systems would be nice, but not at the expense of your employees sanity.
I haven’t actually been called yet with the new rotation, but any week I’m on-call I’m a bit on edge. In the past I had some pretty horrible on-call experiences that pushed me close to quitting, which I won’t get into, so I’m preparing for the worst. I worked my ass off to get into a position where I didn’t need to be on-call and put in my time working nights so other people could sleep. Being back on-call feels like a demotion.
First: IT seems to be rather the exception - most professions have no on-call. Eg. even if my car mechanic screws up a service job, they'll have me bring the car back into the garage during their normal working hours, regardless of how and where stranded I am in the middle of the night.
A second comment: I'll be responsible for anything I have created in my own way. The reality of software development is that we implement functional requirements we've been given with which we disagree, we implement non-functional requirements which don't achieve the goal, we are made to use frameworks and tools we're not familiar with, on a short timeline, a low budget and inadequate infrastructure and we're supposed to take responsibility for code our co-workers wrote.
I think there’s actually a fair number of jobs where some level of this is expected.
Doctors are one obvious example — they have on call responsibilities often more onerous than IT, and depending on the situation don’t always receive additional compensation for it.
If you manage people who work different hours from you, in a lot of jobs it’s not uncommon to be called in if shit hits the fan when you’re not working (for example if you’re a hotel manager, to just name one).
I’ve found that any good lawyer I’ve worked with will answer my calls and help me work through things at basically any time of day (their firm might be billing me for the time, but that doesn’t necessarily directly translate to their comp).
Lots of reporters are expected to cover news that breaks on their beat, no matter when it happens.
My doctor (primary care physician) doesn’t work outside of business hours. In an emergency the recorded message says to call an ambulance and go to the emergency department at the hospital, which is staffed by a different set of people.
So it seems they do have at least some separation of the oncall aspect?
Lawyers are another story, there’s a lot of things wrong with that profession and we shouldn’t be trying to copy them.
Even in family practice, it’s not uncommon to be able to get a call back from the on call doctor at the practice on weekends or off hours — if you’ve got a situation that maybe doesn’t warrant the ER, but you’re not sure if it can wait until Monday.
Only if you're dying.
Come in late Friday and you're going to be sitting in a bed until Monday even if your gall bladder is about to explode.
The point was that "on call" is specifically confined as an expectation only to certain types of doctors or under very urgent circumstances.
In addition, doctors have extra special dysfunctions like "too many hours in a shift".
However, many of these are because doctors also have been fighting various efforts to teach more of them which would enable distributing the required extra labor across more people.
But what you’re talking about is a person whose job it is to be oncall. It’s the equivalent of an SRE, rather than a SWE. They’re not doing it because they believe in “you build it, you run it” or anything like that.
No they don't.
I know plenty of people who have had to sit around for 8+ hours because the particular type of doctor is not available. The on call only really applies if you're bleeding out.
In my 20+ years of development and support, there has only been once that I was paged due to an actual catastrophic failure. Most are because shitty "SREs" wants monitoring on everything, even if its stuff that I have no control over.
I mean.... On call doctors literally save lives. Most on-call software engineers don't. So.
24/7 coverage is expensive and mandating someone is on call 24/7 don’t actually provide it.
For the purposes of this exercise presume that our theoretical on-call process is no worse than Google's SRE structure: You are on-call for a 12 hour shift that is more or less aligned with your waking hours, and you are compensated extra for the time you are on-call outside of normal working hours, whether or not you are called in. You are on-call at most one week per month, on average, and usually less.
You are on-call for a 12 hour shift that is more or less aligned with your waking hours
I suppose if you're Google they can theoretically make it so it's more aligned with your waking hours? Do they do it? Most companies don't or can't. I.e. it's _less_ aligned. you are compensated extra for the time you are on-call outside of normal working hours, whether or not you are called in
How much? Way too many on-call processes in which this is nothing but a few dollars to be able to say "see, we do pay for this, even when you're not called!". As in, way not enough for the number being on-call does to how you go about your day. Always on edge, always awaiting that call / alert that requires you to drop whatever you are currently doing. Preventing you from actually doing/starting certain things.You haven't even mentioned the expected reaction and resolution time and that alone can make a huge difference.
You are on-call at most one week per month, on average, and usually less.
Great, only one week out of four /s That's crazy if you ask me. Going back to preventing you from going about your day in a normal way. There's no "doing on-call well" in how you describe it.The on-call compensation varies depending on what tier of service they're offering. Tier 1 (5 minute response time) is 2/3 of your effectively hourly pay for on-call time outside of local business hours and 1/3 for tier 2 (30 min response time). Or time off in lieu.
If the customer is awake at 3am on a Sunday, it's the customer's problem that they were awake at 3am on a Sunday. If it's a social network, I frankly couldn't care; the customer should go to bed. If it's going to be deployed in the emergency room, fine, we should care, but YOU, management, should find people who are actually willing to take that shift (for extra money, or are based in other time zones).
Plus any large enough company should have team in spread out timezones eliminating the need for on call if it’s correctly managed.
Have a generalist ops team that is staffed 24x7, or has paid on call as part of the job. They get run books to respond to whatever goes on.
I’ve set this up twice. The first time, we had a team in the Philippines that would cover overnights.
They could start and rollback deployments and do most stuff via the runbook they were provided. Most callouts (5% of escalations) to product teams were due to bad or missing documentation.
The US based team did similar work, just during the day. Both could escalate quality issues for the product team to fix.
The other model was all US, on-call based. We used junior and low-skill folks, who had rotating on-call. They were paid 20% of hourly rate for standby pay and had a minimum pay threshold when they got called. All of that hit the cost center of the offending product or service, so there was both a financial incentive to not get calls, and a human incentive as the engineers didn’t want to get called for escalations. Again, documentation is key.
I have survived 2 cardiac arrests (almost died) during high-stress times. I've been stable for a few years now, but only after I enacted VERY HARD boundaries around work/life and never cut down on sleep for any reason (among other health-first changes I made). I have a significant increase in cardiac arrythmias any time I don't sleep enough.
I consider myself at this point as having a disability that prevents me from overworking, and I absolutely need my employers to respect that and accommodate that.
I can work normal hours, and that's my offer. If you want to pay me less, that's okay, but I'm not doing on-call unless it's business hours only.
If customers give a shit about uptime at 2am then it's management's responsibility to find people in other time zones to deal with it, or pay extra for people who are willing to sacrifice and risk their health for a customer (I won't take that deal though).
It's not an issue because we don't break prod. I also feel I'm well compensated. When there have been issues at inconvenient hours, my manager has encouraged me to take it easy after resolving the incident. We've also prioritized improving our integration tests and addressing other issues noted during root cause analysis (RCA), which I suspect is why we haven't had any incidents in recent memory.
If on call duties are this frustrating, I'd argue it's team/organizational dysfunction that is the real problem, and bad on call shifts is just one of the symptoms.
Ultimately, somebody needs to be available to fix a production incident. One person suffering from on call duties is better than thousands of paying customers suffering from broken software.
75% on call even if I was never called would be profoundly unhealthy for me. So I wouldn’t dismiss the toll of just being available 24/7.
EDIT: I forgot to mention I am on an on call rotation but it is 1week on and 7 weeks off. So, not too horrible.
I would expect stricter accountability with a more reasonable on-call schedule.
I was like you, and probably still am deep below, and I will fall again in the same trap. But please, try to think about flipping your point of view here and instead of being your manager generous and your company good for not taking into account the time you failed to answer on duty, think about how you are being exploited covering 75% of on duty alone, and the money the company didn't loose just because of you. And how much of that money you got.
The job can be excruciatibg though. Don't do it without proper compensation.
I've held several management positions where I've carried a pager. At one employer I helped keep trading databases operational in 7 time-zones on 3 continents. At another I helped fix a backup issue on Christmas eve. Helping those customers was a core part of my responsibility. I fully understood that and took great pride in it.
But as a developer I too will never accept another on-call rotation.
Companies which assign on-call duties to developers make the mistake that development, management and operations are different kinds of work which require different environments and skill-sets. Other engineering tasks include testing, documentation, training, and maintenance. At small startups the founders and early employees may do some or all of these but that becomes impractical at larger established businesses.
Engineers should learn and do all these things in the course of their career but not all at the same time unless quality isn't a concern.
My experience at a unicorn a few years ago convinced me companies which assign developers on-call rotation either don't understand or don't care about the quality or sustainability of their business. In that company senior management was replaced by folks from Google and Facebook shortly after I joined. I was moved into a team where I had no role in the design, develop or deployment of its services. I had no say in the hiring or firing of the so-called engineers who rushed failing services into place past a wholly ineffective QA department.
I should have seen the writing on the wall when I began to be pressured by managers and recruiters to rubber-stamp candidates who couldn't pass our coding tests but had spent lots of time on-call. The company's priorities slowly became clearer to me as they grew evermore desperate to live up to their promises. Ultimately I suffered an ischemic attack from the stress of this environment and left the company to focus on my health.
Oh and the company? It let go of most of its engineers a year later and was eventually acquired by competitor for a few hundred million after having raised over a billion dollars.
They don't get paid extra, but they seem to be very happy with the setup.
Different people deal with being on-call differently but personally I don't do what I normally would when on-call, whether that's long motorbike rides or hiking etc because it's not practical to guarantee cell coverage and also the threat of a page ruins the experience. A "day off" whilst on-call isn't equal to a day off
And since in my country, you must gave at least 11 hours between shifts, if you get paged at night, you get PTO for the next 11 hours on top.
If engineers have blanket control to define what is important enough to get interrupted and to prioritize fixing frequent offenders, then sure, it's a perverse incentive.
If, on the other hand, engineering doesn't have very much control over the roadmap and/or isn't allowed to make their own judgment calls about what really matters for pages, then the arrangement that OP describes makes a ton of sense—it gets gets pages onto the budget as a separate line item, which is a good way to get the people who are really in charge on board with investing in permanent fixes.
Significantly reduces the number of pages.
I thought this was a pretty good system and despite the cycles being shorter, we had enough engineers to fill a rotation pretty well so that at most you were on call once a month, alternating months between weekend and weekday on-call cycles.
I still do not enjoy being forced into on call and wish I could opt-in. We traded weeks a lot but with smaller rotations or really finicky paging its awful. I still have a sinking feeling in my gut when I hear the work phone ringtone from somebody else's phone in public, and murphy's law definitely applies to being on call -- you always get paged the minute after your beer gets delivered at a restaurant.
Wait, what? Don't run your sprints Monday to Monday either. That's been the eventual conclusion on all scrum teams I've been on.
I have never had this much time spent doing non-development related tasks. For 4 weeks every 1.5 months I can't have a life at all. This just screams to me that we are forcing broken software/not complete software out the gate a building huge piles of technical debt that will never get the focus. I remember a time when I would start at 9am and end at 6pm every day and never heard a peep about production issues unless the support engineers couldn't figure it out. Which maybe happened twice a year. To make matters worse most things are not allowed to be touched in production with the risk of being fired for making changes. So if you want to "fix" any data or call xyz service you need high ranking approval. It's like being tortured!
As a 50 something year old software engineer. Its always been like this. I'm kinda shocked at how reluctant the new generation is to support the systems. Sure we'd all prefer strict 9-5 hours but most companies rely on software to stay in business and you need experts available in case things go wrong.
Oncall is a source of so many ways for abuse, don't even ask me how I know. Saying that rejecting Oncall is denying support for you system is bollocks.
I'm happy that younger engineers mostly laught at that concept and leave. Once of the few lessons they are teaching us (the old pricks), especially in self care and respect space.
We don't have an on-call rotation but I desperately want one. Because if no one is on-call - then all of us are on-call. Any one of us could be called at any time if one of our larger commerce projects falls over.
To me on-call is a necessary burden that means when I'm not on duty I am completely free to ignore my phone.
I'll certainly feel more positive about helping outside of the 9-5. I do like to be helpful, but perhaps that'll wear off like some kind of honeymoon period, juxtaposed to my current situation.
I'm always looking for more positives in such a system because I want it to work. Tuesday to Tuesday sounds great. Other comments here highlight the difference between critical fixes and patch it laters. Any other insights are welcome.
I still think of each error as a possibility of improvement to asymptotic zero error and I prefer working with people like that.
Others prefer other systems and I think that it’s fine for each of these groups to select for members appropriately aligned.