Why we wake up for incidents.
We commit to answering critical incidents with a real engineer in under thirty minutes. This is the case for why we made that promise, what on-call actually looks like inside a small team, and what we give up to keep it.
When we tell a prospective client that a real engineer will answer a critical incident in under thirty minutes, the most common reaction is a polite nod. The kind of nod you give when someone says "free coffee on Mondays." The promise is so common in vendor decks that it has lost its meaning. We make it anyway. Here is the case for why.
01What the promise actually contains.
Three words do the work here, and all three matter.
"Critical." Not every page is a critical incident. A failed image upload is not. A misconfigured Mailchimp DKIM record is not. A critical incident is anything that has stopped the business from making money, anything that has exposed customer data, and anything that breaks an SLA we wrote into a contract. We define this list explicitly with every client before we onboard them. The list lives in a single markdown file. If something is not on it, we will respond — but the clock does not start.
"A real engineer." Not a ticket-acknowledgement bot. Not an account manager who will "loop in the team." A specific human with commit access to the affected system, who has either written the offending code or read it carefully, opens a session and starts diagnosing. We do not consider the clock stopped by an automated reply.
"Under thirty minutes." Measured from the moment the alert fires to the moment the engineer's first keystroke lands in the affected system. Not the moment they wake up. Not the moment they read the page. The moment they start work.
These three definitions are the entire commitment. The promise is small. The work to hold it is not.
02Why most agencies cannot hold this line.
Three structural reasons.
The first is outsourcing. A typical agency's "24/7 support" is a managed service desk in a different timezone, staffed by people who do not know your codebase and who have a runbook that boils down to escalate to the agency's account manager during business hours. The clock starts when you open a ticket. It does not stop until someone at the agency reads it. That window is rarely under thirty minutes. It is rarely under three hours.
The second is bench depth. A genuine on-call rotation needs a primary, a secondary, and a tertiary engineer with overlapping context across every system on the contract. At an agency of twenty-five people split across forty clients, that depth does not exist for any individual account. There are two people who know your system, and either one of them being unreachable means the page becomes a Tuesday-morning conversation.
The third is financial. Real on-call is expensive. The engineer carrying the pager cannot deploy anything new during their week. They cannot drink. They cannot fly. They cannot mentor a junior on a long block of focused work. At a typical agency, that cost shows up as forty percent of an engineer's time billed at zero. The pricing model does not absorb this. So the practice quietly stops.
"24/7 support" without these three things is theatre. The light is on. No one is at the desk.
03On-call inside a small team.
Here is how it actually runs, in concrete terms.
The rotation is one week, alternating. The primary on-call engineer does not deploy anything new to production during their week. They do not start anything they cannot finish in a sitting. Their job is to be alert, to know what is shipping, and to be reachable.
The secondary on-call engineer covers the primary if they cannot respond within ten minutes. If the secondary also cannot respond, the rotation collapses to whoever is awake. We carry physical pagers as well as phone alerts, because the failure mode we are most worried about is a phone on silent next to a sleeping engineer.
After a wake-up, the on-call engineer is off client work for the following twenty-four hours. This is not a perk. It is a constraint. Decisions made on three hours of sleep are decisions we have to undo a week later, and the cost of undoing them is higher than the cost of the lost day.
We write a postmortem for every incident that touched production, no matter how brief. We sign the postmortem with two names, not one. The pattern of postmortems is reviewed quarterly. We look at the same things every time: time-to-detect, time-to-mitigate, time-to-resolve, and the proximate cause's root in the system. The numbers stay private. The patterns inform what we charge for next.
04What it costs.
The honest answer: roughly half the work we could otherwise sell.
We refuse projects that would put us over our on-call ratio. We refuse clients whose systems run in stacks we cannot get on the phone with at three in the morning. We refuse to onboard a new account inside the two weeks before a major holiday, because the new engineer is not yet a viable primary.
We pay people more than the local market for the same engineering work, because part of what we are buying is their willingness to be unavailable for things they would otherwise enjoy. We do not romanticise this. It is a real trade. Some engineers don't take the trade twice.
05What it earns.
One thing: the right to be trusted with the work that hurts when it breaks.
A site outage at 11pm Friday is a different conversation than a roadmap meeting on Wednesday. It is also where most agency relationships actually live, even when neither side admits it. The work that buyers will pay a premium for is the work they cannot afford to have done badly. Everything else is interchangeable.
If you cannot answer the page, you are competing on price for the parts that don't matter. If you can, you are not competing at all.
We publish the response window publicly because it is the most expensive thing we do and the easiest thing to verify. The day we cannot hold it is the day we close the company or fix what broke. There is no third option.