30-second verdict
Every automation you build will eventually hit an error: an expired login, a rate limit, a renamed field. By default, Zapier, Make, and n8n handle that error by emailing the builder and moving on, which in practice means nobody finds out. Four patterns fix this: retries for temporary failures, a dead-letter path for records that fail anyway, alerts routed to a human who will act, and run logs you check. All three tools support all four. Most stacks we audit have none of them turned on. Setting up the minimal version takes about two hours.
What error handling means, in plain words
Error handling is the part of an automation that decides what happens when a step does not work. Not the happy path where the form submission becomes a CRM contact and everyone is pleased. The other path: the one where the CRM was down for forty seconds, or someone changed a password, or a lead typed their email with a trailing space and the API rejected it. Error handling answers four questions in advance. Should we try again? If trying again fails, where does this record go? Who finds out, and how fast? And a month from now, can anyone see what happened?
Why automations fail silently by default
No-code tools are optimistic by design. They are built to show you the run that worked, because that is the moment you decide to pay for the tool. The failure behaviour is quieter.
Here is what each tool does when a step errors and you have configured nothing. Zapier marks the run as errored in Zap History and emails the person who built the Zap. If the Zap keeps erroring, Zapier turns it off entirely and sends another email. Make stores the failed run and, after repeated consecutive errors, deactivates the whole scenario. n8n logs the failed execution and keeps going, waiting for someone to open the executions list.
Notice the pattern: in every case, the safety net is an email to one inbox, usually the inbox of whoever built the thing two years ago. That person filters the emails after week one, changes roles, or leaves. The automation then fails in total silence. And the hard part of silent failure is that nobody complains, because nobody knows what is missing. A lead fills out your form, the webhook returns a 500, the lead never reaches HubSpot, and sales never follows up. The lead does not call to report that your Zap errored. They buy from someone else.
The errors themselves are boring and predictable. Across 600+ workflows we have built, the same five account for most failures: 401 Unauthorized after a password change or OAuth token expiry, 429 rate limit exceeded during a busy hour, record not found because someone renamed or deleted a CRM property, a timeout during a vendor's bad deploy, and malformed input data, usually a blank required field or a date in the wrong format. The first four are temporary or fixable in minutes. The fifth fails forever no matter how many times you retry it. That distinction drives everything below.
The four patterns, mechanically
Pattern 1: retries with backoff
A retry means the tool attempts the failed step again instead of giving up. Backoff means it waits longer between each attempt: a minute, then five, then twenty-five. The waiting matters. If an API is rate-limiting you, retrying instantly just earns you another 429. Spacing the attempts out gives the vendor outage or the rate-limit window time to pass.
Retries fix temporary failures only. A timeout, yes. A rate limit, yes. A contact record with no email address, no. That record will fail on attempt one and attempt fifty, which is why retries need the next pattern behind them.
One warning we give every client: retries plus money steps can create duplicates. If your automation creates an invoice and the run errors after the invoice exists but before the run records success, a retry creates a second invoice. The fix is to make the step idempotent: search for the record first and only create it if it is missing, or use a unique key so the second attempt updates instead of inserts. We learned this the concrete way on an AI shift-allocation build, where we added ownership locks so two runs could never claim the same shift. We wrote up that double-booking fix separately.
Pattern 2: dead-letter paths
A dead-letter path is a place where failed records land instead of vanishing. The name comes from postal services: a letter that cannot be delivered goes to the dead-letter office instead of the bin. Yours does not need to be fancy. A Google Sheet or Airtable table called "Failed runs" with five columns: timestamp, which automation, the record's key data, the error message, and a "resolved" checkbox.
The dead-letter path catches what retries cannot. The lead with the malformed email still exists in that sheet. Someone can fix the email by hand and push the row through, and the lead gets followed up three hours late instead of never. Without the sheet, the only record of that lead is buried in a run log that expires.
Pattern 3: alert routing to a human
An error that nobody sees is an error that nobody fixes. The default email-the-builder behaviour fails for a structural reason: the person who built the automation is rarely the person who feels its failure. The marketing coordinator feels the missing leads. The bookkeeper feels the missing invoices. The builder just gets another email.
Routing means errors go to a shared place the affected team reads, almost always a Slack or Teams channel, with the automation name, the error message, and a link to the failed run. Route by stakes, not by volume. Anything touching money or customers goes to a channel with a named owner. Internal conveniences, like the Zap that posts birthdays to Slack, can keep the default email. Nobody needs to be paged for a missed birthday.
Pattern 4: run logs you can audit
All three tools keep a history of every run: what triggered it, what data passed through each step, what succeeded and what failed. The log already exists. The pattern is the habit of reading it, because logs nobody opens provide no safety.
Two things to know about run logs. First, retention is plan-shaped: lower tiers keep days, higher tiers keep weeks or months, and self-hosted n8n keeps whatever you configure on your own disk. If a failure surfaces six weeks late and your retention is seven days, the evidence is gone. Second, logs answer the question alerts cannot: not "what broke loudly" but "what quietly stopped running at all." An automation with zero runs in a month, when it should run daily, never throws an error. Only the log shows the absence.
What each pattern looks like in Zapier, Make, and n8n
The concepts are identical across tools. The buttons are not. Here is where each pattern lives. If you are still choosing between the three platforms, our comparison covers that decision; this table assumes you have picked one.
| Pattern | Zapier | Make | n8n |
|---|---|---|---|
| Retries with backoff | Autoreplay (paid plans) re-runs errored Zap runs automatically over the following hours, with growing gaps between attempts. On free plans you replay manually from Zap History. | Add an error handler to a module with the Break directive: you set the number of attempts and the wait between them. Requires the scenario setting "Allow storing of incomplete executions." | Per node: enable "Retry On Fail" and set max tries and the wait between tries. The built-in waits are short, seconds not hours, so for longer outages you re-run from the executions list instead. |
| Dead-letter path | Attach an error handler path to the failing step and have it write the payload to a "Failed runs" sheet or Airtable base. | Two options: the incomplete executions store is a built-in dead-letter queue you can resolve and resume, or add a sheet-write module on the error route before the directive. | Set the node's on-error behaviour to "Continue (using error output)", which gives the node a second output branch. Route that branch to your sheet or database. |
| Alert routing | Build one Zap using the Zapier Manager app's error trigger that posts every Zap error to a Slack channel. Manager also has a trigger for when a Zap gets turned off, which is the alert that matters most. | Put a Slack module on the error handler route, before the Break or Resume directive, posting scenario name and error text. | Build one workflow starting with the Error Trigger node, posting workflow name, error message, and the execution link to Slack. Assign it as the error workflow in each workflow's settings. |
| Run logs | Zap History, filterable by status. Retention depends on plan tier. | Scenario History shows every execution with the data bundles in and out of each module. Retention depends on plan tier. | Executions list. Cloud retention depends on plan; self-hosted retention is whatever you configure. |
One asymmetry worth naming: in Zapier and n8n, alert routing is centralized, one error Zap or one error workflow covers everything. In Make, error handlers attach per module, so coverage is something you add deliberately to the scenarios that matter rather than getting globally for free.
What this looks like at 5 people vs 20 people
At a 5-person company, the builder and the operator are usually the same person. If your Zap breaks, you feel it yourself within a day or two. The honest minimum here is small: default error notifications pointed at an inbox or channel you genuinely read, retries on any step that calls an external API, and a dead-letter sheet for the two or three automations that touch money or customers. That is it. You do not need an error workflow for the Slack birthday bot, and you do not need a consultant to add a retry checkbox.
At 20 people, the structure breaks, because automations now cross departments. The person who built the lead-routing flow sits in ops, while the person who notices missing leads sits in sales, and the error emails go to neither of them. At this size you need three things you did not need before. First, a named owner per automation, written down. Second, alerts routed to the owning team's channel rather than one person's inbox. Third, a registry: a list of every automation, what it does, what it touches, and who owns it. If that list does not exist, the error handling conversation is premature, and we have written about why the registry comes first.
The 20-person version also raises the duplicate-record stakes. With one builder, you know which steps create invoices. With four builders across two years, you do not, and a well-meaning retry policy can double-charge a customer. Idempotency checks on money steps stop being optional here.
The minimal version worth doing this week
If you do nothing else, do these four things. Budget about two hours total.
- Point error notifications somewhere watched. Create a #automation-errors channel. In Zapier, build the one Manager-app Zap that posts errors and turn-offs there. In n8n, build the one Error Trigger workflow and assign it everywhere. In Make, add Slack-posting error handlers to your highest-stakes scenarios. Thirty minutes.
- Turn on retries for API-calling steps. Autoreplay in Zapier, Break directives in Make, Retry On Fail in n8n. Skip retries on any step that creates invoices or charges until you have checked it is idempotent. Twenty minutes.
- Create one "Failed runs" sheet and wire dead-letter writes into your top three automations only: the ones touching leads, money, or customer communication. Forty minutes.
- Put a 15-minute monthly check on the calendar with a named owner. The checklist is below. Five minutes to schedule, and the recurring slot is the part most teams skip.
The monthly 15-minute health check
This is the audit habit that makes the other three patterns real. One person, once a month, fifteen minutes, in this order.
- Minutes 1 to 5: scan the error logs. Open Zap History, Make's scenario History, or n8n's executions list. Filter to errored runs in the last 30 days. You are looking for repeat offenders: the same automation erroring weekly is a design problem, not bad luck.
- Minutes 5 to 8: check for silence. Look for automations with zero runs that should have run. A daily sync with no executions in two weeks did not get reliable. It died. This is the check no alert can do for you.
- Minutes 8 to 11: empty the dead-letter sheet. Fix and replay what can be fixed, mark the rest resolved with a note. If the sheet has forty rows from one automation, that automation needs rework, not patience.
- Minutes 11 to 13: check connections. Every platform has a connections page showing expired or failing app logins. Reauthorize anything red before it takes a workflow down mid-month.
- Minutes 13 to 15: confirm nothing was auto-disabled. Zapier turns off Zaps that error too often. Make deactivates scenarios after consecutive failures. A disabled automation throws no errors at all, which is exactly why it goes unnoticed.
If this check keeps surfacing the same fires, the problem is upstream of error handling, and a broader look at the stack is worth more than another retry. That stack review is the core of our automation and AI work: flat $150 per hour, scope quoted in writing before we start, and if the fix is a 20-minute settings change we will say so instead of inventing a project.
Misconceptions we hear constantly
"No error emails for three months means it's working." It might mean the trigger died, the Zap got auto-disabled, or the emails are going to a filtered folder. Absence of errors is not presence of health. Only the zero-runs check catches a dead trigger.
"Retries make the automation reliable." Retries make it resilient to temporary failures. Bad data fails every attempt, and retries on non-idempotent steps create duplicates. Retries are one pattern of four, not a substitute for the other three.
"We'll add error handling once we scale." Backwards. The patterns cost two hours when you have eight workflows and a week of archaeology when you have eighty. The cheapest day to add them was the day you built the automation. The second cheapest is today.
"Self-hosting n8n makes this safer." Self-hosting gives you control of retention and data, and it adds a new failure mode: now your server can go down too, and there is no vendor status page to blame. The four patterns apply identically. You just also own the uptime.
"We need a monitoring tool for this." Below roughly fifty workflows, you do not. The built-in histories, one alert channel, one spreadsheet, and a calendar slot cover it. Buy tooling when the monthly check stops fitting in fifteen minutes, not before.
Frequently asked questions
Do all my automations need all four patterns?
No. Apply all four to anything touching money, leads, or customer communication. Internal conveniences need only the defaults plus the monthly check. Over-engineering a Slack notification bot wastes the hours you should spend on the invoice flow.
Will retries create duplicate records or double charges?
They can, whenever a run fails after the create step succeeded but before the run finished. Protect money steps by searching before creating, or by using a unique key so a second attempt updates instead of inserting. Until a step is idempotent, leave automatic retries off for it and replay failures by hand.
How do I catch an automation that stopped running entirely?
Dead automations throw no errors, so alerts cannot help. Two defences: the monthly zero-runs check in your run history, and turn-off alerts where the platform offers them. Zapier's Manager app can trigger a notification the moment a Zap gets turned off, which is the single highest-value alert in the whole stack.
I'm on a free plan without automatic retries. What should I do?
Lean harder on the other three patterns. Without Autoreplay, a Zapier error stays failed until a human replays it from Zap History, so the alert channel and the monthly check carry more weight. If a free-plan automation touches revenue, that alone usually justifies the paid tier, and if you want a second opinion on whether it does, ask us before paying anyone for a rebuild.
We can handle this for you
We scope this exact work in hours, quote it in writing, and ship it in weeks. The 30-minute call is free and useful either way.
Book a 30-minute call$150/hr flat · published pricing · no retainers