How to build an AI email triage agent with n8n and Claude

30-second verdict

An email triage agent reads each incoming email, assigns one label from a short list you define, routes it, and logs the decision. You can build it in n8n with a Gmail Trigger, one Claude API call on a small model, a Switch node, and a Google Sheet as the decision log. Anything customer-facing stays a draft until a human approves it. Expect 6 to 10 hours to build it properly and single-digit dollars a month to run it. If you get fewer than about 30 emails a day, skip the whole thing and use Gmail filters.

What you're building

The flow is short: Gmail Trigger fires on a new message, a Code node cleans the email, one API call to Claude returns a label, a Switch node routes it, Gmail nodes apply labels or create drafts, and every decision lands in a Google Sheet. Six nodes plus branches. No agent framework, no vector database, no memory.

After 600+ workflows, the pattern we trust for AI in operations is narrow: the AI step reads unstructured text and turns it into one structured decision, then deterministic rules act on that decision. We used the same split in an AI shift-allocation system that parsed messy, free-text scheduling requests with an AI step and used ownership locks to act on them. The AI never books anything. It only reads. Email triage works the same way: Claude classifies, n8n routes, a human sends.

This sits squarely in our automation and AI practice, and it is one of the highest-return builds we know, because inbox triage is daily, boring, and measurable.

Prerequisites

Honest list of what this needs:

n8n. Self-hosted Community Edition is free and is what we describe here. n8n Cloud works too, and the entry-level plan is enough for one mailbox. If you are choosing a platform from scratch, read our Zapier vs Make vs n8n comparison first.
A Gmail account. On self-hosted n8n you need a Google Cloud project with the Gmail API enabled and an OAuth client ID. Budget 30 minutes for that alone. n8n Cloud has a built-in OAuth connection.
An Anthropic API key from the Claude Console, with billing enabled. The lowest usage tier covers this workload with room to spare.
A Google Sheet for the decision log, with a header row you create before wiring anything.
Optional: Slack, for the human review channel.

Step 1: Pick your labels before you touch n8n

Most triage builds fail at the taxonomy, not the technology. Sit with two weeks of real inbox history and write down 5 to 8 labels that are mutually exclusive. Ours for a typical small business inbox:

invoice: a bill, receipt, or payment question
support: an existing customer asking for help
sales: a new prospect asking about services or pricing
recruiting: a job application or candidate question
newsletter: bulk marketing or digest content
needs_human: everything else, on purpose

The needs_human label is not a failure bucket. It is the design. Every classification system needs a legal exit for ambiguity, or the model will force bad fits. Create matching Gmail labels now (Gmail Settings, Labels, Create new label): AI/Invoice, AI/Support, AI/Sales, AI/Recruiting, AI/Newsletter, AI/Needs-Human. The Gmail node in n8n needs these to exist before you can select them.

Step 2: Set up the Gmail Trigger

Add a Gmail Trigger node. Set Event to Message Received and Poll Times to Every Minute. Under Filters, set Label Names or IDs to INBOX so you classify inbound mail only, not your own sent messages. Turn Simplify off. You want the full payload, including the plain-text body, not the trimmed summary.

One mailbox per workflow. If you have three inboxes, build three copies. Shared logic, separate credentials, separate logs. It is duller and far easier to debug.

Step 3: Clean the email before it hits the model

Add a Code node named Prepare Email. Three jobs:

Strip quoted history. Cut everything from the first line that starts with a quote marker or an "On [date], [name] wrote:" line. If you skip this, the model classifies the thread history instead of the new message. A customer reply on a six-message sales thread will get labeled sales when it is in fact a support request, and you will not understand why until you read the raw input.
Truncate the body to about 4,000 characters. Classification does not improve past that point. Cost and latency do.
Build one compact object with exactly these fields: messageId, threadId, from, subject, body. Everything downstream reads from this object, so later nodes never touch the raw Gmail payload.

Step 4: The classification call to Claude

n8n ships an Anthropic node and LangChain-style AI nodes. They work. We use a plain HTTP Request node for triage because every field is visible, and the execution log shows the exact request and response. That matters when you are debugging a misclassification in week three.

Configure the node, named Classify with Claude:

Method: POST. URL: https://api.anthropic.com/v1/messages
Headers: x-api-key (store it as a Header Auth credential in n8n, never paste it into the node), anthropic-version set to 2023-06-01, content-type set to application/json
Body: JSON with model set to claude-haiku-4-5, max_tokens set to 300, a system field holding the classifier prompt, and a messages array with one user message containing the sender, subject, and cleaned body

Use the small model. Triage is a closed-list classification task, and the cheapest Claude tier handles it well. Save the big models for work that needs reasoning.

The prompt is where the quality lives. Ours, lightly trimmed:

You are an email triage classifier for a small business inbox. Classify the email into exactly one of these labels: invoice (a bill, receipt, or payment question), support (an existing customer asking for help), sales (a new prospect asking about services or pricing), recruiting (a job application or candidate question), newsletter (bulk marketing or digest content), needs_human (anything that fits none of these, fits two equally well, or mentions a legal issue, a refund, or a complaint). Respond with JSON only, no other text, in this shape: {"label": "...", "confidence": 0.0 to 1.0, "reason": "one short sentence"}. Never invent a label that is not on the list.

Three design choices to copy: every label gets a one-line definition so the model is not guessing your intent. Sensitive topics (legal, refund, complaint) are routed to a human by rule, written into the prompt itself. And the output is a fixed JSON shape, so the next node can parse it instead of reading prose. Treat the confidence number as a rough flag, not a calibrated probability. It is useful for routing, not for reporting.

Where this breaks: invented labels and broken JSON

Sooner or later the model will return "Invoice question" instead of "invoice", or wrap the JSON in markdown code fences, or add a friendly sentence before the braces. If your workflow assumes clean output, it fails silently on those runs. The catch is structural, not hopeful: parse the response in a Code node inside a try/catch, lowercase and trim the label, and check it against a hardcoded allowlist array of your six labels. Anything that fails any of those checks becomes needs_human, and the raw model output gets written to the log so you can see what came back.

Step 5: Validate the output and route with a Switch node

Add a Code node named Validate Decision that does the parsing and allowlist check above, plus one more rule: if confidence is below 0.7, overwrite the label to needs_human and keep the original label in a separate field called model_label for the log.

Then add a Switch node named Route by Label, Mode set to Rules, with one output per label. Each rule compares the expression {{ $json.label }} against the exact string: invoice, support, sales, recruiting, newsletter. Set the fallback output to the needs_human branch, so an unmatched value can never vanish.

Step 6: Actions per label, with human review gates

Now the deterministic half. Per branch:

invoice, sales, recruiting: a Gmail node with Resource Message, Operation Add Label, applying the matching AI/ label. The email stays in the inbox. Triage means sorted, not hidden.
newsletter: add the label, then a second Gmail node that removes the INBOX label. Archived, not deleted, so nothing is unrecoverable.
support: add the label, then a Gmail node with Resource Draft, Operation Create, holding a suggested reply, plus a Slack node posting the sender, subject, label, and confidence to a #email-triage channel.
needs_human: add the label and post to the same Slack channel with the model's reason field included.

The hard rule, and the one we will not negotiate on client builds: the agent never sends email to a customer. It creates drafts. A person opens the draft, edits it or not, and presses send. That send button is the review gate. The day you remove it, a hallucinated discount or a wrong name can go out under your domain, and the damage to trust outweighs the hours the agent saves.

Where this breaks: the trigger and the API

Two failure modes that do not announce themselves. First, the polling trigger after downtime: if your n8n instance is offline for an hour, messages from the gap may never be picked up. Test it deliberately. Stop n8n, send yourself three emails, restart, and watch whether all three process. Whatever you observe, add a daily sweep: a second scheduled workflow that searches the mailbox for messages older than one hour with no AI/ label and pushes them through the same classify-and-route path. Second, API errors: rate limits (429) and overload responses (529) happen. On the HTTP Request node, enable Retry On Fail with 3 tries and a wait between tries, and register an n8n Error Workflow that posts every failed execution to Slack. The dangerous version of this failure is silent: the workflow just stops classifying, nobody notices, and the inbox quietly goes back to manual.

Step 7: Log every decision

Add a Google Sheets node named Log Decision, Operation Append Row, on every branch including needs_human. Columns: timestamp, message_id, from, subject (first 100 characters), label, model_label, confidence, reason, model, prompt_version, action_taken.

The log is not bureaucracy. It is the only way to improve the prompt with evidence instead of guesswork. Once a week, filter to the needs_human rows and read them. Patterns appear fast: maybe half are vendors that belong under invoice, so you tighten that definition. Bump prompt_version every time you change the prompt, so you can see whether accuracy moved after the change. We treat this the same way we treat any operations audit: measure first, then adjust.

How you know it worked

Run this acceptance test before you trust the agent:

Replay 50 historical emails. Label them by hand first, then run them through the workflow and compare. On a six-label set, 45 or more matching out of 50 is a working triage agent. Below 40 usually means your labels overlap. Fix the definitions, not the model.
One row per email. After three days of live running, the sheet row count should equal the inbound email count. Any gap means the trigger or the error path is leaking.
One test email per label. Send them to yourself. Each should carry the right AI/ label within two minutes.
Zero sent messages. Search the Sent folder for anything the workflow produced. The correct number is zero. Drafts only.
Measure against hours. Before launch, time yourself triaging 20 emails and count your weekly volume. Multiply. That number, in minutes per week, is the baseline the agent has to beat, and the log sheet proves whether it did. If you skipped this measurement, you cannot claim the agent saved anything.

What this costs to run

The API cost is the part people overestimate most. Work the arithmetic for a busy small-business inbox at 150 emails a day. After cleaning, each classification is roughly 1,200 input tokens and 120 output tokens. Over a month that is about 4,500 emails, 5.4 million input tokens, and 0.5 million output tokens. As of this writing, Claude Haiku 4.5 is priced at 1 dollar per million input tokens and 5 dollars per million output tokens, which puts this workload under 10 dollars a month. Check the current Anthropic pricing page before you budget, but the shape holds: small model, short inputs, tiny outputs. The model is not the cost.

The real costs are elsewhere. Self-hosting n8n means a small server, often one you already run. n8n Cloud's entry plan costs more per month than the API will. And the dominant cost is build time: yours or someone else's.

When to stop DIYing

Three honest tiers:

Do not build this at all if you get under about 30 emails a day. Gmail filters plus 15 minutes of manual triage beat any agent at that volume. You do not need a consultant, and you do not need this workflow.
Build it yourself if you are comfortable in n8n, you have one mailbox, and the steps above read as doable. It is a solid weekend project, and the worst failure mode (mislabeled email) is recoverable.
Get help when the scope grows: shared inboxes with assignment rules, a sales label that should create a CRM deal automatically, multiple mailboxes with different taxonomies, or compliance constraints on what the model may see. That is where the edge cases multiply.

If we build it: our rate is a flat $150 per hour CAD, and this build typically lands between 6 and 10 hours, so 900 to 1,500 dollars. That includes the sweep workflow, the error workflow, the decision log, the acceptance test, and one tuning pass on the prompt after the first week of live data. We quote the scope in writing before we start, hours never expire, and there is no retainer. A 10-hour build does not need a monthly retainer.

FAQ

Can I build this in Zapier or Make instead of n8n?

Yes. The architecture is identical: trigger, clean, classify, validate, route, log. n8n wins for this specific build because the Code node makes the cleanup and validation steps easy, and self-hosting makes a poll-every-minute trigger cheap at volume. On Zapier, per-task pricing adds up fast at 150 emails a day. Our platform comparison covers the tradeoffs in detail.

Which Claude model should I use for email triage?

The smallest current one, which is Claude Haiku as of this writing. Triage is closed-list classification with the answers written into the prompt, so the cheap tier performs nearly identically to the expensive one at a fraction of the cost. If your replay test shows persistent misses, fix the label definitions first. Upgrading the model to compensate for an ambiguous taxonomy adds cost without fixing the problem.

Can the agent reply to customers automatically?

It can, and it should not. Keep every customer-facing message behind a human review gate: the agent creates a Gmail draft, a person reads it and presses send. Auto-labeling and archiving are safe to automate because mistakes are recoverable. A wrong email sent under your name is not. Revisit the question after months of clean logs, not before, and even then only for narrow cases like acknowledgment receipts.

What about sensitive data in the emails?

Every inbound email body goes to the model provider's API, so treat this as a data-processing decision, not just a technical one. Review the provider's commercial terms: as of this writing, Anthropic's commercial API terms do not use your inputs to train models by default, but verify against the current terms yourself. If the mailbox receives regulated data such as health or financial records, keep it out of scope or add a redaction step in the Prepare Email node before the API call. When in doubt, route that mailbox to needs_human by rule.

We can handle this for you

We scope this exact work in hours, quote it in writing, and ship it in weeks. The 30-minute call is free and useful either way.

Book a Discovery Call

$150/hr flat · published pricing · no retainers