Invoice Agent — Agentic Office Automation
What an honest agentic workflow looks like.
2026Every office processes invoices, and the work is the same in every industry: open a PDF, read the fields, check the vendor, check the math, check that it is not a duplicate of last month's, pick a category, type the relevant numbers into a spreadsheet. Each one takes three to eight minutes if everything is in order. A small business handling a hundred invoices a month spends five to thirteen hours doing exactly this, every month. UiPath's 2021 Office Worker Survey put the average global office worker at four and a half hours a week on tasks they themselves believed could be automated, with manual data entry ranking second on the list of tasks workers most wanted automated, behind only email triage. Invoice processing is, at its core, exactly that — typing fields from a document into a spreadsheet.
There is no skill in this work. It is also not something anyone is paid to do well — it is the kind of task that gets done badly because it is boring, and the badly-done version creates downstream problems: payments to wrong vendors, missed early-payment discounts, the same invoice booked twice, budget categories that nobody can audit later. The cost of a manual error is $10 to $25 per invoice according to AppZen's 2025 AP benchmarks. The cost of the time spent doing it correctly is roughly the same. Either way, money walks out the door for work that should not require human attention.
WHAT “AGENTIC” ACTUALLY MEANS
A workflow becomes agentic the moment the program stops following a fixed script and starts deciding for itself which tool to use, in what order, and how to handle the cases the script did not anticipate. That is the dividing line between this project and a chatbot-with-extraction. A single LLM call returning a JSON object is not an agent. An agent is what happens when a model can call functions, observe their results, and choose its next move on that basis.
Concretely, for every invoice this system receives: it reads the PDF (text layer if present, vision if it is a scan); it extracts the structured fields; it calls lookup_vendor against the master records, fuzzy-matched; it calls verify_math to confirm the arithmetic; it calls check_duplicate against rows already in the spreadsheet. Then it decides. If everything passes, it categorises and appends. If anything failed, it routes to review with a structured reason. The decision in that last step is the part that justifies the word agent. The model is not following a switch statement; it is choosing a branch based on three independent typed tool results and producing a human-readable justification for the path it took.
ARCHITECTURE
The agent runs on the Claude Agent SDK. Three reasons drove the choice. First, the SDK reuses Claude Code's authentication, which means a Claude Pro subscription covers all model calls — no separate API key, no per-token billing surprise. Second, native tool use: each tool is a small Python function with a typed input schema, bundled into an in-process MCP server so there is no IPC overhead. Third, the SDK streams every tool call back to the controller, which is what makes the autonomy visible in the demo: the agent thinking, calling, observing, deciding.
Six business-logic tools, plus Claude Code's built-in Read for PDFs, is the entire tool surface. Validation is done in plain Python; Pydantic models normalise records before they reach the Excel writer; one JSON file per flagged invoice is the audit trail. None of those individual pieces is novel. The novelty is putting them behind an agent that decides when to call which piece.
AGENT_FLOW.SVG
DEMO_RUN.MP4 — INBOX TO SPREADSHEET, NO CUTS ON THE RUN ITSELF
EDGE CASES BUILT IN
The synthetic invoice generator does not produce ten clean invoices and call it done. It injects targeted edge cases on top of the clean set, chosen because they are the ones that hurt accounts payable in real operations. Two invoices share the same number from the same vendor — a duplicate pair. One has netto + tax ≠ brutto by a small but detectable delta. One comes from a company that is not in the master records — and incidentally bills in CHF rather than EUR, although the agent never gets that far: the unknown-vendor check fires first and the invoice is routed to review before currency would even become a question. Two of the ten are rendered as image-only scan-look PDFs with rotation, JPEG artefacts, and gaussian noise — no text layer at all; the agent has to fall back to vision to read them. And to mirror real-world format diversity, one of the ten arrives not as a PDF at all but as an HTML order confirmation, the way many webshops actually deliver them. That is the kind of variance that breaks naive regex-and-PDF-parser pipelines in production.
RESULTS
The recorded run processes the ten-invoice batch end-to-end on a Claude Pro plan at around fifty seconds per invoice. Seven invoices land in the spreadsheet as PROCESSED records; three are routed to review with structured reasons — the duplicate, the math-error invoice, and the unknown vendor. The agent's note for the duplicate read:
Invoice INV-2026-DUP-5557 from Pixelpunk Online AG (VND-002) is already present in processed.xlsx. Math and vendor identity check out, but the duplicate invoice number requires human review before any further action.
That sentence is what makes this useful in practice. A reviewer reading the flagged JSON does not have to guess what the agent thought; the justification is there, with the specific reference to the already-booked record. A second behaviour worth recording: when a vendor's default booking category disagrees with what is actually being bought, the agent overrides. In one validation run a webshop vendor registered as Fachliteratur (books) appeared with line items of coffee, a branded mug, and a desk organiser. The agent re-categorised the booking as Bürobedarf (office supplies) and recorded its reasoning. That is the contextual judgement that justifies a tool over a hard-coded vendor-to-category map.
WHAT THIS DOES NOT YET DO
The honest list. A real deployment would require an audit log that retains the agent's full reasoning trail rather than just the summary, because GoBD requires immutable storage of the rationale. A payment-approval workflow gating anything above a configurable threshold — processing an invoice is not the same as authorising payment. An integration with a real accounting system (DATEV, lexoffice, sevDesk); the Excel export here is a deliberate showcase choice because it makes the data immediately legible. A vendor master that supports onboarding new vendors mid-flow rather than only flagging unknowns. And a rate-limit-aware batcher for queues larger than ~50 items, where the current per-invoice query model would need a different pattern.
None of those are exotic. They are the gap between a demo that works and a system in production — and the gap is where the engineering judgement actually lives.
KEY_NUMBERS
manual_per_invoice = 3–8 min // industry baseline (UiPath, 2021)
agent_per_invoice ≈ 50 s // Claude Pro, Sonnet-class model, 5–7 tool calls
duplicate_detection = 1 / 1 // injected pair, second occurrence flagged with reason
math_tolerance = 0.02 // both directions, currency-agnostic
unknown_vendor_threshold = 0.78 // fuzzy-match acceptance score
vision_fallback_share ≈ 20% // scan variants in the synthetic batch
The point is not the percentages. The point is that every decision the agent makes has a number behind it that can be tuned, audited, and explained.