Add an approval gate before tool execution. Pause the graph, surface the planned action to a human, and resume on approval — with state preserved across the wait.
Your agent is good. It can plan refunds, draft emails, schedule meetings, deploy code. The problem is that "good" is not "good enough to let it execute autonomously on a production system." One bug in the planner — one hallucinated order ID, one prompt-injected user message — and the agent does the wrong thing in the real world.
Logging the action after the fact does not help. Audit logs tell you what happened; they do not stop it. You need a way for the agent to plan, then pause, then ask a human "are you sure?" before the action fires.
The complication: the agent might pause for seconds, minutes, hours, or days. Whatever runtime you build has to survive process restarts, deploys, on-call handoffs. It has to remember exactly what the agent was about to do and resume from that exact point. That is what LangGraph's interrupt + checkpointer system gives you.
One new node — Approval — and two re-wired edges. That is the entire change. The minimalism is intentional.
The Approval node sits between the planner and the tool because that is where the trust boundary is. Anything upstream of the gate is "the agent thinking"; anything downstream is "the agent acting in the world." HITL formalizes that boundary.
The pattern generalizes by adding more gates, not by making each gate smarter. A complex agent might have HITL on tool calls, HITL on cross-account writes, HITL on customer-facing communications — three Approval nodes, each guarding a specific risk surface. Compose by repetition, not by complexity.
Notice the gate is synchronous in the graph but asynchronous in wall time. The graph code looks like a function call; the runtime turns it into a persisted pause. That separation — synchronous semantics, asynchronous execution — is what makes HITL feel like normal code while behaving like a durable workflow.
MemorySaver loses state on every process restart. Your service redeploys, every interrupted run silently dies, and approvals from real users vanish. Use SqliteSaver or PostgresSaver the moment your code leaves a notebook. The dev/prod parity here is critical.
When you insert an Approval node, you must also delete the old direct edge from Planner to Tool. A surprising number of 'my approval gate is being skipped' bugs are simply a leftover edge that bypasses the gate. The Approval node is a wall, not a suggestion; make sure no route goes around it.
Resume requires the original thread_id. If your UI loses it (page refresh, lost session), you cannot resume the interrupted run. Persist thread_id in your session store, URL, or a queue — never just in browser memory. Treat it the same as a payment intent ID.
Surfacing 'approve issue_refund(4421, 84.99)?' to a human is barely better than executing without approval. The human cannot tell whether the agent's reasoning was sound. Always include the model's rationale, retrieved context, and any relevant alternatives the agent considered. The point of HITL is informed approval, not rubber-stamping.
Gate everything and humans approve everything without reading. Gate nothing and the agent runs wild. The middle path: gate only actions whose worst case is bad, automate the rest. Measure approval rates per gate — if a gate has >99% approve rate, consider whether it needs to be a gate at all.
Human approves, tool fires, tool fails. What now? Many implementations have no path for this case and the agent silently 'succeeds' with a broken side effect. After every gated tool, check for failure and either retry, re-prompt the human, or escalate. Do not assume approval implies execution.