How to test AI procedures before they touch real data

Software engineers don't ship code without testing it. They write tests that define correct behavior, run them on every change, and refuse to deploy if anything fails. Business processes don't get the same treatment. Someone writes a return policy, emails it to the team, and hopes everyone follows it correctly. When they don't — wrong refund amount, missed escalation, angry customer — you find out from the complaint, not from a test. If your business processes are automated by AI, they need the same testing discipline as code. ## The parallel In software: 1. Write the code 2. Write tests that define "correct" 3. Run the tests — if they fail, fix the code 4. Ship it 5. When something changes, run the tests again In operations procedure: 1. Write the procedure instructions 2. Generate test scenarios that define "correct" 3. Run the tests — if they fail, fix the instructions 4. Turn it on 5. When the process changes, run the tests again Your return policy, your fraud rules, your payment recovery process — these are programs. They have inputs (an order, a customer, a payment), logic (check the policy, evaluate the risk), and outputs (approve, deny, escalate). Programs need tests. ## What a test looks like A test for a return procedure: **Scenario:** Customer requests return for $45 defective item. 12 days after delivery. First return on account. **Expected:** Auto-approve. No human review. Another test for the same procedure: **Scenario:** Customer requests return for $200 item. 35 days after delivery (outside 30-day window). Third return this quarter. **Expected:** Escalate to human review. Flag as potential abuse. Each test is a specific situation paired with the correct response. The procedure runs against it. Either it matches or it doesn't. ## Why this catches problems you won't find otherwise **Processes change, tests catch drift.** Your return window changed from 30 to 45 days. Someone updated the instructions. But did they update the VIP exception (60-day window)? Without tests, you find out when a VIP gets denied at day 35 and calls support. With tests, the VIP scenario fails immediately. **Edge cases get captured permanently.** That weird situation six months ago — customer returned a gift, different country, different currency. Someone figured out how to handle it. Without tests, that knowledge lives in memory. With tests, it runs every time the procedure changes. **New team members inherit context.** When a new ops manager takes over, the tests describe every scenario the team has thought through. They don't ask "what do we do when a customer returns a third item?" They read the test. ## How to build an eval suite for your procedures ### Step 1: Generate scenarios For each procedure, identify the decision points and create scenarios that cover: - **Happy path:** The standard case that matches your policy cleanly - **Edge cases:** Borderline conditions (day 30 of a 30-day window, $50.01 on a $50 threshold) - **Exception paths:** VIP customers, repeat returners, high-value items - **Failure modes:** Missing data, conflicting information, ambiguous inputs A payment recovery procedure might need: - Soft decline (insufficient funds) — should retry - Hard decline (expired card) — should not retry, should email customer - Soft decline on VIP subscriber — should retry AND notify account manager - Hard decline on customer with 3 prior failures — should cancel subscription - Unknown decline code — should escalate to human ### Step 2: Define expected outcomes Each scenario needs a clear, testable expected outcome. Not "handle appropriately" — that's not testable. Instead: "Auto-approve the refund" or "Escalate to human with recommendation to deny." ### Step 3: Run and grade independently The procedure and the grader should be separate. The grader doesn't know what the procedure was "trying" to do. It only knows the correct outcome and whether the procedure produced it. This is the same principle as an audit: the auditor shouldn't be the person who did the work. Use a different AI model or a different provider for grading to avoid shared reasoning patterns. ### Step 4: Run on every change Updated the refund threshold? Tests run. Added a new product exception? Tests run. Changed VIP rules? Tests run. If a change breaks a scenario that used to pass, you know before a single customer is affected. ## The numbers Procedures that go through testing before deployment catch 2-3 instruction errors on average. Missing conditions, ambiguous thresholds, forgotten exceptions. At 1,000 operations/month, 2-3% error rate means 20-30 customers affected. At 10,000/month, that's 200-300 wrong decisions — wrong refund amounts, missed escalations, incorrect fraud flags. Testing reduces this to near zero. Not by making the procedure perfect, but by making errors visible before they reach customers. ## The principle Operations processes are too important to deploy on faith. The companies that test their business logic with the same rigor as their software will operate more reliably, catch problems earlier, and earn more customer trust. Testing is boring infrastructure. It's also the difference between "our AI mostly works" and "our AI works." --- *DeepMerge tests every procedure before it goes live. AI agents execute your operations across Shopify, Stripe, and 30+ integrations — with built-in eval suites, safety classification, and human-in-the-loop approvals.*