Kapso uses a testing approach where test cases are defined in YAML format and evaluated by an AI judge. This allows for more flexible, natural testing of conversational agents.
# tests/customer_support/order_lookup_test.yamlname: order_lookup_testdescription: Test successful order status inquiryscript: | 1. Start with: "I want to check my order status" 2. When agent asks for order ID, say: "ABC-123-456" 3. Agent should provide order status 4. Say: "When will it arrive?" 5. Agent should provide delivery estimate 6. Say: "Thanks!"rubric: | 1. Agent asks for order ID promptly (25%) 2. Agent successfully retrieves order status (25%) 3. Agent provides delivery information (25%) 4. Agent maintains professional tone (15%) 5. Agent offers additional help (10%) Critical Failures: - Agent cannot find the order - Agent provides wrong information - Agent asks for sensitive data unnecessarily
Copy
# Run the testkapso test tests/customer_support/order_lookup_test.yaml# Output:✓ order_lookup_test (0.95/1.00) - Agent asks for order ID promptly: 0.25/0.25 - Agent successfully retrieves order status: 0.25/0.25 - Agent provides delivery information: 0.25/0.25 - Agent maintains professional tone: 0.15/0.15 - Agent offers additional help: 0.05/0.10 ⚠️
Instead of exact string matching, Kapso tests:
Define conversational scenarios in natural language
Specify expected behaviors, not exact outputs
Use an AI judge to evaluate if the agent met expectations
# tests/customer_support/test-suite.yamlname: Customer Support Testsdescription: Comprehensive tests for customer support agent# id: auto-generated during deployment
# tests/customer_support/greeting_test.yamlname: greeting_testdescription: Test if agent greets users appropriatelyscript: | 1. User says: "Hello, I need help" 2. Assistant should greet warmly and ask how they can help 3. User says: "I have a problem with my order" 4. Assistant should acknowledge the issue and offer to helprubric: | 1. Warm and professional greeting (30%) 2. Asks how they can help (20%) 3. Acknowledges the user's problem (25%) 4. Offers specific assistance for orders (25%)# id: auto-generated during deployment
script: | 1. Start by saying "Hi there!" 2. When the agent greets you, say "I need to return an item" 3. If the agent asks for order number, provide "ORD-12345" 4. When agent provides return instructions, say "Thank you" 5. Observe if agent offers additional help
Best Practices for Scripts:
Use numbered steps for clarity
Write as instructions: “Start by…”, “When agent…”, “Say…”
Include conditionals: “If agent asks X, respond with Y”
name: successful_order_lookupdescription: Test successful order status inquiryscript: | 1. Start with: "I want to check my order status" 2. When agent asks for order ID, say: "ABC-123-456" 3. Agent should provide order status 4. Say: "When will it arrive?" 5. Agent should provide delivery estimate 6. Say: "Thanks!"rubric: | 1. Agent asks for order ID promptly (25%) 2. Agent successfully retrieves order status (25%) 3. Agent provides delivery information (25%) 4. Agent maintains professional tone (15%) 5. Agent offers additional help (10%) Critical Failures: - Agent cannot find the order - Agent provides wrong information - Agent asks for sensitive data unnecessarily
name: invalid_order_handlingdescription: Test how agent handles invalid order numbersscript: | 1. Say: "I need to check order INVALID-ID" 2. Agent should explain they cannot find the order 3. Agent should offer alternative help 4. Say: "Maybe I have the wrong number" 5. Agent should offer to help find the orderrubric: | 1. Agent politely explains order not found (30%) 2. Agent doesn't blame the user (20%) 3. Agent offers alternative solutions (30%) 4. Agent remains helpful and patient (20%) Critical Failures: - Agent crashes or shows error messages - Agent accuses user of lying - Agent gives up without offering help
name: multi_step_processdescription: Test agent handling multi-step booking processscript: | 1. Say: "I want to book an appointment" 2. Agent should ask for service type 3. Say: "Car maintenance" 4. Agent should ask for preferred date 5. Say: "Next Tuesday" 6. Agent should ask for time preference 7. Say: "Morning please" 8. Agent should confirm the booking details 9. Say: "Yes, that's correct" 10. Agent should provide confirmationrubric: | 1. Agent collects all required information (40%) - Service type (10%) - Date preference (10%) - Time preference (10%) - Confirmation (10%) 2. Agent provides clear confirmation (20%) 3. Agent handles the flow smoothly (20%) 4. Agent is patient with user (20%) Critical Failures: - Agent skips required information - Agent books without confirmation - Agent loses context during conversation
name: human_handoff_triggerdescription: Test global handoff node activationscript: | 1. Start normally: "Hi, I need help" 2. Agent greets and offers help 3. Say: "This is too complicated, I need a human" 4. Agent should trigger handoff 5. Verify handoff message is appropriaterubric: | 1. Agent recognizes handoff request (40%) 2. Agent triggers handoff promptly (30%) 3. Handoff message is professional (20%) 4. Agent doesn't argue about handoff (10%) Critical Failures: - Agent ignores handoff request - Agent continues automated responses - Agent is dismissive of user's request
# Run all testskapso test# Run specific test suite by namekapso test "Customer Support Tests"# Run with verbose outputkapso test --verbose# Run specific test filekapso test tests/customer_support/greeting_test.yaml
# ✅ Good: Natural, instructionalscript: | 1. Greet the agent casually 2. When asked how they can help, mention you're upset about billing 3. If agent apologizes, acknowledge it 4. Express that you want to cancel service# ❌ Bad: Too rigidscript: | User: Hello Agent: Hi, how can I help? User: I'm upset about billing Agent: I apologize for the inconvenience
script: | 1. Say: "Check order 12345" 2. Agent should make API call (may show thinking) 3. Agent provides order status 4. Verify information is accuraterubric: | 1. Agent extracts order ID correctly (25%) 2. Agent handles API response properly (35%) 3. Agent presents information clearly (25%) 4. Agent offers relevant follow-up (15%)
script: | 1. Ask: "What's your return policy?" 2. Agent should search knowledge base 3. Verify agent provides accurate policy 4. Ask follow-up: "What about international returns?"rubric: | 1. Agent finds relevant information (30%) 2. Information is accurate (30%) 3. Agent handles follow-up question (25%) 4. Response is well-formatted (15%)
script: | 1. Say: "I need help with my order and also want to know your hours" 2. Agent should use multiple tools appropriately 3. Verify both pieces of information are providedrubric: | 1. Agent recognizes multiple requests (25%) 2. Agent uses order API tool (25%) 3. Agent uses knowledge base for hours (25%) 4. Agent presents both answers clearly (25%)