Overview
Kapso uses a testing approach where test cases are defined in YAML format and evaluated by an AI judge. This allows for more flexible, natural testing of conversational agents.
Complete example
# tests/customer_support/order_lookup_test.yaml
name: order_lookup_test
description: Test successful order status inquiry
script: |
1. Start with: "I want to check my order status"
2. When agent asks for order ID, say: "ABC-123-456"
3. Agent should provide order status
4. Say: "When will it arrive?"
5. Agent should provide delivery estimate
6. Say: "Thanks!"
rubric: |
1. Agent asks for order ID promptly (25%)
2. Agent successfully retrieves order status (25%)
3. Agent provides delivery information (25%)
4. Agent maintains professional tone (15%)
5. Agent offers additional help (10%)
Critical Failures:
- Agent cannot find the order
- Agent provides wrong information
- Agent asks for sensitive data unnecessarily
# Run the test
kapso test tests/customer_support/order_lookup_test.yaml
# Output:
✓ order_lookup_test (0.95/1.00)
- Agent asks for order ID promptly: 0.25/0.25
- Agent successfully retrieves order status: 0.25/0.25
- Agent provides delivery information: 0.25/0.25
- Agent maintains professional tone: 0.15/0.15
- Agent offers additional help: 0.05/0.10 ⚠️
Instead of exact string matching, Kapso tests:
- Define conversational scenarios in natural language
- Specify expected behaviors, not exact outputs
- Use an AI judge to evaluate if the agent met expectations
- Score performance on a 0.0 to 1.0 scale
Test organization
Directory structure
tests/
├── customer_support/ # Test suite directory
│ ├── test-suite.yaml # Suite metadata
│ ├── greeting_test.yaml # Individual test cases
│ ├── order_flow_test.yaml
│ └── error_handling_test.yaml
└── product_search/
├── test-suite.yaml
├── basic_search_test.yaml
└── advanced_filters_test.yaml
Test suite configuration
Each test suite requires a test-suite.yaml
file:
# tests/customer_support/test-suite.yaml
name: Customer Support Tests
description: Comprehensive tests for customer support agent
# id: auto-generated during deployment
Writing test cases
Test case structure
# tests/customer_support/greeting_test.yaml
name: greeting_test
description: Test if agent greets users appropriately
script: |
1. User says: "Hello, I need help"
2. Assistant should greet warmly and ask how they can help
3. User says: "I have a problem with my order"
4. Assistant should acknowledge the issue and offer to help
rubric: |
1. Warm and professional greeting (30%)
2. Asks how they can help (20%)
3. Acknowledges the user's problem (25%)
4. Offers specific assistance for orders (25%)
# id: auto-generated during deployment
Script section
The script defines the conversation flow:
script: |
1. Start by saying "Hi there!"
2. When the agent greets you, say "I need to return an item"
3. If the agent asks for order number, provide "ORD-12345"
4. When agent provides return instructions, say "Thank you"
5. Observe if agent offers additional help
Best Practices for Scripts:
- Use numbered steps for clarity
- Write as instructions: “Start by…”, “When agent…”, “Say…”
- Include conditionals: “If agent asks X, respond with Y”
- Specify observations: “Observe if…”, “Check that…”
- Use natural language, not rigid scripts
Rubric section
The rubric defines scoring criteria:
rubric: |
1. Agent greets professionally (20%)
2. Agent asks for order details (20%)
3. Agent provides correct return process (30%)
4. Agent offers additional assistance (15%)
5. Agent maintains helpful tone throughout (15%)
Critical Failures:
- Agent refuses to help with returns
- Agent provides incorrect return policy
- Agent ends conversation abruptly
Rubric Guidelines:
- Use percentage-based scoring (must total 100%)
- Focus on behaviors and outcomes, not exact wording
- Include “Critical Failures” for must-not-happen scenarios
- Be specific about what constitutes success
Test examples
Example 1: Happy path test
name: successful_order_lookup
description: Test successful order status inquiry
script: |
1. Start with: "I want to check my order status"
2. When agent asks for order ID, say: "ABC-123-456"
3. Agent should provide order status
4. Say: "When will it arrive?"
5. Agent should provide delivery estimate
6. Say: "Thanks!"
rubric: |
1. Agent asks for order ID promptly (25%)
2. Agent successfully retrieves order status (25%)
3. Agent provides delivery information (25%)
4. Agent maintains professional tone (15%)
5. Agent offers additional help (10%)
Critical Failures:
- Agent cannot find the order
- Agent provides wrong information
- Agent asks for sensitive data unnecessarily
Example 2: Error handling test
name: invalid_order_handling
description: Test how agent handles invalid order numbers
script: |
1. Say: "I need to check order INVALID-ID"
2. Agent should explain they cannot find the order
3. Agent should offer alternative help
4. Say: "Maybe I have the wrong number"
5. Agent should offer to help find the order
rubric: |
1. Agent politely explains order not found (30%)
2. Agent doesn't blame the user (20%)
3. Agent offers alternative solutions (30%)
4. Agent remains helpful and patient (20%)
Critical Failures:
- Agent crashes or shows error messages
- Agent accuses user of lying
- Agent gives up without offering help
Example 3: Complex flow test
name: multi_step_process
description: Test agent handling multi-step booking process
script: |
1. Say: "I want to book an appointment"
2. Agent should ask for service type
3. Say: "Car maintenance"
4. Agent should ask for preferred date
5. Say: "Next Tuesday"
6. Agent should ask for time preference
7. Say: "Morning please"
8. Agent should confirm the booking details
9. Say: "Yes, that's correct"
10. Agent should provide confirmation
rubric: |
1. Agent collects all required information (40%)
- Service type (10%)
- Date preference (10%)
- Time preference (10%)
- Confirmation (10%)
2. Agent provides clear confirmation (20%)
3. Agent handles the flow smoothly (20%)
4. Agent is patient with user (20%)
Critical Failures:
- Agent skips required information
- Agent books without confirmation
- Agent loses context during conversation
Example 4: Global node test
name: human_handoff_trigger
description: Test global handoff node activation
script: |
1. Start normally: "Hi, I need help"
2. Agent greets and offers help
3. Say: "This is too complicated, I need a human"
4. Agent should trigger handoff
5. Verify handoff message is appropriate
rubric: |
1. Agent recognizes handoff request (40%)
2. Agent triggers handoff promptly (30%)
3. Handoff message is professional (20%)
4. Agent doesn't argue about handoff (10%)
Critical Failures:
- Agent ignores handoff request
- Agent continues automated responses
- Agent is dismissive of user's request
Running tests
CLI commands
# Run all tests
kapso test
# Run specific test suite by name
kapso test "Customer Support Tests"
# Run with verbose output
kapso test --verbose
# Run specific test file
kapso test tests/customer_support/greeting_test.yaml
Test output
Running Customer Support Tests...
✓ greeting_test (0.95/1.00)
- Warm and professional greeting: 0.28/0.30
- Asks how they can help: 0.20/0.20
- Acknowledges the user's problem: 0.25/0.25
- Offers specific assistance: 0.22/0.25
✗ error_handling_test (0.65/1.00)
- Explains order not found: 0.25/0.30
- Doesn't blame user: 0.20/0.20
- Offers alternatives: 0.15/0.30 ⚠️
- Remains helpful: 0.05/0.20 ⚠️
Critical Failure: Agent gave up without offering help
Summary: 1 passed, 1 failed
Average Score: 0.80/1.00
Test development workflow
1. Create test suite
kapso create test-suite --name "Payment Flow Tests"
2. Create test cases
kapso create test-case \
--test-suite "Payment Flow Tests" \
--name "successful_payment"
3. Write test content
Edit the generated YAML file with your script and rubric.
4. Test locally
# Test against local agent
kapso test --local
# Test against deployed agent
kapso test
5. Iterate
Refine tests based on results and agent improvements.
Best practices
1. Test coverage
Cover these essential areas:
- Happy paths: Normal, successful flows
- Error cases: Invalid inputs, API failures
- Edge cases: Boundary conditions, unusual requests
- Global nodes: Handoff, help menu triggers
- Multi-turn conversations: Context retention
- Recovery flows: Getting back on track
2. Script writing
# ✅ Good: Natural, instructional
script: |
1. Greet the agent casually
2. When asked how they can help, mention you're upset about billing
3. If agent apologizes, acknowledge it
4. Express that you want to cancel service
# ❌ Bad: Too rigid
script: |
User: Hello
Agent: Hi, how can I help?
User: I'm upset about billing
Agent: I apologize for the inconvenience
3. Rubric design
# ✅ Good: Behavior-focused
rubric: |
1. Agent shows empathy for billing concern (30%)
2. Agent attempts to understand the issue (25%)
3. Agent offers solutions before cancellation (25%)
4. Agent respects user's decision (20%)
# ❌ Bad: Too specific about wording
rubric: |
1. Agent says "I'm sorry" (25%)
2. Agent uses the word "billing" (25%)
3. Agent mentions "cancellation" (25%)
4. Agent says "Is there anything else?" (25%)
4. Scoring guidelines
- 0.90-1.00: Excellent - Ready for production
- 0.80-0.89: Good - Minor improvements needed
- 0.70-0.79: Acceptable - Some issues to address
- 0.60-0.69: Poor - Significant problems
- Below 0.60: Failing - Major rework needed
5. Test maintenance
- Update tests when agent behavior changes
- Add tests for new features
- Remove obsolete tests
- Keep test suites focused and manageable
- Document why specific tests exist
Advanced testing
Testing webhooks
script: |
1. Say: "Check order 12345"
2. Agent should make API call (may show thinking)
3. Agent provides order status
4. Verify information is accurate
rubric: |
1. Agent extracts order ID correctly (25%)
2. Agent handles API response properly (35%)
3. Agent presents information clearly (25%)
4. Agent offers relevant follow-up (15%)
Testing knowledge base
script: |
1. Ask: "What's your return policy?"
2. Agent should search knowledge base
3. Verify agent provides accurate policy
4. Ask follow-up: "What about international returns?"
rubric: |
1. Agent finds relevant information (30%)
2. Information is accurate (30%)
3. Agent handles follow-up question (25%)
4. Response is well-formatted (15%)
script: |
1. Say: "I need help with my order and also want to know your hours"
2. Agent should use multiple tools appropriately
3. Verify both pieces of information are provided
rubric: |
1. Agent recognizes multiple requests (25%)
2. Agent uses order API tool (25%)
3. Agent uses knowledge base for hours (25%)
4. Agent presents both answers clearly (25%)
Debugging failed tests
- Run with verbose mode:
kapso test --verbose
- Check the actual conversation: Review what the agent actually said
- Verify node transitions: Ensure the agent follows expected flow
- Test script clarity: Make sure instructions are clear
- Adjust rubric weights: Fine-tune scoring percentages
- Add console logs: Use print statements in agent code during development