Overview

Kapso uses a testing approach where test cases are defined in YAML format and evaluated by an AI judge. This allows for more flexible, natural testing of conversational agents.

Complete example

# tests/customer_support/order_lookup_test.yaml
name: order_lookup_test
description: Test successful order status inquiry
script: |
  1. Start with: "I want to check my order status"
  2. When agent asks for order ID, say: "ABC-123-456"
  3. Agent should provide order status
  4. Say: "When will it arrive?"
  5. Agent should provide delivery estimate
  6. Say: "Thanks!"
  
rubric: |
  1. Agent asks for order ID promptly (25%)
  2. Agent successfully retrieves order status (25%)
  3. Agent provides delivery information (25%)
  4. Agent maintains professional tone (15%)
  5. Agent offers additional help (10%)
  
  Critical Failures:
  - Agent cannot find the order
  - Agent provides wrong information
  - Agent asks for sensitive data unnecessarily
# Run the test
kapso test tests/customer_support/order_lookup_test.yaml

# Output:
 order_lookup_test (0.95/1.00)
  - Agent asks for order ID promptly: 0.25/0.25
  - Agent successfully retrieves order status: 0.25/0.25
  - Agent provides delivery information: 0.25/0.25
  - Agent maintains professional tone: 0.15/0.15
  - Agent offers additional help: 0.05/0.10 ⚠️

Instead of exact string matching, Kapso tests:

  • Define conversational scenarios in natural language
  • Specify expected behaviors, not exact outputs
  • Use an AI judge to evaluate if the agent met expectations
  • Score performance on a 0.0 to 1.0 scale

Test organization

Directory structure

tests/
├── customer_support/           # Test suite directory
│   ├── test-suite.yaml        # Suite metadata
│   ├── greeting_test.yaml     # Individual test cases
│   ├── order_flow_test.yaml
│   └── error_handling_test.yaml
└── product_search/
    ├── test-suite.yaml
    ├── basic_search_test.yaml
    └── advanced_filters_test.yaml

Test suite configuration

Each test suite requires a test-suite.yaml file:

# tests/customer_support/test-suite.yaml
name: Customer Support Tests
description: Comprehensive tests for customer support agent
# id: auto-generated during deployment

Writing test cases

Test case structure

# tests/customer_support/greeting_test.yaml
name: greeting_test
description: Test if agent greets users appropriately
script: |
  1. User says: "Hello, I need help"
  2. Assistant should greet warmly and ask how they can help
  3. User says: "I have a problem with my order"
  4. Assistant should acknowledge the issue and offer to help
rubric: |
  1. Warm and professional greeting (30%)
  2. Asks how they can help (20%)
  3. Acknowledges the user's problem (25%)
  4. Offers specific assistance for orders (25%)
# id: auto-generated during deployment

Script section

The script defines the conversation flow:

script: |
  1. Start by saying "Hi there!"
  2. When the agent greets you, say "I need to return an item"
  3. If the agent asks for order number, provide "ORD-12345"
  4. When agent provides return instructions, say "Thank you"
  5. Observe if agent offers additional help

Best Practices for Scripts:

  • Use numbered steps for clarity
  • Write as instructions: “Start by…”, “When agent…”, “Say…”
  • Include conditionals: “If agent asks X, respond with Y”
  • Specify observations: “Observe if…”, “Check that…”
  • Use natural language, not rigid scripts

Rubric section

The rubric defines scoring criteria:

rubric: |
  1. Agent greets professionally (20%)
  2. Agent asks for order details (20%)
  3. Agent provides correct return process (30%)
  4. Agent offers additional assistance (15%)
  5. Agent maintains helpful tone throughout (15%)
  
  Critical Failures:
  - Agent refuses to help with returns
  - Agent provides incorrect return policy
  - Agent ends conversation abruptly

Rubric Guidelines:

  • Use percentage-based scoring (must total 100%)
  • Focus on behaviors and outcomes, not exact wording
  • Include “Critical Failures” for must-not-happen scenarios
  • Be specific about what constitutes success

Test examples

Example 1: Happy path test

name: successful_order_lookup
description: Test successful order status inquiry
script: |
  1. Start with: "I want to check my order status"
  2. When agent asks for order ID, say: "ABC-123-456"
  3. Agent should provide order status
  4. Say: "When will it arrive?"
  5. Agent should provide delivery estimate
  6. Say: "Thanks!"
  
rubric: |
  1. Agent asks for order ID promptly (25%)
  2. Agent successfully retrieves order status (25%)
  3. Agent provides delivery information (25%)
  4. Agent maintains professional tone (15%)
  5. Agent offers additional help (10%)
  
  Critical Failures:
  - Agent cannot find the order
  - Agent provides wrong information
  - Agent asks for sensitive data unnecessarily

Example 2: Error handling test

name: invalid_order_handling
description: Test how agent handles invalid order numbers
script: |
  1. Say: "I need to check order INVALID-ID"
  2. Agent should explain they cannot find the order
  3. Agent should offer alternative help
  4. Say: "Maybe I have the wrong number"
  5. Agent should offer to help find the order
  
rubric: |
  1. Agent politely explains order not found (30%)
  2. Agent doesn't blame the user (20%)
  3. Agent offers alternative solutions (30%)
  4. Agent remains helpful and patient (20%)
  
  Critical Failures:
  - Agent crashes or shows error messages
  - Agent accuses user of lying
  - Agent gives up without offering help

Example 3: Complex flow test

name: multi_step_process
description: Test agent handling multi-step booking process
script: |
  1. Say: "I want to book an appointment"
  2. Agent should ask for service type
  3. Say: "Car maintenance"
  4. Agent should ask for preferred date
  5. Say: "Next Tuesday"
  6. Agent should ask for time preference
  7. Say: "Morning please"
  8. Agent should confirm the booking details
  9. Say: "Yes, that's correct"
  10. Agent should provide confirmation
  
rubric: |
  1. Agent collects all required information (40%)
     - Service type (10%)
     - Date preference (10%)
     - Time preference (10%)
     - Confirmation (10%)
  2. Agent provides clear confirmation (20%)
  3. Agent handles the flow smoothly (20%)
  4. Agent is patient with user (20%)
  
  Critical Failures:
  - Agent skips required information
  - Agent books without confirmation
  - Agent loses context during conversation

Example 4: Global node test

name: human_handoff_trigger
description: Test global handoff node activation
script: |
  1. Start normally: "Hi, I need help"
  2. Agent greets and offers help
  3. Say: "This is too complicated, I need a human"
  4. Agent should trigger handoff
  5. Verify handoff message is appropriate
  
rubric: |
  1. Agent recognizes handoff request (40%)
  2. Agent triggers handoff promptly (30%)
  3. Handoff message is professional (20%)
  4. Agent doesn't argue about handoff (10%)
  
  Critical Failures:
  - Agent ignores handoff request
  - Agent continues automated responses
  - Agent is dismissive of user's request

Running tests

CLI commands

# Run all tests
kapso test

# Run specific test suite by name
kapso test "Customer Support Tests"

# Run with verbose output
kapso test --verbose

# Run specific test file
kapso test tests/customer_support/greeting_test.yaml

Test output

Running Customer Support Tests...

✓ greeting_test (0.95/1.00)
  - Warm and professional greeting: 0.28/0.30
  - Asks how they can help: 0.20/0.20
  - Acknowledges the user's problem: 0.25/0.25
  - Offers specific assistance: 0.22/0.25

✗ error_handling_test (0.65/1.00)
  - Explains order not found: 0.25/0.30
  - Doesn't blame user: 0.20/0.20
  - Offers alternatives: 0.15/0.30 ⚠️
  - Remains helpful: 0.05/0.20 ⚠️
  
  Critical Failure: Agent gave up without offering help

Summary: 1 passed, 1 failed
Average Score: 0.80/1.00

Test development workflow

1. Create test suite

kapso create test-suite --name "Payment Flow Tests"

2. Create test cases

kapso create test-case \
  --test-suite "Payment Flow Tests" \
  --name "successful_payment"

3. Write test content

Edit the generated YAML file with your script and rubric.

4. Test locally

# Test against local agent
kapso test --local

# Test against deployed agent
kapso test

5. Iterate

Refine tests based on results and agent improvements.

Best practices

1. Test coverage

Cover these essential areas:

  • Happy paths: Normal, successful flows
  • Error cases: Invalid inputs, API failures
  • Edge cases: Boundary conditions, unusual requests
  • Global nodes: Handoff, help menu triggers
  • Multi-turn conversations: Context retention
  • Recovery flows: Getting back on track

2. Script writing

# ✅ Good: Natural, instructional
script: |
  1. Greet the agent casually
  2. When asked how they can help, mention you're upset about billing
  3. If agent apologizes, acknowledge it
  4. Express that you want to cancel service
  
# ❌ Bad: Too rigid
script: |
  User: Hello
  Agent: Hi, how can I help?
  User: I'm upset about billing
  Agent: I apologize for the inconvenience

3. Rubric design

# ✅ Good: Behavior-focused
rubric: |
  1. Agent shows empathy for billing concern (30%)
  2. Agent attempts to understand the issue (25%)
  3. Agent offers solutions before cancellation (25%)
  4. Agent respects user's decision (20%)

# ❌ Bad: Too specific about wording
rubric: |
  1. Agent says "I'm sorry" (25%)
  2. Agent uses the word "billing" (25%)
  3. Agent mentions "cancellation" (25%)
  4. Agent says "Is there anything else?" (25%)

4. Scoring guidelines

  • 0.90-1.00: Excellent - Ready for production
  • 0.80-0.89: Good - Minor improvements needed
  • 0.70-0.79: Acceptable - Some issues to address
  • 0.60-0.69: Poor - Significant problems
  • Below 0.60: Failing - Major rework needed

5. Test maintenance

  • Update tests when agent behavior changes
  • Add tests for new features
  • Remove obsolete tests
  • Keep test suites focused and manageable
  • Document why specific tests exist

Advanced testing

Testing webhooks

script: |
  1. Say: "Check order 12345"
  2. Agent should make API call (may show thinking)
  3. Agent provides order status
  4. Verify information is accurate
  
rubric: |
  1. Agent extracts order ID correctly (25%)
  2. Agent handles API response properly (35%)
  3. Agent presents information clearly (25%)
  4. Agent offers relevant follow-up (15%)

Testing knowledge base

script: |
  1. Ask: "What's your return policy?"
  2. Agent should search knowledge base
  3. Verify agent provides accurate policy
  4. Ask follow-up: "What about international returns?"
  
rubric: |
  1. Agent finds relevant information (30%)
  2. Information is accurate (30%)
  3. Agent handles follow-up question (25%)
  4. Response is well-formatted (15%)

Testing tool selection

script: |
  1. Say: "I need help with my order and also want to know your hours"
  2. Agent should use multiple tools appropriately
  3. Verify both pieces of information are provided
  
rubric: |
  1. Agent recognizes multiple requests (25%)
  2. Agent uses order API tool (25%)
  3. Agent uses knowledge base for hours (25%)
  4. Agent presents both answers clearly (25%)

Debugging failed tests

  1. Run with verbose mode: kapso test --verbose
  2. Check the actual conversation: Review what the agent actually said
  3. Verify node transitions: Ensure the agent follows expected flow
  4. Test script clarity: Make sure instructions are clear
  5. Adjust rubric weights: Fine-tune scoring percentages
  6. Add console logs: Use print statements in agent code during development