Testing framework

Overview

Kapso uses a testing approach where test cases are defined in YAML format and evaluated by an AI judge. This allows for more flexible, natural testing of conversational agents.

Complete example

# tests/customer_support/order_lookup_test.yaml
name: order_lookup_test
description: Test successful order status inquiry
script: |
  1. Start with: "I want to check my order status"
  2. When agent asks for order ID, say: "ABC-123-456"
  3. Agent should provide order status
  4. Say: "When will it arrive?"
  5. Agent should provide delivery estimate
  6. Say: "Thanks!"
  
rubric: |
  1. Agent asks for order ID promptly (25%)
  2. Agent successfully retrieves order status (25%)
  3. Agent provides delivery information (25%)
  4. Agent maintains professional tone (15%)
  5. Agent offers additional help (10%)
  
  Critical Failures:
  - Agent cannot find the order
  - Agent provides wrong information
  - Agent asks for sensitive data unnecessarily

# Run the test
kapso test tests/customer_support/order_lookup_test.yaml

# Output:
✓ order_lookup_test (0.95/1.00)
  - Agent asks for order ID promptly: 0.25/0.25
  - Agent successfully retrieves order status: 0.25/0.25
  - Agent provides delivery information: 0.25/0.25
  - Agent maintains professional tone: 0.15/0.15
  - Agent offers additional help: 0.05/0.10 ⚠️

Instead of exact string matching, Kapso tests:

Define conversational scenarios in natural language
Specify expected behaviors, not exact outputs
Use an AI judge to evaluate if the agent met expectations
Score performance on a 0.0 to 1.0 scale

Test organization

Directory structure

tests/
├── customer_support/           # Test suite directory
│   ├── test-suite.yaml        # Suite metadata
│   ├── greeting_test.yaml     # Individual test cases
│   ├── order_flow_test.yaml
│   └── error_handling_test.yaml
└── product_search/
    ├── test-suite.yaml
    ├── basic_search_test.yaml
    └── advanced_filters_test.yaml

Test suite configuration

Each test suite requires a test-suite.yaml file:

# tests/customer_support/test-suite.yaml
name: Customer Support Tests
description: Comprehensive tests for customer support agent
# id: auto-generated during deployment

Writing test cases

Test case structure

# tests/customer_support/greeting_test.yaml
name: greeting_test
description: Test if agent greets users appropriately
script: |
  1. User says: "Hello, I need help"
  2. Assistant should greet warmly and ask how they can help
  3. User says: "I have a problem with my order"
  4. Assistant should acknowledge the issue and offer to help
rubric: |
  1. Warm and professional greeting (30%)
  2. Asks how they can help (20%)
  3. Acknowledges the user's problem (25%)
  4. Offers specific assistance for orders (25%)
# id: auto-generated during deployment

Script section

The script defines the conversation flow:

script: |
  1. Start by saying "Hi there!"
  2. When the agent greets you, say "I need to return an item"
  3. If the agent asks for order number, provide "ORD-12345"
  4. When agent provides return instructions, say "Thank you"
  5. Observe if agent offers additional help

Best Practices for Scripts:

Use numbered steps for clarity
Write as instructions: “Start by…”, “When agent…”, “Say…”
Include conditionals: “If agent asks X, respond with Y”
Specify observations: “Observe if…”, “Check that…”
Use natural language, not rigid scripts

Rubric section

The rubric defines scoring criteria:

rubric: |
  1. Agent greets professionally (20%)
  2. Agent asks for order details (20%)
  3. Agent provides correct return process (30%)
  4. Agent offers additional assistance (15%)
  5. Agent maintains helpful tone throughout (15%)
  
  Critical Failures:
  - Agent refuses to help with returns
  - Agent provides incorrect return policy
  - Agent ends conversation abruptly

Rubric Guidelines:

Use percentage-based scoring (must total 100%)
Focus on behaviors and outcomes, not exact wording
Include “Critical Failures” for must-not-happen scenarios
Be specific about what constitutes success

Test examples

Example 1: Happy path test

name: successful_order_lookup
description: Test successful order status inquiry
script: |
  1. Start with: "I want to check my order status"
  2. When agent asks for order ID, say: "ABC-123-456"
  3. Agent should provide order status
  4. Say: "When will it arrive?"
  5. Agent should provide delivery estimate
  6. Say: "Thanks!"
  
rubric: |
  1. Agent asks for order ID promptly (25%)
  2. Agent successfully retrieves order status (25%)
  3. Agent provides delivery information (25%)
  4. Agent maintains professional tone (15%)
  5. Agent offers additional help (10%)
  
  Critical Failures:
  - Agent cannot find the order
  - Agent provides wrong information
  - Agent asks for sensitive data unnecessarily

Example 2: Error handling test

name: invalid_order_handling
description: Test how agent handles invalid order numbers
script: |
  1. Say: "I need to check order INVALID-ID"
  2. Agent should explain they cannot find the order
  3. Agent should offer alternative help
  4. Say: "Maybe I have the wrong number"
  5. Agent should offer to help find the order
  
rubric: |
  1. Agent politely explains order not found (30%)
  2. Agent doesn't blame the user (20%)
  3. Agent offers alternative solutions (30%)
  4. Agent remains helpful and patient (20%)
  
  Critical Failures:
  - Agent crashes or shows error messages
  - Agent accuses user of lying
  - Agent gives up without offering help

Example 3: Complex flow test

name: multi_step_process
description: Test agent handling multi-step booking process
script: |
  1. Say: "I want to book an appointment"
  2. Agent should ask for service type
  3. Say: "Car maintenance"
  4. Agent should ask for preferred date
  5. Say: "Next Tuesday"
  6. Agent should ask for time preference
  7. Say: "Morning please"
  8. Agent should confirm the booking details
  9. Say: "Yes, that's correct"
  10. Agent should provide confirmation
  
rubric: |
  1. Agent collects all required information (40%)
     - Service type (10%)
     - Date preference (10%)
     - Time preference (10%)
     - Confirmation (10%)
  2. Agent provides clear confirmation (20%)
  3. Agent handles the flow smoothly (20%)
  4. Agent is patient with user (20%)
  
  Critical Failures:
  - Agent skips required information
  - Agent books without confirmation
  - Agent loses context during conversation

Example 4: Global node test

name: human_handoff_trigger
description: Test global handoff node activation
script: |
  1. Start normally: "Hi, I need help"
  2. Agent greets and offers help
  3. Say: "This is too complicated, I need a human"
  4. Agent should trigger handoff
  5. Verify handoff message is appropriate
  
rubric: |
  1. Agent recognizes handoff request (40%)
  2. Agent triggers handoff promptly (30%)
  3. Handoff message is professional (20%)
  4. Agent doesn't argue about handoff (10%)
  
  Critical Failures:
  - Agent ignores handoff request
  - Agent continues automated responses
  - Agent is dismissive of user's request

Running tests

CLI commands

# Run all tests
kapso test

# Run specific test suite by name
kapso test "Customer Support Tests"

# Run with verbose output
kapso test --verbose

# Run specific test file
kapso test tests/customer_support/greeting_test.yaml

Test output

Running Customer Support Tests...

✓ greeting_test (0.95/1.00)
  - Warm and professional greeting: 0.28/0.30
  - Asks how they can help: 0.20/0.20
  - Acknowledges the user's problem: 0.25/0.25
  - Offers specific assistance: 0.22/0.25

✗ error_handling_test (0.65/1.00)
  - Explains order not found: 0.25/0.30
  - Doesn't blame user: 0.20/0.20
  - Offers alternatives: 0.15/0.30 ⚠️
  - Remains helpful: 0.05/0.20 ⚠️
  
  Critical Failure: Agent gave up without offering help

Summary: 1 passed, 1 failed
Average Score: 0.80/1.00

Test development workflow

1. Create test suite

kapso create test-suite --name "Payment Flow Tests"

2. Create test cases

kapso create test-case \
  --test-suite "Payment Flow Tests" \
  --name "successful_payment"

3. Write test content

Edit the generated YAML file with your script and rubric.

4. Test locally

# Test against local agent
kapso test --local

# Test against deployed agent
kapso test

5. Iterate

Refine tests based on results and agent improvements.

Best practices

1. Test coverage

Cover these essential areas:

Happy paths: Normal, successful flows
Error cases: Invalid inputs, API failures
Edge cases: Boundary conditions, unusual requests
Global nodes: Handoff, help menu triggers
Multi-turn conversations: Context retention
Recovery flows: Getting back on track

2. Script writing

# ✅ Good: Natural, instructional
script: |
  1. Greet the agent casually
  2. When asked how they can help, mention you're upset about billing
  3. If agent apologizes, acknowledge it
  4. Express that you want to cancel service
  
# ❌ Bad: Too rigid
script: |
  User: Hello
  Agent: Hi, how can I help?
  User: I'm upset about billing
  Agent: I apologize for the inconvenience

3. Rubric design

# ✅ Good: Behavior-focused
rubric: |
  1. Agent shows empathy for billing concern (30%)
  2. Agent attempts to understand the issue (25%)
  3. Agent offers solutions before cancellation (25%)
  4. Agent respects user's decision (20%)

# ❌ Bad: Too specific about wording
rubric: |
  1. Agent says "I'm sorry" (25%)
  2. Agent uses the word "billing" (25%)
  3. Agent mentions "cancellation" (25%)
  4. Agent says "Is there anything else?" (25%)

4. Scoring guidelines

0.90-1.00: Excellent - Ready for production
0.80-0.89: Good - Minor improvements needed
0.70-0.79: Acceptable - Some issues to address
0.60-0.69: Poor - Significant problems
Below 0.60: Failing - Major rework needed

5. Test maintenance

Update tests when agent behavior changes
Add tests for new features
Remove obsolete tests
Keep test suites focused and manageable
Document why specific tests exist

Advanced testing

Testing webhooks

script: |
  1. Say: "Check order 12345"
  2. Agent should make API call (may show thinking)
  3. Agent provides order status
  4. Verify information is accurate
  
rubric: |
  1. Agent extracts order ID correctly (25%)
  2. Agent handles API response properly (35%)
  3. Agent presents information clearly (25%)
  4. Agent offers relevant follow-up (15%)

Testing knowledge base

script: |
  1. Ask: "What's your return policy?"
  2. Agent should search knowledge base
  3. Verify agent provides accurate policy
  4. Ask follow-up: "What about international returns?"
  
rubric: |
  1. Agent finds relevant information (30%)
  2. Information is accurate (30%)
  3. Agent handles follow-up question (25%)
  4. Response is well-formatted (15%)

Testing tool selection

script: |
  1. Say: "I need help with my order and also want to know your hours"
  2. Agent should use multiple tools appropriately
  3. Verify both pieces of information are provided
  
rubric: |
  1. Agent recognizes multiple requests (25%)
  2. Agent uses order API tool (25%)
  3. Agent uses knowledge base for hours (25%)
  4. Agent presents both answers clearly (25%)

Debugging failed tests

Run with verbose mode: kapso test --verbose
Check the actual conversation: Review what the agent actually said
Verify node transitions: Ensure the agent follows expected flow
Test script clarity: Make sure instructions are clear
Adjust rubric weights: Fine-tune scoring percentages
Add console logs: Use print statements in agent code during development

Getting Started

Node Types

Testing & Deployment

Overview

Complete example

Test organization

Directory structure

Test suite configuration

Writing test cases

Test case structure

Script section

Rubric section

Test examples

Example 1: Happy path test

Example 2: Error handling test

Example 3: Complex flow test

Example 4: Global node test

Running tests

CLI commands

Test output

Test development workflow

1. Create test suite

2. Create test cases

3. Write test content

4. Test locally

5. Iterate

Best practices

1. Test coverage

2. Script writing

3. Rubric design

4. Scoring guidelines

5. Test maintenance

Advanced testing

Testing webhooks

Testing knowledge base

Testing tool selection

Debugging failed tests

Getting Started

Node Types

Testing & Deployment

​Overview

​Complete example

​Test organization

​Directory structure

​Test suite configuration

​Writing test cases

​Test case structure

​Script section

​Rubric section

​Test examples

​Example 1: Happy path test

​Example 2: Error handling test

​Example 3: Complex flow test

​Example 4: Global node test

​Running tests

​CLI commands

​Test output

​Test development workflow

​1. Create test suite

​2. Create test cases

​3. Write test content

​4. Test locally

​5. Iterate

​Best practices

​1. Test coverage

​2. Script writing

​3. Rubric design

​4. Scoring guidelines

​5. Test maintenance

​Advanced testing

​Testing webhooks

​Testing knowledge base

​Testing tool selection

​Debugging failed tests

Overview

Complete example

Test organization

Directory structure

Test suite configuration

Writing test cases

Test case structure

Script section

Rubric section

Test examples

Example 1: Happy path test

Example 2: Error handling test

Example 3: Complex flow test

Example 4: Global node test

Running tests

CLI commands

Test output

Test development workflow

1. Create test suite

2. Create test cases

3. Write test content

4. Test locally

5. Iterate

Best practices

1. Test coverage

2. Script writing

3. Rubric design

4. Scoring guidelines

5. Test maintenance

Advanced testing

Testing webhooks

Testing knowledge base

Testing tool selection

Debugging failed tests