Testing Strategy

Capstan has 3800+ tests across 222+ files. The test suite covers both the framework itself and the applications it scaffolds, verifies, and helps operate. Agent tests are the primary focus, organized into 7 categories.

Agent Testing

1. Unit Tests with Mock LLM

Use the mockLLM() pattern to create deterministic LLM responses for fast, offline testing. No API keys needed.

import { describe, it, expect } from "bun:test";
import { createSmartAgent } from "@zauso-ai/capstan-ai";
import type { LLMProvider, LLMMessage, LLMResponse, LLMOptions } from "@zauso-ai/capstan-ai";

function mockLLM(responses: string[]): LLMProvider {
  let i = 0;
  return {
    name: "mock",
    async chat(_msgs: LLMMessage[], _opts?: LLMOptions): Promise<LLMResponse> {
      return { content: responses[i++] ?? "done", model: "mock" };
    },
  };
}

it("validates tool arguments before execution", () => {
  const tool = {
    name: "write_file",
    description: "Write a file",
    parameters: {
      type: "object",
      properties: { path: { type: "string" } },
      required: ["path"],
    },
    validate(args: Record<string, unknown>) {
      if (typeof args.path !== "string" || args.path.includes(".."))
        return { valid: false, error: "Invalid path" };
      return { valid: true };
    },
    async execute(args: Record<string, unknown>) {
      return { written: args.path };
    },
  };

  expect(tool.validate({ path: "../etc/passwd" })).toEqual({
    valid: false,
    error: "Invalid path",
  });
});

What to test with mock LLM:

  • Tool execution and result handling
  • Tool input validation (validate hook)
  • Context compression (snip, microcompact, autocompact triggers)
  • Token budget nudge and force-complete behavior
  • Tool result budget truncation and disk persistence
  • Lifecycle hooks (afterIteration, afterToolCall, onRunComplete)
  • Checkpoint serialization and resume
  • Error withholding and retry logic
  • Memory staleness annotations
  • Model fallback with thinking block stripping

2. Real LLM End-to-End Tests

Configure .env.test with real LLM credentials. The describeWithLLM helper runs test suites once per configured provider, skipping gracefully when no credentials are available.

import { describeWithLLM } from "./helpers/env.js";

describeWithLLM("Smoke -- agent basics", (provider) => {
  it("calls a single tool correctly", async () => {
    const agent = createSmartAgent({
      llm: provider,
      tools: [multiplyTool],
      maxIterations: 10,
    });
    const result = await agent.run("What is 17 * 23? Use the multiply tool.");
    expect(result.status).toBe("completed");
    expect(result.toolCalls.find(c => c.tool === "multiply")!.result).toBe(391);
  }, 120_000);
});

Three test layers:

  • Smoke (tests/e2e/llm/smoke.test.ts) -- 7 tests, 2 min timeout each. Single tool calls, tool chaining, tool selection, error recovery, skill activation.
  • Long-run (tests/e2e/llm/long-run.test.ts) -- 4 tests, 10 min timeout each. 25+ sequential lookups with context compression, stop hook rejection loops, cross-run evolution.
  • Scenario (tests/e2e/llm/scenario.test.ts) -- 2 tests, 10 min timeout each. Real filesystem workspace: code generation with test verification, bug diagnosis and fix.

3. Adversarial Tests (18 cases)

File: tests/unit/adversarial-llm.test.ts. Test that the agent runtime handles malformed or hostile LLM output without crashing:

  • Malformed JSON in tool call arguments
  • Nonexistent tools -- LLM calls a tool name not in the registry
  • Huge responses exceeding toolResultBudget.maxChars
  • Null arguments, missing required fields
  • Path traversal -- ../ in tool arguments; validate hook should reject
  • Type coercion -- wrong argument types (string where number expected)

4. Long-Chain Tests (6 cases)

Test agent behavior over 50+ iterations with context compression:

it("survives 50+ iterations with autocompact", async () => {
  const agent = createSmartAgent({
    llm: provider,
    tools: longRunningTools,
    maxIterations: 60,
    contextWindowSize: 16_000,
    compaction: {
      autocompact: { threshold: 0.85, maxFailures: 3 },
      microcompact: { maxToolResultChars: 2000, protectedTail: 4 },
      snip: { preserveTail: 6 },
    },
  });
  const result = await agent.run("Process all 50 items in the queue.");
  expect(result.status).toBe("completed");
  expect(result.iterations).toBeGreaterThan(30);
}, 600_000);

What long-chain tests verify: compression fires correctly, accumulated state survives compaction, multiple compression tiers activate in sequence, goal coherence across compact boundaries, no infinite loops when compression cannot free enough space.

5. Idempotency Tests (6 cases)

Test checkpoint resume and deterministic replay:

  • Save a checkpoint mid-run via onCheckpoint hook
  • Resume from saved checkpoint with agent.resume(checkpoint, message)
  • Verify the resumed run produces the same outcome as a fresh run
  • Checkpoint serialization round-trips cleanly (JSON parse/stringify)
  • Evolution store deduplication (same experience recorded twice produces one entry)

6. Smoke Tests (55+ cases)

Test every public API and configuration combination: with and without fallbackLlm, skills, evolution, llmTimeout, compaction at each tier, toolResultBudget, tokenBudget as number vs. object. All AgentTool option permutations. Skill lifecycle. Evolution lifecycle.

7. Engineering Maturity Tests (50 cases)

File: tests/unit/engineering-maturity.test.ts. Comprehensive tests for production-grade agent infrastructure:

  • Integration -- validation pipeline, tool timeout, LLM watchdog, disk persistence of large results, microcompact with caching
  • Adversarial -- path traversal in validate, edge cases in JSON Schema validation, malformed tool parameters
  • Contract -- afterToolCall status values, LLMTimeoutConfig shape, streaming idle timeout behavior, persistDir file creation
  • Regression -- backward compatibility with minimal config, default timeout values, compaction with zero-length messages

Web App Testing

Testing defineAPI Routes

Test an API handler directly by calling the handler function:

import { describe, test, expect } from "bun:test";
import { GET } from "../app/routes/api/tickets/index.api.ts";

test("returns tickets filtered by status", async () => {
  const result = await GET.handler({
    input: { status: "open" },
    params: {},
    ctx: {
      auth: { isAuthenticated: true, type: "human", userId: "u1" },
      request: new Request("http://localhost/api/tickets?status=open"),
      env: {},
    },
  });
  expect(result.tickets).toBeArray();
  expect(result.tickets.every(t => t.status === "open")).toBe(true);
});

Testing Policies

import { definePolicy } from "@zauso-ai/capstan-core";

const policy = definePolicy({
  key: "testPolicy",
  title: "Test",
  effect: "deny",
  async check({ ctx }) {
    if (!ctx.auth.isAuthenticated) return { effect: "deny", reason: "Not authenticated" };
    return { effect: "allow" };
  },
});

test("denies unauthenticated requests", async () => {
  const result = await policy.check({
    ctx: { auth: { isAuthenticated: false, type: "anonymous" } },
    input: {},
  });
  expect(result.effect).toBe("deny");
});

Testing Approval Workflows

import { createApproval, resolveApproval, clearApprovals } from "@zauso-ai/capstan-core";

beforeEach(() => clearApprovals());

test("approval lifecycle", async () => {
  const approval = createApproval({
    route: "POST /tickets",
    input: { title: "New ticket" },
    reason: "Agent write requires review",
    requestedBy: { type: "agent", agentId: "agent_1" },
  });
  expect(approval.status).toBe("pending");

  const resolved = resolveApproval(approval.id, {
    action: "approve",
    reviewedBy: { type: "human", userId: "admin_1" },
  });
  expect(resolved.status).toBe("approved");
});

Performance Benchmarks

Capstan keeps a committed benchmark suite under benchmarks/. Current gates: React SSR render hot paths, route scanning and matching on synthetic app trees, in-memory runtime request handling, page runtime document and navigation payload generation.

Run with npm run perf:check to enforce budgets.

Running Tests

# Full suite (3800+ tests)
npm test

# Real LLM tests (requires .env.test)
npm run test:llm

# Smoke only (fast, 2 min timeout)
npm run test:llm:smoke

# Long-run + scenario (slow, 10 min timeout)
npm run test:llm:long

# Node contract suite (Vitest)
npm run test:node

# Browser e2e (Playwright)
npm run test:e2e

# Single file
bun test tests/unit/engineering-maturity.test.ts

# Performance gate
npm run perf:check