Testing
The project has 1,184+ tests across 102 test files in backend, frontend, and SDK packages.
Commands
| Command | What It Runs |
|---|---|
make test |
All tests (server + frontend) |
make test.server |
Backend tests only (apps/server) |
make test.frontend |
Frontend tests only (apps/frontend) |
make test.client |
SDK tests only (packages/rule-client) |
make test.unit |
Unit tests only (no external services required) |
make test.integration |
Integration tests (requires Docker Compose services) |
make test.e2e |
End-to-end tests (starts stack, uses real Gemini) |
make test.verbose |
Backend tests with verbose output |
make test.cov |
Backend tests with coverage report |
Running tests directly
Backend:
cd apps/server
uv run pytest # all server tests
uv run pytest tests/unit # unit tests only
uv run pytest tests/integration # integration tests only
uv run pytest -k "test_evaluate" # filter by name
Frontend:
cd apps/frontend
pnpm test # all frontend tests
pnpm test -- --watch # watch mode
Python SDKs:
cd packages/rule-client
uv run pytest
cd packages/agentic-client
uv run pytest
Test Categories
Unit Tests
Pure logic tests with no external dependencies. These test domain models, utility functions, and business logic in isolation.
- Located in
tests/unit/within each package - No database, no Elasticsearch, no Neo4j, no network calls
- Fast: should complete in seconds
- Covers: domain models, evaluation pipeline stages, diff parsing, context assembly, verdict aggregation, conflict aggregation, PII sanitization, health scoring, gateway normalization, discovery analyzers, playground, context delivery formatting, ABAC, classification/RLS, departments, subjects, compliance, fact store, operability, plugins, eval harness, contract parser/comparator, clause aggregator, event sequences, document discovery, conflict scanner, cost tracker, LLM providers
Integration Tests
Tests that run against the Docker Compose services (PostgreSQL, Elasticsearch, Neo4j).
- Located in
tests/integration/within each package - Require
docker compose upto be running - Test actual database queries, search indexing, and graph operations
- Covers: rules CRUD API, search API, intent API, relationships API, proposals lifecycle, agent governance, Tier 1 end-to-end (Postgres-only), tenant isolation
Subject-Specific Tests
Each SubjectKind has dedicated tests validating the adapter, prompt rendering, and aggregation logic.
- Located in
tests/evaluation/subjects/test_<kind>_subject.py - Test that the subject adapter correctly renders facts for LLM, extracts features, and parses remediations
Classification Tests
Every endpoint that returns classified data has tests verifying access control in both directions:
- High-clearance users see all data (PUBLIC through RESTRICTED)
- Low-clearance users see only what their classification level permits
Audit Tests
Every action that should be audit-logged has tests verifying:
- The audit entry is created with correct fields
- Hash chain integrity is maintained after the action
LLM Tests
Tests involving the Gemini API are split into two categories:
Mocked LLM tests (run in CI):
- All tests that exercise LLM-driven features (extraction, evaluation, conflict detection) use a mock LLM client by default
- The mock returns deterministic responses based on fixtures
- These tests verify that the integration code correctly handles LLM responses
Live LLM tests (eval suite, gated):
- Gated behind the
RULEREPO_LIVE_LLM=1environment variable - Call the real Gemini API to verify extraction quality, conflict detection precision/recall, and evaluation accuracy
- Run nightly, not on every PR
- Require
GEMINI_API_KEYto be set
RULEREPO_LIVE_LLM=1 GEMINI_API_KEY=... uv run pytest tests/eval/
End-to-End Tests
Full-stack tests that run against a live Docker Compose stack with real Gemini API calls:
- Located in
tests/e2e/underapps/server - Gated behind
RULEREPO_LIVE_LLM=1 - Tests: document extraction, code evaluation, and full workflow (create rule, evaluate, verify verdict)
- Run with
make test.e2e(starts the stack automatically)
Safety Tests
Dedicated security tests under tests/safety/:
- Prompt injection defense: 31 tests covering 20+ injection patterns (role injection, system override, encoding evasion, Unicode tricks, delimiter attacks). All must be blocked.
- Run as part of the normal test suite
Eval Harness
Nightly regression suite under apps/server/eval_harness/:
- 90 golden cases across 8 domains (engineering, legal, HR, finance, IT security, sales, communications, governance)
- Computes precision/recall/F1 per domain
- CI regression gates block merges on quality drops
- Run with
make eval.harness
Frontend Tests
- Component tests: Vitest + React Testing Library. Test individual components in isolation.
- End-to-end tests: Playwright (if added). Test full user flows through the frontend.
Writing Tests
Mocking the LLM
Always mock the LLM in unit and integration tests:
from unittest.mock import AsyncMock
async def test_evaluate_rule(mock_llm_client):
mock_llm_client.generate.return_value = MockResponse(
verdict="ALLOW",
reasoning="The change follows all applicable rules.",
)
result = await evaluation_service.evaluate(context, mock_llm_client)
assert result.verdict == "ALLOW"
Test Data
Use fixtures in tests/fixtures/ for:
- Sample rules and rule sets
- Document content (PDF, markdown, text)
- Expected extraction results
- Expected evaluation verdicts
See Also
- Contributing -- setup and coding conventions
- CLAUDE.md -- full testing policy and conventions