Testing

The project has 1,184+ tests across 102 test files in backend, frontend, and SDK packages.

Commands

Command	What It Runs
`make test`	All tests (server + frontend)
`make test.server`	Backend tests only (`apps/server`)
`make test.frontend`	Frontend tests only (`apps/frontend`)
`make test.client`	SDK tests only (`packages/rule-client`)
`make test.unit`	Unit tests only (no external services required)
`make test.integration`	Integration tests (requires Docker Compose services)
`make test.e2e`	End-to-end tests (starts stack, uses real Gemini)
`make test.verbose`	Backend tests with verbose output
`make test.cov`	Backend tests with coverage report

Running tests directly

Backend:

cd apps/server
uv run pytest                       # all server tests
uv run pytest tests/unit            # unit tests only
uv run pytest tests/integration     # integration tests only
uv run pytest -k "test_evaluate"    # filter by name

Frontend:

cd apps/frontend
pnpm test                           # all frontend tests
pnpm test -- --watch                # watch mode

Python SDKs:

cd packages/rule-client
uv run pytest

cd packages/agentic-client
uv run pytest

Test Categories

Unit Tests

Pure logic tests with no external dependencies. These test domain models, utility functions, and business logic in isolation.

Located in tests/unit/ within each package
No database, no Elasticsearch, no Neo4j, no network calls
Fast: should complete in seconds
Covers: domain models, evaluation pipeline stages, diff parsing, context assembly, verdict aggregation, conflict aggregation, PII sanitization, health scoring, gateway normalization, discovery analyzers, playground, context delivery formatting, ABAC, classification/RLS, departments, subjects, compliance, fact store, operability, plugins, eval harness, contract parser/comparator, clause aggregator, event sequences, document discovery, conflict scanner, cost tracker, LLM providers

Integration Tests

Tests that run against the Docker Compose services (PostgreSQL, Elasticsearch, Neo4j).

Located in tests/integration/ within each package
Require docker compose up to be running
Test actual database queries, search indexing, and graph operations
Covers: rules CRUD API, search API, intent API, relationships API, proposals lifecycle, agent governance, Tier 1 end-to-end (Postgres-only), tenant isolation

Subject-Specific Tests

Each SubjectKind has dedicated tests validating the adapter, prompt rendering, and aggregation logic.

Located in tests/evaluation/subjects/test_<kind>_subject.py
Test that the subject adapter correctly renders facts for LLM, extracts features, and parses remediations

Classification Tests

Every endpoint that returns classified data has tests verifying access control in both directions:

High-clearance users see all data (PUBLIC through RESTRICTED)
Low-clearance users see only what their classification level permits

Audit Tests

Every action that should be audit-logged has tests verifying:

The audit entry is created with correct fields
Hash chain integrity is maintained after the action

LLM Tests

Tests involving the Gemini API are split into two categories:

Mocked LLM tests (run in CI):

All tests that exercise LLM-driven features (extraction, evaluation, conflict detection) use a mock LLM client by default
The mock returns deterministic responses based on fixtures
These tests verify that the integration code correctly handles LLM responses

Live LLM tests (eval suite, gated):

Gated behind the RULEREPO_LIVE_LLM=1 environment variable
Call the real Gemini API to verify extraction quality, conflict detection precision/recall, and evaluation accuracy
Run nightly, not on every PR
Require GEMINI_API_KEY to be set

RULEREPO_LIVE_LLM=1 GEMINI_API_KEY=... uv run pytest tests/eval/

End-to-End Tests

Full-stack tests that run against a live Docker Compose stack with real Gemini API calls:

Located in tests/e2e/ under apps/server
Gated behind RULEREPO_LIVE_LLM=1
Tests: document extraction, code evaluation, and full workflow (create rule, evaluate, verify verdict)
Run with make test.e2e (starts the stack automatically)

Safety Tests

Dedicated security tests under tests/safety/:

Prompt injection defense: 31 tests covering 20+ injection patterns (role injection, system override, encoding evasion, Unicode tricks, delimiter attacks). All must be blocked.
Run as part of the normal test suite

Eval Harness

Nightly regression suite under apps/server/eval_harness/:

90 golden cases across 8 domains (engineering, legal, HR, finance, IT security, sales, communications, governance)
Computes precision/recall/F1 per domain
CI regression gates block merges on quality drops
Run with make eval.harness

Frontend Tests

Component tests: Vitest + React Testing Library. Test individual components in isolation.
End-to-end tests: Playwright (if added). Test full user flows through the frontend.

Writing Tests

Mocking the LLM

Always mock the LLM in unit and integration tests:

from unittest.mock import AsyncMock

async def test_evaluate_rule(mock_llm_client):
    mock_llm_client.generate.return_value = MockResponse(
        verdict="ALLOW",
        reasoning="The change follows all applicable rules.",
    )
    result = await evaluation_service.evaluate(context, mock_llm_client)
    assert result.verdict == "ALLOW"

Test Data

Use fixtures in tests/fixtures/ for:

Sample rules and rule sets
Document content (PDF, markdown, text)
Expected extraction results
Expected evaluation verdicts