Skip to content

Batched Evaluation

Overview

The evaluation engine supports batched evaluation — sending all selected rules to the LLM in a single call instead of making one API call per rule. This delivers 5-20x fewer API calls, lower latency, and better verdicts because the LLM can see rule interactions.

How It Works

┌─────────────────────────────────┐
│        EvaluationService         │
│ service.py → evaluate_batch()    │
└───────────┬─────────────────────┘
            │
    ┌───────▼────────┐
    │ batch_evaluator │
    │                 │
    │  ┌────────────┐ │
    │  │ Build      │ │   All rules + context in one prompt
    │  │ prompt     │ │
    │  └─────┬──────┘ │
    │        │        │
    │  ┌─────▼──────┐ │
    │  │ Gemini     │ │   Single Flash API call
    │  │ Flash      │ │   → JSON array of per-rule verdicts
    │  └─────┬──────┘ │
    │        │        │
    │  ┌─────▼──────┐ │
    │  │ Pro        │ │   Only for DENY + CRITICAL rules
    │  │ confirm    │ │   (confirmation pass)
    │  └─────┬──────┘ │
    │        │        │
    │  ┌─────▼──────┐ │
    │  │ Fallback   │ │   On any failure → per-rule asyncio.gather()
    │  │ per-rule   │ │
    │  └────────────┘ │
    └─────────────────┘

Structured Output Schema

The batch call requests a JSON object with a verdicts array:

{
  "verdicts": [
    {
      "rule_index": 0,
      "rule_id": "uuid",
      "verdict": "ALLOW",
      "confidence": 0.95,
      "reasoning": "The code change does not affect...",
      "issue_description": "",
      "fix_suggestion": null,
      "locations": []
    }
  ]
}

Each entry corresponds to one rule, in the order they were listed in the prompt.

Tiered Model Strategy

Rule Severity Batch (Flash) Confirmation (Pro)
LOW / MEDIUM Evaluated Not re-evaluated
HIGH Evaluated Not re-evaluated
CRITICAL + DENY Evaluated Re-evaluated with Pro model
CRITICAL + ALLOW Evaluated Not re-evaluated

Only rules that receive a DENY verdict and have CRITICAL severity get a Pro confirmation pass. This keeps Pro costs minimal while ensuring high-severity denials are accurate.

Fallback Behavior

If the batch call fails for any reason (API error, timeout, response parsing failure, prompt too large), the system transparently falls back to per-rule evaluation using asyncio.gather() — the same behavior as before batching was introduced.

Caching

  • Batch cache key: hash(sorted_rule_ids + context_hash + model_id)
  • If any rule in the batch has been revised since the cache entry was stored, the cache entry is invalidated
  • Pro confirmation results are cached individually using the existing per-rule cache

Configuration

Batching is the default evaluation path. No configuration flag is needed to enable it.

  • Max prompt size: 30,000 characters (configurable in batch_evaluator.py)
  • Max diff size: 8,000 characters (truncated if longer)
  • Thinking level: medium for batches, high for Pro confirmation

Surface-Based Template Routing

The batch evaluator routes to surface-specific prompt templates based on the surface field on EvaluationContext. Instead of branching on if context.diff: (code) vs else (facts), the evaluator selects the template dynamically:

def _select_template(context: EvaluationContext) -> str:
    surface = context.surface or ("code" if context.diff else "generic")
    template_path = PROMPTS_DIR / f"evaluate_batch_{surface}.txt"
    if template_path.exists():
        return template_path.read_text()
    return (PROMPTS_DIR / "evaluate_batch_generic.txt").read_text()

Available batch templates:

Template Surface Key Instructions
evaluate_batch_code.txt code Diff references, file paths, line numbers
evaluate_batch_contract.txt contract Clause references, span offsets, clause revisions
evaluate_batch_transaction.txt transaction JSON paths, field-level remediations, approval routing
evaluate_batch_document.txt document Document spans, text_rewrite, section references
evaluate_batch_message.txt message Message segments, tone guidance, disclaimers
evaluate_batch_human_action.txt human_action Process compliance, authorization, event context
evaluate_batch_generic.txt generic (fallback) Domain-neutral facts-based evaluation

Non-code templates do not reference code concepts (file paths, line numbers, function names). Each template defines its own location format and remediation kinds appropriate to its domain.

For backward compatibility, contexts without an explicit surface that have a diff field default to the code template.