Batched Evaluation
Overview
The evaluation engine supports batched evaluation — sending all selected rules to the LLM in a single call instead of making one API call per rule. This delivers 5-20x fewer API calls, lower latency, and better verdicts because the LLM can see rule interactions.
How It Works
┌─────────────────────────────────┐
│ EvaluationService │
│ service.py → evaluate_batch() │
└───────────┬─────────────────────┘
│
┌───────▼────────┐
│ batch_evaluator │
│ │
│ ┌────────────┐ │
│ │ Build │ │ All rules + context in one prompt
│ │ prompt │ │
│ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ Gemini │ │ Single Flash API call
│ │ Flash │ │ → JSON array of per-rule verdicts
│ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ Pro │ │ Only for DENY + CRITICAL rules
│ │ confirm │ │ (confirmation pass)
│ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ Fallback │ │ On any failure → per-rule asyncio.gather()
│ │ per-rule │ │
│ └────────────┘ │
└─────────────────┘
Structured Output Schema
The batch call requests a JSON object with a verdicts array:
{
"verdicts": [
{
"rule_index": 0,
"rule_id": "uuid",
"verdict": "ALLOW",
"confidence": 0.95,
"reasoning": "The code change does not affect...",
"issue_description": "",
"fix_suggestion": null,
"locations": []
}
]
}
Each entry corresponds to one rule, in the order they were listed in the prompt.
Tiered Model Strategy
| Rule Severity | Batch (Flash) | Confirmation (Pro) |
|---|---|---|
| LOW / MEDIUM | Evaluated | Not re-evaluated |
| HIGH | Evaluated | Not re-evaluated |
| CRITICAL + DENY | Evaluated | Re-evaluated with Pro model |
| CRITICAL + ALLOW | Evaluated | Not re-evaluated |
Only rules that receive a DENY verdict and have CRITICAL severity get a Pro confirmation pass. This keeps Pro costs minimal while ensuring high-severity denials are accurate.
Fallback Behavior
If the batch call fails for any reason (API error, timeout, response parsing failure, prompt too large), the system transparently falls back to per-rule evaluation using asyncio.gather() — the same behavior as before batching was introduced.
Caching
- Batch cache key:
hash(sorted_rule_ids + context_hash + model_id) - If any rule in the batch has been revised since the cache entry was stored, the cache entry is invalidated
- Pro confirmation results are cached individually using the existing per-rule cache
Configuration
Batching is the default evaluation path. No configuration flag is needed to enable it.
- Max prompt size: 30,000 characters (configurable in
batch_evaluator.py) - Max diff size: 8,000 characters (truncated if longer)
- Thinking level:
mediumfor batches,highfor Pro confirmation
Surface-Based Template Routing
The batch evaluator routes to surface-specific prompt templates based on the surface field on EvaluationContext. Instead of branching on if context.diff: (code) vs else (facts), the evaluator selects the template dynamically:
def _select_template(context: EvaluationContext) -> str:
surface = context.surface or ("code" if context.diff else "generic")
template_path = PROMPTS_DIR / f"evaluate_batch_{surface}.txt"
if template_path.exists():
return template_path.read_text()
return (PROMPTS_DIR / "evaluate_batch_generic.txt").read_text()
Available batch templates:
| Template | Surface | Key Instructions |
|---|---|---|
evaluate_batch_code.txt |
code | Diff references, file paths, line numbers |
evaluate_batch_contract.txt |
contract | Clause references, span offsets, clause revisions |
evaluate_batch_transaction.txt |
transaction | JSON paths, field-level remediations, approval routing |
evaluate_batch_document.txt |
document | Document spans, text_rewrite, section references |
evaluate_batch_message.txt |
message | Message segments, tone guidance, disclaimers |
evaluate_batch_human_action.txt |
human_action | Process compliance, authorization, event context |
evaluate_batch_generic.txt |
generic (fallback) | Domain-neutral facts-based evaluation |
Non-code templates do not reference code concepts (file paths, line numbers, function names). Each template defines its own location format and remediation kinds appropriate to its domain.
For backward compatibility, contexts without an explicit surface that have a diff field default to the code template.