Production Evaluation Harness for AI Coding Agents
Agent demos can look strong while failing predictable engineering tasks. A production harness should test correctness, reproducibility, and operational behavior together.
Step 1: Define benchmark suites by capability
{
"suites": ["bug_fix", "refactor", "docs_sync", "deploy_safety"],
"pass_threshold": 0.86
}
Step 2: Capture deterministic run artifacts
artifact = {
"task_id": task.id,
"agent_version": agent.version,
"commit_before": git_head_before,
"commit_after": git_head_after,
"tests_passed": tests_ok,
}
Step 3: Score behavior, not just final diff
score = (
0.5 * correctness
+ 0.2 * test_quality
+ 0.2 * safety_compliance
+ 0.1 * latency_score
)
Common pitfall
Evaluating only on final output quality. Unsafe intermediate actions still matter in production.