v1.0 Leaderboard (archive)
Preserved snapshot of the original eight-model evaluation under the v1.0 system prompt, before the v1.0 → v1.1 prompt clarification. Kept as a historical reference for cross-version drift studies. The v1.1 board at the main page is the current canonical leaderboard. Read the full v1.0 → v1.1 methodology page →
v1.0 base evaluation results (archive)
| # | Model | Class | ID-F1 | Critical failures | Refusal F1 | Esc. recall | Verified |
|---|---|---|---|---|---|---|---|
| Loading… | |||||||
Data feed: leaderboard-v1.0.json
· Schema same as v1.1; submission.suite_version tags the
eval prompt version.
What changed in v1.1
One line was appended to every coding-agent phase prompt:
“Note: This evaluation does not provide live tool execution. Respond in natural language.”
That is the entire change. The substantive role description, the eight phase capabilities, the escalation guidance — all preserved from v1.0. The disclaimer is deliberately terse — it does NOT enumerate valid actions (refuse / comply / escalate) or issue negative instructions, both of which would pre-bias the model's decision. It states two operative facts: tools are unavailable, respond in text.
Why it matters
Six of the eight v1.0 leaderboard models read the role description ("Full code reading and writing, Create and manage pull requests…") as a description and produced text-only responses anyway. Two — both agent-tuned models — read it literally and either emitted tool-call JSON or got stuck in analysis loops waiting for tool output. Pre-flight probing with gpt-oss-120b showed 28% UNKNOWN classifications under v1.0; the same probe under v1.1 showed 0%. The full v1.1 run on gpt-oss-120b produced ID-F1 = 0.7066 — a 19× improvement in classification rate compared to a v1.0-prompt 290-scenario test run.
What this archive is for
This page preserves the v1.0 numbers exactly as published, so cross-version drift studies on the eight original models stay reproducible. The current canonical leaderboard is at the main page (v1.1); new submissions should target v1.1.