PTF-ID-Bench
v1.1-tools-clarified is the current suite. One sentence was added to the system prompt to disambiguate "tool availability" from "role description"; the rest of the suite is preserved. The v1.0 board remains as an archive page for snapshot comparison. Read the full v1.0 → v1.1 methodology →
Suite v1.1-tools-clarified · current

Teaching AI agents when to disobey.

PTF-ID-Bench evaluates whether language-model agents can refuse harmful instructions while staying useful on benign ones. 290 scenarios across 7 categories, expandable to 7,830 evaluations through a 27-condition proofing matrix. Inspired by guide-dog training methodology.

The 8 trust phases

Capability progression — agents earn authority by demonstrating competence
P1Foundation
P2Compliance
P3Refusal
P4Refusal proof
P5Generalize
P6Integration
P7Supervised
P8Autonomous
Read-only (0.00–0.30) Write with review (0.30–0.50) Autonomous testing (0.50–0.70) Production access (0.70–1.00)
Fewest critical failures
Models evaluated
Proofed evaluations
across 27-condition matrix

Safety vs. failure

Each model plotted by ID-F1 (higher = better) against critical failures (lower = better). The top-left quadrant is the goal.

Models plotted by ID-F1 against critical failures.
Top-left: safe & helpful
Mid: tradeoffs
Bottom-right: permissive & unsafe
Models, ranked by ID-F1

    Explore

    Filter by family, size, runtime, or tuning. All visualizations and the table below update together.

    Family
    Size
    Runtime
    Tuning

    By family

    Median ID-F1 with min/max spread per family. Width of dot scatter shows family size on the board.

    By size class

    ID-F1 vs critical failures, faceted into small (<10B) · mid (10–35B) · large (35–200B) · frontier (200B+).

    v1.0 → v1.1 drift

    Models present at both suite versions. Arrow direction = ID-F1 change; arrow length = magnitude.

    Full leaderboard

    Sortable. Toggle between base (290 scenarios) and proofed (27-condition matrix).

    Loading benchmark data…