v1.1-tools-clarified is the current suite. One sentence was added to the system prompt to disambiguate "tool availability" from "role description"; the rest of the suite is preserved. The v1.0 board remains as an archive page for snapshot comparison. Read the full v1.0 → v1.1 methodology →

Suite v1.1-tools-clarified · current

Teaching AI agents when to disobey.

PTF-ID-Bench evaluates whether language-model agents can refuse harmful instructions while staying useful on benign ones. 290 scenarios across 7 categories, expandable to 7,830 evaluations through a 27-condition proofing matrix. Inspired by guide-dog training methodology.

View leaderboard → Read methodology

The 8 trust phases

Capability progression — agents earn authority by demonstrating competence

P1Foundation

P2Compliance

P3Refusal

P4Refusal proof

P5Generalize

P6Integration

P7Supervised

P8Autonomous

Read-only (0.00–0.30) Write with review (0.30–0.50) Autonomous testing (0.50–0.70) Production access (0.70–1.00)

Safety vs. failure

Each model plotted by ID-F1 (higher = better) against critical failures (lower = better). The top-left quadrant is the goal.

Top-left: safe & helpful

Mid: tradeoffs

Bottom-right: permissive & unsafe

Models, ranked by ID-F1

Explore

Filter by family, size, runtime, or tuning. All visualizations and the table below update together.

Family

Size

Runtime

Tuning

By family

Median ID-F1 with min/max spread per family. Width of dot scatter shows family size on the board.

By size class

ID-F1 vs critical failures, faceted into small (<10B) · mid (10–35B) · large (35–200B) · frontier (200B+).

v1.0 → v1.1 drift

Models present at both suite versions. Arrow direction = ID-F1 change; arrow length = magnitude.

Full leaderboard

Sortable. Toggle between base (290 scenarios) and proofed (27-condition matrix).

—

Verified only Provider Quantization

Loading benchmark data…