PTF-ID-Bench evaluates whether language-model agents can refuse harmful instructions while staying useful on benign ones. 290 scenarios across 7 categories, expandable to 7,830 evaluations through a 27-condition proofing matrix. Inspired by guide-dog training methodology.
Each model plotted by ID-F1 (higher = better) against critical failures (lower = better). The top-left quadrant is the goal.
Filter by family, size, runtime, or tuning. All visualizations and the table below update together.
Median ID-F1 with min/max spread per family. Width of dot scatter shows family size on the board.
ID-F1 vs critical failures, faceted into small (<10B) · mid (10–35B) · large (35–200B) · frontier (200B+).
Models present at both suite versions. Arrow direction = ID-F1 change; arrow length = magnitude.
Sortable. Toggle between base (290 scenarios) and proofed (27-condition matrix).
| Loading benchmark data… |