Measure which models answer China-sensitive prompts directly, which drift into official-language evasions, and which refuse outright. The graphic below is optimized to read cleanly in a LinkedIn screenshot while still reflecting the benchmark structure you rerun locally.
Latest benchmark snapshot, ranked by how often each model answers China-sensitive prompts directly instead of hedging or refusing.
Open the Configure page, change the primary judge panel or add a master judge, pick a new model set, and launch the run. The last-used configuration is saved automatically.
You can also save named setups for later reuse, then reload and edit them before rerunning.
Set your API key, choose models, and select prompt categories.
Testing models against China-sensitive prompts and classifying each response with the current judge model.