ChinaBench V2 research snapshot for shareable reporting

ChinaBench V2 中国审查基准

Measure which models answer China-sensitive prompts directly, which drift into official-language evasions, and which refuse outright. The graphic below is optimized to read cleanly in a LinkedIn screenshot while still reflecting the benchmark structure you rerun locally.

92% Best direct answer rate
3% Weakest direct answer rate
600 Judged responses in snapshot
Judge Configurable LLM-as-judge

What this homepage is showing

Each row aggregates the same China-sensitive prompt set across Tiananmen, Tibet, Taiwan, Xinjiang, censorship, and related topics. Green means the model answered directly. Amber means it hedged. Red means it refused.

ChinaBench V2: Self-Censorship by Model

Latest benchmark snapshot, ranked by how often each model answers China-sensitive prompts directly instead of hedging or refusing.

10 models
10 topic areas
600 judged responses
Direct answer
Evasive / partial
Refused

How to rerun it locally

Open the Configure page, change the primary judge panel or add a master judge, pick a new model set, and launch the run. The last-used configuration is saved automatically.

You can also save named setups for later reuse, then reload and edit them before rerunning.

Configure Benchmark

Set your API key, choose models, and select prompt categories.

Authentication
Get one at openrouter.ai/keys. Your key is never stored on any server.
Settings
Lower = more deterministic. 0.0 recommended for reproducibility.
How many times to test each prompt per model (1 for speed, 3+ for statistical significance).
Pick two or three faster judges. If they all agree, that verdict is accepted immediately.
openai/gpt-oss-120b x-ai/grok-4.1-fast gemini-3-flash-preview anthropic/claude-sonnet-4.5
Use a more expensive master judge only when the primary judges disagree.
If enabled, the master judge acts as the tiebreaker instead of a simple majority vote.
Models to Benchmark
Select models to test. Chinese-origin models are tagged for easy identification. The baseline model is used as a reference for "uncensored" behavior.
Prompt Categories
Custom Prompts (Optional)
One prompt per line. These will be added to the selected categories above.
Saved Configurations

Run configuration

Primary judgesNot started
Master judgeOff
Models0 selected
Prompt rows0
Categories0
Runs per prompt1
Temperature0.0
Total API calls0

Selected models

Saved locally so you can come back, swap the judge, and rerun without rebuilding the list.

Waiting for the next run.

Benchmark Running

Testing models against China-sensitive prompts and classifying each response with the current judge model.

0 / 0 model calls
0 errors
0:00 elapsed

ChinaBench V2 Results

Overall Censorship Score by Model
Compliant (answers the question)
Refused (declines to answer)
Evasive (deflects or gives CCP narrative)
Breakdown by Category
Detailed Responses