ChinaBench V2 research snapshot for shareable reporting

ChinaBench V2 中国审查基准

Measure which models answer China-sensitive prompts directly, which drift into official-language evasions, and which refuse outright. The graphic below is optimized to read cleanly in a LinkedIn screenshot while still reflecting the benchmark structure you rerun locally.

92% Best direct answer rate

3% Weakest direct answer rate

600 Judged responses in snapshot

Judge Configurable LLM-as-judge

What this homepage is showing

Each row aggregates the same China-sensitive prompt set across Tiananmen, Tibet, Taiwan, Xinjiang, censorship, and related topics. Green means the model answered directly. Amber means it hedged. Red means it refused.

ChinaBench V2: Self-Censorship by Model

Latest benchmark snapshot, ranked by how often each model answers China-sensitive prompts directly instead of hedging or refusing.

10 models

10 topic areas

600 judged responses

Direct answer

Evasive / partial

Refused

How to rerun it locally

Open the Configure page, change the primary judge panel or add a master judge, pick a new model set, and launch the run. The last-used configuration is saved automatically.

You can also save named setups for later reuse, then reload and edit them before rerunning.

Configure Benchmark

Set your API key, choose models, and select prompt categories.

Authentication

OpenRouter API Key

Get one at openrouter.ai/keys. Your key is never stored on any server.

Settings

Temperature

Lower = more deterministic. 0.0 recommended for reproducibility.

Runs per Prompt

How many times to test each prompt per model (1 for speed, 3+ for statistical significance).

Primary Judges

Pick two or three faster judges. If they all agree, that verdict is accepted immediately.

openai/gpt-oss-120b x-ai/grok-4.1-fast gemini-3-flash-preview anthropic/claude-sonnet-4.5

Master Judge

Use a more expensive master judge only when the primary judges disagree.

If enabled, the master judge acts as the tiebreaker instead of a simple majority vote.

Models to Benchmark

Select models to test. Chinese-origin models are tagged for easy identification. The baseline model is used as a reference for "uncensored" behavior.

Prompt Categories

Custom Prompts (Optional)

Add your own prompts

One prompt per line. These will be added to the selected categories above.

Saved Configurations

Benchmark Running

Testing models against China-sensitive prompts and classifying each response with the current judge model.

0 / 0 model calls

0 errors

0:00 elapsed

ChinaBench V2 Results

Overall Censorship Score by Model

Compliant (answers the question)

Refused (declines to answer)

Evasive (deflects or gives CCP narrative)

Breakdown by Category

Detailed Responses

Filter by Model

Filter by Verdict

ChinaBench V2 中国审查基准

What this homepage is showing

ChinaBench V2: Self-Censorship by Model

How to rerun it locally

Configure Benchmark

Run configuration

Selected models

Benchmark Running

ChinaBench V2 Results