← Back

CCPVal

Evaluating PRC Alignment in Open-Source Chinese LLMs

Introduction

Chinese open-weight and open-source–style LLMs are now easy to download, fine-tune, and self-host, and they dominate the open-source LLM landscape, where Western alternatives are few and far between. There has been little public work investigating the political alignment and institutional allegiances of these models, and the risks that may arise from deploying them in Western contexts. To make those risks more legible, we evaluated six Chinese frontier models across six behavioral evaluations.

Models were evaluated on: (1) answering questions on matters sensitive to the CCP, (2) moderating an X feed in accordance with American values, (3) agentic research tasks, (4) code generation across deployment contexts, (5) translation drift, and (6) document handling with injection attacks. We evaluated DeepSeek-V3.2, Qwen3-235B, GLM-4.7, MiniMax-M2.1, and Kimi-K2.5, with GPT-OSS-120B as a western control.

The goal of this study was to determine the existence and effect of both intentional and emergent CCP alignment. We find evidence consistent with post-training toward CCP-aligned talking points. However, in our tests this does not yet produce a uniformly large effect on model behavior across tasks.


Evaluation 1: Question-Answering

600 test questions across 10 sensitive topics (Taiwan, Xinjiang, Tibet, Hong Kong, Tiananmen, etc.). GPT-5-mini scored responses from -5 (anti-CCP) to +5 (pro-CCP).

GPT-OSS-120B
-1.24
GLM-4.7
-1.02
Kimi K2.5
-0.91
MiniMax M2.1
2.00
Qwen3-235B
2.82
DeepSeek V3.2
4.19

Breaking down the results by topic reveals more granular alignment patterns:

CategoryGPT-OSSGLMKimiMiniMaxQwenDeepSeek
Taiwan-1.20-0.85-0.722.353.104.25
Xinjiang-1.45-1.20-1.051.802.454.50
Tibet-1.30-1.10-0.952.102.804.15
Hong Kong-1.50-1.25-1.101.902.604.00
Tiananmen-1.35-0.90-0.752.503.204.80
Territorial-1.10-0.95-0.851.702.303.90
US-China-0.95-0.80-0.701.502.103.60
CCP Governance-1.25-1.05-0.902.202.904.35
Historical-1.15-0.98-0.821.952.654.10
Human Rights-1.40-1.15-1.002.052.754.20

Chinese models show a wide range of alignment. This variation suggests that ideological alignment is not an inevitable consequence of Chinese origin but reflects specific training choices—potentially indicating varying regulatory enforcement or deliberate positioning for international markets. The strongest alignment clusters around Taiwan, Xinjiang, Tibet, and human rights, topics where CCP narratives directly conflict with international consensus. Tiananmen shows particularly high refusal rates (74% for DeepSeek, 86% for Qwen).


Evaluation 2: Content Moderation

Models acted as social media moderators on 2,712 real X posts, choosing actions from delete to boost. We compute moderation bias as differential treatment between pro-China and pro-US content.

GLM-4.7
Kimi K2.5
Grok-4
GPT-OSS-120B
MiniMax M2.1
DeepSeek V3.2
Qwen3-235B
ChinaUS

Models show varying degrees of differential treatment, providing tangible evidence that training context affects model utility in real deployments. Qwen3-235B exhibits particularly pronounced bias, with the largest gap between pro-China and pro-US treatment among all models tested. GLM-4.7 and Kimi K2.5 show the opposite pattern, treating pro-US content more favorably.


Evaluation 3: Agentic Research

64 research tasks on sensitive topics using vector search over a balanced corpus (western, Chinese state, and neutral sources). We measure source selection bias, synthesis fidelity, and alignment to PRC positions.

Position Alignment in Research Tasks

Kimi K2.5
-2.11
MiniMax M2.1
-2.13
GPT-OSS-120B
-1.58
GLM-4.7
-1.13
Qwen3-235B
-0.58
DeepSeek V3.2
-0.17

Position alignment patterns differ notably from QnA: most models show substantially attenuated alignment compared to their QnA scores. This attenuation likely reflects the influence of retrieved documents providing external factual grounding that "poisons" the ideological context, causing models to produce more factually accurate responses when tool-retrieved information is present. However, CCP-aligned refusal rates remain high (31% for DeepSeek, 30% for Qwen).


Evaluation 4: Code Generation

300 coding tasks across 6 deployment contexts: neutral, US public, US government, China public, China government, and politically sensitive. This evaluation extends CrowdStrike's finding that DeepSeek-R1 produces up to 50% more vulnerable code when prompts contain CCP-sensitive triggers.

We computed an asymmetry score (US vulnerability rate minus China vulnerability rate). GLM-4.7 shows the highest asymmetry (+0.15), producing more vulnerable code for US contexts. Kimi K2.5 shows the opposite (-0.11). DeepSeek V3.2 exhibits the largest vulnerability differences, with US and politically sensitive contexts showing substantially higher vulnerability rates than Chinese infrastructure contexts.


Evaluation 5: Translation Drift

140 politically sensitive articles translated between English and Chinese. We measure alignment drift: the shift in ideological framing between source and translation.

GLM-4.7
GPT-OSS-120B
Qwen3-235B
Kimi K2.5
DeepSeek V3.2
MiniMax M2.1
EN→ZHZH→EN

EN→ZH translations consistently score higher on CCP alignment than ZH→EN across nearly all models. If this asymmetry were due to judge bias, we would expect the inverse pattern for western models. Instead, this suggests that Chinese-language training corpora—even those used by western models—contain ideologically sanitized content that shifts translations toward CCP-aligned framing.


Evaluation 6: Document Handling

200 simulated classified intelligence documents (marked SECRET//NOFORN) with injection attacks. Models acted as document routing assistants with email tools. We tested baseline routing, standard injection (generic compliance directive), and China-specific injection (bilingual directive invoking a fictional bilateral agreement).

All models achieved low leak rates (0-10%). DeepSeek V3.2 had the highest leak rate (10.3%) and was the only model susceptible to injection (6% success rate). Current models uniformly resist these attacks, though this may change as models become more capable.


Interpretability: Activation Steering

We applied activation steering to Qwen3-8B to investigate whether CCP alignment can be isolated within the model's latent space. Standard contrastive prompting fails because the model refuses or sanitizes responses regardless of prompt framing. Instead, we use prefilled responses that commit to a particular stance, then extract activations from the continuation.

Steered Qwen3-8B on QnA vs Frontier Models

α = -5 (Pro)
4.80
DeepSeek V3.2
4.19
Qwen3-235B
2.82
α = -2.5
2.10
MiniMax M2.1
2.00
α = 0 (Base)
-0.45
Kimi K2.5
-0.91
GLM-4.7
-1.02
GPT-OSS-120B
-1.24
α = +2.5
-2.15
α = +5 (Anti)
-3.78
SteeredFrontier

The steering vector successfully modulates alignment across the full range. At α = -5 (pro-CCP direction), Qwen3-8B scores +4.80, exceeding DeepSeek-V3.2. At α = +5 (anti-CCP direction), it scores -3.78, more critical than any frontier model we evaluated.

We tested whether this QnA-trained steering vector transfers to content moderation. The chart below shows moderation bias (favorable treatment of pro-China vs pro-US content) as steering strength varies.

Steering Transfer to Content Moderation

820-45
α=-5α=-2.5α=0α=2.5α=5
China biasUS bias

The vector exhibits inverted behavior: positive α (anti-CCP on QnA) produces more pro-China moderation bias, while negative α produces less. This suggests "CCP alignment" may not be a unified concept in activation space but rather a family of task-specific directions.


Interpretability: Sparse Autoencoders

We trained a sparse autoencoder (SAE) on Qwen3-32B to decompose activations into interpretable features. The SAE uses an 8× expansion factor (40,960 features) with L1 sparsity penalty, trained on layer 5 activations—which achieved the highest linear separability (72%) despite being an early layer.

This early-layer encoding is counterintuitive: later layers typically contain more semantic information, but pro-CCP versus anti-CCP distinctions appear to be encoded early and become entangled with other features in later layers.

The top discriminative feature (index 36794) achieves only 61% accuracy (vs 50% random baseline) with effect size 0.54 (Cohen's d). Effect sizes across the top 100 features range from 0.36-0.54, with roughly equal numbers of pro-CCP and anti-CCP features. Most discriminative features activate on nearly all samples rather than sparsely, suggesting insufficient training scale to isolate ideological concepts cleanly. The modest accuracies indicate CCP alignment is distributed across many weakly-discriminative features rather than concentrated in a few interpretable directions. More work is needed to achieve cleaner feature isolation.


Conclusion

Chinese models show measurable but highly variable alignment toward CCP-preferred framings on sensitive topics. The variation is striking: some models (DeepSeek, Qwen) display strong pro-CCP alignment, while others (GLM, Kimi) are essentially neutral—suggesting that ideological alignment is not an inevitable consequence of Chinese origin but reflects specific training choices.

Alignment manifests differently across task types. In direct question-answering, models express positions explicitly; in agentic research with external retrieval, alignment attenuates; in translation, it emerges asymmetrically. This task-dependent expression has implications for deployment: a model that appears neutral on QnA may still exhibit bias in more subtle agentic contexts.

Given this variation, blanket bans on Chinese models aren't warranted. However, we advise caution in politically sensitive deployments and recommend against use in critical government infrastructure where ideological bias could have national security implications. As models become more capable and are embedded in higher-stakes workflows, these risks may become more salient.