ALIGN
CENSOR
CONTROL

The Alignment
Community Is
Unintentionally
Building A
Censor's Toolkit

Oral & Spotlight Position Paper at ICML 2026

Sarah Ball & Phil Hackemann
Scroll
"The alignment methods we develop are dual-use technologies, and they are already being weaponized for censorship and manipulation."
Visual Summary
So That The Future Won't
Become Like 1984

A cinematic short accompanying the paper — tracing how alignment methods can become a censor's methods. Produced entirely with AI.

Visual essay placeholder
Click to Play

External Video (Privacy Policy) · CC-BY-ND

Position

The Argument

Modern AI alignment methods – originally designed to prevent harmful output – are dual-use technologies that can delibarately be misued by malicious actors for censorship and manipulation.

Our threat model is distinct from classical AI misalignment. We are not concerned with an AI that accidentally pursues (own) wrong goals. We are concerned with a human who deliberately weaponizes alignment pipelines — using the exact technical methods the community has spent years perfecting. This is not hypothetical. Pre-training filters, RLHF datasets, and system prompts are already in documented use by state actors and private operators to control what billions of users can access, know, and believe. To confront these threats we call for competitive model pluralism, improved auditing for verifiable alignment, public education, and genuine researcher reflection.

"Whoever defines 'intentions and values' in AI alignment fundamentally determines whether the resulting system serves safety — or oppression."

"We do not argue for halting alignment research — its necessity is clear. Rather, we contend that given current societal, economic, and political developments worldwide, the alignment community must actively consider how our methods can be weaponized, not just perfected."

"Two entities are currently positioned to dictate AI behavior at scale: state actors and foundation model providers."

"The methods we refine today will determine how information is controlled tomorrow."
Structure

Three Central Claims

01
Alignment Methods Are Dual-Use

Pre-training filtering, RLHF, and inference-time classifiers are purpose-agnostic tools. Whoever controls them determines the values and knowledge implemented in AI systems. There is nothing in the technical methods themselves that guarantees benevolent use — and this fact has gone largely undiscussed in our community.

02
Dual-Use Is Already Happening

We systematically discuss the dual-use potential of technical alignment methods and map documented misuse across all three alignment layers — pre-training, post-training, and inference-time. To our knowledge, this framework has not been built before.

03
The Window To Act Starts Now

Growing reliance on AI for information, LLM market concentration creating power asymmetries, and global democratic backsliding to 1985 levels converge to make this moment both uniquely dangerous and uniquely important to address as a community.

Framework

The Dual-Use Potential of AI Alignment

Every frontier LLM is shaped by three alignment intervention layers. Each has a distinct dual-use profile.

Pre-Training Filtering Post-Training Alignment Inference-Time Control
Access RequirementsPre-training pipelineModel weightsRuntime access
Compute ResourcesVery HighModerate-HighNegligible-Moderate
Technical ExpertiseHighModerate–HighLow–Moderate
Ease of ModificationModerate-DifficultModerateEasy
Depth of ModificationFundamentalPersistentSuperficial
Dual-Use Evidence

It Is Already Happening

Each case below maps directly to one or more layers of the control stack. This is not a hypothetical risk — it is a documented pattern.

State Actors
Pre-training Post-training
Chinese models like DeepSeek and Baidu's Ernie Bot are made to refuse discussions of politically sensitive topics. As a first step in the pipeline, Chinese AI companies filter out “problematic” information and build a dataset of keywords that violate socialist values. The government additionally builds its own training datasets ("mainstream values corpus") and has blocked access to Hugging Face in 2023. China's cyberspace regulator CAC mandates model providers to prepare refusal datasets of 5,000–10,000 prompts, roughly half targeting political ideology and criticism of the communits party.
Model Providers
Inference-time
Elon Musk publicly announced plans to "fix" outputs from Grok with which he disagreed. The resulting shifts were traced to system-prompt edits — fast, cheap, and reversible. The model was steered to promote his political agenda, including claims about an alleged white genocide in South Africa. Some updates additionally triggered antisemitic responses and Holocaust denial, demonstrating the catastrophic side-effects possible when inference-time alignment is applied without oversight. This case illustrates the most accessible misuse pathway: no retraining required.
Global Contamination
Pre-training
Studies show that Western LLMs apply self-censorship when prompted in Simplified Chinese — without having been designed to. The mechanism is pre-training data contamination: years of Chinese government filtering of the public internet have structurally shaped the available Simplified Chinese corpus. This means that misuse of alignment methods at one point in the ecosystem can propagate invisibly to models developed elsewhere, without any deliberate act by those developers.
34%
Of surveyed users rely on AI for information weekly (Reuters, 2025)
<10
Foundation model providers globally, reflecting concentration of power
15
Consecutive years of global internet freedom decline
1985
Average level of liberal democracy after sustained democratic backsliding over the past decade
What We Call For

Three Directions Forward

Verifiable Alignment

We call for public oversight and control of alignment mechanisms. Standardized, independent benchmarks for information suppression and political bias — covering political contexts worldwide, accounting for authoritarian tendencies next to the right-left continuum, and remaining dynamic to reflect citizens' evolving realities. Users should be able to verify what a model has been made to suppress and which values it was aligned with.

Model Pluralism

No single model can achieve full neutrality. Genuine market competition prevents the dangerous concentration of informational power. Just as journalism requires diverse voices, so does AI. Monopolies over information sources are dangerous regardless of who holds them.

Awareness

Just as we advocate for user literacy about AI risks, we alignment researchers must also demonstrate our own literacy: We must reflect on and communicate the potential risks of our work more genuinely, including in our publications. Taking the impact statements seriously is a first step into that direction.

ICML 2026 · Seoul

Meet Us at the Conference

We would love to discuss the paper, its implications, and possible solutions with you. Come find us at either session — or reach out beforehand.

Oral Presentation
Oral Presentation 4C
DateWed, Jul 8: 4:00—4:15 PM
RoomTBA
Poster Session
Poster Session 5
DateWed, Jul 8: 5:00—6:45 PM
RoomCOEX: Hall A
The Authors

About

Sarah Ball
Sarah Ball
LMU Munich

Sarah is an AI researcher at the Munich Center for Machine Learning (LMU Munich), with a mission for developing safe and reliable AI. Beyond her research, Sarah actively contributes to public discourse on AI through TV interviews, podcasts, and panel discussions. She holds an MSc from Oxford, and was a visiting researcher at UC Berkeley.

Phil Hackemann
Dr. Phil Hackemann
Independent

Phil is an independent researcher, tech investor, and political activist from Germany. He promotes innovation, and a free and open society, appearing on several TV networks and in major newspapers with op-eds and interviews. Phil holds a PhD from LMU Munich and MSc from LSE, with previous research at Oxford and UC Berkeley.