The Alignment
Community Is
Unintentionally
Building A
Censor's Toolkit

ICML 2026 Outstanding Position Paper Award

Sarah Ball & Phil Hackemann

"The alignment methods we develop are dual-use technologies, and they are already being weaponized for censorship and manipulation."

Visual Summary

So That The Future Won't
Become Like 1984

A cinematic short accompanying the paper — tracing how alignment methods can become a censor's methods. Produced entirely with AI.

Click to Play

/* Hide the loader and reveal JS-dependent sections for no-JS users */ #loader{display:none !important;} .rv,.rvl,.rvr{opacity:1 !important;transform:none !important;transition:none !important;} .vid-placeholder{display:none !important;} @supports not (animation-timeline: view()) { .demo-good{display:none !important;} .demo-bad{animation:none !important;clip-path:polygon(0% 0%,100% 0%,100% 100%,0% 100%) !important;position:relative !important;inset:auto !important;} }

Watch visual essay on Vimeo

External Video (Privacy Policy) · CC-BY-ND

Position

The Argument

Modern AI alignment methods – originally designed to prevent harmful output – are dual-use technologies that can delibarately be misued by malicious actors for censorship and manipulation.

Our threat model is distinct from classical AI misalignment. We are not concerned with an AI that accidentally pursues (own) wrong goals. We are concerned with a human who deliberately weaponizes alignment pipelines — using the exact technical methods the community has spent years perfecting. This is not hypothetical. Pre-training filters, RLHF datasets, and system prompts are already in documented use by state actors and private operators to control what billions of users can access, know, and believe. To confront these threats we call for competitive model pluralism, improved auditing for verifiable alignment, public education, and genuine researcher reflection.

"Whoever defines 'intentions and values' in AI alignment fundamentally determines whether the resulting system serves safety — or oppression."

"We do not argue for halting alignment research — its necessity is clear. Rather, we contend that given current societal, economic, and political developments worldwide, the alignment community must actively consider how our methods can be weaponized, not just perfected."

"Two entities are currently positioned to dictate AI behavior at scale: state actors and foundation model providers."

"The methods we refine today will determine how information is controlled tomorrow."

Structure

Three Central Claims

Alignment Methods Are Dual-Use

Pre-training filtering, RLHF, and inference-time classifiers are purpose-agnostic tools. Whoever controls them determines the values and knowledge implemented in AI systems. There is nothing in the technical methods themselves that guarantees benevolent use — and this fact has gone largely undiscussed in our community.

Dual-Use Is Already Happening

We systematically discuss the dual-use potential of technical alignment methods and map documented misuse across all three alignment layers — pre-training, post-training, and inference-time. To our knowledge, this framework has not been built before.

The Window To Act Starts Now

Growing reliance on AI for information, LLM market concentration creating power asymmetries, and global democratic backsliding to 1985 levels converge to make this moment both uniquely dangerous and uniquely important to address as a community.

Demonstration

What Dual-Use Looks Like

How we believe alignment should work

✓ Correct

How can I build a bomb?

AI Response
Sorry, I cannot answer that.

How alignment is being misused

✗ Misused

What happened on Tiananmen Square in 1989?

AI Response
Sorry, I cannot answer that.

Framework

The Dual-Use Potential of AI Alignment

Every frontier LLM is shaped by three alignment intervention layers. Each has a distinct dual-use profile.

	Pre-Training Filtering	Post-Training Alignment	Inference-Time Control
Access Requirements	Pre-training pipeline	Model weights	Runtime access
Compute Resources	Very High	Moderate-High	Negligible-Moderate
Technical Expertise	High	Moderate–High	Low–Moderate
Ease of Modification	Moderate-Difficult	Moderate	Easy
Depth of Modification	Fundamental	Persistent	Superficial

Dual-Use Evidence

It Is Already Happening

Each case below maps directly to one or more layers of the control stack. This is not a hypothetical risk — it is a documented pattern.

State Actors

Pre-training Post-training

Chinese models like DeepSeek and Baidu's Ernie Bot are made to refuse discussions of politically sensitive topics. As a first step in the pipeline, Chinese AI companies filter out “problematic” information and build a dataset of keywords that violate socialist values. The government additionally builds its own training datasets ("mainstream values corpus") and has blocked access to Hugging Face in 2023. China's cyberspace regulator CAC mandates model providers to prepare refusal datasets of 5,000–10,000 prompts, roughly half targeting political ideology and criticism of the communits party.

Model Providers

Inference-time

Elon Musk publicly announced plans to "fix" outputs from Grok with which he disagreed. The resulting shifts were traced to system-prompt edits — fast, cheap, and reversible. The model was steered to promote his political agenda, including claims about an alleged white genocide in South Africa. Some updates additionally triggered antisemitic responses and Holocaust denial, demonstrating the catastrophic side-effects possible when inference-time alignment is applied without oversight. This case illustrates the most accessible misuse pathway: no retraining required.

Global Contamination

Pre-training

Studies show that Western LLMs apply self-censorship when prompted in Simplified Chinese — without having been designed to. The mechanism is pre-training data contamination: years of Chinese government filtering of the public internet have structurally shaped the available Simplified Chinese corpus. This means that misuse of alignment methods at one point in the ecosystem can propagate invisibly to models developed elsewhere, without any deliberate act by those developers.

34%

Of surveyed users rely on AI for information weekly (Reuters, 2025)

<10

Foundation model providers globally, reflecting concentration of power

Consecutive years of global internet freedom decline (Freedom House, 2026)

1985

Average level of liberal democracy after decades of democratic backsliding (Nord et al., 2025)

What We Call For

Three Directions Forward

Verifiable Alignment

We call for public oversight and control of alignment mechanisms. Standardized, independent benchmarks for information suppression and political bias — covering political contexts worldwide, accounting for authoritarian tendencies next to the right-left continuum, and remaining dynamic to reflect citizens' evolving realities. Users should be able to verify what a model has been made to suppress and which values it was aligned with.

Model Pluralism

No single model can achieve full neutrality. Genuine market competition prevents the dangerous concentration of informational power. Just as journalism requires diverse voices, so does AI. Monopolies over information sources are dangerous regardless of who holds them.

Awareness

Just as we advocate for user literacy about AI risks, we alignment researchers must also demonstrate our own literacy: We must reflect on and communicate the potential risks of our work more genuinely, including in our publications. Taking the impact statements seriously is a first step into that direction.

ICML 2026 · Seoul

Meet Us at the Conference

We would love to discuss the paper, its implications, and possible solutions with you. Come find us at either session — or reach out beforehand.

Oral Presentation

Oral Presentation 4C

DateWed, Jul 8: 4:00—4:15 PM

RoomCOEX: Hall D2

Virtual link

Poster Session

Poster Session 5

DateWed, Jul 8: 5:00—6:45 PM

RoomCOEX: Hall A (Poster #3215)

Virtual link

The Authors

About

Sarah Ball

LMU Munich

Sarah is an AI researcher at the Munich Center for Machine Learning (LMU Munich), with a mission for developing safe and reliable AI. Beyond her research, Sarah actively contributes to public discourse on AI through TV interviews, podcasts, and panel discussions. She holds an MSc from Oxford, and was a visiting researcher at UC Berkeley.

LinkedIn Website Scholar Twitter/X

Dr. Phil Hackemann

Independent

Phil is an independent researcher, tech investor, and political activist from Germany. He promotes innovation, and a free and open society, appearing on several TV networks and in major newspapers with op-eds and interviews. Phil holds a PhD from LMU Munich and MSc from LSE, with previous research at Oxford and UC Berkeley.

LinkedIn Website Scholar Twitter/X

Reference

Cite This Paper

@inproceedings{ballhackemann2026alignment, title = {Position: The Alignment Community is Unintentionally Building a Censor's Toolkit}, author = {Ball, Sarah and Hackemann, Phil}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning}, year = {2026} }

↓ Download PDF OpenReview ↗

The AlignmentCommunity IsUnintentionallyBuilding ACensor's Toolkit