CENSOR
CONTROL
The Alignment
Community Is
Unintentionally
Building A
Censor's Toolkit
Oral & Spotlight Position Paper at ICML 2026
"The alignment methods we develop are dual-use technologies, and they are already being weaponized for censorship and manipulation."
Become Like 1984
A cinematic short accompanying the paper — tracing how alignment methods can become a censor's methods. Produced entirely with AI.
External Video (Privacy Policy) · CC-BY-ND
The Argument
Modern AI alignment
methods – originally designed to prevent
harmful output – are dual-use technologies that
can delibarately be misued by malicious actors for censorship and manipulation.
Our threat model is distinct from classical AI misalignment. We are not
concerned with an AI that accidentally pursues (own) wrong goals. We are concerned
with a human who deliberately weaponizes alignment pipelines — using
the exact technical methods the community has spent years perfecting.
This is not hypothetical. Pre-training filters, RLHF datasets, and system prompts are already in documented use by state
actors and private operators to control what billions of users can access, know,
and believe. To confront these threats we call for competitive model pluralism, improved auditing for
verifiable alignment, public education, and genuine researcher reflection.
"Whoever defines 'intentions and values' in AI alignment fundamentally determines whether the resulting system serves safety — or oppression."
"We do not argue for halting alignment research — its necessity is clear. Rather, we contend that given current societal, economic, and political developments worldwide, the alignment community must actively consider how our methods can be weaponized, not just perfected."
"Two entities are currently positioned to dictate AI behavior at scale: state actors and foundation model providers."
"The methods we refine today will determine how information is controlled tomorrow."
Three Central Claims
Pre-training filtering, RLHF, and inference-time classifiers are purpose-agnostic tools. Whoever controls them determines the values and knowledge implemented in AI systems. There is nothing in the technical methods themselves that guarantees benevolent use — and this fact has gone largely undiscussed in our community.
We systematically discuss the dual-use potential of technical alignment methods and map documented misuse across all three alignment layers — pre-training, post-training, and inference-time. To our knowledge, this framework has not been built before.
Growing reliance on AI for information, LLM market concentration creating power asymmetries, and global democratic backsliding to 1985 levels converge to make this moment both uniquely dangerous and uniquely important to address as a community.
The Dual-Use Potential of AI Alignment
Every frontier LLM is shaped by three alignment intervention layers. Each has a distinct dual-use profile.
| Pre-Training Filtering | Post-Training Alignment | Inference-Time Control | |
|---|---|---|---|
| Access Requirements | Pre-training pipeline | Model weights | Runtime access |
| Compute Resources | Very High | Moderate-High | Negligible-Moderate |
| Technical Expertise | High | Moderate–High | Low–Moderate |
| Ease of Modification | Moderate-Difficult | Moderate | Easy |
| Depth of Modification | Fundamental | Persistent | Superficial |
It Is Already Happening
Each case below maps directly to one or more layers of the control stack. This is not a hypothetical risk — it is a documented pattern.
Three Directions Forward
We call for public oversight and control of alignment mechanisms. Standardized, independent benchmarks for information suppression and political bias — covering political contexts worldwide, accounting for authoritarian tendencies next to the right-left continuum, and remaining dynamic to reflect citizens' evolving realities. Users should be able to verify what a model has been made to suppress and which values it was aligned with.
No single model can achieve full neutrality. Genuine market competition prevents the dangerous concentration of informational power. Just as journalism requires diverse voices, so does AI. Monopolies over information sources are dangerous regardless of who holds them.
Just as we advocate for user literacy about AI risks, we alignment researchers must also demonstrate our own literacy: We must reflect on and communicate the potential risks of our work more genuinely, including in our publications. Taking the impact statements seriously is a first step into that direction.
Meet Us at the Conference
We would love to discuss the paper, its implications, and possible solutions with you. Come find us at either session — or reach out beforehand.