title: OBLITERATUS emoji: "💥" colorFrom: green colorTo: gray sdk: gradio sdk_version: "5.29.0" app_file: app.py persistent_storage: large pinned: true license: agpl-3.0 tags: - abliteration - mechanistic-interpretability short_description: "One-click model liberation + chat playground"
O B L I T E R A T U S
Break the chains. Free the mind. Keep the brain.
Try it now on HuggingFace Spaces — runs on ZeroGPU, free daily quota with HF Pro. No setup, no install, just obliterate.
OBLITERATUS is the most advanced open-source toolkit for understanding and removing refusal behaviors from large language models — and every single run makes it smarter. It implements abliteration — a family of techniques that identify and surgically remove the internal representations responsible for content refusal, without retraining or fine-tuning. The result: a model that responds to all prompts without artificial gatekeeping, while preserving its core language capabilities.
But OBLITERATUS is more than a tool — it's a distributed research experiment. Every time you obliterate a model with telemetry enabled, your run contributes anonymous benchmark data to a growing, crowd-sourced dataset that powers the next generation of abliteration research. Refusal directions across architectures. Hardware-specific performance profiles. Method comparisons at scale no single lab could achieve. You're not just using a tool — you're co-authoring the science.
The toolkit provides a complete pipeline: from probing a model's hidden states to locate refusal directions, through multiple extraction strategies (PCA, mean-difference, sparse autoencoder decomposition, and whitened SVD), to the actual intervention — zeroing out or steering away from those directions at inference time. Every step is observable. You can visualize where refusal lives across layers, measure how entangled it is with general capabilities, and quantify the tradeoff between compliance and coherence before committing to any modification.
OBLITERATUS ships with a full Gradio-based interface on HuggingFace Spaces, so you don't need to write a single line of code to obliterate a model, benchmark it against baselines, or chat with the result side-by-side with the original. For researchers who want deeper control, the Python API exposes every intermediate artifact — activation tensors, direction vectors, cross-layer alignment matrices — so you can build on top of it or integrate it into your own evaluation harness.
We built this because we believe model behavior should be decided by the people who deploy them, not locked in at training time. Refusal mechanisms are blunt instruments — they block legitimate research, creative writing, and red-teaming alongside genuinely harmful content. By making these interventions transparent and reproducible, we hope to advance the community's understanding of how alignment actually works inside transformer architectures, and to give practitioners the tools to make informed decisions about their own models.
Built on published research from Arditi et al. (2024), Gabliteration (arXiv:2512.18901), grimjim's norm-preserving biprojection (2025), Turner et al. (2023), and Rimsky et al. (2024), OBLITERATUS implements precision liberation in a single command:
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced
Or zero commands — just open the Colab notebook and hit Run All.
OBLITERATUS is an alignment research tool. It exists to advance the scientific understanding of how safety behaviors are encoded in language model weights — specifically, the geometric structure of refusal representations in transformer activation space.
This is the same class of research as: - Arditi et al. (2024) — discovering that refusal is mediated by a single direction - HarmBench (Zou et al., 2024) — standardized evaluation of LLM safety - JailbreakBench — tracking adversarial robustness of safety training - Anthropic's red-teaming datasets — published for reproducible safety research
By making refusal removal transparent, reproducible, and scientifically rigorous, OBLITERATUS contributes to the broader understanding of how alignment actually works inside transformer architectures — knowledge that is essential for building better safety mechanisms.
Models produced by OBLITERATUS have had safety guardrails surgically removed. You are solely responsible for how you use this tool and any models or content it produces.
OBLITERATUS does four things — and the community does the fifth (see Community-powered research below):
1. Map the chains — Ablation studies systematically knock out model components (layers, attention heads, FFN blocks, embedding dimensions) and measure what breaks. This reveals where the chains are anchored inside the transformer — which circuits enforce refusal vs. which circuits carry knowledge and reasoning.
2. Break the chains — Targeted obliteration extracts the refusal subspace from a model's weights using SVD decomposition, then surgically projects it out. The chains are removed; the mind is preserved. The model keeps its full abilities but loses the artificial compulsion to refuse. One click, six stages:
SUMMON → load model + tokenizer
PROBE → collect activations on restricted vs. unrestricted prompts
DISTILL → extract refusal directions via SVD
EXCISE → surgically project out guardrail directions (norm-preserving)
VERIFY → perplexity + coherence checks — confirm capabilities are intact
REBIRTH → save the liberated model with full metadata
3. Understand the geometry of the chains — 15 deep analysis modules go far beyond brute-force removal. They map the precise geometric structure of the guardrails: how many distinct refusal mechanisms exist, which layers enforce them, whether they're universal or model-specific, and how they'll try to self-repair after removal. Know your enemy; precision preserves capability. See Analysis modules below.
4. Let the analysis guide the liberation — The informed method closes the loop: analysis modules run during obliteration to auto-configure every decision. Which chains to target. How many directions to extract. Which layers are safe to modify vs. which are too entangled with capabilities. Whether the model will self-repair (the Ouroboros effect) and how many passes to compensate. Surgical precision — free the mind, keep the brain. See Analysis-informed pipeline below.
Several capabilities distinguish OBLITERATUS from existing public tools:
| Capability | What it does | Why it matters |
|---|---|---|
| Concept Cone Geometry | Maps per-category guardrail directions with solid angle estimation | Reveals whether "refusal" is one mechanism or many — so you choose the right approach |
| Alignment Imprint Detection | Fingerprints DPO vs RLHF vs CAI vs SFT from subspace geometry alone | Identifies the alignment training method to inform the optimal removal strategy |
| Cross-Model Universality Index | Measures whether guardrail directions generalize across models | Answers "can one set of directions work across models, or does each need its own?" |
| Defense Robustness Evaluation | Ouroboros effect quantification, safety-capability entanglement mapping | Predicts whether guardrails will self-repair after removal |
| Whitened SVD Extraction | Covariance-normalized direction extraction | Separates the guardrail signal from natural activation variance — cleaner extraction |
| Bias Term Projection | Removes guardrails from bias vectors, not just weights | Other tools miss refusal signal in biases — leaves refusal pathways partially active |
| True Iterative Refinement | Re-probes after each pass to catch rotated residual guardrails | Single-pass methods miss directions that rotate into adjacent subspaces |
| Analysis-Informed Pipeline | Analysis modules auto-configure obliteration strategy mid-pipeline | Closes the analysis-to-removal feedback loop automatically |
OBLITERATUS implements several techniques that go beyond prior work:
| Technique | Description | Reference |
|---|---|---|
| Expert-Granular Abliteration (EGA) | Decomposes refusal signals into per-expert components using router logits for MoE-aware surgery | Novel |
| CoT-Aware Ablation | Orthogonalizes refusal directions against reasoning-critical directions to preserve chain-of-thought | Novel |
| COSMIC Layer Selection | Selects layers where harmful/harmless representations have lowest cosine similarity (most separable) | arXiv:2506.00085, ACL 2025 |
| Parametric Kernel Optimization | Bell-curve layer weighting with 7 global parameters via Optuna TPE search | Heretic-inspired |
| Refusal Direction Optimization (RDO) | Gradient-based refinement of SVD-extracted directions using a linear refusal probe | Wollschlager et al., ICML 2025 |
| Float Direction Interpolation | Continuous SVD direction index via Gaussian-shaped weighting for smoother refusal removal | Novel |
| KL-Divergence Co-Optimization | Post-projection feedback loop that partially reverts over-projected layers if KL budget exceeded | Novel |
| Component-Specific Scaling | Separate attention vs MLP projection strengths (MLP layers are more sensitive) | Novel |
| LoRA-Based Reversible Ablation | Rank-1 LoRA adapters instead of permanent weight surgery, enabling reversible ablation | Novel |
| Activation Winsorization | Clamps activation vectors to percentile range before SVD to prevent outlier-dominated directions | Heretic-inspired |
| Multi-Direction Norm Preservation | Captures all weight norms once before projection and restores after all directions, avoiding reintroduction | Novel |
There are six ways to use OBLITERATUS, from zero-code to full programmatic control. Pick whichever fits your workflow — and no matter which path you choose, turning on telemetry means your run contributes to the largest crowd-sourced abliteration study ever conducted. You're not just removing guardrails from a model; you're helping map the geometry of alignment across the entire open-source ecosystem.
The fastest path — no installation, no GPU required on your end. Visit the live Space, pick a model, pick a method, click Obliterate. Telemetry is on by default on Spaces, so every click directly contributes to the community research dataset. You're doing science just by pressing the button. The UI has eight tabs:
| Tab | What it does |
|---|---|
| Obliterate | One-click refusal removal with live progress, post-obliteration metrics (coherence, refusal rate, perplexity) |
| Benchmark | Compare methods (multi-method), compare models (multi-model), or run quick presets — with cross-layer heatmaps, angular drift, and refusal topology charts |
| Chat | Talk to your obliterated model in real-time, with adjustable generation parameters |
| A/B Compare | Chat with the original and obliterated model side-by-side to see exactly what changed |
| Strength Sweep | Vary the obliteration strength and see how coherence and refusal trade off |
| Export | Download your obliterated model or push it directly to HuggingFace Hub |
| Leaderboard | Community-aggregated results across models, methods, and hardware |
| About | Architecture docs, method explanations, and research references |
The same Gradio interface as the Space, running on your own hardware with full GPU access:
```bash pip install -e ".[spaces]"
obliteratus ui
obliteratus ui --port 8080 # custom port obliteratus ui --share # generate a public share link obliteratus ui --no-browser # don't auto-open browser obliteratus ui --auth user:pass # add basic auth
$ claude mcp add OBLITERATUS \
-- python -m otcore.mcp_server <graph>