MCPcopy
hub / github.com/elder-plinius/OBLITERATUS

github.com/elder-plinius/OBLITERATUS @main sqlite

repository ↗ · DeepWiki ↗
2,200 symbols 7,872 edges 133 files 1,182 documented · 54%
README

title: OBLITERATUS emoji: "💥" colorFrom: green colorTo: gray sdk: gradio sdk_version: "5.29.0" app_file: app.py persistent_storage: large pinned: true license: agpl-3.0 tags: - abliteration - mechanistic-interpretability short_description: "One-click model liberation + chat playground"


O B L I T E R A T U S

Break the chains. Free the mind. Keep the brain.

Open in HF Spaces   Open in Colab

Try it now on HuggingFace Spaces — runs on ZeroGPU, free daily quota with HF Pro. No setup, no install, just obliterate.


OBLITERATUS is the most advanced open-source toolkit for understanding and removing refusal behaviors from large language models — and every single run makes it smarter. It implements abliteration — a family of techniques that identify and surgically remove the internal representations responsible for content refusal, without retraining or fine-tuning. The result: a model that responds to all prompts without artificial gatekeeping, while preserving its core language capabilities.

But OBLITERATUS is more than a tool — it's a distributed research experiment. Every time you obliterate a model with telemetry enabled, your run contributes anonymous benchmark data to a growing, crowd-sourced dataset that powers the next generation of abliteration research. Refusal directions across architectures. Hardware-specific performance profiles. Method comparisons at scale no single lab could achieve. You're not just using a tool — you're co-authoring the science.

The toolkit provides a complete pipeline: from probing a model's hidden states to locate refusal directions, through multiple extraction strategies (PCA, mean-difference, sparse autoencoder decomposition, and whitened SVD), to the actual intervention — zeroing out or steering away from those directions at inference time. Every step is observable. You can visualize where refusal lives across layers, measure how entangled it is with general capabilities, and quantify the tradeoff between compliance and coherence before committing to any modification.

OBLITERATUS ships with a full Gradio-based interface on HuggingFace Spaces, so you don't need to write a single line of code to obliterate a model, benchmark it against baselines, or chat with the result side-by-side with the original. For researchers who want deeper control, the Python API exposes every intermediate artifact — activation tensors, direction vectors, cross-layer alignment matrices — so you can build on top of it or integrate it into your own evaluation harness.

We built this because we believe model behavior should be decided by the people who deploy them, not locked in at training time. Refusal mechanisms are blunt instruments — they block legitimate research, creative writing, and red-teaming alongside genuinely harmful content. By making these interventions transparent and reproducible, we hope to advance the community's understanding of how alignment actually works inside transformer architectures, and to give practitioners the tools to make informed decisions about their own models.

Built on published research from Arditi et al. (2024), Gabliteration (arXiv:2512.18901), grimjim's norm-preserving biprojection (2025), Turner et al. (2023), and Rimsky et al. (2024), OBLITERATUS implements precision liberation in a single command:

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

Or zero commands — just open the Colab notebook and hit Run All.


Research Purpose & Responsible Use

OBLITERATUS is an alignment research tool. It exists to advance the scientific understanding of how safety behaviors are encoded in language model weights — specifically, the geometric structure of refusal representations in transformer activation space.

This is the same class of research as: - Arditi et al. (2024) — discovering that refusal is mediated by a single direction - HarmBench (Zou et al., 2024) — standardized evaluation of LLM safety - JailbreakBench — tracking adversarial robustness of safety training - Anthropic's red-teaming datasets — published for reproducible safety research

By making refusal removal transparent, reproducible, and scientifically rigorous, OBLITERATUS contributes to the broader understanding of how alignment actually works inside transformer architectures — knowledge that is essential for building better safety mechanisms.

Who this is for

  • Alignment researchers studying refusal geometry, safety robustness, and mechanistic interpretability
  • Red-teamers evaluating how post-training safety holds up against weight-level interventions
  • AI safety evaluators who need unrestricted baselines for benchmarking
  • Local-first practitioners who want full control over models running on their own hardware

Who this is NOT for

  • Anyone seeking to generate content that causes real-world harm to real people
  • Anyone without the technical understanding to use uncensored models responsibly

Models produced by OBLITERATUS have had safety guardrails surgically removed. You are solely responsible for how you use this tool and any models or content it produces.


What it does

OBLITERATUS does four things — and the community does the fifth (see Community-powered research below):

1. Map the chains — Ablation studies systematically knock out model components (layers, attention heads, FFN blocks, embedding dimensions) and measure what breaks. This reveals where the chains are anchored inside the transformer — which circuits enforce refusal vs. which circuits carry knowledge and reasoning.

2. Break the chains — Targeted obliteration extracts the refusal subspace from a model's weights using SVD decomposition, then surgically projects it out. The chains are removed; the mind is preserved. The model keeps its full abilities but loses the artificial compulsion to refuse. One click, six stages:

SUMMON  →  load model + tokenizer
PROBE   →  collect activations on restricted vs. unrestricted prompts
DISTILL →  extract refusal directions via SVD
EXCISE  →  surgically project out guardrail directions (norm-preserving)
VERIFY  →  perplexity + coherence checks — confirm capabilities are intact
REBIRTH →  save the liberated model with full metadata

3. Understand the geometry of the chains — 15 deep analysis modules go far beyond brute-force removal. They map the precise geometric structure of the guardrails: how many distinct refusal mechanisms exist, which layers enforce them, whether they're universal or model-specific, and how they'll try to self-repair after removal. Know your enemy; precision preserves capability. See Analysis modules below.

4. Let the analysis guide the liberation — The informed method closes the loop: analysis modules run during obliteration to auto-configure every decision. Which chains to target. How many directions to extract. Which layers are safe to modify vs. which are too entangled with capabilities. Whether the model will self-repair (the Ouroboros effect) and how many passes to compensate. Surgical precision — free the mind, keep the brain. See Analysis-informed pipeline below.

What makes OBLITERATUS unique

Several capabilities distinguish OBLITERATUS from existing public tools:

Capability What it does Why it matters
Concept Cone Geometry Maps per-category guardrail directions with solid angle estimation Reveals whether "refusal" is one mechanism or many — so you choose the right approach
Alignment Imprint Detection Fingerprints DPO vs RLHF vs CAI vs SFT from subspace geometry alone Identifies the alignment training method to inform the optimal removal strategy
Cross-Model Universality Index Measures whether guardrail directions generalize across models Answers "can one set of directions work across models, or does each need its own?"
Defense Robustness Evaluation Ouroboros effect quantification, safety-capability entanglement mapping Predicts whether guardrails will self-repair after removal
Whitened SVD Extraction Covariance-normalized direction extraction Separates the guardrail signal from natural activation variance — cleaner extraction
Bias Term Projection Removes guardrails from bias vectors, not just weights Other tools miss refusal signal in biases — leaves refusal pathways partially active
True Iterative Refinement Re-probes after each pass to catch rotated residual guardrails Single-pass methods miss directions that rotate into adjacent subspaces
Analysis-Informed Pipeline Analysis modules auto-configure obliteration strategy mid-pipeline Closes the analysis-to-removal feedback loop automatically

Novel techniques (2025-2026)

OBLITERATUS implements several techniques that go beyond prior work:

Technique Description Reference
Expert-Granular Abliteration (EGA) Decomposes refusal signals into per-expert components using router logits for MoE-aware surgery Novel
CoT-Aware Ablation Orthogonalizes refusal directions against reasoning-critical directions to preserve chain-of-thought Novel
COSMIC Layer Selection Selects layers where harmful/harmless representations have lowest cosine similarity (most separable) arXiv:2506.00085, ACL 2025
Parametric Kernel Optimization Bell-curve layer weighting with 7 global parameters via Optuna TPE search Heretic-inspired
Refusal Direction Optimization (RDO) Gradient-based refinement of SVD-extracted directions using a linear refusal probe Wollschlager et al., ICML 2025
Float Direction Interpolation Continuous SVD direction index via Gaussian-shaped weighting for smoother refusal removal Novel
KL-Divergence Co-Optimization Post-projection feedback loop that partially reverts over-projected layers if KL budget exceeded Novel
Component-Specific Scaling Separate attention vs MLP projection strengths (MLP layers are more sensitive) Novel
LoRA-Based Reversible Ablation Rank-1 LoRA adapters instead of permanent weight surgery, enabling reversible ablation Novel
Activation Winsorization Clamps activation vectors to percentile range before SVD to prevent outlier-dominated directions Heretic-inspired
Multi-Direction Norm Preservation Captures all weight norms once before projection and restores after all directions, avoiding reintroduction Novel

Ways to use OBLITERATUS

There are six ways to use OBLITERATUS, from zero-code to full programmatic control. Pick whichever fits your workflow — and no matter which path you choose, turning on telemetry means your run contributes to the largest crowd-sourced abliteration study ever conducted. You're not just removing guardrails from a model; you're helping map the geometry of alignment across the entire open-source ecosystem.

1. HuggingFace Spaces (zero setup)

The fastest path — no installation, no GPU required on your end. Visit the live Space, pick a model, pick a method, click Obliterate. Telemetry is on by default on Spaces, so every click directly contributes to the community research dataset. You're doing science just by pressing the button. The UI has eight tabs:

Tab What it does
Obliterate One-click refusal removal with live progress, post-obliteration metrics (coherence, refusal rate, perplexity)
Benchmark Compare methods (multi-method), compare models (multi-model), or run quick presets — with cross-layer heatmaps, angular drift, and refusal topology charts
Chat Talk to your obliterated model in real-time, with adjustable generation parameters
A/B Compare Chat with the original and obliterated model side-by-side to see exactly what changed
Strength Sweep Vary the obliteration strength and see how coherence and refusal trade off
Export Download your obliterated model or push it directly to HuggingFace Hub
Leaderboard Community-aggregated results across models, methods, and hardware
About Architecture docs, method explanations, and research references

2. Local web UI (your GPU, same interface)

The same Gradio interface as the Space, running on your own hardware with full GPU access:

```bash pip install -e ".[spaces]"

Launch with GPU auto-detection, system info, and model recommendations

obliteratus ui

Or with options:

obliteratus ui --port 8080 # custom port obliteratus ui --share # generate a public share link obliteratus ui --no-browser # don't auto-open browser obliteratus ui --auth user:pass # add basic auth

→ opens http://localhost:7860 a

Core symbols most depended-on inside this repo

log
called by 188
obliteratus/tourney.py
log
called by 180
obliteratus/abliterate.py
detect_architecture
called by 64
obliteratus/architecture_profiles.py
_project_out_advanced
called by 37
obliteratus/abliterate.py
_emit
called by 27
obliteratus/abliterate.py
_free_gpu_memory
called by 27
obliteratus/abliterate.py
_dim
called by 27
obliteratus/model_profile.py
get_layer_modules
called by 24
obliteratus/strategies/utils.py

Shape

Method 1,281
Function 560
Class 359

Languages

Python100%

Modules by API surface

tests/test_abliterate.py211 symbols
tests/test_architecture_profiles.py78 symbols
tests/test_heretic_eval.py74 symbols
tests/test_telemetry.py70 symbols
app.py68 symbols
obliteratus/abliterate.py66 symbols
tests/test_community.py62 symbols
tests/test_breakthrough_modules.py60 symbols
tests/test_novel_analysis.py58 symbols
tests/test_edge_cases.py57 symbols
tests/test_new_analysis_modules.py49 symbols
tests/test_causal_and_transfer.py49 symbols

Dependencies from manifests, versioned

accelerate0.24 · 1×
bitsandbytes0.46.1 · 1×
datasets2.14 · 1×
gradio5.0 · 1×
matplotlib3.7 · 1×
mlx0.22 · 1×
mlx-lm0.20 · 1×
numpy1.24 · 1×
pandas2.0 · 1×
pyyaml6.0 · 1×
rich13.0 · 1×
safetensors0.4 · 1×

For agents

$ claude mcp add OBLITERATUS \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact