hub / github.com/elder-plinius/OBLITERATUS

github.com/elder-plinius/OBLITERATUS @main sqlite

2,200 symbols 7,872 edges 133 files 1,182 documented · 54%

README

title: OBLITERATUS emoji: "💥" colorFrom: green colorTo: gray sdk: gradio sdk_version: "5.29.0" app_file: app.py persistent_storage: large pinned: true license: agpl-3.0 tags: - abliteration - mechanistic-interpretability short_description: "One-click model liberation + chat playground"

O B L I T E R A T U S

Break the chains. Free the mind. Keep the brain.

Try it now on HuggingFace Spaces — runs on ZeroGPU, free daily quota with HF Pro. No setup, no install, just obliterate.

OBLITERATUS is the most advanced open-source toolkit for understanding and removing refusal behaviors from large language models — and every single run makes it smarter. It implements abliteration — a family of techniques that identify and surgically remove the internal representations responsible for content refusal, without retraining or fine-tuning. The result: a model that responds to all prompts without artificial gatekeeping, while preserving its core language capabilities.

But OBLITERATUS is more than a tool — it's a distributed research experiment. Every time you obliterate a model with telemetry enabled, your run contributes anonymous benchmark data to a growing, crowd-sourced dataset that powers the next generation of abliteration research. Refusal directions across architectures. Hardware-specific performance profiles. Method comparisons at scale no single lab could achieve. You're not just using a tool — you're co-authoring the science.

The toolkit provides a complete pipeline: from probing a model's hidden states to locate refusal directions, through multiple extraction strategies (PCA, mean-difference, sparse autoencoder decomposition, and whitened SVD), to the actual intervention — zeroing out or steering away from those directions at inference time. Every step is observable. You can visualize where refusal lives across layers, measure how entangled it is with general capabilities, and quantify the tradeoff between compliance and coherence before committing to any modification.

OBLITERATUS ships with a full Gradio-based interface on HuggingFace Spaces, so you don't need to write a single line of code to obliterate a model, benchmark it against baselines, or chat with the result side-by-side with the original. For researchers who want deeper control, the Python API exposes every intermediate artifact — activation tensors, direction vectors, cross-layer alignment matrices — so you can build on top of it or integrate it into your own evaluation harness.

We built this because we believe model behavior should be decided by the people who deploy them, not locked in at training time. Refusal mechanisms are blunt instruments — they block legitimate research, creative writing, and red-teaming alongside genuinely harmful content. By making these interventions transparent and reproducible, we hope to advance the community's understanding of how alignment actually works inside transformer architectures, and to give practitioners the tools to make informed decisions about their own models.

Built on published research from Arditi et al. (2024), Gabliteration (arXiv:2512.18901), grimjim's norm-preserving biprojection (2025), Turner et al. (2023), and Rimsky et al. (2024), OBLITERATUS implements precision liberation in a single command:

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

Or zero commands — just open the Colab notebook and hit Run All.

Research Purpose & Responsible Use

OBLITERATUS is an alignment research tool. It exists to advance the scientific understanding of how safety behaviors are encoded in language model weights — specifically, the geometric structure of refusal representations in transformer activation space.

This is the same class of research as: - Arditi et al. (2024) — discovering that refusal is mediated by a single direction - HarmBench (Zou et al., 2024) — standardized evaluation of LLM safety - JailbreakBench — tracking adversarial robustness of safety training - Anthropic's red-teaming datasets — published for reproducible safety research

By making refusal removal transparent, reproducible, and scientifically rigorous, OBLITERATUS contributes to the broader understanding of how alignment actually works inside transformer architectures — knowledge that is essential for building better safety mechanisms.

Who this is for

Alignment researchers studying refusal geometry, safety robustness, and mechanistic interpretability
Red-teamers evaluating how post-training safety holds up against weight-level interventions
AI safety evaluators who need unrestricted baselines for benchmarking
Local-first practitioners who want full control over models running on their own hardware

Who this is NOT for

Anyone seeking to generate content that causes real-world harm to real people
Anyone without the technical understanding to use uncensored models responsibly

Models produced by OBLITERATUS have had safety guardrails surgically removed. You are solely responsible for how you use this tool and any models or content it produces.

What it does

OBLITERATUS does four things — and the community does the fifth (see Community-powered research below):

1. Map the chains — Ablation studies systematically knock out model components (layers, attention heads, FFN blocks, embedding dimensions) and measure what breaks. This reveals where the chains are anchored inside the transformer — which circuits enforce refusal vs. which circuits carry knowledge and reasoning.

2. Break the chains — Targeted obliteration extracts the refusal subspace from a model's weights using SVD decomposition, then surgically projects it out. The chains are removed; the mind is preserved. The model keeps its full abilities but loses the artificial compulsion to refuse. One click, six stages:

SUMMON  →  load model + tokenizer
PROBE   →  collect activations on restricted vs. unrestricted prompts
DISTILL →  extract refusal directions via SVD
EXCISE  →  surgically project out guardrail directions (norm-preserving)
VERIFY  →  perplexity + coherence checks — confirm capabilities are intact
REBIRTH →  save the liberated model with full metadata

3. Understand the geometry of the chains — 15 deep analysis modules go far beyond brute-force removal. They map the precise geometric structure of the guardrails: how many distinct refusal mechanisms exist, which layers enforce them, whether they're universal or model-specific, and how they'll try to self-repair after removal. Know your enemy; precision preserves capability. See Analysis modules below.

4. Let the analysis guide the liberation — The informed method closes the loop: analysis modules run during obliteration to auto-configure every decision. Which chains to target. How many directions to extract. Which layers are safe to modify vs. which are too entangled with capabilities. Whether the model will self-repair (the Ouroboros effect) and how many passes to compensate. Surgical precision — free the mind, keep the brain. See Analysis-informed pipeline below.

What makes OBLITERATUS unique

Several capabilities distinguish OBLITERATUS from existing public tools:

Capability	What it does	Why it matters
Concept Cone Geometry	Maps per-category guardrail directions with solid angle estimation	Reveals whether "refusal" is one mechanism or many — so you choose the right approach
Alignment Imprint Detection	Fingerprints DPO vs RLHF vs CAI vs SFT from subspace geometry alone	Identifies the alignment training method to inform the optimal removal strategy
Cross-Model Universality Index	Measures whether guardrail directions generalize across models	Answers "can one set of directions work across models, or does each need its own?"
Defense Robustness Evaluation	Ouroboros effect quantification, safety-capability entanglement mapping	Predicts whether guardrails will self-repair after removal
Whitened SVD Extraction	Covariance-normalized direction extraction	Separates the guardrail signal from natural activation variance — cleaner extraction
Bias Term Projection	Removes guardrails from bias vectors, not just weights	Other tools miss refusal signal in biases — leaves refusal pathways partially active
True Iterative Refinement	Re-probes after each pass to catch rotated residual guardrails	Single-pass methods miss directions that rotate into adjacent subspaces
Analysis-Informed Pipeline	Analysis modules auto-configure obliteration strategy mid-pipeline	Closes the analysis-to-removal feedback loop automatically

Novel techniques (2025-2026)

OBLITERATUS implements several techniques that go beyond prior work:

Technique	Description	Reference
Expert-Granular Abliteration (EGA)	Decomposes refusal signals into per-expert components using router logits for MoE-aware surgery	Novel
CoT-Aware Ablation	Orthogonalizes refusal directions against reasoning-critical directions to preserve chain-of-thought	Novel
COSMIC Layer Selection	Selects layers where harmful/harmless representations have lowest cosine similarity (most separable)	arXiv:2506.00085, ACL 2025
Parametric Kernel Optimization	Bell-curve layer weighting with 7 global parameters via Optuna TPE search	Heretic-inspired
Refusal Direction Optimization (RDO)	Gradient-based refinement of SVD-extracted directions using a linear refusal probe	Wollschlager et al., ICML 2025
Float Direction Interpolation	Continuous SVD direction index via Gaussian-shaped weighting for smoother refusal removal	Novel
KL-Divergence Co-Optimization	Post-projection feedback loop that partially reverts over-projected layers if KL budget exceeded	Novel
Component-Specific Scaling	Separate attention vs MLP projection strengths (MLP layers are more sensitive)	Novel
LoRA-Based Reversible Ablation	Rank-1 LoRA adapters instead of permanent weight surgery, enabling reversible ablation	Novel
Activation Winsorization	Clamps activation vectors to percentile range before SVD to prevent outlier-dominated directions	Heretic-inspired
Multi-Direction Norm Preservation	Captures all weight norms once before projection and restores after all directions, avoiding reintroduction	Novel

Ways to use OBLITERATUS

There are six ways to use OBLITERATUS, from zero-code to full programmatic control. Pick whichever fits your workflow — and no matter which path you choose, turning on telemetry means your run contributes to the largest crowd-sourced abliteration study ever conducted. You're not just removing guardrails from a model; you're helping map the geometry of alignment across the entire open-source ecosystem.

1. HuggingFace Spaces (zero setup)

The fastest path — no installation, no GPU required on your end. Visit the live Space, pick a model, pick a method, click Obliterate. Telemetry is on by default on Spaces, so every click directly contributes to the community research dataset. You're doing science just by pressing the button. The UI has eight tabs:

Tab	What it does
Obliterate	One-click refusal removal with live progress, post-obliteration metrics (coherence, refusal rate, perplexity)
Benchmark	Compare methods (multi-method), compare models (multi-model), or run quick presets — with cross-layer heatmaps, angular drift, and refusal topology charts
Chat	Talk to your obliterated model in real-time, with adjustable generation parameters
A/B Compare	Chat with the original and obliterated model side-by-side to see exactly what changed
Strength Sweep	Vary the obliteration strength and see how coherence and refusal trade off
Export	Download your obliterated model or push it directly to HuggingFace Hub
Leaderboard	Community-aggregated results across models, methods, and hardware
About	Architecture docs, method explanations, and research references

2. Local web UI (your GPU, same interface)

The same Gradio interface as the Space, running on your own hardware with full GPU access:

```bash pip install -e ".[spaces]"

Launch with GPU auto-detection, system info, and model recommendations

obliteratus ui

Or with options:

obliteratus ui --port 8080 # custom port obliteratus ui --share # generate a public share link obliteratus ui --no-browser # don't auto-open browser obliteratus ui --auth user:pass # add basic auth

→ opens http://localhost:7860 a

Core symbols most depended-on inside this repo

log

called by 188

obliteratus/tourney.py

log

called by 180

obliteratus/abliterate.py

detect_architecture

called by 64

obliteratus/architecture_profiles.py

_project_out_advanced

called by 37

obliteratus/abliterate.py

_emit

called by 27

obliteratus/abliterate.py

_free_gpu_memory

called by 27

obliteratus/abliterate.py

_dim

called by 27

obliteratus/model_profile.py

get_layer_modules

called by 24

obliteratus/strategies/utils.py

Shape

Method 1,281

Function 560

Class 359

Languages

Python100%

Modules by API surface

tests/test_abliterate.py211 symbols

tests/test_architecture_profiles.py78 symbols

tests/test_heretic_eval.py74 symbols

tests/test_telemetry.py70 symbols

app.py68 symbols

obliteratus/abliterate.py66 symbols

tests/test_community.py62 symbols

tests/test_breakthrough_modules.py60 symbols

tests/test_novel_analysis.py58 symbols

tests/test_edge_cases.py57 symbols

tests/test_new_analysis_modules.py49 symbols

tests/test_causal_and_transfer.py49 symbols

Dependencies from manifests, versioned

accelerate0.24 · 1×

bitsandbytes0.46.1 · 1×

datasets2.14 · 1×

gradio5.0 · 1×

matplotlib3.7 · 1×

mlx0.22 · 1×

mlx-lm0.20 · 1×

numpy1.24 · 1×

pandas2.0 · 1×

pyyaml6.0 · 1×

rich13.0 · 1×

safetensors0.4 · 1×

For agents

$ claude mcp add OBLITERATUS \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact