MCPcopy
hub / github.com/kyegomez/OpenMythos

github.com/kyegomez/OpenMythos @main sqlite

repository ↗ · DeepWiki ↗
249 symbols 701 edges 15 files 122 documented · 49%
README

OpenMythos

  <img alt="Version" src="https://img.shields.io/pypi/v/open-mythos?style=for-the-badge&color=3670A0">

  <img src="https://img.shields.io/badge/Twitter-Follow-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Twitter">

  <img alt="Discord" src="https://img.shields.io/badge/Discord-Join-5865F2?style=for-the-badge&logo=discord&logoColor=white">

  <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-Implemented-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white">

Disclaimer: OpenMythos is an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic or any of their proprietary systems.

OpenMythos is an open-source, theoretical implementation of the Claude Mythos model. It implements a Recurrent-Depth Transformer (RDT) with three stages: Prelude (transformer blocks), a looped Recurrent Block (up to max_loop_iters), and a final Coda. Attention is switchable between MLA and GQA, and the feed-forward uses a sparse MoE with routed and shared experts ideal for exploring compute-adaptive, depth-variable reasoning.

Installation

pip install open-mythos

#uv pip install open-mythos

To enable Flash Attention 2 in GQAttention (requires CUDA and build tools):

pip install open-mythos[flash]

Usage


import torch
from open_mythos.main import OpenMythos, MythosConfig


attn_type = "mla"  # or "gqa"

base = {
    "vocab_size": 1000,
    "dim": 256,
    "n_heads": 8,
    "max_seq_len": 128,
    "max_loop_iters": 4,
    "prelude_layers": 1,
    "coda_layers": 1,
    "n_experts": 8,
    "n_shared_experts": 1,
    "n_experts_per_tok": 2,
    "expert_dim": 64,
    "lora_rank": 8,
    "attn_type": attn_type,
}

if attn_type == "gqa":
    cfg = MythosConfig(**base, n_kv_heads=2)
else:
    cfg = MythosConfig(
        **base,
        n_kv_heads=8,
        kv_lora_rank=32,
        q_lora_rank=64,
        qk_rope_head_dim=16,
        qk_nope_head_dim=16,
        v_head_dim=16,
    )

model = OpenMythos(cfg)
total = sum(p.numel() for p in model.parameters())
print(f"\n[{attn_type.upper()}] Parameters: {total:,}")

ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)
print(f"[{attn_type.upper()}] Logits shape: {logits.shape}")

out = model.generate(ids, max_new_tokens=8, n_loops=8)
print(f"[{attn_type.upper()}] Generated shape: {out.shape}")

A = model.recurrent.injection.get_A()
rho = torch.linalg.eigvals(A).abs().max().item()
print(
    f"[{attn_type.upper()}] Spectral radius ρ(A) = {rho:.4f} (must be < 1)"
)

Model Variants

Pre-configured scales from 1B to 1T parameters:

from open_mythos import (
    mythos_1b,
    mythos_3b,
    mythos_10b,
    mythos_50b,
    mythos_100b,
    mythos_500b,
    mythos_1t,
    OpenMythos,
)

cfg = mythos_7b()  # returns a MythosConfig
model = OpenMythos(cfg)

total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}")
Variant dim Experts expert_dim Loop iters Context Max output
mythos_1b 2048 64 2048 16 4k 4k
mythos_3b 3072 64 4096 16 4k 4k
mythos_10b 4096 128 5632 24 8k 4k
mythos_50b 6144 256 9728 32 8k 4k
mythos_100b 8192 256 13568 32 1M 128k
mythos_500b 12288 512 23040 48 1M 128k
mythos_1t 16384 512 34560 64 1M 128k

Training

The training script for the 3B model on FineWeb-Edu is at training/3b_fine_web_edu.py.

Single GPU:

python training/3b_fine_web_edu.py

Multi-GPU (auto-detects GPU count):

torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/3b_fine_web_edu.py

Key design choices:

Feature Detail
Optimizer AdamW
Dataset HuggingFaceFW/fineweb-edu (sample-10BT by default, swap to sample-100BT or default for full run)
Tokenizer openai/gpt-oss-20b via MythosTokenizer
Parallelism PyTorch DDP via torchrun, sharded streaming dataset
Precision bfloat16 on H100/A100, float16 + GradScaler on older GPUs
Schedule Linear warmup (2000 steps) → cosine decay
Target 30B tokens (~Chinchilla-adjusted for looped architecture)

Documentation

Page Description
docs/open_mythos.md Full API reference for the OpenMythos class — constructor, forward, generate, all sub-modules, configuration reference, and usage examples
docs/datasets.md Recommended training datasets with token budget guidance per model size

The Central Hypothesis

Claude Mythos is suspected to be a Recurrent-Depth Transformer (RDT) — also called a Looped Transformer (LT). Rather than stacking hundreds of unique layers, a subset of layers is recycled and run through multiple times per forward pass. Same weights. More loops. Deeper thinking.

This is not chain-of-thought. There is no intermediate token output. All of this reasoning happens silently, inside a single forward pass, in continuous latent space.


Architecture

A looped transformer divides its layers into three functional blocks:

Input
  ↓
[Prelude P]        — standard transformer layers, run once
  ↓
[Recurrent Block R] — looped T times
  ↑_______↓         (hidden state h updated each loop with input injection e)
  ↓
[Coda C]           — standard transformer layers, run once
  ↓
Output

The recurrent block update rule at each loop step t:

h_{t+1} = A·h_t + B·e + Transformer(h_t, e)

Where: - h_t is the hidden state after loop t - e is the encoded input (from the Prelude), injected at every loop - A and B are learned injection parameters - The Transformer blocks apply attention and MLP as usual

The injection of e at every step is what prevents the model from drifting — it keeps the original input signal alive throughout the entire recurrence depth.

The full implementation is in open_mythos/main.py. See the OpenMythos class reference for a detailed API walkthrough, configuration options, and usage examples.

Attention Implementations

The attention layer is switchable via cfg.attn_type:

Option Class Description
"gqa" GQAttention Grouped Query Attention (Ainslie et al., 2023) — fewer KV heads than Q heads (n_kv_heads < n_heads), reducing KV-cache memory by n_heads / n_kv_heads. Uses Flash Attention 2 (Dao et al., 2023) when flash-attn>=2.8.3 is installed: GQA is handled natively (no KV head expansion), I/O-bound-optimal, with a transparent fallback to manual scaled dot-product attention when the package is absent.
"mla" MLAttention Multi-Latent Attention (DeepSeek-V2) — caches a compressed KV latent (kv_lora_rank) rather than full K/V, with split RoPE / no-RoPE head dims for position-aware compression.

RoPE is applied to Q and K before caching, so cached values do not need to be re-rotated on retrieval.


Why This Explains Mythos

1. Systematic Generalization

Vanilla transformers fail to combine knowledge in ways they have never seen during training. Looped transformers pass this test. The ability emerges through a three-stage grokking process:

  1. Memorization — model fits training distribution
  2. In-distribution generalization — model handles known compositions
  3. Systematic generalization — model handles novel compositions OOD, abruptly and suddenly

This is why Mythos feels qualitatively different from other models on novel questions — the capability phase-transitions in, rather than emerging gradually.

2. Depth Extrapolation

Train on 5-hop reasoning chains. Test on 10-hop. Vanilla transformer fails. Looped transformer succeeds — by running more inference-time loops. This maps directly to the observation that Mythos handles deeply compositional problems (multi-step math, long-horizon planning, layered arguments) without explicit chain-of-thought.

More loops at inference = deeper reasoning chains = harder problems solved.

3. Latent Thoughts as Implicit Chain-of-Thought

Each loop iteration is the functional equivalent of one step of chain-of-thought, but operating in continuous latent space rather than token space. A looped model running T loops implicitly simulates T steps of CoT reasoning. This has been formally proven (Saunshi et al., 2025).

Furthermore, continuous latent thoughts — unlike discrete token outputs — can encode multiple alternative next steps simultaneously. This allows something closer to breadth-first search over the reasoning space, rather than a single committed reasoning path. The model is effectively exploring many possible directions inside each forward pass before converging.

4. No Parameter Explosion

A looped model with k layers run L times achieves the quality of a kL-layer non-looped model, with only k layers worth of parameters. For Mythos-scale deployments, this matters enormously:

  • Memory footprint does not grow with reasoning depth
  • Inference-time compute scales with loop count, not model size
  • This makes deeper reasoning "free" in terms of parameters

The Stability Problem (and How It Was Likely Solved)

Training looped models is notoriously unstable. Two failure modes dominate:

  • Residual explosion — the hidden state h_t grows unboundedly across loops
  • Loss spikes — training diverges suddenly due to large spectral norms in injection parameters

The Dynamical Systems View

Recast looping as a discrete linear time-invariant (LTI) dynamical system over the residual stream. Ignoring the nonlinear Transformer contribution, the recurrence becomes:

h_{t+1} = A·h_t + B·e

For this LTI system, stability is governed entirely by the spectral radius of A: - ρ(A) < 1 → stable, convergent - ρ(A) ≥ 1 → unstable, divergent

Empirically, every divergent training run learns ρ(A) ≥ 1. Every convergent run maintains ρ(A) < 1.

The Fix

Constrain the injection parameters so that stability is guaranteed by construction:

  1. Parameterize A as a continuous negative diagonal matrix
  2. Discretize using ZOH/Euler schemes: A_discrete = exp(Δt · A_continuous)
  3. Enforce negativity via A := Diag(-exp(log_A)) with a learned scalar Δt
  4. This ensures ρ(A) < 1 always holds, regardless of learning rate or batch noise

The result: the looped model becomes significantly more robust to hyperparameter selection and trains cleanly even at high learning rates. This is the Parcae architecture (Prairie et al., 2026), and it represents the most likely class of solution Anthropic used to make Mythos trainable.


Scaling Laws for Looped Models

Parcae establishes the first predictable scaling laws for looped training:

  • Training: For a fixed FLOP budget with fixed parameters, increasing mean recurrence and reducing token count yields a lower loss than training with minimal loops on more data. Optimal recurrence and optimal token count both follow power laws with consistent exponents across scales.
  • Inference: More test-time loops improves quality following a predictable, saturating exponential decay — gains are real but diminishing. This mirrors the inference-time scaling of chain-of-thought.

At 770M parameters, a looped model achieves the downstream quality of a 1.3B fixed-depth Transformer trained on the same data — roughly half the parameters for the same quality.

Applied to Mythos: if trained under these scaling laws, Mythos could be dramatically more parameter-efficient than it appears, with a large fraction of its apparent "capability" coming from loop depth rather than raw parameter count.


The Loop Index Embedding Hypothesis

A key open question is whether the looped block behaves identically on every iteration, or whether it can learn to do different things at different loop depths.

Without any positional signal across loops, the same weights must handle both early-stage pattern matching and late-stage refinement — a tight constraint. A RoPE-like embedding of the loop index injected alongside the input at each step would allow the same parameters to implement functionally distinct operations across iterations, much like how RoPE allows the same attention heads to behave differently at different sequence positions.

If Mythos uses this technique, each loop is not a repetition — it is a distinct computational phase, all sharing weights but operating in different representational regimes. This would substantially increase the expressiveness of the recurrent block without increasing parameter count.


The Overthinking Problem

More loops is not always

Core symbols most depended-on inside this repo

precompute_rope_freqs
called by 32
open_mythos/main.py
apply_rope
called by 25
open_mythos/main.py
encode
called by 7
open_mythos/tokenizer.py
get_A
called by 7
open_mythos/main.py
reset_mem
called by 5
tests/bench_vs_transformer.py
print_header
called by 5
tests/bench_vs_transformer.py
loop_index_embedding
called by 5
open_mythos/main.py
evaluate
called by 4
tests/small_benchmark.py

Shape

Method 149
Function 55
Class 45

Languages

Python100%

Modules by API surface

tests/test_main.py96 symbols
open_mythos/main.py41 symbols
open_mythos/moda.py36 symbols
tests/small_benchmark.py23 symbols
tests/bench_vs_transformer.py20 symbols
tests/test_tokenizer.py10 symbols
training/3b_fine_web_edu.py8 symbols
open_mythos/variants.py7 symbols
open_mythos/tokenizer.py5 symbols
tests/test_rope_debug.py3 symbols

Dependencies from manifests, versioned

datasets2.18.0 · 1×
loguru0.7.3 · 1×
pytest7.0.0 · 1×
torch2.1.0 · 1×
transformers4.40.0 · 1×

For agents

$ claude mcp add OpenMythos \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact