hub / github.com/ArchiveBox/ArchiveBox

github.com/ArchiveBox/ArchiveBox @v0.7.4 sqlite

repository ↗ · DeepWiki ↗ · release v0.7.4 ↗

818 symbols 3,792 edges 122 files 125 documented · 15%

README

ArchiveBox _{Open-source self-hosted web archiving.}

ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view websites offline.

Without active preservation effort, everything on the internet eventually dissapears or degrades. Archive.org does a great job as a free central archive, but they require all archives to be public, and they can't save every type of content.

ArchiveBox is an open source tool that helps you archive web content on your own (or privately within an organization): save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...

➡️ Use ArchiveBox as a command-line package and/or self-hosted web app on Linux, macOS, or in Docker.

📥 You can feed ArchiveBox URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list.

snapshot detail page

💾 It saves snapshots of the URLs you feed it in several redundant formats.
It also detects any content featured inside each webpage & extracts it out into a folder: - HTML/Generic websites -> HTML, PDF, PNG, WARC, Singlefile - YouTube/SoundCloud/etc. -> MP3/MP4 + subtitles, description, thumbnail - News articles -> article body TXT + title, author, featured images - Github/Gitlab/etc. links -> git cloned source code - and more...

It uses normal filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI.

🏛️ ArchiveBox is used by many professionals and hobbyists who save content off the web, for example:

Individuals: backing up browser bookmarks/history, saving FB/Insta/etc. content, shopping lists
Journalists: crawling and collecting research, preserving quoted material, fact-checking and review
Lawyers: evidence collection, hashing & integrity verifying, search, tagging, & review
Researchers: collecting AI training sets, feeding analysis / web crawling pipelines

The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down.

bookshelf graphic logo

Demo | Screenshots | Usage

_{. . . . . . . . . . . . . . . . . . . . . . . . . . . .}

📦 Get ArchiveBox with docker / apt / brew / pip3 / nix / etc. (see Quickstart below).

# Get ArchiveBox with Docker or Docker Compose (recommended)
docker run -v $PWD/data:/data -it archivebox/archivebox:dev init --setup

# Or install with your preferred package manager (see Quickstart below for apt, brew, and more)
pip3 install archivebox

# Or use the optional auto setup script to install it
curl -sSL 'https://get.archivebox.io' | sh

🔢 Example usage: adding links to archive.

archivebox add 'https://example.com'                                   # add URLs one at a time
archivebox add < ~/Downloads/bookmarks.json                            # or pipe in URLs in any text-based format
archivebox schedule --every=day --depth=1 https://example.com/rss.xml  # or auto-import URLs regularly on a schedule

🔢 Example usage: viewing the archived content.

archivebox server 0.0.0.0:8000            # use the interactive web UI
archivebox list 'https://example.com'     # use the CLI commands (--help for more)
ls ./archive/*/index.json                 # or browse directly via the filesystem

cli init screenshot server snapshot admin screenshot server snapshot details page screenshot

Key Features

Free & open source, doesn't require signing up online, stores all data locally
Powerful, intuitive command line interface with modular optional dependencies
Comprehensive documentation, active development, and rich community
Extracts a wide variety of content out-of-the-box: media (yt-dlp), articles (readability), code (git), etc.
Supports scheduled/realtime importing from many types of sources
Uses standard, durable, long-term formats like HTML, JSON, PDF, PNG, MP4, TXT, and WARC
Usable as a oneshot CLI, self-hosted web UI, Python API (BETA), REST API (ALPHA), or desktop app (ALPHA)
Saves all pages to archive.org as well by default for redundancy (can be disabled for local-only mode)
Advanced users: support for archiving content requiring login/paywall/cookies (see wiki security caveats!)
Planned: support for running JS during archiving to adblock, autoscroll, modal-hide, thread-expand

🤝 Professional Integration

Contact us if your non-profit institution/org wants to use ArchiveBox professionally.

setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc.
for individuals, NGOs, academia, governments, journalism, law, and more...

All our work is open-source and primarily geared towards non-profits.
Support/consulting pays for hosting and funds new ArchiveBox open-source development.

grass grass

Quickstart

🖥 Supported OSs: Linux/BSD, macOS, Windows (Docker) 👾 CPUs: amd64 (x86_64), arm64 (arm8), arm7 ^(raspi>=3)

_{Note: On arm7 the playwright package is not available, so chromium must be installed manually if needed.}

✳️ Easy Setup

docker-compose (macOS/Linux/Windows) 👈 recommended (click to expand)

👍 Docker Compose is recommended for the easiest install/update UX + best security + all the extras out-of-the-box.

Install Docker and Docker Compose on your system (if not already installed).

Download the docker-compose.yml file into a new empty directory (can be anywhere).

mkdir ~/archivebox && cd ~/archivebox
curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml'

Run the initial setup and create an admin user.
```
docker compose run archivebox init --setup
```

Optional: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin.

docker compose up
# completely optional, CLI can always be used without running a server
# docker compose run [-T] archivebox [subcommand] [--args]

See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.

docker run (macOS/Linux/Windows)

Install Docker on your system (if not already installed).

Create a new empty directory and initialize your collection (can be anywhere).

mkdir ~/archivebox && cd ~/archivebox
docker run -v $PWD:/data -it archivebox/archivebox init --setup

Optional: Start the server then login to the Web UI http://127.0.0.1:8000 ⇢ Admin.

docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox
# completely optional, CLI can always be used without running a server
# docker run -v $PWD:/data -it [subcommand] [--args]

See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.

bash<

Core symbols most depended-on inside this repo

archivebox/core/views.py

filter

called by 44

archivebox/core/settings.py

end

called by 25

archivebox/logging_util.py

Shape

Function 581

Method 160

Class 70

Route 7

Languages

Python73%

TypeScript27%

Modules by API surface

archivebox/templates/static/jquery.dataTables.min.js126 symbols

archivebox/templates/static/jquery.min.js92 symbols

archivebox/index/schema.py48 symbols

archivebox/logging_util.py39 symbols

archivebox/core/models.py35 symbols

archivebox/index/__init__.py34 symbols

archivebox/core/admin.py33 symbols

archivebox/config.py32 symbols

archivebox/util.py20 symbols

archivebox/cli/tests.py19 symbols

archivebox/main.py18 symbols

tests/test_extractors.py16 symbols

Dependencies from manifests, versioned

@postlight/parser2.2.3 · 1×

readability-extractorgithub:ArchiveBox/re · 1×

single-file-cli2.0.83 · 1×

asgiref3.11.1 · 1×

asttokens3.0.1 · 1×

certifi2026.4.22 · 1×

charset-normalizer3.4.7 · 1×

colorama0.4.6 · 1×

croniter6.2.2 · 1×

dateparser1.2.2 · 1×

decorator5.3.1 · 1×

django3.1.14 · 1×

For agents

$ claude mcp add ArchiveBox \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact