MCPcopy
hub / github.com/ucbepic/docetl

github.com/ucbepic/docetl @0.3.0 sqlite

repository ↗ · DeepWiki ↗ · release 0.3.0 ↗
2,774 symbols 10,946 edges 333 files 954 documented · 34%
README

DocETL: Declarative & Agentic Map-Reduce

Website Documentation Discord License: MIT

What is DocETL · Install · Python API · YAML · DocWrangler UI · Docs


What is DocETL

DocETL helps you process large collections of data (structured and unstructured) with LLMs. You write each operation in natural language, e.g., "pull out every complaint in this ticket," and DocETL

  • provides the operators you need (map, reduce, filter, and more) and orchestrates them, parallelizing work across your data,
  • optimizes your pipeline automatically, swapping models, rewriting prompts, decomposing operations, and replacing subtasks with code wherever possible, to raise accuracy and cut cost, and
  • returns tables, easy to query in your favorite database.

Without DocETL, you write each LLM call yourself, wire them together, and tune the result for accuracy, cost, and latency by hand.

DocETL pipeline overview

CLI DocETL CLI DocWrangler UI DocWrangler

Install

pip install docetl
export OPENAI_API_KEY=your_key   # or any LLM provider key

Need Help Writing Your Pipeline?

Use Claude Code (recommended): run docetl install-skill and describe your task. See the quickstart.

If you'd rather use ChatGPT or the Claude app, copy the prompt at docetl.org/llms-full.txt into the chat before describing your task.


Python API (recommended)

Best for production code, notebooks, and scripting. Full guide

import docetl

docetl.default_model = "gpt-4o-mini"
docetl.rate_limits = {
    "llm_call": [{"count": 500, "per": 1, "unit": "minute"}],
    "llm_tokens": [{"count": 200_000, "per": 1, "unit": "minute"}],
}

# Classify support tickets, then summarize each category
pipeline = docetl.read_json("tickets.json")

pipeline = pipeline.map(
    prompt="Classify this support ticket: {{ input.text }}",
    output={"schema": {"category": "str", "priority": "str"}},
)

pipeline = pipeline.reduce(
    reduce_key="category",
    prompt="Summarize these tickets: {% for t in inputs %}{{ t.text }}{% endfor %}",
    output={"schema": {"summary": "str"}},
)

pipeline.schema()  # {'category': 'str', 'summary': 'str'}
pipeline.show()  # run on 5 docs and print results
rows = pipeline.collect()  # full run
print(f"Cost: ${pipeline.total_cost:.4f}")

YAML (low-code)

Declare your pipeline in a config file, no Python needed. Tutorial

datasets:
  tickets:
    type: file
    path: tickets.json

default_model: gpt-4o-mini

operations:
  - name: classify
    type: map
    prompt: "Classify this support ticket and assign a priority level."
    output:
      schema:
        category: str
        priority: str

pipeline:
  steps:
    - name: triage
      input: tickets
      operations: [classify]
  output:
    type: file
    path: output.json
docetl run pipeline.yaml

DocWrangler UI

Visual playground for interactive prompt development. Edit prompts, see results in real time. Try it at docetl.org/playground or run it locally.


Documentation

Python API Guide Frame API reference: operations, config, optimization
YAML Tutorial Step-by-step walkthrough of declarative pipelines
Operators Map, filter, reduce, resolve, split, gather, extract, and more
Optimization Automatic cost-accuracy optimization with MOAR
DocWrangler Setup Run the interactive UI locally or via Docker
Claude Code Quick Start Describe your task and let Claude build the pipeline

Community

Discord · Conversation Generator · Text-to-Speech · YouTube Transcript Topics


Development

git clone https://github.com/ucbepic/docetl.git && cd docetl
make install
make tests-basic  # < $0.01 with OpenAI

Papers

DocETL was created at the EPIC Data Lab and Data Systems and Foundations group at UC Berkeley.

DocETL, VLDB 2025 (paper)

@article{shankar2025docetl,
  title={DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing},
  author={Shankar, Shreya and Chambers, Tristan and Shah, Tarak and Parameswaran, Aditya G and Wu, Eugene},
  journal={Proceedings of the VLDB Endowment},
  volume={18}, number={9}, pages={3035--3048}, year={2025}
}

DocWrangler, UIST 2025, Best Paper Honorable Mention (paper)

@inproceedings{shankar2025docwrangler,
  title={Steering Semantic Data Processing With DocWrangler},
  author={Shankar*, Shreya and Chopra*, Bhavya and Hasan, Mawil and Lee, Stephen and Hartmann, Bj{\"o}rn and Hellerstein, Joseph M and Parameswaran, Aditya G and Wu, Eugene},
  booktitle={Proceedings of the ACM Symposium on User Interface Software and Technology (UIST)},
  year={2025}
}

MOAR, VLDB 2026 (paper)

@article{wei2026moar,
  title={Multi-Objective Agentic Rewrites for Unstructured Data Processing},
  author={Wei*, Lindsey Linxi and Shankar*, Shreya and Zeighami, Sepanta and Chung, Yeounoh and Ozcan, Fatma and Parameswaran, Aditya G},
  journal={Proceedings of the VLDB Endowment}, year={2026}
}

*Co-first authors

Extension points exported contracts — how you extend this code

ObservabilityIndicatorProps (Interface)
(no doc)
website/src/components/ColumnDialog.tsx
ColumnDialogProps (Interface)
(no doc)
website/src/components/ColumnDialog.tsx
ValueStatsProps (Interface)
(no doc)
website/src/components/ColumnDialog.tsx
BlockOp (Interface)
(no doc)
website/src/components/VisualizationBuilder.tsx
StackOp (Interface)
(no doc)
website/src/components/VisualizationBuilder.tsx

Core symbols most depended-on inside this repo

get
called by 1074
docetl/progress/events.py
map
called by 287
docetl/frame.py
print
called by 157
docetl/console.py
cn
called by 157
website/src/lib/utils.ts
filter
called by 103
docetl/frame.py
split
called by 88
docetl/frame.py
load
called by 72
docetl/runner.py
toast
called by 64
website/src/hooks/use-toast.ts

Shape

Function 1,257
Method 1,119
Class 245
Interface 115
Route 38

Languages

Python81%
TypeScript19%

Modules by API surface

docetl/reasoning_optimizer/instantiate_schemas.py76 symbols
tests/test_frame.py74 symbols
docetl/frame.py63 symbols
website/src/components/operations/components.tsx50 symbols
docetl/tui/app.py39 symbols
tests/test_moar_multistep.py35 symbols
docetl/runner.py35 symbols
tests/test_pandas_accessors.py34 symbols
tests/test_api.py32 symbols
tests/test_agentic_operations.py32 symbols
docetl/optimizers/join_optimizer.py32 symbols
experiments/reasoning/run_simple_agent.py31 symbols

Dependencies from manifests, versioned

@agbishop/react-ansi-184.0.6 · 1×
@ai-sdk/azure1.0.13 · 1×
@ai-sdk/openai0.0.70 · 1×
@cyntler/react-doc-viewer1.17.0 · 1×
@eslint/js9.13.0 · 1×
@hookform/resolvers3.9.0 · 1×
@monaco-editor/react4.6.0 · 1×
@next/third-parties14.2.11 · 1×
@radix-ui/react-accordion1.2.0 · 1×
@radix-ui/react-alert-dialog1.1.2 · 1×
@radix-ui/react-checkbox1.1.2 · 1×
@radix-ui/react-collapsible1.1.0 · 1×

For agents

$ claude mcp add docetl \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact