hub / github.com/datajuicer/data-juicer

github.com/datajuicer/data-juicer @v1.5.3 sqlite

repository ↗ · DeepWiki ↗ · release v1.5.3 ↗

7,096 symbols 32,413 edges 806 files 2,848 documented · 40%

README

Data-Juicer: The Data Operating System for the Foundation Model Era

Multimodal | Cloud-Native | AI-Ready | Large-Scale

Data-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as composable infrastructure—providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte.

Whether you're deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clusters—no glue code required.

Alibaba Cloud PAI has deeply integrated Data-Juicer into its data processing products. See Quickly submit a DataJuicer job.

🚀 Quick Start

Zero-install exploration: - JupyterLab Playground with Tutorials - Ask DJ Copilot

Install & run:

uv pip install py-data-juicer
dj-process --config demos/process_simple/process.yaml

Or compose in Python:

from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper

ds = NestedDataset.from_dict({
    "text": ["Short", "This passes the filter.", "Text   with   spaces"]
})
res_ds = ds.process([
    TextLengthFilter(min_len=10),
    WhitespaceNormalizationMapper()
])

for s in res_ds:
    print(s)

✨ Why Data-Juicer?

1. Modular & Extensible Architecture

200+ operators spanning text, image, audio, video, and multimodal data
Recipe-first: Reproducible YAML pipelines you can version, share, and fork like code
Composable: Drop in a single operator, chain complex workflows, or orchestrate full pipelines
Hot-reload: Iterate on operators without pipeline restarts

2. Full-Spectrum Data Intelligence

Foundation Models: Pre-training, fine-tuning, RL, and evaluation-grade curation
Agent Systems: Clean tool traces, structure context, de-identification, and quality gating
RAG & Analytics: Extraction, normalization, semantic chunking, deduplication, and data profiling

3. Production-Ready Performance

Scale: Process 70B samples in 2h on 50 Ray nodes (6400 cores)
Efficiency: Deduplicate 5TB in 2.8h using 1280 cores
Optimization: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness
Observability: Built-in tracing for debugging, auditing, and iterative improvement

⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo. It helps more people discover the project and keeps you notified of new releases and features.

📰 News

[2026-06-26] Release v1.5.3: VLA Ops Enhancements; Ray Repartition Pipeline; Scalability & Robustness

🤖 VLA Ops Enhancements — Expanded embodied-AI processing with 10+ new/renamed VLA operators (camera calibration via DeepCalib/DroidCalib/MoGe, atomic action segmentation, hand action computation & motion smoothing, clip reassembly, trajectory overlay, LeRobot export) and a complete VLA pipeline demo.
🔄 Ray Repartition Pipeline — New ray_repartition_pipeline for dataset-level block repartitioning in Ray mode.
⚡ Scalable Ray Data Reads — Wired override_num_blocks through the full call chain for controlling block parallelism on PB-scale datasets.
🧪 Test Coverage Expansion — Added 409 new test cases across 18 test files.
🐳 Stability & Robustness Fixes — JSONStreamDatasource schema unification, OP env version resolution, FUSE-safe rmtree for PartitionedRayExecutor, deprecated model name updates, and num_proc handling fixes.

[2026-05-29] Release v1.5.2: Semantic LLM OPs, Cross-doc Line Dedup & Leaner Dependencies

🧹 New Deduplicator — Added DocumentLineDeduplicator for cross-document line-level dedup, removing boilerplate lines (templates, copyright notices, navigation bars) by global document frequency.
🤖 Agent Data Quality Toolkit — Shipped interaction-quality OPs & recipe, a bad-case HTML report, and more robust JSONL / HuggingFace meta loading.
📦 Leaner & Faster Install — Slimmed the default dependency set (Ray, audio, spaCy, av, etc. moved to on-demand extras) to speed up installation.
🐳 Stability & Robustness Fixes — Library-safe error handling (raise over exit(1)), Ray init/temp-dir fixes, valid API params (drop invalid max_new_tokens), PyArrow 20+ batch JSON reading, local-path aesthetics model support, and more performance/bug fixes.
🧠 Semantic LLM Operators — Introduced llm_extract_mapper, llm_condition_filter, and llm_structured_ops with unified llm_* naming and configurable inference strategies (join/agg/top-k planned).

[2026-03-17] Release v1.5.1: LaTeX OPs; Compressed Format Support; Operator Robustness Fixes

📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle .tex archives and figure contexts.
🗜️ Compressed dataset format support: json[l].gz files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.
📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.
🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See date-juicer-agents for more details.

[2026-02-12] Release v1.5.0: Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs

🚀 Enhanced Distributed Execution Framework -- Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution.
🤖 Expanded Embodied AI Video Processing -- Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling.
💪🏻 System Performance & Developer Experience Optimizations -- Enabled batch inference, memory/log reduction, core logic refactoring, and updated documentation/templates.
🐳 Critical Bug Fixes & Stability Improvements -- Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability.

[2026-02-02] Release v1.4.6: Copilot, Video Bytes I/O & Ray Tracing

🤖 Q&A Copilot — Now live on our Doc Site | DingTalk | Discord. Feel free to ask anything related to Data-Juicer ecosystem!
- Check 🤖 Data-Juicer Agents | 📃 Deploy-ready codes | 🎬 More demos for more details.
🎬 Video Bytes I/O — Direct bytes processing for video pipelines
🫆 Ray Mode Tracer — Track changed samples in distributed processing
🐳 Enhancements & fixes — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug/doc fixes.

[2026-01-15] Release v1.4.5: 20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade

Embodied-AI OPs: added/enhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus S3 upload/download.
New Pipeline OP: compose multiple OPs into one pipeline; introduced Ray + vLLM pipelines for LLM/VLM inference.
Docs upgrade: moved to a unified Sphinx-based documentation build/deploy workflow with isolated theme/architecture repo.
Enhancements & fixes: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes.

[2025-12-01] Release v1.4.4: NeurIPS’25 Spotlight, 6 New Video/MM OPs & S3 I/O

NeurIPS'25 Spotlight for Data-Juicer 2.0
Repo split: sandbox/recipes/agents moved to standalone repos
S3 I/O added to loader/exporter
6 new video & multimodal OPs (character detection, VGGT, whole-body pose, hand reconstruction) + docs/Ray/video I/O improvements and bug fixes

View All Release and News Archive

🔌 Users & Ecosystems

The below list focuses on developer-facing integration and usages in alphabetical order.
Missing your project / name? Feel free to open a PR or reach out.

Data-Juicer plugs into your existing stack and evolves with community contributions:

Extensions

data-juicer-agents — DJ Copilot and agentic workflows
data-juicer-hub — Community recipes and best practices
data-juicer-sandbox — Data-model co-development with feedback loops

Frameworks & Platforms

AgentScope · Apache Arrow · Apache HDFS · Apache Hudi · Apache Iceberg · Apache Paimon · Alibaba PAI · Delta Lake · DiffSynth-Studio · EasyAnimate · Eval-Scope · Huawei Ascend · Hugging Face · LanceDB · LLaMA-Factory · ModelScope · ModelScope Swift · NVIDIA NeMo · Ray · RM-Gallery · Trinity-RFT · Volcano Engine

Industry

Alibaba Group, Ant Group, BYD Auto, ByteDance, DTSTACK, JD.com, NVIDIA, OPPO, Xiaohongshu, Xiaomi, Ximalaya, and more.

Academia

CAS, Nanjing University, Peking University, RUC, Tsinghua University, UCAS, Zhejiang University, and more.

Contributing & Community

We believe in building together. Whether you're fixing a typo, crafting a new operator, or sharing a breakthrough recipe, every contribution shapes the future of data processing.

We welcome contributions at all levels: - Good First Issues — Add operators, improve docs, report issues, or fix bugs - Developer Guide — Optimize engines, add features, or enhance core infrastructure - DJ-Hub — Share knowledge: recipes, papers, and best practices - Connect: [Slack](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3

Core symbols most depended-on inside this repo

get

called by 1187

data_juicer/utils/registry.py

append

called by 1176

tools/mm_eval/inception_metrics/video_metrics/metric_utils.py

from_list

called by 623

data_juicer/core/data/dj_dataset.py

get

called by 338

data_juicer/core/data/dj_dataset.py

map

called by 274

data_juicer/core/data/dj_dataset.py

split

called by 241

demos/tool_dataset_splitting_by_language/app.py

to_list

called by 210

data_juicer/core/data/dj_dataset.py

write

called by 157

tools/mm_eval/inception_metrics/util.py

Shape

Method 4,807

Function 1,229

Class 980

Route 80

Languages

Python100%

Modules by API surface

demos/agent/scripts/generate_bad_case_report.py134 symbols

data_juicer/ops/common/hawor_func.py113 symbols

tests/ops/test_op_env.py79 symbols

tests/config/test_config_functions.py79 symbols

data_juicer/ops/base_op.py66 symbols

data_juicer/utils/model_utils.py63 symbols

tests/ops/test_mixins.py61 symbols

tests/core/executor/test_pipeline_dag.py61 symbols

data_juicer/ops/common/prompt2prompt_pipeline.py61 symbols

tests/format/test_formatter.py58 symbols

tests/utils/test_file_utils.py57 symbols

tests/utils/test_ckpt_utils.py57 symbols

Dependencies from manifests, versioned

datasets4.7.0 · 1×

fsspec2023.5.0 · 1×

loguru1×

pandas1×

tqdm1×

Datastores touched

(mysql)Database · 1 repos

For agents

$ claude mcp add data-juicer \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact