Data-Juicer: The Data Operating System for the Foundation Model Era


Multimodal | Cloud-Native | AI-Ready | Large-Scale
Data-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as composable infrastructure—providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte.
Whether you're deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clusters—no glue code required.
Alibaba Cloud PAI has deeply integrated Data-Juicer into its data processing products. See Quickly submit a DataJuicer job.
🚀 Quick Start
Zero-install exploration:
- JupyterLab Playground with Tutorials
- Ask DJ Copilot
Install & run:
uv pip install py-data-juicer
dj-process --config demos/process_simple/process.yaml
Or compose in Python:
from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper
ds = NestedDataset.from_dict({
"text": ["Short", "This passes the filter.", "Text with spaces"]
})
res_ds = ds.process([
TextLengthFilter(min_len=10),
WhitespaceNormalizationMapper()
])
for s in res_ds:
print(s)
✨ Why Data-Juicer?
1. Modular & Extensible Architecture
- 200+ operators spanning text, image, audio, video, and multimodal data
- Recipe-first: Reproducible YAML pipelines you can version, share, and fork like code
- Composable: Drop in a single operator, chain complex workflows, or orchestrate full pipelines
- Hot-reload: Iterate on operators without pipeline restarts
2. Full-Spectrum Data Intelligence
- Foundation Models: Pre-training, fine-tuning, RL, and evaluation-grade curation
- Agent Systems: Clean tool traces, structure context, de-identification, and quality gating
- RAG & Analytics: Extraction, normalization, semantic chunking, deduplication, and data profiling
3. Production-Ready Performance
- Scale: Process 70B samples in 2h on 50 Ray nodes (6400 cores)
- Efficiency: Deduplicate 5TB in 2.8h using 1280 cores
- Optimization: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness
- Observability: Built-in tracing for debugging, auditing, and iterative improvement
⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo. It helps more people discover the project and keeps you notified of new releases and features.
📰 News
[2026-06-26] Release v1.5.3: VLA Ops Enhancements; Ray Repartition Pipeline; Scalability & Robustness
- 🤖 VLA Ops Enhancements — Expanded embodied-AI processing with 10+ new/renamed VLA operators (camera calibration via DeepCalib/DroidCalib/MoGe, atomic action segmentation, hand action computation & motion smoothing, clip reassembly, trajectory overlay, LeRobot export) and a complete VLA pipeline demo.
- 🔄 Ray Repartition Pipeline — New
ray_repartition_pipeline for dataset-level block repartitioning in Ray mode.
- ⚡ Scalable Ray Data Reads — Wired
override_num_blocks through the full call chain for controlling block parallelism on PB-scale datasets.
- 🧪 Test Coverage Expansion — Added 409 new test cases across 18 test files.
- 🐳 Stability & Robustness Fixes — JSONStreamDatasource schema unification, OP env version resolution, FUSE-safe rmtree for PartitionedRayExecutor, deprecated model name updates, and num_proc handling fixes.
[2026-05-29] Release v1.5.2: Semantic LLM OPs, Cross-doc Line Dedup & Leaner Dependencies
- 🧹 New Deduplicator — Added
DocumentLineDeduplicator for cross-document line-level dedup, removing boilerplate lines (templates, copyright notices, navigation bars) by global document frequency.
- 🤖 Agent Data Quality Toolkit — Shipped interaction-quality OPs & recipe, a bad-case HTML report, and more robust JSONL / HuggingFace meta loading.
- 📦 Leaner & Faster Install — Slimmed the default dependency set (Ray, audio, spaCy, av, etc. moved to on-demand extras) to speed up installation.
- 🐳 Stability & Robustness Fixes — Library-safe error handling (raise over
exit(1)), Ray init/temp-dir fixes, valid API params (drop invalid max_new_tokens), PyArrow 20+ batch JSON reading, local-path aesthetics model support, and more performance/bug fixes.
- 🧠 Semantic LLM Operators — Introduced
llm_extract_mapper, llm_condition_filter, and llm_structured_ops with unified llm_* naming and configurable inference strategies (join/agg/top-k planned).
[2026-03-17] Release v1.5.1: LaTeX OPs; Compressed Format Support; Operator Robustness Fixes
- 📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle
.tex archives and figure contexts.
- 🗜️ Compressed dataset format support:
json[l].gz files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.
- 📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.
- 🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See date-juicer-agents for more details.
[2026-02-12] Release v1.5.0: Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs
- 🚀 Enhanced Distributed Execution Framework -- Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution.
- 🤖 Expanded Embodied AI Video Processing -- Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling.
- 💪🏻 System Performance & Developer Experience Optimizations -- Enabled batch inference, memory/log reduction, core logic refactoring, and updated documentation/templates.
- 🐳 Critical Bug Fixes & Stability Improvements -- Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability.
[2026-02-02] Release v1.4.6: Copilot, Video Bytes I/O & Ray Tracing
- 🤖 Q&A Copilot — Now live on our Doc Site | DingTalk | Discord. Feel free to ask anything related to Data-Juicer ecosystem!
- 🎬 Video Bytes I/O — Direct bytes processing for video pipelines
- Ray Mode Tracer — Track changed samples in distributed processing
- 🐳 Enhancements & fixes — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug/doc fixes.
[2026-01-15] Release v1.4.5: 20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade
- Embodied-AI OPs: added/enhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus S3 upload/download.
- New Pipeline OP: compose multiple OPs into one pipeline; introduced Ray + vLLM pipelines for LLM/VLM inference.
- Docs upgrade: moved to a unified Sphinx-based documentation build/deploy workflow with isolated theme/architecture repo.
- Enhancements & fixes: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes.
[2025-12-01] Release v1.4.4: NeurIPS’25 Spotlight, 6 New Video/MM OPs & S3 I/O
- NeurIPS'25 Spotlight for Data-Juicer 2.0
- Repo split: sandbox/recipes/agents moved to standalone repos
- S3 I/O added to loader/exporter
- 6 new video & multimodal OPs (character detection, VGGT, whole-body pose, hand reconstruction) + docs/Ray/video I/O improvements and bug fixes
View All Release and News Archive
🔌 Users & Ecosystems
The below list focuses on developer-facing integration and usages in alphabetical order.
Missing your project / name? Feel free to open a PR or reach out.
Data-Juicer plugs into your existing stack and evolves with community contributions:
Extensions
Frameworks & Platforms
AgentScope · Apache Arrow · Apache HDFS · Apache Hudi · Apache Iceberg · Apache Paimon · Alibaba PAI · Delta Lake · DiffSynth-Studio · EasyAnimate · Eval-Scope · Huawei Ascend · Hugging Face · LanceDB · LLaMA-Factory · ModelScope · ModelScope Swift · NVIDIA NeMo · Ray · RM-Gallery · Trinity-RFT · Volcano Engine
Industry
Alibaba Group, Ant Group, BYD Auto, ByteDance, DTSTACK, JD.com, NVIDIA, OPPO, Xiaohongshu, Xiaomi, Ximalaya, and more.
Academia
CAS, Nanjing University, Peking University, RUC, Tsinghua University, UCAS, Zhejiang University, and more.
Contributing & Community
We believe in building together. Whether you're fixing a typo, crafting a new operator, or sharing a breakthrough recipe, every contribution shapes the future of data processing.
We welcome contributions at all levels:
- Good First Issues — Add operators, improve docs, report issues, or fix bugs
- Developer Guide — Optimize engines, add features, or enhance core infrastructure
- DJ-Hub — Share knowledge: recipes, papers, and best practices
- Connect: [Slack](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3