MCPcopy
hub / github.com/datajuicer/data-juicer

github.com/datajuicer/data-juicer @v1.5.3 sqlite

repository ↗ · DeepWiki ↗ · release v1.5.3 ↗
7,096 symbols 32,413 edges 806 files 2,848 documented · 40%
README

Data-Juicer: The Data Operating System for the Foundation Model Era

PyPI Downloads Docker

Docs Operators Recipes

Chinese Paper Coverage

Multimodal | Cloud-Native | AI-Ready | Large-Scale

Data-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as composable infrastructure—providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte.

Whether you're deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clusters—no glue code required.

Alibaba Cloud PAI has deeply integrated Data-Juicer into its data processing products. See Quickly submit a DataJuicer job.


🚀 Quick Start

Zero-install exploration: - JupyterLab Playground with Tutorials - Ask DJ Copilot

Install & run:

uv pip install py-data-juicer
dj-process --config demos/process_simple/process.yaml

Or compose in Python:

from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper

ds = NestedDataset.from_dict({
    "text": ["Short", "This passes the filter.", "Text   with   spaces"]
})
res_ds = ds.process([
    TextLengthFilter(min_len=10),
    WhitespaceNormalizationMapper()
])

for s in res_ds:
    print(s)

✨ Why Data-Juicer?

1. Modular & Extensible Architecture

  • 200+ operators spanning text, image, audio, video, and multimodal data
  • Recipe-first: Reproducible YAML pipelines you can version, share, and fork like code
  • Composable: Drop in a single operator, chain complex workflows, or orchestrate full pipelines
  • Hot-reload: Iterate on operators without pipeline restarts

2. Full-Spectrum Data Intelligence

  • Foundation Models: Pre-training, fine-tuning, RL, and evaluation-grade curation
  • Agent Systems: Clean tool traces, structure context, de-identification, and quality gating
  • RAG & Analytics: Extraction, normalization, semantic chunking, deduplication, and data profiling

3. Production-Ready Performance

  • Scale: Process 70B samples in 2h on 50 Ray nodes (6400 cores)
  • Efficiency: Deduplicate 5TB in 2.8h using 1280 cores
  • Optimization: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness
  • Observability: Built-in tracing for debugging, auditing, and iterative improvement

⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo. It helps more people discover the project and keeps you notified of new releases and features.


📰 News

[2026-06-26] Release v1.5.3: VLA Ops Enhancements; Ray Repartition Pipeline; Scalability & Robustness

  • 🤖 VLA Ops Enhancements — Expanded embodied-AI processing with 10+ new/renamed VLA operators (camera calibration via DeepCalib/DroidCalib/MoGe, atomic action segmentation, hand action computation & motion smoothing, clip reassembly, trajectory overlay, LeRobot export) and a complete VLA pipeline demo.
  • 🔄 Ray Repartition Pipeline — New ray_repartition_pipeline for dataset-level block repartitioning in Ray mode.
  • Scalable Ray Data Reads — Wired override_num_blocks through the full call chain for controlling block parallelism on PB-scale datasets.
  • 🧪 Test Coverage Expansion — Added 409 new test cases across 18 test files.
  • 🐳 Stability & Robustness Fixes — JSONStreamDatasource schema unification, OP env version resolution, FUSE-safe rmtree for PartitionedRayExecutor, deprecated model name updates, and num_proc handling fixes.

[2026-05-29] Release v1.5.2: Semantic LLM OPs, Cross-doc Line Dedup & Leaner Dependencies

  • 🧹 New Deduplicator — Added DocumentLineDeduplicator for cross-document line-level dedup, removing boilerplate lines (templates, copyright notices, navigation bars) by global document frequency.
  • 🤖 Agent Data Quality Toolkit — Shipped interaction-quality OPs & recipe, a bad-case HTML report, and more robust JSONL / HuggingFace meta loading.
  • 📦 Leaner & Faster Install — Slimmed the default dependency set (Ray, audio, spaCy, av, etc. moved to on-demand extras) to speed up installation.
  • 🐳 Stability & Robustness Fixes — Library-safe error handling (raise over exit(1)), Ray init/temp-dir fixes, valid API params (drop invalid max_new_tokens), PyArrow 20+ batch JSON reading, local-path aesthetics model support, and more performance/bug fixes.
  • 🧠 Semantic LLM Operators — Introduced llm_extract_mapper, llm_condition_filter, and llm_structured_ops with unified llm_* naming and configurable inference strategies (join/agg/top-k planned).

[2026-03-17] Release v1.5.1: LaTeX OPs; Compressed Format Support; Operator Robustness Fixes

  • 📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle .tex archives and figure contexts.
  • 🗜️ Compressed dataset format support: json[l].gz files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.
  • 📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.
  • 🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See date-juicer-agents for more details.

[2026-02-12] Release v1.5.0: Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs

  • 🚀 Enhanced Distributed Execution Framework -- Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution.
  • 🤖 Expanded Embodied AI Video Processing -- Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling.
  • 💪🏻 System Performance & Developer Experience Optimizations -- Enabled batch inference, memory/log reduction, core logic refactoring, and updated documentation/templates.
  • 🐳 Critical Bug Fixes & Stability Improvements -- Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability.

[2026-02-02] Release v1.4.6: Copilot, Video Bytes I/O & Ray Tracing

  • 🤖 Q&A Copilot — Now live on our Doc Site | DingTalk | Discord. Feel free to ask anything related to Data-Juicer ecosystem!
  • 🎬 Video Bytes I/O — Direct bytes processing for video pipelines
  • 🫆 Ray Mode Tracer — Track changed samples in distributed processing
  • 🐳 Enhancements & fixes — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug/doc fixes.

[2026-01-15] Release v1.4.5: 20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade

  • Embodied-AI OPs: added/enhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus S3 upload/download.
  • New Pipeline OP: compose multiple OPs into one pipeline; introduced Ray + vLLM pipelines for LLM/VLM inference.
  • Docs upgrade: moved to a unified Sphinx-based documentation build/deploy workflow with isolated theme/architecture repo.
  • Enhancements & fixes: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes.

[2025-12-01] Release v1.4.4: NeurIPS’25 Spotlight, 6 New Video/MM OPs & S3 I/O

  • NeurIPS'25 Spotlight for Data-Juicer 2.0
  • Repo split: sandbox/recipes/agents moved to standalone repos
  • S3 I/O added to loader/exporter
  • 6 new video & multimodal OPs (character detection, VGGT, whole-body pose, hand reconstruction) + docs/Ray/video I/O improvements and bug fixes

View All Release and News Archive


🔌 Users & Ecosystems

The below list focuses on developer-facing integration and usages in alphabetical order.
Missing your project / name? Feel free to open a PR or reach out.

Data-Juicer plugs into your existing stack and evolves with community contributions:

Extensions

Frameworks & Platforms

AgentScope · Apache Arrow · Apache HDFS · Apache Hudi · Apache Iceberg · Apache Paimon · Alibaba PAI · Delta Lake · DiffSynth-Studio · EasyAnimate · Eval-Scope · Huawei Ascend · Hugging Face · LanceDB · LLaMA-Factory · ModelScope · ModelScope Swift · NVIDIA NeMo · Ray · RM-Gallery · Trinity-RFT · Volcano Engine

Industry

Alibaba Group, Ant Group, BYD Auto, ByteDance, DTSTACK, JD.com, NVIDIA, OPPO, Xiaohongshu, Xiaomi, Ximalaya, and more.

Academia

CAS, Nanjing University, Peking University, RUC, Tsinghua University, UCAS, Zhejiang University, and more.

Contributing & Community

We believe in building together. Whether you're fixing a typo, crafting a new operator, or sharing a breakthrough recipe, every contribution shapes the future of data processing.

We welcome contributions at all levels: - Good First Issues — Add operators, improve docs, report issues, or fix bugs - Developer Guide — Optimize engines, add features, or enhance core infrastructure - DJ-Hub — Share knowledge: recipes, papers, and best practices - Connect: [Slack](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3

Core symbols most depended-on inside this repo

get
called by 1187
data_juicer/utils/registry.py
append
called by 1176
tools/mm_eval/inception_metrics/video_metrics/metric_utils.py
from_list
called by 623
data_juicer/core/data/dj_dataset.py
get
called by 338
data_juicer/core/data/dj_dataset.py
map
called by 274
data_juicer/core/data/dj_dataset.py
split
called by 241
demos/tool_dataset_splitting_by_language/app.py
to_list
called by 210
data_juicer/core/data/dj_dataset.py
write
called by 157
tools/mm_eval/inception_metrics/util.py

Shape

Method 4,807
Function 1,229
Class 980
Route 80

Languages

Python100%

Modules by API surface

demos/agent/scripts/generate_bad_case_report.py134 symbols
data_juicer/ops/common/hawor_func.py113 symbols
tests/ops/test_op_env.py79 symbols
tests/config/test_config_functions.py79 symbols
data_juicer/ops/base_op.py66 symbols
data_juicer/utils/model_utils.py63 symbols
tests/ops/test_mixins.py61 symbols
tests/core/executor/test_pipeline_dag.py61 symbols
data_juicer/ops/common/prompt2prompt_pipeline.py61 symbols
tests/format/test_formatter.py58 symbols
tests/utils/test_file_utils.py57 symbols
tests/utils/test_ckpt_utils.py57 symbols

Dependencies from manifests, versioned

datasets4.7.0 · 1×
fsspec2023.5.0 · 1×
tqdm

Datastores touched

(mysql)Database · 1 repos

For agents

$ claude mcp add data-juicer \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact