DataStudio Documentation

Industrial-Grade Multimodal Data Processing Pipeline for MLLMs

DataStudio is the data curation engine behind the Bee project and the Honey-Data-15M dataset.

Note

DataStudio is part of the Bee project, accepted to ICLR 2026. See the paper for details.

Quick Links

📄 Paper	Read the research paper on arXiv
🏠 Project Page	Visit the Bee project homepage
🤗 Models & Data	Access Bee-8B model and Honey-Data-15M dataset
💻 GitHub	Source code and issue tracker

Key Features

🧩 Modular Architecture: 16 built-in rule operators + MLLM operators with unified interface for easy composition
🤖 MLLM-Powered Processing: Native integration with vision-language models for intelligent filtering and rewriting
⚡ High Performance: Multi-process async concurrent API requests (8192+ via MPOpenAIAPI) with LMDB image caching
📊 Flexible Data Handling: JSON/JSONL support, multi-image samples, format preservation

Getting Started

Install DataStudio:

git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio
pip install -r requirements.txt
pip install -e .

Try the built-in examples:

# Rule filtering example (no MLLM required)
python run.py -c configs/examples/rule_filter_only.py

# Text normalization example
python run.py -c configs/examples/text_normalization.py

Documentation Contents

User Guide

API Reference

中文文档

Community

Bug Reports: GitHub Issues
Feature Requests: GitHub Discussions
Contributing: See CONTRIBUTING.md

Citation

If you use DataStudio in your research, please cite:

@article{zhang2025bee,
  title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
  author={Zhang, Yi and Ni, Bolin and Chen, Xin-Sheng and Zhang, Heng-Rui and Rao, Yongming and Peng, Houwen and Lu, Qinglin and Hu, Han and Guo, Meng-Hao and Hu, Shi-Min},
  journal={arXiv preprint arXiv:2510.13795},
  year={2025}
}