DataStudio Documentation
Industrial-Grade Multimodal Data Processing Pipeline for MLLMs
DataStudio is the data curation engine behind the Bee project and the Honey-Data-15M dataset.
Note
DataStudio is part of the Bee project, accepted to ICLR 2026. See the paper for details.
Quick Links
📄 Paper |
Read the research paper on arXiv |
Visit the Bee project homepage |
|
Access Bee-8B model and Honey-Data-15M dataset |
|
💻 GitHub |
Source code and issue tracker |
Key Features
- 🧩 Modular Architecture
16 built-in rule operators + MLLM operators with unified interface for easy composition
- 🤖 MLLM-Powered Processing
Native integration with vision-language models for intelligent filtering and rewriting
- ⚡ High Performance
Multi-process async concurrent API requests (8192+ via
MPOpenAIAPI) with LMDB image caching- 📊 Flexible Data Handling
JSON/JSONL support, multi-image samples, format preservation
Getting Started
Install DataStudio:
git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio
pip install -r requirements.txt
pip install -e .
Try the built-in examples:
# Rule filtering example (no MLLM required)
python run.py -c configs/examples/rule_filter_only.py
# Text normalization example
python run.py -c configs/examples/text_normalization.py
Documentation Contents
User Guide
API Reference
Community
Bug Reports: GitHub Issues
Feature Requests: GitHub Discussions
Contributing: See CONTRIBUTING.md
Citation
If you use DataStudio in your research, please cite:
@article{zhang2025bee,
title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
author={Zhang, Yi and Ni, Bolin and Chen, Xin-Sheng and Zhang, Heng-Rui and Rao, Yongming and Peng, Houwen and Lu, Qinglin and Hu, Han and Guo, Meng-Hao and Hu, Shi-Min},
journal={arXiv preprint arXiv:2510.13795},
year={2025}
}