DataStudio Documentation

DataStudio

Industrial-Grade Multimodal Data Processing Pipeline for MLLMs

DataStudio is the data curation engine behind the Bee project and the Honey-Data-15M dataset.

Note

DataStudio is part of the Bee project, accepted to ICLR 2026. See the paper for details.

Key Features

🧩 Modular Architecture

16 built-in rule operators + MLLM operators with unified interface for easy composition

🤖 MLLM-Powered Processing

Native integration with vision-language models for intelligent filtering and rewriting

High Performance

Multi-process async concurrent API requests (8192+ via MPOpenAIAPI) with LMDB image caching

📊 Flexible Data Handling

JSON/JSONL support, multi-image samples, format preservation

Getting Started

Install DataStudio:

git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio
pip install -r requirements.txt
pip install -e .

Try the built-in examples:

# Rule filtering example (no MLLM required)
python run.py -c configs/examples/rule_filter_only.py

# Text normalization example
python run.py -c configs/examples/text_normalization.py

Documentation Contents

Community

Citation

If you use DataStudio in your research, please cite:

@article{zhang2025bee,
  title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
  author={Zhang, Yi and Ni, Bolin and Chen, Xin-Sheng and Zhang, Heng-Rui and Rao, Yongming and Peng, Houwen and Lu, Qinglin and Hu, Han and Guo, Meng-Hao and Hu, Shi-Min},
  journal={arXiv preprint arXiv:2510.13795},
  year={2025}
}

Indices and Tables