Frequently Asked Questions

This page answers common questions about DataStudio.

General Questions

What is DataStudio?

DataStudio is an industrial-grade multimodal data processing pipeline designed for preparing training data for Multimodal Large Language Models (MLLMs). It’s the data curation engine behind the Bee project and the Honey-Data-15M dataset.

What makes DataStudio different from other data processing tools?

DataStudio is specifically designed for multimodal (image + text) data with:

  • Native MLLM integration: Use vision-language models for intelligent filtering and rewriting

  • Massive parallelism: Multi-process async concurrent API requests (8192+ via MPOpenAIAPI)

  • LMDB image caching: Blazing-fast image I/O with sharded caching

  • Config-driven pipelines: Reproducible, shareable processing workflows

Is DataStudio free to use?

Yes! DataStudio is open-source under the Apache License 2.0. You can use it freely for both research and commercial purposes.

Installation & Setup

What Python version is required?

DataStudio requires Python 3.10 or higher.

python --version  # Should be 3.10+

How do I install DataStudio?

git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio
pip install -r requirements.txt
pip install -e .

Do I need a GPU?

No GPU is required for running DataStudio itself. However:

  • If you’re using locally deployed MLLMs (e.g., vLLM, SGLang), you’ll need GPUs

  • If you’re using cloud APIs (OpenAI, etc.), no GPU is needed

How much disk space do I need?

It depends on your dataset:

  • LMDB cache: Roughly 50-100% of your original image size

  • Output data: Similar to input data size

  • Working space: ~10GB for checkpoints and logs

We recommend having at least 2x your dataset size in free disk space.

Data Format

What data formats does DataStudio support?

DataStudio supports:

  • JSON: Single file with array of samples

  • JSONL: One sample per line (recommended for large datasets)

What should my data look like?

{
    "id": "unique_sample_id",
    "image": "path/to/image.jpg",
    "conversations": [
        {"from": "human", "value": "What is in this image?"},
        {"from": "gpt", "value": "This is a cat."}
    ]
}

Can I process multi-image samples?

Yes! Use a list for the image field:

{
    "id": "multi_image_sample",
    "image": ["image1.jpg", "image2.jpg", "image3.jpg"],
    "conversations": [
        {"from": "human", "value": "Compare these three images."},
        {"from": "gpt", "value": "The first image shows..."}
    ]
}

Can I process text-only data?

Yes, simply omit the image field or set use_image=False in your config:

dataloader = dict(
    use_image=False,
    # ...
)

Pipeline & Operators

How do I choose which operators to use?

Start with these common patterns:

  1. Basic cleaning: ConvLengthFilter + RemoveThinkRewriter

  2. Quality filtering: Add MLLMFilter with a quality prompt

  3. Content enhancement: Add MLLMRewriter for CoT enrichment

What’s the difference between Filter and Rewriter?

  • Filter: Decides whether to keep or remove a sample (binary decision)

  • Rewriter: Modifies the content of a sample (transformation)

Can I combine multiple operators?

Yes! That’s the core design of DataStudio:

pipeline = Pipeline([
    Filter1(),
    Filter2(),
    Rewriter1(),
    Rewriter2(),
])

Operators execute in order. Filtered samples are removed before reaching subsequent operators.

How do I create custom operators?

See the Development Guide guide for detailed instructions.

MLLM Integration

Which MLLM providers are supported?

DataStudio supports any OpenAI-compatible API, including:

  • OpenAI (GPT-4o, GPT-4V)

  • Anthropic (Claude)

  • Local deployments (vLLM, SGLang, Ollama)

  • Cloud services (Azure OpenAI, AWS Bedrock)

How do I use a local model?

Deploy your model with vLLM or SGLang, then configure DataStudio:

model = dict(
    model="Qwen3-VL-30B-A3B-Instruct",
    api_base="http://localhost:8000/v1",
    key="not-needed",
    thread_num=512,
)

How many concurrent API calls can I make?

This depends on your API provider:

  • OpenAI: 50-200 (rate limits apply)

  • Local vLLM: 256-1024 (depends on GPU memory)

  • SGLang cluster: 2048-8192 (highest throughput)

Why are my MLLM calls slow?

Common causes:

  1. Low thread_num: Increase thread_num in model config

  2. API rate limiting: Reduce thread_num or upgrade API tier

  3. Large images: Use resize_image_size to reduce image size

  4. Network latency: Consider local model deployment

Performance

How do I speed up processing?

See the Quick Start guide for performance tips. Key tips:

  1. Pre-cache images with --cache-images

  2. Use rule-based filters before MLLM operators

  3. Increase thread_num for MLLM operations

  4. Use SSD storage for LMDB cache

What batch_size should I use?

Available Memory

Recommended batch_size

< 16 GB

1,000 - 5,000

16-64 GB

5,000 - 20,000

> 64 GB

20,000 - 100,000

How do I handle very large datasets (10M+ samples)?

  1. Pre-cache images before running the pipeline

  2. Use config-driven pipelines for automatic checkpointing

  3. Split processing across multiple machines if needed

  4. Monitor progress with W&B integration

Troubleshooting

My pipeline crashed. How do I resume?

Simply re-run the same command:

python run.py -c my_config.py  # Automatically resumes

DataStudio saves checkpoints after each batch.

Why is my output empty?

Common causes:

  1. Filters too aggressive: Check filter_ops in rejected samples

  2. Wrong data format: Verify input matches expected format

  3. Path issues: Check image paths are correct

How do I debug filtering decisions?

Examine the filtered samples:

kept, filtered = pipeline(data)

for item in filtered[:10]:
    print(f"ID: {item['id']}")
    print(f"Filtered by: {item.get('filter_ops', {})}")
    print()

See Troubleshooting Guide for more detailed debugging guides.

Contributing

How can I contribute?

We welcome contributions! See CONTRIBUTING.md for guidelines:

  • Bug reports and feature requests

  • Documentation improvements

  • New operators

  • Performance optimizations

How do I report a bug?

Open an issue on GitHub with:

  1. DataStudio version

  2. Python version and OS

  3. Minimal reproducible example

  4. Error message and traceback

Citation

How should I cite DataStudio?

If you use DataStudio in your research, please cite:

@article{zhang2025bee,
  title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
  author={Zhang, Yi and others},
  journal={arXiv preprint arXiv:2510.13795},
  year={2025}
}

Still Have Questions?