Frequently Asked Questions

This page answers common questions about DataStudio.

DataStudio is an industrial-grade multimodal data processing pipeline designed for preparing training data for Multimodal Large Language Models (MLLMs). It’s the data curation engine behind the Bee project and the Honey-Data-15M dataset.

What makes DataStudio different from other data processing tools?

DataStudio is specifically designed for multimodal (image + text) data with:

Native MLLM integration: Use vision-language models for intelligent filtering and rewriting
Massive parallelism: Multi-process async concurrent API requests (8192+ via MPOpenAIAPI)
LMDB image caching: Blazing-fast image I/O with sharded caching
Config-driven pipelines: Reproducible, shareable processing workflows

Is DataStudio free to use?

Yes! DataStudio is open-source under the Apache License 2.0. You can use it freely for both research and commercial purposes.

Installation & Setup 

What Python version is required?

DataStudio requires Python 3.10 or higher.

python --version  # Should be 3.10+

How do I install DataStudio?

git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio
pip install -r requirements.txt
pip install -e .

Do I need a GPU?

No GPU is required for running DataStudio itself. However:

If you’re using locally deployed MLLMs (e.g., vLLM, SGLang), you’ll need GPUs
If you’re using cloud APIs (OpenAI, etc.), no GPU is needed

How much disk space do I need?

It depends on your dataset:

LMDB cache: Roughly 50-100% of your original image size
Output data: Similar to input data size
Working space: ~10GB for checkpoints and logs

We recommend having at least 2x your dataset size in free disk space.

Data Format 

What data formats does DataStudio support?

DataStudio supports:

JSON: Single file with array of samples
JSONL: One sample per line (recommended for large datasets)

What should my data look like?

{
    "id": "unique_sample_id",
    "image": "path/to/image.jpg",
    "conversations": [
        {"from": "human", "value": "What is in this image?"},
        {"from": "gpt", "value": "This is a cat."}
    ]
}

Can I process multi-image samples?

Yes! Use a list for the image field:

{
    "id": "multi_image_sample",
    "image": ["image1.jpg", "image2.jpg", "image3.jpg"],
    "conversations": [
        {"from": "human", "value": "Compare these three images."},
        {"from": "gpt", "value": "The first image shows..."}
    ]
}

Can I process text-only data?

Yes, simply omit the image field or set use_image=False in your config:

dataloader = dict(
    use_image=False,
    # ...
)

Pipeline & Operators 

How do I choose which operators to use?

Start with these common patterns:

Basic cleaning: ConvLengthFilter + RemoveThinkRewriter
Quality filtering: Add MLLMFilter with a quality prompt
Content enhancement: Add MLLMRewriter for CoT enrichment

What’s the difference between Filter and Rewriter?

Filter: Decides whether to keep or remove a sample (binary decision)
Rewriter: Modifies the content of a sample (transformation)

Can I combine multiple operators?

Yes! That’s the core design of DataStudio:

pipeline = Pipeline([
    Filter1(),
    Filter2(),
    Rewriter1(),
    Rewriter2(),
])

Operators execute in order. Filtered samples are removed before reaching subsequent operators.

How do I create custom operators?

See the Development Guide guide for detailed instructions.

MLLM Integration 

Which MLLM providers are supported?

DataStudio supports any OpenAI-compatible API, including:

OpenAI (GPT-4o, GPT-4V)
Anthropic (Claude)
Local deployments (vLLM, SGLang, Ollama)
Cloud services (Azure OpenAI, AWS Bedrock)

How do I use a local model?

Deploy your model with vLLM or SGLang, then configure DataStudio:

model = dict(
    model="Qwen3-VL-30B-A3B-Instruct",
    api_base="http://localhost:8000/v1",
    key="not-needed",
    thread_num=512,
)

How many concurrent API calls can I make?

This depends on your API provider:

OpenAI: 50-200 (rate limits apply)
Local vLLM: 256-1024 (depends on GPU memory)
SGLang cluster: 2048-8192 (highest throughput)

Why are my MLLM calls slow?

Common causes:

Low thread_num: Increase thread_num in model config
API rate limiting: Reduce thread_num or upgrade API tier
Large images: Use resize_image_size to reduce image size
Network latency: Consider local model deployment

Performance 

How do I speed up processing?

See the Quick Start guide for performance tips. Key tips:

Pre-cache images with --cache-images
Use rule-based filters before MLLM operators
Increase thread_num for MLLM operations
Use SSD storage for LMDB cache

What batch_size should I use?

Available Memory	Recommended batch_size
< 16 GB	1,000 - 5,000
16-64 GB	5,000 - 20,000
> 64 GB	20,000 - 100,000

How do I handle very large datasets (10M+ samples)?

Pre-cache images before running the pipeline
Use config-driven pipelines for automatic checkpointing
Split processing across multiple machines if needed
Monitor progress with W&B integration

Troubleshooting 

My pipeline crashed. How do I resume?

Simply re-run the same command:

python run.py -c my_config.py  # Automatically resumes

DataStudio saves checkpoints after each batch.

Why is my output empty?