Frequently Asked Questions
===========================
This page answers common questions about DataStudio.
.. contents:: Table of Contents
:local:
:depth: 2
General Questions
-----------------
What is DataStudio?
~~~~~~~~~~~~~~~~~~~
DataStudio is an industrial-grade multimodal data processing pipeline designed for preparing training data for Multimodal Large Language Models (MLLMs). It's the data curation engine behind the `Bee project `_ and the `Honey-Data-15M `_ dataset.
What makes DataStudio different from other data processing tools?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DataStudio is specifically designed for multimodal (image + text) data with:
- **Native MLLM integration**: Use vision-language models for intelligent filtering and rewriting
- **Massive parallelism**: Multi-process async concurrent API requests (8192+ via ``MPOpenAIAPI``)
- **LMDB image caching**: Blazing-fast image I/O with sharded caching
- **Config-driven pipelines**: Reproducible, shareable processing workflows
Is DataStudio free to use?
~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes! DataStudio is open-source under the Apache License 2.0. You can use it freely for both research and commercial purposes.
Installation & Setup
--------------------
What Python version is required?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DataStudio requires **Python 3.10 or higher**.
.. code-block:: bash
python --version # Should be 3.10+
How do I install DataStudio?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio
pip install -r requirements.txt
pip install -e .
Do I need a GPU?
~~~~~~~~~~~~~~~~
**No GPU is required** for running DataStudio itself. However:
- If you're using locally deployed MLLMs (e.g., vLLM, SGLang), you'll need GPUs
- If you're using cloud APIs (OpenAI, etc.), no GPU is needed
How much disk space do I need?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It depends on your dataset:
- **LMDB cache**: Roughly 50-100% of your original image size
- **Output data**: Similar to input data size
- **Working space**: ~10GB for checkpoints and logs
We recommend having **at least 2x your dataset size** in free disk space.
Data Format
-----------
What data formats does DataStudio support?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DataStudio supports:
- **JSON**: Single file with array of samples
- **JSONL**: One sample per line (recommended for large datasets)
What should my data look like?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: json
{
"id": "unique_sample_id",
"image": "path/to/image.jpg",
"conversations": [
{"from": "human", "value": "What is in this image?"},
{"from": "gpt", "value": "This is a cat."}
]
}
Can I process multi-image samples?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes! Use a list for the ``image`` field:
.. code-block:: json
{
"id": "multi_image_sample",
"image": ["image1.jpg", "image2.jpg", "image3.jpg"],
"conversations": [
{"from": "human", "value": "Compare these three images."},
{"from": "gpt", "value": "The first image shows..."}
]
}
Can I process text-only data?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes, simply omit the ``image`` field or set ``use_image=False`` in your config:
.. code-block:: python
dataloader = dict(
use_image=False,
# ...
)
Pipeline & Operators
--------------------
How do I choose which operators to use?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Start with these common patterns:
1. **Basic cleaning**: ``ConvLengthFilter`` + ``RemoveThinkRewriter``
2. **Quality filtering**: Add ``MLLMFilter`` with a quality prompt
3. **Content enhancement**: Add ``MLLMRewriter`` for CoT enrichment
What's the difference between Filter and Rewriter?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- **Filter**: Decides whether to keep or remove a sample (binary decision)
- **Rewriter**: Modifies the content of a sample (transformation)
Can I combine multiple operators?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes! That's the core design of DataStudio:
.. code-block:: python
pipeline = Pipeline([
Filter1(),
Filter2(),
Rewriter1(),
Rewriter2(),
])
Operators execute in order. Filtered samples are removed before reaching subsequent operators.
How do I create custom operators?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See the :doc:`guide/development` guide for detailed instructions.
MLLM Integration
----------------
Which MLLM providers are supported?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DataStudio supports any **OpenAI-compatible API**, including:
- OpenAI (GPT-4o, GPT-4V)
- Anthropic (Claude)
- Local deployments (vLLM, SGLang, Ollama)
- Cloud services (Azure OpenAI, AWS Bedrock)
How do I use a local model?
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Deploy your model with vLLM or SGLang, then configure DataStudio:
.. code-block:: python
model = dict(
model="Qwen3-VL-30B-A3B-Instruct",
api_base="http://localhost:8000/v1",
key="not-needed",
thread_num=512,
)
How many concurrent API calls can I make?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This depends on your API provider:
- **OpenAI**: 50-200 (rate limits apply)
- **Local vLLM**: 256-1024 (depends on GPU memory)
- **SGLang cluster**: 2048-8192 (highest throughput)
Why are my MLLM calls slow?
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Common causes:
1. **Low thread_num**: Increase ``thread_num`` in model config
2. **API rate limiting**: Reduce ``thread_num`` or upgrade API tier
3. **Large images**: Use ``resize_image_size`` to reduce image size
4. **Network latency**: Consider local model deployment
Performance
-----------
How do I speed up processing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See the :doc:`guide/quick_start` guide for performance tips. Key tips:
1. Pre-cache images with ``--cache-images``
2. Use rule-based filters before MLLM operators
3. Increase ``thread_num`` for MLLM operations
4. Use SSD storage for LMDB cache
What batch_size should I use?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
:header-rows: 1
* - Available Memory
- Recommended batch_size
* - < 16 GB
- 1,000 - 5,000
* - 16-64 GB
- 5,000 - 20,000
* - > 64 GB
- 20,000 - 100,000
How do I handle very large datasets (10M+ samples)?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. **Pre-cache images** before running the pipeline
2. **Use config-driven pipelines** for automatic checkpointing
3. **Split processing** across multiple machines if needed
4. **Monitor progress** with W&B integration
Troubleshooting
---------------
My pipeline crashed. How do I resume?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Simply re-run the same command:
.. code-block:: bash
python run.py -c my_config.py # Automatically resumes
DataStudio saves checkpoints after each batch.
Why is my output empty?
~~~~~~~~~~~~~~~~~~~~~~~
Common causes:
1. **Filters too aggressive**: Check ``filter_ops`` in rejected samples
2. **Wrong data format**: Verify input matches expected format
3. **Path issues**: Check image paths are correct
How do I debug filtering decisions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Examine the filtered samples:
.. code-block:: python
kept, filtered = pipeline(data)
for item in filtered[:10]:
print(f"ID: {item['id']}")
print(f"Filtered by: {item.get('filter_ops', {})}")
print()
See :doc:`troubleshooting` for more detailed debugging guides.
Contributing
------------
How can I contribute?
~~~~~~~~~~~~~~~~~~~~~
We welcome contributions! See `CONTRIBUTING.md `_ for guidelines:
- Bug reports and feature requests
- Documentation improvements
- New operators
- Performance optimizations
How do I report a bug?
~~~~~~~~~~~~~~~~~~~~~~
Open an issue on GitHub with:
1. DataStudio version
2. Python version and OS
3. Minimal reproducible example
4. Error message and traceback
Citation
--------
How should I cite DataStudio?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you use DataStudio in your research, please cite:
.. code-block:: bibtex
@article{zhang2025bee,
title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
author={Zhang, Yi and others},
journal={arXiv preprint arXiv:2510.13795},
year={2025}
}
Still Have Questions?
---------------------
- Check the :doc:`troubleshooting` guide
- Browse the `GitHub Issues `_
- Start a `Discussion `_