Getting Started

This guide will help you install DataStudio and run your first data processing pipeline.

Prerequisites 

Before installing DataStudio, ensure you have:

Python 3.10 or higher
pip package manager
Git for cloning the repository
Sufficient disk space for image caching (varies by dataset)

Installation 

Step 1: Clone the Repository 

git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio

Step 2: Create Virtual Environment (Recommended)

python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

Step 3: Install Dependencies 

pip install -r requirements.txt

Step 4: Install DataStudio 

pip install -e .

Step 5: Verify Installation 

python -c "import datastudio; print('✅ DataStudio installed successfully!')"

Optional: Weights & Biases 

For experiment tracking:

pip install wandb
wandb login

Quick Start 

Option 1: Try Built-in Examples 

After installation, run the built-in examples to verify your setup (no additional data needed):

# Rule filtering example (CPU only, no MLLM required)
python run.py -c configs/examples/rule_filter_only.py

# Text normalization example (remove think tags, etc.)
python run.py -c configs/examples/text_normalization.py

See the Examples Guide for all 5 examples covering different scenarios.

Option 2: Config-Driven (Production)

For production use, create a config file:

# my_pipeline.py
_base_ = ["@/_base_/models/local_api_model.py", "@/_base_/dataset.py"]

work_dir = "./work_dirs/my_experiment"

logger = dict(type="Logger", log_file="logs/process.log")

dataset_yaml = "/path/to/dataset.yaml"

dataloader = dict(
    dataset=dataset_yaml,
    batch_size=1000,
    use_image=True,
    cache_dir="~/cache/images_lmdb",
)

datasaver = dict(
    dataset=dataset_yaml,
    output_dir="./output",
    save_yaml_name="processed",
)

pipeline = dict(
    type="Pipeline",
    operations={
        "filter": {
            "type": "FILTERS",
            "priority": 1,
            "cfg": {"type": "ConvLengthFilter", "min_length": 1, "max_length": 20}
        },
    }
)

Then run:

python run.py -c my_pipeline.py

Core Concepts 

Data Format 

DataStudio uses a standard multimodal conversation format:

{
    "id": "unique_id",
    "image": "path/to/image.jpg",
    "conversations": [
        {"from": "human", "value": "Question text"},
        {"from": "gpt", "value": "Answer text"}
    ]
}

id: Unique sample identifier
image: Path to image file (string or list for multi-image)
conversations: List of messages, alternating between human and gpt

Operators 

DataStudio provides two types of operators:

Type	Purpose	Returns
Filter	Decide to keep or remove samples	`(rejected: bool, reason: str)`
Rewriter	Modify content without removing	`new_answer` or `None`

Pipeline 

A Pipeline combines multiple operators:

pipeline = Pipeline([
    filter1,    # Executes first
    filter2,    # Executes second
    rewriter1,  # Executes third
])

Samples filtered out by early operators skip later operators.

Built-in Operators 

Filters 

Operator	Description
`ConvLengthFilter`	Filter by conversation turns
`ImageSizeFilter`	Filter by image dimensions
`ImageAspectRatioFilter`	Filter by aspect ratio
`TextRepeatFilter`	Detect text repetition
`MLLMFilter`	MLLM-powered filtering

Rewriters 

Operator	Description
`RemoveThinkRewriter`	Remove `<think>` tags
`NormPromptRewriter`	Normalize prompts
`SplitRewriter`	Split multi-turn to single
`MLLMRewriter`	MLLM-powered rewriting

Dataset Configuration 

Create a YAML file to define your dataset:

# dataset.yaml
data_root: /path/to/data

datasets:
  - file_path: train.jsonl
    source: my_dataset
    split: train

  - file_path: eval.jsonl
    source: my_dataset
    split: eval

What’s Next?

Now that you have DataStudio running, explore:

Quick Start - Complete quick start guide
Examples Guide - Ready-to-run example configurations
Development Guide - Creating custom operators
DataStudio Architecture Guide - Deep dive into DataStudio internals

Need Help?

Frequently Asked Questions - Frequently asked questions
Troubleshooting Guide - Common issues and solutions
GitHub Issues - Report bugs