Getting Started

This guide will help you install DataStudio and run your first data processing pipeline.

Prerequisites

Before installing DataStudio, ensure you have:

  • Python 3.10 or higher

  • pip package manager

  • Git for cloning the repository

  • Sufficient disk space for image caching (varies by dataset)

Installation

Step 1: Clone the Repository

git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Install DataStudio

pip install -e .

Step 5: Verify Installation

python -c "import datastudio; print('✅ DataStudio installed successfully!')"

Optional: Weights & Biases

For experiment tracking:

pip install wandb
wandb login

Quick Start

Option 1: Try Built-in Examples

After installation, run the built-in examples to verify your setup (no additional data needed):

# Rule filtering example (CPU only, no MLLM required)
python run.py -c configs/examples/rule_filter_only.py

# Text normalization example (remove think tags, etc.)
python run.py -c configs/examples/text_normalization.py

See the Examples Guide for all 5 examples covering different scenarios.

Option 2: Config-Driven (Production)

For production use, create a config file:

# my_pipeline.py
_base_ = ["@/_base_/models/local_api_model.py", "@/_base_/dataset.py"]

work_dir = "./work_dirs/my_experiment"

logger = dict(type="Logger", log_file="logs/process.log")

dataset_yaml = "/path/to/dataset.yaml"

dataloader = dict(
    dataset=dataset_yaml,
    batch_size=1000,
    use_image=True,
    cache_dir="~/cache/images_lmdb",
)

datasaver = dict(
    dataset=dataset_yaml,
    output_dir="./output",
    save_yaml_name="processed",
)

pipeline = dict(
    type="Pipeline",
    operations={
        "filter": {
            "type": "FILTERS",
            "priority": 1,
            "cfg": {"type": "ConvLengthFilter", "min_length": 1, "max_length": 20}
        },
    }
)

Then run:

python run.py -c my_pipeline.py

Core Concepts

Data Format

DataStudio uses a standard multimodal conversation format:

{
    "id": "unique_id",
    "image": "path/to/image.jpg",
    "conversations": [
        {"from": "human", "value": "Question text"},
        {"from": "gpt", "value": "Answer text"}
    ]
}
  • id: Unique sample identifier

  • image: Path to image file (string or list for multi-image)

  • conversations: List of messages, alternating between human and gpt

Operators

DataStudio provides two types of operators:

Type

Purpose

Returns

Filter

Decide to keep or remove samples

(rejected: bool, reason: str)

Rewriter

Modify content without removing

new_answer or None

Pipeline

A Pipeline combines multiple operators:

pipeline = Pipeline([
    filter1,    # Executes first
    filter2,    # Executes second
    rewriter1,  # Executes third
])

Samples filtered out by early operators skip later operators.

Built-in Operators

Filters

Operator

Description

ConvLengthFilter

Filter by conversation turns

ImageSizeFilter

Filter by image dimensions

ImageAspectRatioFilter

Filter by aspect ratio

TextRepeatFilter

Detect text repetition

MLLMFilter

MLLM-powered filtering

Rewriters

Operator

Description

RemoveThinkRewriter

Remove <think> tags

NormPromptRewriter

Normalize prompts

SplitRewriter

Split multi-turn to single

MLLMRewriter

MLLM-powered rewriting

Dataset Configuration

Create a YAML file to define your dataset:

# dataset.yaml
data_root: /path/to/data

datasets:
  - file_path: train.jsonl
    source: my_dataset
    split: train

  - file_path: eval.jsonl
    source: my_dataset
    split: eval

What’s Next?

Now that you have DataStudio running, explore:

Need Help?