Getting Started
This guide will help you install DataStudio and run your first data processing pipeline.
Prerequisites
Before installing DataStudio, ensure you have:
Python 3.10 or higher
pip package manager
Git for cloning the repository
Sufficient disk space for image caching (varies by dataset)
Installation
Step 1: Clone the Repository
git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio
Step 2: Create Virtual Environment (Recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
Step 3: Install Dependencies
pip install -r requirements.txt
Step 4: Install DataStudio
pip install -e .
Step 5: Verify Installation
python -c "import datastudio; print('✅ DataStudio installed successfully!')"
Optional: Weights & Biases
For experiment tracking:
pip install wandb
wandb login
Quick Start
Option 1: Try Built-in Examples
After installation, run the built-in examples to verify your setup (no additional data needed):
# Rule filtering example (CPU only, no MLLM required)
python run.py -c configs/examples/rule_filter_only.py
# Text normalization example (remove think tags, etc.)
python run.py -c configs/examples/text_normalization.py
See the Examples Guide for all 5 examples covering different scenarios.
Option 2: Config-Driven (Production)
For production use, create a config file:
# my_pipeline.py
_base_ = ["@/_base_/models/local_api_model.py", "@/_base_/dataset.py"]
work_dir = "./work_dirs/my_experiment"
logger = dict(type="Logger", log_file="logs/process.log")
dataset_yaml = "/path/to/dataset.yaml"
dataloader = dict(
dataset=dataset_yaml,
batch_size=1000,
use_image=True,
cache_dir="~/cache/images_lmdb",
)
datasaver = dict(
dataset=dataset_yaml,
output_dir="./output",
save_yaml_name="processed",
)
pipeline = dict(
type="Pipeline",
operations={
"filter": {
"type": "FILTERS",
"priority": 1,
"cfg": {"type": "ConvLengthFilter", "min_length": 1, "max_length": 20}
},
}
)
Then run:
python run.py -c my_pipeline.py
Core Concepts
Data Format
DataStudio uses a standard multimodal conversation format:
{
"id": "unique_id",
"image": "path/to/image.jpg",
"conversations": [
{"from": "human", "value": "Question text"},
{"from": "gpt", "value": "Answer text"}
]
}
id: Unique sample identifierimage: Path to image file (string or list for multi-image)conversations: List of messages, alternating betweenhumanandgpt
Operators
DataStudio provides two types of operators:
Type |
Purpose |
Returns |
|---|---|---|
Filter |
Decide to keep or remove samples |
|
Rewriter |
Modify content without removing |
|
Pipeline
A Pipeline combines multiple operators:
pipeline = Pipeline([
filter1, # Executes first
filter2, # Executes second
rewriter1, # Executes third
])
Samples filtered out by early operators skip later operators.
Built-in Operators
Filters
Operator |
Description |
|---|---|
|
Filter by conversation turns |
|
Filter by image dimensions |
|
Filter by aspect ratio |
|
Detect text repetition |
|
MLLM-powered filtering |
Rewriters
Operator |
Description |
|---|---|
|
Remove |
|
Normalize prompts |
|
Split multi-turn to single |
|
MLLM-powered rewriting |
Dataset Configuration
Create a YAML file to define your dataset:
# dataset.yaml
data_root: /path/to/data
datasets:
- file_path: train.jsonl
source: my_dataset
split: train
- file_path: eval.jsonl
source: my_dataset
split: eval
What’s Next?
Now that you have DataStudio running, explore:
Quick Start - Complete quick start guide
Examples Guide - Ready-to-run example configurations
Development Guide - Creating custom operators
DataStudio Architecture Guide - Deep dive into DataStudio internals
Need Help?
Frequently Asked Questions - Frequently asked questions
Troubleshooting Guide - Common issues and solutions
GitHub Issues - Report bugs