Examples Guide

DataStudio includes 151 demo data samples (multimodal QA with images) and 5 ready-to-run example configs. You can try the examples immediately after installation — no additional data download required.

Demo data location: configs/examples/demo_data/ Example configs: configs/examples/*.py

Demo Data

The demo dataset contains 151 multimodal QA samples covering diverse sources (Caption, VQA, OCR, Math, Chart, etc.). Each sample includes:

Multi-turn conversations in messages format (OpenAI-style)
Image references (265 images included in demo_data/images/)
<think> tags in answers (useful for testing normalization)
ori_answer fields in 126 samples (original answers before rewriting)

Data format:

{
  "messages": [
    {"role": "user", "content": "<image>\nWhat color is the phone?"},
    {"role": "assistant", "content": "<think>\n\n</think>\nThe phone is red."}
  ],
  "images": ["images/xxx.jpg"],
  "id": "sample_001",
  "source": "st_vqa"
}

Example Overview

#	Config	MLLM Required	Description
1	`rule_filter_only.py`	No	Rule filtering (length, repetition, image validation)
2	`mllm_quality_filter.py`	Yes	Rule filters + MLLM question-image consistency check
3	`mllm_answer_rewrite.py`	Yes	MLLM answer rewriting (plain text mode)
4	`honeypipe.py`	Yes	5-stage end-to-end pipeline (HoneyPipe)
5	`text_normalization.py`	No	Remove think tags, normalize image tags and prompts

Recommended order: Start with Example 1 or 5 (no MLLM needed), then try 2–4 if you have a deployed inference service.

Examples Without MLLM (Run Directly)

Example 1: Rule-Based Filtering

Filters out samples with abnormal conversation lengths, text length anomalies, or repetitive content. Text-only, no image loading. Inherits rule filters from _base_/filters/filter_rule_base_for_answer.py.

python run.py -c configs/examples/rule_filter_only.py

Inherited operators (from _base_):

ImageSizeFilter, ImageAspectRatioFilter, ImageExtFilter (no-ops when use_image=False)
ConvLengthFilter: Keep conversations with 1–300 turns
LengthAnomalyFilter: Reject questions < 2 or > 4096 tokens; reject answers < 1 or > 4096 tokens
TextRepeatFilter: Detect repetitive patterns in questions and answers

Output: ./output/examples/rule_filter_only/

Example 5: Text Normalization

Cleans up data formatting without filtering. The demo data contains <think>...</think> tags in all 151 answers, making it a good test case.

python run.py -c configs/examples/text_normalization.py

What it does:

RemoveThinkRewriter: Remove <think>...</think> tags and their content
NormImageTagRewriter: Move <image> tags to the beginning of questions
NormPromptRewriter: Standardize prompt format
RemoveReasonRewriter: Remove reasoning prefix content

Output: ./output/examples/text_normalization/

Examples With MLLM (Require Inference Service)

These examples require an OpenAI-compatible inference service. Deploy one first:

# vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-VL-30B-A3B-Instruct \
    --port 8000 --tensor-parallel-size 4

# or SGLang
python -m sglang.launch_server \
    --model Qwen/Qwen3-VL-30B-A3B-Instruct \
    --port 8000 --tp 4

Then update the model section in the config if your setup differs (model name, port, etc.).

Example 2: MLLM Quality Filtering

Rule-based filters run first (inherited from _base_), then an MLLM checks question-image consistency.

python run.py -c configs/examples/mllm_quality_filter.py

Pipeline:

priority=0 — Rule filters (from _base_/filters/filter_rule_base_for_question.py)
priority=10 — MLLM question-image consistency filter

Key config:

qi_consist_request = dict(
    type="RequestBuilder",
    prompt="prompts/filter/question_image_consist_v2.txt",
    key_templates={"result": "q{idx}", "reason": "q{idx}_reason"},
    with_image=True,
    with_question=True,
)

Output: ./output/examples/mllm_quality_filter/

Example 3: MLLM Answer Rewriting

Sends the original question + image to the MLLM, which generates a new answer. Uses plain text mode (no JSON parsing).

python run.py -c configs/examples/mllm_answer_rewrite.py

Key config:

model = dict(
    return_dict=False,   # Plain text output — model response is the new answer
)

rewriter_request = dict(
    type="RequestBuilder",
    prompt=None,          # No template, send original question directly
    key_templates=None,   # Plain text mode
    with_image=True,
    with_question=True,
)

After rewriting, the original answer is saved in the ori_answer field.

Output: ./output/examples/mllm_answer_rewrite/

Example 4: HoneyPipe Full Pipeline (5 Stages)

End-to-end processing from raw data to clean training data. All configuration is expanded inline, serving as a reference for understanding every available option.

python run.py -c configs/examples/honeypipe.py

Execution order:

Priority	Stage	Description
0	Rule Filters	ConvLength, ImageSize, AspectRatio, ImageExt, LengthAnomaly, TextRepeat
10	MLLM Filter	Question-image consistency check
20	MLLM Rewriter	Regenerate answers (plain text mode)
30	Consistency Check	Compare old vs new answers
40	Normalization	Remove think tags, normalize image tags

Output: ./output/examples/honeypipe/

Customizing Examples

To adapt any example for your own data:

Replace the dataset: Change dataset_yaml to point to your own YAML file
Set data_root: Set to the base directory for resolving relative image paths in your data
Adjust model config: Update model, api_base, port to match your inference service
Tune parameters: Adjust batch_size, thread_num, filter thresholds, etc.

# Example: use your own data
dataset_yaml = "/path/to/your/dataset.yaml"

dataloader = dict(
    dataset=dataset_yaml,
    data_root="/path/to/your/data/root",   # Base path for image resolution
    batch_size=5000,
    use_image=True,
)