Examples Guide

DataStudio includes 151 demo data samples (multimodal QA with images) and 5 ready-to-run example configs. You can try the examples immediately after installation — no additional data download required.

Demo data location: configs/examples/demo_data/ Example configs: configs/examples/*.py


Demo Data

The demo dataset contains 151 multimodal QA samples covering diverse sources (Caption, VQA, OCR, Math, Chart, etc.). Each sample includes:

  • Multi-turn conversations in messages format (OpenAI-style)

  • Image references (265 images included in demo_data/images/)

  • <think> tags in answers (useful for testing normalization)

  • ori_answer fields in 126 samples (original answers before rewriting)

Data format:

{
  "messages": [
    {"role": "user", "content": "<image>\nWhat color is the phone?"},
    {"role": "assistant", "content": "<think>\n\n</think>\nThe phone is red."}
  ],
  "images": ["images/xxx.jpg"],
  "id": "sample_001",
  "source": "st_vqa"
}

Example Overview

#

Config

MLLM Required

Description

1

rule_filter_only.py

No

Rule filtering (length, repetition, image validation)

2

mllm_quality_filter.py

Yes

Rule filters + MLLM question-image consistency check

3

mllm_answer_rewrite.py

Yes

MLLM answer rewriting (plain text mode)

4

honeypipe.py

Yes

5-stage end-to-end pipeline (HoneyPipe)

5

text_normalization.py

No

Remove think tags, normalize image tags and prompts

Recommended order: Start with Example 1 or 5 (no MLLM needed), then try 2–4 if you have a deployed inference service.


Examples Without MLLM (Run Directly)

Example 1: Rule-Based Filtering

Filters out samples with abnormal conversation lengths, text length anomalies, or repetitive content. Text-only, no image loading. Inherits rule filters from _base_/filters/filter_rule_base_for_answer.py.

python run.py -c configs/examples/rule_filter_only.py

Inherited operators (from _base_):

  • ImageSizeFilter, ImageAspectRatioFilter, ImageExtFilter (no-ops when use_image=False)

  • ConvLengthFilter: Keep conversations with 1–300 turns

  • LengthAnomalyFilter: Reject questions < 2 or > 4096 tokens; reject answers < 1 or > 4096 tokens

  • TextRepeatFilter: Detect repetitive patterns in questions and answers

Output: ./output/examples/rule_filter_only/


Example 5: Text Normalization

Cleans up data formatting without filtering. The demo data contains <think>...</think> tags in all 151 answers, making it a good test case.

python run.py -c configs/examples/text_normalization.py

What it does:

  • RemoveThinkRewriter: Remove <think>...</think> tags and their content

  • NormImageTagRewriter: Move <image> tags to the beginning of questions

  • NormPromptRewriter: Standardize prompt format

  • RemoveReasonRewriter: Remove reasoning prefix content

Output: ./output/examples/text_normalization/


Examples With MLLM (Require Inference Service)

These examples require an OpenAI-compatible inference service. Deploy one first:

# vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-VL-30B-A3B-Instruct \
    --port 8000 --tensor-parallel-size 4

# or SGLang
python -m sglang.launch_server \
    --model Qwen/Qwen3-VL-30B-A3B-Instruct \
    --port 8000 --tp 4

Then update the model section in the config if your setup differs (model name, port, etc.).


Example 2: MLLM Quality Filtering

Rule-based filters run first (inherited from _base_), then an MLLM checks question-image consistency.

python run.py -c configs/examples/mllm_quality_filter.py

Pipeline:

  1. priority=0 — Rule filters (from _base_/filters/filter_rule_base_for_question.py)

  2. priority=10 — MLLM question-image consistency filter

Key config:

qi_consist_request = dict(
    type="RequestBuilder",
    prompt="prompts/filter/question_image_consist_v2.txt",
    key_templates={"result": "q{idx}", "reason": "q{idx}_reason"},
    with_image=True,
    with_question=True,
)

Output: ./output/examples/mllm_quality_filter/


Example 3: MLLM Answer Rewriting

Sends the original question + image to the MLLM, which generates a new answer. Uses plain text mode (no JSON parsing).

python run.py -c configs/examples/mllm_answer_rewrite.py

Key config:

model = dict(
    return_dict=False,   # Plain text output — model response is the new answer
)

rewriter_request = dict(
    type="RequestBuilder",
    prompt=None,          # No template, send original question directly
    key_templates=None,   # Plain text mode
    with_image=True,
    with_question=True,
)

After rewriting, the original answer is saved in the ori_answer field.

Output: ./output/examples/mllm_answer_rewrite/


Example 4: HoneyPipe Full Pipeline (5 Stages)

End-to-end processing from raw data to clean training data. All configuration is expanded inline, serving as a reference for understanding every available option.

python run.py -c configs/examples/honeypipe.py

Execution order:

Priority

Stage

Description

0

Rule Filters

ConvLength, ImageSize, AspectRatio, ImageExt, LengthAnomaly, TextRepeat

10

MLLM Filter

Question-image consistency check

20

MLLM Rewriter

Regenerate answers (plain text mode)

30

Consistency Check

Compare old vs new answers

40

Normalization

Remove think tags, normalize image tags

Output: ./output/examples/honeypipe/


Customizing Examples

To adapt any example for your own data:

  1. Replace the dataset: Change dataset_yaml to point to your own YAML file

  2. Set data_root: Set to the base directory for resolving relative image paths in your data

  3. Adjust model config: Update model, api_base, port to match your inference service

  4. Tune parameters: Adjust batch_size, thread_num, filter thresholds, etc.

# Example: use your own data
dataset_yaml = "/path/to/your/dataset.yaml"

dataloader = dict(
    dataset=dataset_yaml,
    data_root="/path/to/your/data/root",   # Base path for image resolution
    batch_size=5000,
    use_image=True,
)