Examples Guide
DataStudio includes 151 demo data samples (multimodal QA with images) and 5 ready-to-run example configs. You can try the examples immediately after installation — no additional data download required.
Demo data location:
configs/examples/demo_data/Example configs:configs/examples/*.py
Demo Data
The demo dataset contains 151 multimodal QA samples covering diverse sources (Caption, VQA, OCR, Math, Chart, etc.). Each sample includes:
Multi-turn conversations in
messagesformat (OpenAI-style)Image references (265 images included in
demo_data/images/)<think>tags in answers (useful for testing normalization)ori_answerfields in 126 samples (original answers before rewriting)
Data format:
{
"messages": [
{"role": "user", "content": "<image>\nWhat color is the phone?"},
{"role": "assistant", "content": "<think>\n\n</think>\nThe phone is red."}
],
"images": ["images/xxx.jpg"],
"id": "sample_001",
"source": "st_vqa"
}
Example Overview
# |
Config |
MLLM Required |
Description |
|---|---|---|---|
1 |
|
No |
Rule filtering (length, repetition, image validation) |
2 |
|
Yes |
Rule filters + MLLM question-image consistency check |
3 |
|
Yes |
MLLM answer rewriting (plain text mode) |
4 |
|
Yes |
5-stage end-to-end pipeline (HoneyPipe) |
5 |
|
No |
Remove think tags, normalize image tags and prompts |
Recommended order: Start with Example 1 or 5 (no MLLM needed), then try 2–4 if you have a deployed inference service.
Examples Without MLLM (Run Directly)
Example 1: Rule-Based Filtering
Filters out samples with abnormal conversation lengths, text length anomalies, or repetitive content. Text-only, no image loading. Inherits rule filters from _base_/filters/filter_rule_base_for_answer.py.
python run.py -c configs/examples/rule_filter_only.py
Inherited operators (from _base_):
ImageSizeFilter,ImageAspectRatioFilter,ImageExtFilter(no-ops whenuse_image=False)ConvLengthFilter: Keep conversations with 1–300 turnsLengthAnomalyFilter: Reject questions < 2 or > 4096 tokens; reject answers < 1 or > 4096 tokensTextRepeatFilter: Detect repetitive patterns in questions and answers
Output: ./output/examples/rule_filter_only/
Example 5: Text Normalization
Cleans up data formatting without filtering. The demo data contains <think>...</think> tags in all 151 answers, making it a good test case.
python run.py -c configs/examples/text_normalization.py
What it does:
RemoveThinkRewriter: Remove<think>...</think>tags and their contentNormImageTagRewriter: Move<image>tags to the beginning of questionsNormPromptRewriter: Standardize prompt formatRemoveReasonRewriter: Remove reasoning prefix content
Output: ./output/examples/text_normalization/
Examples With MLLM (Require Inference Service)
These examples require an OpenAI-compatible inference service. Deploy one first:
# vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
--port 8000 --tensor-parallel-size 4
# or SGLang
python -m sglang.launch_server \
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
--port 8000 --tp 4
Then update the model section in the config if your setup differs (model name, port, etc.).
Example 2: MLLM Quality Filtering
Rule-based filters run first (inherited from _base_), then an MLLM checks question-image consistency.
python run.py -c configs/examples/mllm_quality_filter.py
Pipeline:
priority=0— Rule filters (from_base_/filters/filter_rule_base_for_question.py)priority=10— MLLM question-image consistency filter
Key config:
qi_consist_request = dict(
type="RequestBuilder",
prompt="prompts/filter/question_image_consist_v2.txt",
key_templates={"result": "q{idx}", "reason": "q{idx}_reason"},
with_image=True,
with_question=True,
)
Output: ./output/examples/mllm_quality_filter/
Example 3: MLLM Answer Rewriting
Sends the original question + image to the MLLM, which generates a new answer. Uses plain text mode (no JSON parsing).
python run.py -c configs/examples/mllm_answer_rewrite.py
Key config:
model = dict(
return_dict=False, # Plain text output — model response is the new answer
)
rewriter_request = dict(
type="RequestBuilder",
prompt=None, # No template, send original question directly
key_templates=None, # Plain text mode
with_image=True,
with_question=True,
)
After rewriting, the original answer is saved in the ori_answer field.
Output: ./output/examples/mllm_answer_rewrite/
Example 4: HoneyPipe Full Pipeline (5 Stages)
End-to-end processing from raw data to clean training data. All configuration is expanded inline, serving as a reference for understanding every available option.
python run.py -c configs/examples/honeypipe.py
Execution order:
Priority |
Stage |
Description |
|---|---|---|
0 |
Rule Filters |
ConvLength, ImageSize, AspectRatio, ImageExt, LengthAnomaly, TextRepeat |
10 |
MLLM Filter |
Question-image consistency check |
20 |
MLLM Rewriter |
Regenerate answers (plain text mode) |
30 |
Consistency Check |
Compare old vs new answers |
40 |
Normalization |
Remove think tags, normalize image tags |
Output: ./output/examples/honeypipe/
Customizing Examples
To adapt any example for your own data:
Replace the dataset: Change
dataset_yamlto point to your own YAML fileSet
data_root: Set to the base directory for resolving relative image paths in your dataAdjust model config: Update
model,api_base,portto match your inference serviceTune parameters: Adjust
batch_size,thread_num, filter thresholds, etc.
# Example: use your own data
dataset_yaml = "/path/to/your/dataset.yaml"
dataloader = dict(
dataset=dataset_yaml,
data_root="/path/to/your/data/root", # Base path for image resolution
batch_size=5000,
use_image=True,
)