Troubleshooting Guide
This guide helps you diagnose and fix common issues with DataStudio.
Installation Issues
ImportError: No module named ‘datastudio’
Symptom:
>>> from datastudio.operators import Pipeline
ImportError: No module named 'datastudio'
Solution:
Install DataStudio in development mode:
cd /path/to/DataStudio
pip install -e .
Or ensure you’re in the correct virtual environment:
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
Missing Dependencies
Symptom:
ModuleNotFoundError: No module named 'lmdb'
Solution:
pip install -r requirements.txt
For specific packages:
pip install lmdb pillow tqdm requests
Python Version Incompatibility
Symptom:
SyntaxError: invalid syntax
# or
TypeError: 'type' object is not subscriptable
Solution:
DataStudio requires Python 3.10+. Check your version:
python --version
Upgrade if needed:
# Using pyenv
pyenv install 3.10.12
pyenv local 3.10.12
Data Loading Issues
FileNotFoundError for Images
Symptom:
FileNotFoundError: [Errno 2] No such file or directory: 'images/sample.jpg'
Causes & Solutions:
Relative paths: Use absolute paths in your data or set
data_root# In dataset YAML data_root: /absolute/path/to/dataset
Path format: Ensure paths match your OS (forward vs backslashes)
Missing files: Verify images exist at specified paths
Invalid JSON Format
Symptom:
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes
Solution:
Check your JSON/JSONL format:
# Validate JSON
python -m json.tool your_data.json
# For JSONL, check each line
python -c "
import json
with open('data.jsonl') as f:
for i, line in enumerate(f, 1):
try:
json.loads(line)
except json.JSONDecodeError as e:
print(f'Error on line {i}: {e}')
"
LMDB Cache Issues
Symptom:
lmdb.Error: /path/to/cache: No such file or directory
Solution:
Ensure the parent directory exists:
mkdir -p ~/cache/images_lmdb
Symptom:
lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached
Solution:
Your LMDB database is full. Delete and recreate:
rm -rf ~/cache/images_lmdb
python run.py -c config.py --cache-images
Or use a different cache directory:
dataloader = dict(
cache_dir="~/cache/images_lmdb_v2", # New location
# ...
)
Pipeline Execution Issues
No Output / All Samples Filtered
Symptom:
All samples are filtered out, output is empty.
Diagnosis:
Check why samples were filtered:
kept, filtered = pipeline(data)
print(f"Kept: {len(kept)}, Filtered: {len(filtered)}")
# Examine filter reasons
from collections import Counter
reasons = []
for item in filtered:
if 'filter_ops' in item:
for op, qa_reasons in item['filter_ops'].items():
for qa_idx, reason in qa_reasons.items():
reasons.append(f"{op}: {reason}")
for reason, count in Counter(reasons).most_common(10):
print(f" {count:5d}x {reason}")
Common causes:
Filters too strict: Relax filter parameters
Data format mismatch: Check
conversationsstructureMissing images: Verify image paths
Pipeline Hangs
Symptom:
Pipeline stops responding, no progress for extended time.
Causes & Solutions:
MLLM API timeout: Reduce
thread_numor increasetimeoutmodel = dict( thread_num=64, # Reduce from 512 timeout=(60, 3600), # Increase timeout )
Deadlock in threading: Restart and reduce parallelism
Memory exhaustion: Reduce
batch_sizedataloader = dict( batch_size=1000, # Reduce from 10000 )
Out of Memory (OOM)
Symptom:
MemoryError
# or
Killed (process terminated by OS)
Solutions:
Reduce batch size:
dataloader = dict(batch_size=1000)
Resize images:
dataloader = dict(resize_image_size=1024) # Smaller images
Disable image loading if not needed:
dataloader = dict(use_image=False)
MLLM Issues
API Connection Errors
Symptom:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8000)
Solutions:
Check API server is running:
curl http://localhost:8000/v1/modelsVerify API base URL in config:
model = dict( api_base="http://localhost:8000/v1", # Include /v1 )
Check firewall/network settings
Rate Limiting
Symptom:
HTTP 429: Too Many Requests
# or
Rate limit exceeded
Solution:
Reduce concurrent requests:
model = dict(
thread_num=50, # Reduce from higher value
retry=10, # Add retries for transient errors
)
Invalid API Response
Symptom:
json.decoder.JSONDecodeError: Expecting value
Diagnosis:
Check the raw API response:
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={"model": "test", "messages": [{"role": "user", "content": "hi"}]}
)
print(response.status_code)
print(response.text)
Common causes:
Wrong API endpoint
Model not loaded on server
Invalid request format
Checkpoint & Resume Issues
Checkpoint Not Working
Symptom:
Re-running doesn’t resume from checkpoint.
Diagnosis:
Check if checkpoint file exists:
ls -la work_dirs/my_experiment/checkpoint.json
Solutions:
Ensure same work_dir: Config must specify the same
work_dirCheck checkpoint content:
import json with open("work_dirs/my_experiment/checkpoint.json") as f: print(json.dumps(json.load(f), indent=2))
Manual checkpoint reset: Delete checkpoint to restart:
rm work_dirs/my_experiment/checkpoint.json
Corrupted Checkpoint
Symptom:
json.decoder.JSONDecodeError when loading checkpoint
Solution:
Delete the corrupted checkpoint and restart:
rm work_dirs/my_experiment/checkpoint.json
python run.py -c config.py # Restart from beginning
Output Issues
Output Format Changed
Symptom:
Output data has unexpected fields or structure.
Explanation:
DataStudio adds metadata fields:
filter_ops: Which filters applied and whyrewrite_ops: Which rewriters modified the contentori_answer: Original answer before rewritingrejected: Whether sample was filtered out
To get clean output, post-process:
def clean_output(item):
"""Remove DataStudio metadata from output."""
clean = item.copy()
for key in ['filter_ops', 'rewrite_ops', 'ori_answer', 'rejected', 'source_file']:
clean.pop(key, None)
return clean
Duplicate Samples in Output
Symptom:
Same sample appears multiple times in output.
Possible causes:
Re-running without clearing output: Output is appended
Input has duplicates: Check input data
Solution:
Clear output directory before re-running:
rm -rf output/my_dataset/*
python run.py -c config.py
Performance Issues
Slow Image Loading
Symptom:
Pipeline is slow, mostly waiting on image loading.
Solution:
Pre-cache images to LMDB:
python run.py -c config.py --cache-images
This only needs to be done once per dataset.
MLLM Throughput Low
Symptom:
MLLM operations are slower than expected.
Diagnosis:
Calculate actual throughput:
import time
start = time.time()
kept, filtered = pipeline(data[:100])
elapsed = time.time() - start
print(f"Throughput: {100/elapsed:.2f} samples/sec")
Solutions:
Increase thread_num (if API allows):
model = dict(thread_num=1024)
Use local model instead of API
Optimize prompts (shorter prompts = faster)
Getting More Help
If you can’t resolve your issue:
Check GitHub Issues: github.com/Open-Bee/DataStudio/issues
Enable debug logging:
import logging logging.basicConfig(level=logging.DEBUG)
Collect diagnostic info:
python -c " import sys import datastudio print(f'Python: {sys.version}') print(f'DataStudio: {datastudio.__version__}') "
Open a new issue with:
DataStudio version
Python version and OS
Full error traceback
Minimal config to reproduce
Sample data (if possible)
See Also
Frequently Asked Questions - Frequently asked questions
Quick Start - Quick start guide