Troubleshooting Guide

This guide helps you diagnose and fix common issues with DataStudio.

Installation Issues 

ImportError: No module named ‘datastudio’

Symptom:

>>> from datastudio.operators import Pipeline
ImportError: No module named 'datastudio'

Solution:

Install DataStudio in development mode:

cd /path/to/DataStudio
pip install -e .

Or ensure you’re in the correct virtual environment:

source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

Missing Dependencies 

Symptom:

ModuleNotFoundError: No module named 'lmdb'

Solution:

pip install -r requirements.txt

For specific packages:

pip install lmdb pillow tqdm requests

Python Version Incompatibility 

Symptom:

SyntaxError: invalid syntax
# or
TypeError: 'type' object is not subscriptable

Solution:

DataStudio requires Python 3.10+. Check your version:

python --version

Upgrade if needed:

# Using pyenv
pyenv install 3.10.12
pyenv local 3.10.12

Data Loading Issues 

FileNotFoundError for Images 

Symptom:

FileNotFoundError: [Errno 2] No such file or directory: 'images/sample.jpg'

Causes & Solutions:

Relative paths: Use absolute paths in your data or set data_root
```
# In dataset YAML
data_root: /absolute/path/to/dataset
```
Path format: Ensure paths match your OS (forward vs backslashes)
Missing files: Verify images exist at specified paths

Invalid JSON Format 

Symptom:

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes

Solution:

Check your JSON/JSONL format:

# Validate JSON
python -m json.tool your_data.json

# For JSONL, check each line
python -c "
import json
with open('data.jsonl') as f:
    for i, line in enumerate(f, 1):
        try:
            json.loads(line)
        except json.JSONDecodeError as e:
            print(f'Error on line {i}: {e}')
"

LMDB Cache Issues 

Symptom:

lmdb.Error: /path/to/cache: No such file or directory

Solution:

Ensure the parent directory exists:

mkdir -p ~/cache/images_lmdb

Symptom:

lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached

Solution:

Your LMDB database is full. Delete and recreate:

rm -rf ~/cache/images_lmdb
python run.py -c config.py --cache-images

Or use a different cache directory:

dataloader = dict(
    cache_dir="~/cache/images_lmdb_v2",  # New location
    # ...
)

Pipeline Execution Issues 

No Output / All Samples Filtered 

Symptom:

All samples are filtered out, output is empty.

Diagnosis:

Check why samples were filtered:

kept, filtered = pipeline(data)

print(f"Kept: {len(kept)}, Filtered: {len(filtered)}")

# Examine filter reasons
from collections import Counter
reasons = []
for item in filtered:
    if 'filter_ops' in item:
        for op, qa_reasons in item['filter_ops'].items():
            for qa_idx, reason in qa_reasons.items():
                reasons.append(f"{op}: {reason}")

for reason, count in Counter(reasons).most_common(10):
    print(f"  {count:5d}x {reason}")

Common causes:

Filters too strict: Relax filter parameters
Data format mismatch: Check conversations structure
Missing images: Verify image paths

Pipeline Hangs 

Symptom:

Pipeline stops responding, no progress for extended time.

Causes & Solutions:

MLLM API timeout: Reduce thread_num or increase timeout

model = dict(
    thread_num=64,  # Reduce from 512
    timeout=(60, 3600),  # Increase timeout
)

Deadlock in threading: Restart and reduce parallelism

Memory exhaustion: Reduce batch_size

dataloader = dict(
    batch_size=1000,  # Reduce from 10000
)

Out of Memory (OOM)

Symptom:

MemoryError
# or
Killed (process terminated by OS)

Solutions:

Reduce batch size:
```
dataloader = dict(batch_size=1000)
```

Resize images:

dataloader = dict(resize_image_size=1024)  # Smaller images

Disable image loading if not needed:
```
dataloader = dict(use_image=False)
```

MLLM Issues 

API Connection Errors 

Symptom:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8000)

Solutions:

Check API server is running:
```
curl http://localhost:8000/v1/models
```

Verify API base URL in config:

model = dict(
    api_base="http://localhost:8000/v1",  # Include /v1
)

Check firewall/network settings

Rate Limiting 

Symptom:

HTTP 429: Too Many Requests
# or
Rate limit exceeded

Solution:

Reduce concurrent requests:

model = dict(
    thread_num=50,  # Reduce from higher value
    retry=10,       # Add retries for transient errors
)

Invalid API Response 

Symptom:

json.decoder.JSONDecodeError: Expecting value

Diagnosis:

Check the raw API response:

import requests
response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={"model": "test", "messages": [{"role": "user", "content": "hi"}]}
)
print(response.status_code)
print(response.text)

Common causes:

Wrong API endpoint
Model not loaded on server
Invalid request format

Checkpoint & Resume Issues 

Checkpoint Not Working 

Symptom:

Re-running doesn’t resume from checkpoint.

Diagnosis:

Check if checkpoint file exists:

ls -la work_dirs/my_experiment/checkpoint.json

Solutions:

Ensure same work_dir: Config must specify the same work_dir

Check checkpoint content:

import json
with open("work_dirs/my_experiment/checkpoint.json") as f:
    print(json.dumps(json.load(f), indent=2))

Manual checkpoint reset: Delete checkpoint to restart:
```
rm work_dirs/my_experiment/checkpoint.json
```

Corrupted Checkpoint 

Symptom:

json.decoder.JSONDecodeError when loading checkpoint

Solution:

Delete the corrupted checkpoint and restart:

rm work_dirs/my_experiment/checkpoint.json
python run.py -c config.py  # Restart from beginning

Output Issues 

Output Format Changed 

Symptom:

Output data has unexpected fields or structure.

Explanation:

DataStudio adds metadata fields:

filter_ops: Which filters applied and why
rewrite_ops: Which rewriters modified the content
ori_answer: Original answer before rewriting
rejected: Whether sample was filtered out

To get clean output, post-process:

def clean_output(item):
    """Remove DataStudio metadata from output."""
    clean = item.copy()
    for key in ['filter_ops', 'rewrite_ops', 'ori_answer', 'rejected', 'source_file']:
        clean.pop(key, None)
    return clean

Duplicate Samples in Output 

Symptom:

Same sample appears multiple times in output.

Possible causes:

Re-running without clearing output: Output is appended
Input has duplicates: Check input data

Solution:

Clear output directory before re-running:

rm -rf output/my_dataset/*
python run.py -c config.py

Performance Issues 

Slow Image Loading 

Symptom:

Pipeline is slow, mostly waiting on image loading.

Solution:

Pre-cache images to LMDB:

python run.py -c config.py --cache-images

This only needs to be done once per dataset.

MLLM Throughput Low 

Symptom:

MLLM operations are slower than expected.

Diagnosis:

Calculate actual throughput:

import time

start = time.time()
kept, filtered = pipeline(data[:100])
elapsed = time.time() - start

print(f"Throughput: {100/elapsed:.2f} samples/sec")

Solutions:

Increase thread_num (if API allows):
```
model = dict(thread_num=1024)
```
Use local model instead of API
Optimize prompts (shorter prompts = faster)

Getting More Help 

If you can’t resolve your issue:

Check GitHub Issues: github.com/Open-Bee/DataStudio/issues

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Collect diagnostic info:

python -c "
import sys
import datastudio
print(f'Python: {sys.version}')
print(f'DataStudio: {datastudio.__version__}')
"

Open a new issue with:
- DataStudio version
- Python version and OS
- Full error traceback
- Minimal config to reproduce
- Sample data (if possible)