Troubleshooting Guide

This guide helps you diagnose and fix common issues with DataStudio.

Installation Issues

ImportError: No module named ‘datastudio’

Symptom:

>>> from datastudio.operators import Pipeline
ImportError: No module named 'datastudio'

Solution:

Install DataStudio in development mode:

cd /path/to/DataStudio
pip install -e .

Or ensure you’re in the correct virtual environment:

source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

Missing Dependencies

Symptom:

ModuleNotFoundError: No module named 'lmdb'

Solution:

pip install -r requirements.txt

For specific packages:

pip install lmdb pillow tqdm requests

Python Version Incompatibility

Symptom:

SyntaxError: invalid syntax
# or
TypeError: 'type' object is not subscriptable

Solution:

DataStudio requires Python 3.10+. Check your version:

python --version

Upgrade if needed:

# Using pyenv
pyenv install 3.10.12
pyenv local 3.10.12

Data Loading Issues

FileNotFoundError for Images

Symptom:

FileNotFoundError: [Errno 2] No such file or directory: 'images/sample.jpg'

Causes & Solutions:

  1. Relative paths: Use absolute paths in your data or set data_root

    # In dataset YAML
    data_root: /absolute/path/to/dataset
    
  2. Path format: Ensure paths match your OS (forward vs backslashes)

  3. Missing files: Verify images exist at specified paths

Invalid JSON Format

Symptom:

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes

Solution:

Check your JSON/JSONL format:

# Validate JSON
python -m json.tool your_data.json

# For JSONL, check each line
python -c "
import json
with open('data.jsonl') as f:
    for i, line in enumerate(f, 1):
        try:
            json.loads(line)
        except json.JSONDecodeError as e:
            print(f'Error on line {i}: {e}')
"

LMDB Cache Issues

Symptom:

lmdb.Error: /path/to/cache: No such file or directory

Solution:

Ensure the parent directory exists:

mkdir -p ~/cache/images_lmdb

Symptom:

lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached

Solution:

Your LMDB database is full. Delete and recreate:

rm -rf ~/cache/images_lmdb
python run.py -c config.py --cache-images

Or use a different cache directory:

dataloader = dict(
    cache_dir="~/cache/images_lmdb_v2",  # New location
    # ...
)

Pipeline Execution Issues

No Output / All Samples Filtered

Symptom:

All samples are filtered out, output is empty.

Diagnosis:

Check why samples were filtered:

kept, filtered = pipeline(data)

print(f"Kept: {len(kept)}, Filtered: {len(filtered)}")

# Examine filter reasons
from collections import Counter
reasons = []
for item in filtered:
    if 'filter_ops' in item:
        for op, qa_reasons in item['filter_ops'].items():
            for qa_idx, reason in qa_reasons.items():
                reasons.append(f"{op}: {reason}")

for reason, count in Counter(reasons).most_common(10):
    print(f"  {count:5d}x {reason}")

Common causes:

  1. Filters too strict: Relax filter parameters

  2. Data format mismatch: Check conversations structure

  3. Missing images: Verify image paths

Pipeline Hangs

Symptom:

Pipeline stops responding, no progress for extended time.

Causes & Solutions:

  1. MLLM API timeout: Reduce thread_num or increase timeout

    model = dict(
        thread_num=64,  # Reduce from 512
        timeout=(60, 3600),  # Increase timeout
    )
    
  2. Deadlock in threading: Restart and reduce parallelism

  3. Memory exhaustion: Reduce batch_size

    dataloader = dict(
        batch_size=1000,  # Reduce from 10000
    )
    

Out of Memory (OOM)

Symptom:

MemoryError
# or
Killed (process terminated by OS)

Solutions:

  1. Reduce batch size:

    dataloader = dict(batch_size=1000)
    
  2. Resize images:

    dataloader = dict(resize_image_size=1024)  # Smaller images
    
  3. Disable image loading if not needed:

    dataloader = dict(use_image=False)
    

MLLM Issues

API Connection Errors

Symptom:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8000)

Solutions:

  1. Check API server is running:

    curl http://localhost:8000/v1/models
    
  2. Verify API base URL in config:

    model = dict(
        api_base="http://localhost:8000/v1",  # Include /v1
    )
    
  3. Check firewall/network settings

Rate Limiting

Symptom:

HTTP 429: Too Many Requests
# or
Rate limit exceeded

Solution:

Reduce concurrent requests:

model = dict(
    thread_num=50,  # Reduce from higher value
    retry=10,       # Add retries for transient errors
)

Invalid API Response

Symptom:

json.decoder.JSONDecodeError: Expecting value

Diagnosis:

Check the raw API response:

import requests
response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={"model": "test", "messages": [{"role": "user", "content": "hi"}]}
)
print(response.status_code)
print(response.text)

Common causes:

  1. Wrong API endpoint

  2. Model not loaded on server

  3. Invalid request format

Checkpoint & Resume Issues

Checkpoint Not Working

Symptom:

Re-running doesn’t resume from checkpoint.

Diagnosis:

Check if checkpoint file exists:

ls -la work_dirs/my_experiment/checkpoint.json

Solutions:

  1. Ensure same work_dir: Config must specify the same work_dir

  2. Check checkpoint content:

    import json
    with open("work_dirs/my_experiment/checkpoint.json") as f:
        print(json.dumps(json.load(f), indent=2))
    
  3. Manual checkpoint reset: Delete checkpoint to restart:

    rm work_dirs/my_experiment/checkpoint.json
    

Corrupted Checkpoint

Symptom:

json.decoder.JSONDecodeError when loading checkpoint

Solution:

Delete the corrupted checkpoint and restart:

rm work_dirs/my_experiment/checkpoint.json
python run.py -c config.py  # Restart from beginning

Output Issues

Output Format Changed

Symptom:

Output data has unexpected fields or structure.

Explanation:

DataStudio adds metadata fields:

  • filter_ops: Which filters applied and why

  • rewrite_ops: Which rewriters modified the content

  • ori_answer: Original answer before rewriting

  • rejected: Whether sample was filtered out

To get clean output, post-process:

def clean_output(item):
    """Remove DataStudio metadata from output."""
    clean = item.copy()
    for key in ['filter_ops', 'rewrite_ops', 'ori_answer', 'rejected', 'source_file']:
        clean.pop(key, None)
    return clean

Duplicate Samples in Output

Symptom:

Same sample appears multiple times in output.

Possible causes:

  1. Re-running without clearing output: Output is appended

  2. Input has duplicates: Check input data

Solution:

Clear output directory before re-running:

rm -rf output/my_dataset/*
python run.py -c config.py

Performance Issues

Slow Image Loading

Symptom:

Pipeline is slow, mostly waiting on image loading.

Solution:

Pre-cache images to LMDB:

python run.py -c config.py --cache-images

This only needs to be done once per dataset.

MLLM Throughput Low

Symptom:

MLLM operations are slower than expected.

Diagnosis:

Calculate actual throughput:

import time

start = time.time()
kept, filtered = pipeline(data[:100])
elapsed = time.time() - start

print(f"Throughput: {100/elapsed:.2f} samples/sec")

Solutions:

  1. Increase thread_num (if API allows):

    model = dict(thread_num=1024)
    
  2. Use local model instead of API

  3. Optimize prompts (shorter prompts = faster)

Getting More Help

If you can’t resolve your issue:

  1. Check GitHub Issues: github.com/Open-Bee/DataStudio/issues

  2. Enable debug logging:

    import logging
    logging.basicConfig(level=logging.DEBUG)
    
  3. Collect diagnostic info:

    python -c "
    import sys
    import datastudio
    print(f'Python: {sys.version}')
    print(f'DataStudio: {datastudio.__version__}')
    "
    
  4. Open a new issue with:

    • DataStudio version

    • Python version and OS

    • Full error traceback

    • Minimal config to reproduce

    • Sample data (if possible)

See Also