Troubleshooting Guide ===================== This guide helps you diagnose and fix common issues with DataStudio. .. contents:: Table of Contents :local: :depth: 2 Installation Issues ------------------- ImportError: No module named 'datastudio' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: python >>> from datastudio.operators import Pipeline ImportError: No module named 'datastudio' **Solution:** Install DataStudio in development mode: .. code-block:: bash cd /path/to/DataStudio pip install -e . Or ensure you're in the correct virtual environment: .. code-block:: bash source venv/bin/activate # Linux/Mac # or: venv\Scripts\activate # Windows Missing Dependencies ~~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text ModuleNotFoundError: No module named 'lmdb' **Solution:** .. code-block:: bash pip install -r requirements.txt For specific packages: .. code-block:: bash pip install lmdb pillow tqdm requests Python Version Incompatibility ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text SyntaxError: invalid syntax # or TypeError: 'type' object is not subscriptable **Solution:** DataStudio requires Python 3.10+. Check your version: .. code-block:: bash python --version Upgrade if needed: .. code-block:: bash # Using pyenv pyenv install 3.10.12 pyenv local 3.10.12 Data Loading Issues ------------------- FileNotFoundError for Images ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text FileNotFoundError: [Errno 2] No such file or directory: 'images/sample.jpg' **Causes & Solutions:** 1. **Relative paths**: Use absolute paths in your data or set ``data_root`` .. code-block:: python # In dataset YAML data_root: /absolute/path/to/dataset 2. **Path format**: Ensure paths match your OS (forward vs backslashes) 3. **Missing files**: Verify images exist at specified paths Invalid JSON Format ~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes **Solution:** Check your JSON/JSONL format: .. code-block:: bash # Validate JSON python -m json.tool your_data.json # For JSONL, check each line python -c " import json with open('data.jsonl') as f: for i, line in enumerate(f, 1): try: json.loads(line) except json.JSONDecodeError as e: print(f'Error on line {i}: {e}') " LMDB Cache Issues ~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text lmdb.Error: /path/to/cache: No such file or directory **Solution:** Ensure the parent directory exists: .. code-block:: bash mkdir -p ~/cache/images_lmdb **Symptom:** .. code-block:: text lmdb.MapFullError: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached **Solution:** Your LMDB database is full. Delete and recreate: .. code-block:: bash rm -rf ~/cache/images_lmdb python run.py -c config.py --cache-images Or use a different cache directory: .. code-block:: python dataloader = dict( cache_dir="~/cache/images_lmdb_v2", # New location # ... ) Pipeline Execution Issues ------------------------- No Output / All Samples Filtered ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Symptom:** All samples are filtered out, output is empty. **Diagnosis:** Check why samples were filtered: .. code-block:: python kept, filtered = pipeline(data) print(f"Kept: {len(kept)}, Filtered: {len(filtered)}") # Examine filter reasons from collections import Counter reasons = [] for item in filtered: if 'filter_ops' in item: for op, qa_reasons in item['filter_ops'].items(): for qa_idx, reason in qa_reasons.items(): reasons.append(f"{op}: {reason}") for reason, count in Counter(reasons).most_common(10): print(f" {count:5d}x {reason}") **Common causes:** 1. **Filters too strict**: Relax filter parameters 2. **Data format mismatch**: Check ``conversations`` structure 3. **Missing images**: Verify image paths Pipeline Hangs ~~~~~~~~~~~~~~ **Symptom:** Pipeline stops responding, no progress for extended time. **Causes & Solutions:** 1. **MLLM API timeout**: Reduce ``thread_num`` or increase ``timeout`` .. code-block:: python model = dict( thread_num=64, # Reduce from 512 timeout=(60, 3600), # Increase timeout ) 2. **Deadlock in threading**: Restart and reduce parallelism 3. **Memory exhaustion**: Reduce ``batch_size`` .. code-block:: python dataloader = dict( batch_size=1000, # Reduce from 10000 ) Out of Memory (OOM) ~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text MemoryError # or Killed (process terminated by OS) **Solutions:** 1. **Reduce batch size**: .. code-block:: python dataloader = dict(batch_size=1000) 2. **Resize images**: .. code-block:: python dataloader = dict(resize_image_size=1024) # Smaller images 3. **Disable image loading** if not needed: .. code-block:: python dataloader = dict(use_image=False) MLLM Issues ----------- API Connection Errors ~~~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8000) **Solutions:** 1. **Check API server is running**: .. code-block:: bash curl http://localhost:8000/v1/models 2. **Verify API base URL** in config: .. code-block:: python model = dict( api_base="http://localhost:8000/v1", # Include /v1 ) 3. **Check firewall/network settings** Rate Limiting ~~~~~~~~~~~~~ **Symptom:** .. code-block:: text HTTP 429: Too Many Requests # or Rate limit exceeded **Solution:** Reduce concurrent requests: .. code-block:: python model = dict( thread_num=50, # Reduce from higher value retry=10, # Add retries for transient errors ) Invalid API Response ~~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text json.decoder.JSONDecodeError: Expecting value **Diagnosis:** Check the raw API response: .. code-block:: python import requests response = requests.post( "http://localhost:8000/v1/chat/completions", json={"model": "test", "messages": [{"role": "user", "content": "hi"}]} ) print(response.status_code) print(response.text) **Common causes:** 1. Wrong API endpoint 2. Model not loaded on server 3. Invalid request format Checkpoint & Resume Issues -------------------------- Checkpoint Not Working ~~~~~~~~~~~~~~~~~~~~~~ **Symptom:** Re-running doesn't resume from checkpoint. **Diagnosis:** Check if checkpoint file exists: .. code-block:: bash ls -la work_dirs/my_experiment/checkpoint.json **Solutions:** 1. **Ensure same work_dir**: Config must specify the same ``work_dir`` 2. **Check checkpoint content**: .. code-block:: python import json with open("work_dirs/my_experiment/checkpoint.json") as f: print(json.dumps(json.load(f), indent=2)) 3. **Manual checkpoint reset**: Delete checkpoint to restart: .. code-block:: bash rm work_dirs/my_experiment/checkpoint.json Corrupted Checkpoint ~~~~~~~~~~~~~~~~~~~~ **Symptom:** .. code-block:: text json.decoder.JSONDecodeError when loading checkpoint **Solution:** Delete the corrupted checkpoint and restart: .. code-block:: bash rm work_dirs/my_experiment/checkpoint.json python run.py -c config.py # Restart from beginning Output Issues ------------- Output Format Changed ~~~~~~~~~~~~~~~~~~~~~ **Symptom:** Output data has unexpected fields or structure. **Explanation:** DataStudio adds metadata fields: - ``filter_ops``: Which filters applied and why - ``rewrite_ops``: Which rewriters modified the content - ``ori_answer``: Original answer before rewriting - ``rejected``: Whether sample was filtered out To get clean output, post-process: .. code-block:: python def clean_output(item): """Remove DataStudio metadata from output.""" clean = item.copy() for key in ['filter_ops', 'rewrite_ops', 'ori_answer', 'rejected', 'source_file']: clean.pop(key, None) return clean Duplicate Samples in Output ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Symptom:** Same sample appears multiple times in output. **Possible causes:** 1. **Re-running without clearing output**: Output is appended 2. **Input has duplicates**: Check input data **Solution:** Clear output directory before re-running: .. code-block:: bash rm -rf output/my_dataset/* python run.py -c config.py Performance Issues ------------------ Slow Image Loading ~~~~~~~~~~~~~~~~~~ **Symptom:** Pipeline is slow, mostly waiting on image loading. **Solution:** Pre-cache images to LMDB: .. code-block:: bash python run.py -c config.py --cache-images This only needs to be done once per dataset. MLLM Throughput Low ~~~~~~~~~~~~~~~~~~~ **Symptom:** MLLM operations are slower than expected. **Diagnosis:** Calculate actual throughput: .. code-block:: python import time start = time.time() kept, filtered = pipeline(data[:100]) elapsed = time.time() - start print(f"Throughput: {100/elapsed:.2f} samples/sec") **Solutions:** 1. **Increase thread_num** (if API allows): .. code-block:: python model = dict(thread_num=1024) 2. **Use local model** instead of API 3. **Optimize prompts** (shorter prompts = faster) Getting More Help ----------------- If you can't resolve your issue: 1. **Check GitHub Issues**: `github.com/Open-Bee/DataStudio/issues `_ 2. **Enable debug logging**: .. code-block:: python import logging logging.basicConfig(level=logging.DEBUG) 3. **Collect diagnostic info**: .. code-block:: bash python -c " import sys import datastudio print(f'Python: {sys.version}') print(f'DataStudio: {datastudio.__version__}') " 4. **Open a new issue** with: - DataStudio version - Python version and OS - Full error traceback - Minimal config to reproduce - Sample data (if possible) See Also -------- - :doc:`faq` - Frequently asked questions - :doc:`guide/quick_start` - Quick start guide - `GitHub Discussions `_