DataStudio

User Guide

  • Getting Started
    • Prerequisites
    • Installation
      • Step 1: Clone the Repository
      • Step 2: Create Virtual Environment (Recommended)
      • Step 3: Install Dependencies
      • Step 4: Install DataStudio
      • Step 5: Verify Installation
      • Optional: Weights & Biases
    • Quick Start
      • Option 1: Try Built-in Examples
      • Option 2: Config-Driven (Production)
    • Core Concepts
      • Data Format
      • Operators
      • Pipeline
    • Built-in Operators
      • Filters
      • Rewriters
    • Dataset Configuration
    • What’s Next?
    • Need Help?
  • Quick Start
    • 1. Installation
      • Requirements
      • Steps
      • Optional Dependencies
      • Try It Out
    • 2. Data Format
      • Basic Format
      • Multi-Image Format
      • Schema Compatibility
    • 3. Dataset YAML Configuration
    • 4. Core Concepts
      • Operator Types
      • Pipeline Structure
      • Config Inheritance
    • 5. Built-in Operators
      • Filters
      • Rewriters
    • 6. Config File Details
      • Minimal Config Example
      • Config with MLLM Operators
    • 7. Running the Pipeline
      • Basic Run
      • Cache Images Only
      • Checkpoint Resume
      • Output Structure
    • 8. Model Backend
    • 9. MLLM Operators
      • RequestBuilder
      • MLLMFilter
      • MLLMRewriter
      • SelectiveMLLMRewriter
    • 10. Custom Operators
      • Custom Filter
      • Custom Rewriter
    • 11. Performance Tuning
      • LMDB Image Cache
      • Concurrency Tuning
      • Batch Size
    • 12. Full Example
    • 13. Deploying Inference Services
    • 14. Auxiliary Tools
      • LLMRouter
      • DataVis
  • Examples Guide
    • Demo Data
    • Example Overview
    • Examples Without MLLM (Run Directly)
      • Example 1: Rule-Based Filtering
      • Example 5: Text Normalization
    • Examples With MLLM (Require Inference Service)
      • Example 2: MLLM Quality Filtering
      • Example 3: MLLM Answer Rewriting
      • Example 4: HoneyPipe Full Pipeline (5 Stages)
    • Customizing Examples
  • DataStudio Architecture Guide
    • 1. Project Overview
    • 2. Core Data Flow and Class Relationships
      • 2.1 Overall Architecture
      • 2.2 Data Loading (StandardDataLoader)
      • 2.3 Pipeline Processing (Pipeline → SubPipeline → Operator)
      • 2.4 Operator Execution Details
      • 2.5 Data Saving (StandardDataSaver)
      • 2.6 Core Data Structure Transformations
      • 2.7 Class Dependency Diagram
    • 3. Key Class Reference
    • 4. Directory Structure
    • 5. Core Design Principles
  • Development Guide
    • Operator Architecture
    • Creating a Filter
      • Item-level vs QA-level Filtering
    • Creating a Rewriter
    • Registration and Export
      • 1. Register with the decorator
      • 2. Export in __init__.py
    • Using in Config
    • Writing Tests
  • Frequently Asked Questions
    • General Questions
      • What is DataStudio?
      • What makes DataStudio different from other data processing tools?
      • Is DataStudio free to use?
    • Installation & Setup
      • What Python version is required?
      • How do I install DataStudio?
      • Do I need a GPU?
      • How much disk space do I need?
    • Data Format
      • What data formats does DataStudio support?
      • What should my data look like?
      • Can I process multi-image samples?
      • Can I process text-only data?
    • Pipeline & Operators
      • How do I choose which operators to use?
      • What’s the difference between Filter and Rewriter?
      • Can I combine multiple operators?
      • How do I create custom operators?
    • MLLM Integration
      • Which MLLM providers are supported?
      • How do I use a local model?
      • How many concurrent API calls can I make?
      • Why are my MLLM calls slow?
    • Performance
      • How do I speed up processing?
      • What batch_size should I use?
      • How do I handle very large datasets (10M+ samples)?
    • Troubleshooting
      • My pipeline crashed. How do I resume?
      • Why is my output empty?
      • How do I debug filtering decisions?
    • Contributing
      • How can I contribute?
      • How do I report a bug?
    • Citation
      • How should I cite DataStudio?
    • Still Have Questions?
  • Troubleshooting Guide
    • Installation Issues
      • ImportError: No module named ‘datastudio’
      • Missing Dependencies
      • Python Version Incompatibility
    • Data Loading Issues
      • FileNotFoundError for Images
      • Invalid JSON Format
      • LMDB Cache Issues
    • Pipeline Execution Issues
      • No Output / All Samples Filtered
      • Pipeline Hangs
      • Out of Memory (OOM)
    • MLLM Issues
      • API Connection Errors
      • Rate Limiting
      • Invalid API Response
    • Checkpoint & Resume Issues
      • Checkpoint Not Working
      • Corrupted Checkpoint
    • Output Issues
      • Output Format Changed
      • Duplicate Samples in Output
    • Performance Issues
      • Slow Image Loading
      • MLLM Throughput Low
    • Getting More Help
    • See Also

API Reference

  • 架构设计
    • 核心模块
    • 类层级图
      • Result 类型
    • 设计原则
  • API 文档
    • Operators 模块
      • 核心类型 (core)
        • FilterDecision
        • RewriteDecision
        • Result
        • OperatorResult
        • 类层级图
      • 算子基类
        • FilterDecision
        • RewriteDecision
        • Result
        • OperatorResult
        • QA
        • DataItem
        • Operator
        • Filter
        • Rewriter
        • ConvLengthFilter
        • ImageSizeFilter
        • ImageAspectRatioFilter
        • ImageExtFilter
        • LengthAnomalyFilter
        • ResponseTagFilter
        • TextRepeatFilter
        • RemoveThinkRewriter
        • NormThinkRewriter
        • AddNoThinkRewriter
        • NormImageTagRewriter
        • NormPromptRewriter
        • NormMultiTurnPromptRewriter
        • RemoveAnswerRewriter
        • RemoveReasonRewriter
        • SplitRewriter
        • RequestBuilder
        • MLLMOperator
        • MLLMFilter
        • MLLMRewriter
        • SelectiveMLLMRewriter
    • Datasets 模块
      • StandardDataLoader
        • StandardDataLoader.__init__()
        • StandardDataLoader.get_name()
        • StandardDataLoader.has_checkpoint()
        • StandardDataLoader.is_completed()
        • StandardDataLoader.is_empty()
        • StandardDataLoader.update_checkpoint()
      • StandardDataSaver
        • StandardDataSaver.__init__()
        • StandardDataSaver.clean()
        • StandardDataSaver.has_processed()
        • StandardDataSaver.incremental_save()
        • StandardDataSaver.save()
        • StandardDataSaver.save_yaml()
      • FormatRegistry
        • FormatRegistry.get()
        • FormatRegistry.load()
        • FormatRegistry.register()
        • FormatRegistry.save()
        • FormatRegistry.supported_extensions()
      • BaseFormat
        • BaseFormat.extensions()
        • BaseFormat.load()
        • BaseFormat.save()
      • JsonFormat
        • JsonFormat.extensions()
        • JsonFormat.load()
        • JsonFormat.save()
      • JsonlFormat
        • JsonlFormat.extensions()
        • JsonlFormat.load()
        • JsonlFormat.save()
      • ConfigLoader
        • ConfigLoader.create_config()
        • ConfigLoader.load()
        • ConfigLoader.load_file_paths()
        • ConfigLoader.load_sources_map()
        • ConfigLoader.save()
      • DatasetConfig
        • DatasetConfig.file_path
        • DatasetConfig.source
        • DatasetConfig.data_size
        • DatasetConfig.extra
        • DatasetConfig.__init__()
        • DatasetConfig.data_size
        • DatasetConfig.from_dict()
        • DatasetConfig.source
        • DatasetConfig.to_dict()
        • DatasetConfig.file_path
        • DatasetConfig.extra
      • StandardConfig
        • StandardConfig.datasets
        • StandardConfig.extra
        • StandardConfig.__init__()
        • StandardConfig.from_dict()
        • StandardConfig.get_file_paths()
        • StandardConfig.get_sources_map()
        • StandardConfig.to_dict()
        • StandardConfig.datasets
        • StandardConfig.extra
    • Models 模块
      • OpenAIAPI
        • OpenAIAPI.__init__()
        • OpenAIAPI.generate()
        • OpenAIAPI.generate_inner()
        • OpenAIAPI.shutdown()
      • MPOpenAIAPI
        • MPOpenAIAPI.__init__()
        • MPOpenAIAPI.generate()
        • MPOpenAIAPI.generate_inner()
        • MPOpenAIAPI.shutdown()
      • BaseAPI
        • BaseAPI.allowed_types
        • BaseAPI.__init__()
        • BaseAPI.generate_inner()
        • BaseAPI.encode_image_directly()
        • BaseAPI.check_content()
        • BaseAPI.preproc_content()
        • BaseAPI.prepare_inputs()
        • BaseAPI.process_single_message()
        • BaseAPI.pre_process()
        • BaseAPI.generate()
        • BaseAPI.shutdown()
      • OpenAIAPI
        • OpenAIAPI.__init__()
        • OpenAIAPI.generate()
        • OpenAIAPI.generate_inner()
        • OpenAIAPI.shutdown()
    • Pipelines 模块
      • SubPipeline
        • SubPipeline.__init__()
      • Pipeline
        • Pipeline.sub_pipelines
        • Pipeline.__init__()
      • wrap_items()
      • unwrap_items()
      • Pipeline
        • Pipeline.sub_pipelines
        • Pipeline.__init__()
      • wrap_items()
      • unwrap_items()
      • SubPipeline
        • SubPipeline.__init__()

中文文档

  • 简介
    • 核心特性
    • 辅助工具
  • 安装
  • 快速上手
    • 1. 准备数据集 YAML
    • 2. 编写配置文件
    • 3. 运行
  • 项目结构
  • 关于 Bee 项目
  • 引用
  • 贡献
  • 许可证
  • 快速开始
    • 1. 安装
      • 环境要求
      • 安装步骤
      • 可选依赖
      • 快速体验
    • 2. 数据格式
      • 基本格式
      • 多图格式
      • Schema 兼容
    • 3. 数据集 YAML 配置
    • 4. 核心概念
      • 算子类型
      • 流水线结构
      • 配置继承
    • 5. 内置算子
      • 过滤器
      • 重写器
    • 6. 配置文件详解
      • 最小配置示例
      • 包含 MLLM 算子的配置
    • 7. 运行流水线
      • 基本运行
      • 仅缓存图像
      • 断点续传
      • 输出结构
    • 8. 模型后端
    • 9. MLLM 算子详解
      • RequestBuilder
      • MLLMFilter
      • MLLMRewriter
      • SelectiveMLLMRewriter
    • 10. 自定义算子
      • 自定义过滤器
      • 自定义重写器
    • 11. 性能调优
      • LMDB 图像缓存
      • 并发调优
      • 批量大小
    • 12. 完整示例
    • 13. 部署推理服务
    • 14. 辅助工具
      • LLMRouter
      • DataVis
  • 示例指南
    • 演示数据
    • 示例概览
    • 无需 MLLM 的示例(直接运行)
      • 示例 1:规则过滤
      • 示例 5:文本规范化
    • 需要 MLLM 的示例(需部署推理服务)
      • 示例 2:MLLM 质量过滤
      • 示例 3:MLLM 答案重写
      • 示例 4:HoneyPipe 完整流水线(5 阶段)
    • 自定义示例
  • DataStudio 架构快速理解指南
    • 一、项目定位
    • 二、核心数据流与类调用关系
      • 2.1 整体架构图
      • 2.2 数据加载流程 (StandardDataLoader)
      • 2.3 管线处理流程 (Pipeline → SubPipeline → Operator)
      • 2.4 Operator 执行细节
      • 2.5 数据保存流程 (StandardDataSaver)
      • 2.6 核心数据结构转换
      • 2.7 类依赖关系图
    • 三、核心概念
      • 3.1 数据格式
      • 3.2 Operator(算子)
      • 3.3 Result(结果)
      • 3.4 Pipeline(管线)
    • 四、目录结构速查
    • 五、配置文件解读
    • 六、运行方式
    • 七、关键类速查表
    • 八、扩展指南
      • 添加新的 Filter
      • 添加新的 Rewriter
      • 在配置中使用
    • 九、核心设计原则
    • 十、常见问题
DataStudio
  • Python Module Index

Python Module Index

d
 
d
- datastudio
    datastudio.datasets
    datastudio.models
    datastudio.models.base
    datastudio.models.openai_api
    datastudio.operators
    datastudio.operators.core.result
    datastudio.pipelines
    datastudio.pipelines.pipeline
    datastudio.pipelines.sub_pipeline

© Copyright 2024, DataStudio Team.

Built with Sphinx using a theme provided by Read the Docs.