Datasets 模块

Data loading, saving, and format handling.

class datastudio.datasets.StandardDataLoader[source]

Bases: object

Dataset loader with LMDB image caching and checkpoint resume.

Supports automatic format detection, parallel image loading, adaptive batch sizing, and item-level checkpoint for resumable processing.

Example:

loader = StandardDataLoader(
    data_root='/data/datasets',
    dataset={'file_path': 'train.jsonl'},
    batch_size=32,
    logger=logger,
)
for batch in loader:
    process(batch)
__init__(data_root, dataset, batch_size, parallel_loading=True, num_workers=256, logger=None, cache_dir='~/cache/images_lmdb_sharded', lmdb_num_shards=32, lmdb_map_size_per_shard=1099511627776, lmdb_readonly=False, lmdb_lock=False, resize_image=True, resize_image_size=1024, use_image=True, use_lmdb_cache=True, adjust_batch_size=True, checkpoint_manager=None, **kwargs)[source]
get_name()[source]

Get the dataset name.

Return type:

str

has_checkpoint()[source]

Check if this dataloader has checkpoint tracking enabled and has progress.

Return type:

bool

is_completed()[source]

Check if this dataloader has been fully processed.

Return type:

bool

is_empty()[source]

Check if there are no items to process.

Return type:

bool

update_checkpoint()[source]

Update checkpoint with current progress.

Return type:

None

class datastudio.datasets.StandardDataSaver[source]

Bases: object

Dataset saver that groups output by source and tracks statistics.

Output directory structure:

output_dir/
├── source1/
│   ├── file1.json
│   └── rejected/
│       └── file1.json
├── source2/
│   └── file2.jsonl
└── config.yaml
__init__(output_dir, dataset, logger, save_yaml_config=False, save_yaml_name=None, incremental_save_threshold=8192)[source]

Initialize the data saver.

Parameters:
  • output_dir (str) – Output directory path.

  • dataset (str) – Original dataset YAML path.

  • logger – Logger instance.

  • save_yaml_config (bool) – Whether to save YAML config.

  • save_yaml_name (str) – YAML config file name.

  • incremental_save_threshold (int) – Number of items before triggering incremental save.

clean()[source]

Clear data buffers.

has_processed(dataset)[source]

Check if dataset has already been processed.

Return type:

bool

incremental_save()[source]

Incrementally save data if threshold is reached.

This enables item-level resume by saving progress periodically without waiting for the entire file to be processed.

save()[source]

Save accumulated data to disk.

save_yaml()[source]

Save YAML config and statistics.

class datastudio.datasets.FormatRegistry[source]

Bases: object

Registry for data format handlers with automatic schema conversion.

Detects file format by extension, converts between conversations and messages schemas automatically, and preserves original schema metadata for round-trip fidelity.

classmethod get(file_path)[source]

Get format handler instance based on file extension.

Parameters:

file_path (str) – Path to data file.

Returns:

Format handler instance.

Return type:

BaseFormat

Raises:

ValueError – If file extension is not supported.

classmethod load(file_path, add_source_file=True, remove_rejected=True, auto_normalize=True)[source]

Load data from file.

Parameters:
  • file_path (str) – Path to data file.

  • add_source_file (bool) – Whether to add ‘file_path’ field to each item.

  • remove_rejected (bool) – Whether to remove ‘filtered’/’rejected’ fields.

  • auto_normalize (bool) – Whether to auto-convert to standard schema (default: True). Original schema is stored in ‘_original_schema’ for round-trip.

Returns:

List of data dictionaries in standard schema.

Return type:

list

Note

This method modifies the loaded data items in place by: - Converting to standard schema if auto_normalize=True - Adding ‘file_path’ field if add_source_file=True - Removing ‘filtered’ and ‘rejected’ fields if remove_rejected=True

If you need to preserve the original data, make a deep copy after loading.

classmethod register(format_class)[source]

Register a format handler (decorator).

Parameters:

format_class – Format class to register.

Returns:

The registered format class.

classmethod save(data, file_path, auto_denormalize=True, **kwargs)[source]

Save data to file.

Parameters:
  • data (list) – List of data dictionaries.

  • file_path (str) – Output file path.

  • auto_denormalize (bool) – Whether to auto-convert back to original schema (default: True).

  • **kwargs – Format-specific options.

Return type:

None

classmethod supported_extensions()[source]

Get all supported file extensions.

Returns:

List of supported extensions.

Return type:

list

class datastudio.datasets.BaseFormat[source]

Bases: ABC

Abstract base for data format handlers.

Subclasses implement extensions(), load(), and save(), then register via @FormatRegistry.register.

Example:

@FormatRegistry.register
class MyFormat(BaseFormat):
    @classmethod
    def extensions(cls) -> list:
        return ['.myext']
    def load(self, file_path: str) -> list: ...
    def save(self, data: list, file_path: str, **kwargs) -> None: ...
abstractmethod classmethod extensions()[source]

Return list of supported file extensions.

Returns:

List of file extensions (e.g., [‘.json’, ‘.JSON’]).

Return type:

list

abstractmethod load(file_path)[source]

Load data from file.

Parameters:

file_path (str) – Path to the data file.

Returns:

List of data dictionaries.

Return type:

list

abstractmethod save(data, file_path, **kwargs)[source]

Save data to file.

Parameters:
  • data (list) – List of data dictionaries.

  • file_path (str) – Output file path.

  • **kwargs – Format-specific options.

Return type:

None

class datastudio.datasets.JsonFormat[source]

Bases: BaseFormat

JSON data format handler.

Supports loading and saving data in JSON format. Single objects are wrapped in a list for consistency.

classmethod extensions()[source]

Return supported file extensions.

Returns:

[‘.json’]

Return type:

list

load(file_path)[source]

Load data from JSON file.

Parameters:

file_path (str) – Path to JSON file.

Returns:

Data as list of dicts (single objects are wrapped in list).

Return type:

list

save(data, file_path, indent=4, ensure_ascii=False)[source]

Save data to JSON file.

Parameters:
  • data (list) – List of data dictionaries.

  • file_path (str) – Output file path.

  • indent (int) – JSON indentation level (default: 4).

  • ensure_ascii (bool) – Whether to escape non-ASCII characters (default: False).

Return type:

None

class datastudio.datasets.JsonlFormat[source]

Bases: BaseFormat

JSONL (JSON Lines) data format handler.

Each line contains one JSON object. Invalid lines are skipped with a warning.

classmethod extensions()[source]

Return supported file extensions.

Returns:

[‘.jsonl’]

Return type:

list

load(file_path)[source]

Load data from JSONL file.

Parameters:

file_path (str) – Path to JSONL file.

Returns:

List of data dictionaries.

Return type:

list

Note

Invalid JSON lines are skipped with a warning message.

save(data, file_path, ensure_ascii=False)[source]

Save data to JSONL file.

Parameters:
  • data (list) – List of data dictionaries.

  • file_path (str) – Output file path.

  • ensure_ascii (bool) – Whether to escape non-ASCII characters (default: False).

Return type:

None

class datastudio.datasets.ConfigLoader[source]

Bases: object

Configuration loader for loading and saving YAML configs.

Provides static methods for loading YAML configurations and converting them to standardized format.

Example

>>> config = ConfigLoader.load("config.yaml")
>>> paths = config.get_file_paths()
>>> sources = config.get_sources_map()
classmethod create_config(file_paths, sources=None, data_sizes=None)[source]

Create a standard configuration from file paths.

Parameters:
  • file_paths (List[str]) – List of data file paths.

  • sources (Optional[dict]) – Optional mapping from path to source name.

  • data_sizes (Optional[dict]) – Optional mapping from path to data size.

Return type:

StandardConfig

Returns:

StandardConfig instance.

classmethod load(yaml_path)[source]

Load YAML configuration file.

Parameters:

yaml_path (str) – Path to YAML file.

Return type:

StandardConfig

Returns:

StandardConfig instance.

classmethod load_file_paths(yaml_path, data_root=None)[source]

Load config and return all data file paths.

Parameters:
  • yaml_path (str) – Path to YAML config file.

  • data_root (Optional[str]) – Optional root directory to prepend to relative paths.

Return type:

List[str]

Returns:

List of absolute file paths.

classmethod load_sources_map(yaml_path)[source]

Load config and return file_path to source mapping.

Parameters:

yaml_path (str) – Path to YAML config file.

Returns:

Mapping from file_path to source name.

Return type:

dict

classmethod save(config, yaml_path)[source]

Save configuration to YAML file.

Parameters:
  • config (StandardConfig) – StandardConfig to save.

  • yaml_path (str) – Output file path.

Return type:

None

class datastudio.datasets.DatasetConfig[source]

Bases: object

Standard configuration for a single dataset.

file_path

Path to the data file.

source

Data source name/identifier.

data_size

Number of samples (optional).

extra

Additional metadata fields.

__init__(file_path, source='Unknown', data_size=None, extra=<factory>)
Parameters:
  • file_path (str)

  • source (str)

  • data_size (int | None)

  • extra (dict)

Return type:

None

data_size: Optional[int] = None
classmethod from_dict(data)[source]

Create config from dict, automatically handling field aliases.

Parameters:

data (dict) – Raw configuration dictionary.

Return type:

DatasetConfig

Returns:

DatasetConfig instance with standardized fields.

source: str = 'Unknown'
to_dict()[source]

Convert to dictionary.

Returns:

Configuration as dictionary.

Return type:

dict

file_path: str
extra: dict
class datastudio.datasets.StandardConfig[source]

Bases: object

Standardized complete configuration containing multiple datasets.

datasets

List of DatasetConfig objects.

extra

Additional top-level configuration fields.

__init__(datasets=<factory>, extra=<factory>)
Parameters:
Return type:

None

classmethod from_dict(data)[source]

Create standard config from dictionary.

Parameters:

data (dict) – Raw configuration dictionary with ‘datasets’ key.

Return type:

StandardConfig

Returns:

StandardConfig instance.

get_file_paths()[source]

Get all data file paths.

Return type:

List[str]

Returns:

List of file paths from all datasets.

get_sources_map()[source]

Get file_path to source mapping.

Returns:

Mapping from file_path to source name.

Return type:

dict

to_dict()[source]

Convert to dictionary.

Returns:

Configuration as dictionary.

Return type:

dict

datasets: List[DatasetConfig]
extra: dict