Datasets 模块

Data loading, saving, and format handling.

class datastudio.datasets.StandardDataLoader[source]

Bases: object

Dataset loader with LMDB image caching and checkpoint resume.

Supports automatic format detection, parallel image loading, adaptive batch sizing, and item-level checkpoint for resumable processing.

Example:

loader = StandardDataLoader(
    data_root='/data/datasets',
    dataset={'file_path': 'train.jsonl'},
    batch_size=32,
    logger=logger,
)
for batch in loader:
    process(batch)

__init__(data_root, dataset, batch_size, parallel_loading=True, num_workers=256, logger=None, cache_dir='~/cache/images_lmdb_sharded', lmdb_num_shards=32, lmdb_map_size_per_shard=1099511627776, lmdb_readonly=False, lmdb_lock=False, resize_image=True, resize_image_size=1024, use_image=True, use_lmdb_cache=True, adjust_batch_size=True, checkpoint_manager=None, **kwargs)[source]

get_name()[source]

Get the dataset name.

Return type:: str

has_checkpoint()[source]

Check if this dataloader has checkpoint tracking enabled and has progress.

Return type:: bool

is_completed()[source]

Check if this dataloader has been fully processed.

Return type:: bool

is_empty()[source]

Check if there are no items to process.

Return type:: bool

update_checkpoint()[source]

Update checkpoint with current progress.

Return type:: None

class datastudio.datasets.StandardDataSaver[source]

Bases: object

Dataset saver that groups output by source and tracks statistics.

Output directory structure:

output_dir/
├── source1/
│   ├── file1.json
│   └── rejected/
│       └── file1.json
├── source2/
│   └── file2.jsonl
└── config.yaml

__init__(output_dir, dataset, logger, save_yaml_config=False, save_yaml_name=None, incremental_save_threshold=8192)[source]

Initialize the data saver.

Parameters:

output_dir (str) – Output directory path.
dataset (str) – Original dataset YAML path.
logger – Logger instance.
save_yaml_config (bool) – Whether to save YAML config.
save_yaml_name (str) – YAML config file name.
incremental_save_threshold (int) – Number of items before triggering incremental save.

clean()[source]: Clear data buffers.

has_processed(dataset)[source]

Check if dataset has already been processed.

Return type:: bool

incremental_save()[source]

Incrementally save data if threshold is reached.

This enables item-level resume by saving progress periodically without waiting for the entire file to be processed.

save()[source]: Save accumulated data to disk.

save_yaml()[source]: Save YAML config and statistics.

class datastudio.datasets.FormatRegistry[source]

Bases: object

Registry for data format handlers with automatic schema conversion.

Detects file format by extension, converts between conversations and messages schemas automatically, and preserves original schema metadata for round-trip fidelity.

classmethod get(file_path)[source]

Get format handler instance based on file extension.

Parameters:: file_path (str) – Path to data file.
Returns:: Format handler instance.
Return type:: BaseFormat
Raises:: ValueError – If file extension is not supported.

classmethod load(file_path, add_source_file=True, remove_rejected=True, auto_normalize=True)[source]

Load data from file.

Parameters:

file_path (str) – Path to data file.
add_source_file (bool) – Whether to add ‘file_path’ field to each item.
remove_rejected (bool) – Whether to remove ‘filtered’/’rejected’ fields.
auto_normalize (bool) – Whether to auto-convert to standard schema (default: True). Original schema is stored in ‘_original_schema’ for round-trip.

Returns:

List of data dictionaries in standard schema.

Return type:

list

Note

This method modifies the loaded data items in place by: - Converting to standard schema if auto_normalize=True - Adding ‘file_path’ field if add_source_file=True - Removing ‘filtered’ and ‘rejected’ fields if remove_rejected=True

If you need to preserve the original data, make a deep copy after loading.

classmethod register(format_class)[source]

Register a format handler (decorator).

Parameters:: format_class – Format class to register.
Returns:: The registered format class.

classmethod save(data, file_path, auto_denormalize=True, **kwargs)[source]

Save data to file.

Parameters:

data (list) – List of data dictionaries.
file_path (str) – Output file path.
auto_denormalize (bool) – Whether to auto-convert back to original schema (default: True).
**kwargs – Format-specific options.

Return type:

None

classmethod supported_extensions()[source]

Get all supported file extensions.

Returns:: List of supported extensions.
Return type:: list

class datastudio.datasets.BaseFormat[source]

Bases: ABC

Abstract base for data format handlers.

Subclasses implement extensions(), load(), and save(), then register via @FormatRegistry.register.

Example:

@FormatRegistry.register
class MyFormat(BaseFormat):
    @classmethod
    def extensions(cls) -> list:
        return ['.myext']
    def load(self, file_path: str) -> list: ...
    def save(self, data: list, file_path: str, **kwargs) -> None: ...

abstractmethod classmethod extensions()[source]

Return list of supported file extensions.

Returns:: List of file extensions (e.g., [‘.json’, ‘.JSON’]).
Return type:: list

abstractmethod load(file_path)[source]

Load data from file.

Parameters:: file_path (str) – Path to the data file.
Returns:: List of data dictionaries.
Return type:: list

abstractmethod save(data, file_path, **kwargs)[source]

Save data to file.

Parameters:

data (list) – List of data dictionaries.
file_path (str) – Output file path.
**kwargs – Format-specific options.

Return type:

None

class datastudio.datasets.JsonFormat[source]

Bases: BaseFormat

JSON data format handler.

Supports loading and saving data in JSON format. Single objects are wrapped in a list for consistency.

classmethod extensions()[source]

Return supported file extensions.

Returns:: [‘.json’]
Return type:: list

load(file_path)[source]

Load data from JSON file.

Parameters:: file_path (str) – Path to JSON file.
Returns:: Data as list of dicts (single objects are wrapped in list).
Return type:: list

save(data, file_path, indent=4, ensure_ascii=False)[source]

Save data to JSON file.

Parameters:

data (list) – List of data dictionaries.
file_path (str) – Output file path.
indent (int) – JSON indentation level (default: 4).
ensure_ascii (bool) – Whether to escape non-ASCII characters (default: False).

Return type:

None

class datastudio.datasets.JsonlFormat[source]

Bases: BaseFormat

JSONL (JSON Lines) data format handler.

Each line contains one JSON object. Invalid lines are skipped with a warning.

classmethod extensions()[source]

Return supported file extensions.

Returns:: [‘.jsonl’]
Return type:: list

load(file_path)[source]

Load data from JSONL file.

Parameters:: file_path (str) – Path to JSONL file.
Returns:: List of data dictionaries.
Return type:: list

Note

Invalid JSON lines are skipped with a warning message.

save(data, file_path, ensure_ascii=False)[source]

Save data to JSONL file.

Parameters:

data (list) – List of data dictionaries.
file_path (str) – Output file path.
ensure_ascii (bool) – Whether to escape non-ASCII characters (default: False).

Return type:

None

class datastudio.datasets.ConfigLoader[source]

Bases: object

Configuration loader for loading and saving YAML configs.

Provides static methods for loading YAML configurations and converting them to standardized format.

Example

>>> config = ConfigLoader.load("config.yaml")
>>> paths = config.get_file_paths()
>>> sources = config.get_sources_map()

classmethod create_config(file_paths, sources=None, data_sizes=None)[source]

Create a standard configuration from file paths.

Parameters:

file_paths (List[str]) – List of data file paths.
sources (Optional[dict]) – Optional mapping from path to source name.
data_sizes (Optional[dict]) – Optional mapping from path to data size.

Return type:

StandardConfig

Returns:

StandardConfig instance.

classmethod load(yaml_path)[source]

Load YAML configuration file.

Parameters:: yaml_path (str) – Path to YAML file.
Return type:: StandardConfig
Returns:: StandardConfig instance.

classmethod load_file_paths(yaml_path, data_root=None)[source]

Load config and return all data file paths.

Parameters:

yaml_path (str) – Path to YAML config file.
data_root (Optional[str]) – Optional root directory to prepend to relative paths.

Return type:

List[str]

Returns:

List of absolute file paths.

classmethod load_sources_map(yaml_path)[source]

Load config and return file_path to source mapping.

Parameters:: yaml_path (str) – Path to YAML config file.
Returns:: Mapping from file_path to source name.
Return type:: dict

classmethod save(config, yaml_path)[source]

Save configuration to YAML file.

Parameters:

config (StandardConfig) – StandardConfig to save.
yaml_path (str) – Output file path.

Return type:

None

class datastudio.datasets.DatasetConfig[source]

Bases: object

Standard configuration for a single dataset.

file_path: Path to the data file.

source: Data source name/identifier.

data_size: Number of samples (optional).

extra: Additional metadata fields.

__init__(file_path, source='Unknown', data_size=None, extra=<factory>)

Parameters:

file_path (str)
source (str)
data_size (int | None)
extra (dict)

Return type:

None

data_size: Optional[int] = None

classmethod from_dict(data)[source]

Create config from dict, automatically handling field aliases.

Parameters:: data (dict) – Raw configuration dictionary.
Return type:: DatasetConfig
Returns:: DatasetConfig instance with standardized fields.

source: str = 'Unknown'

to_dict()[source]

Convert to dictionary.

Returns:: Configuration as dictionary.
Return type:: dict

file_path: str

extra: dict

class datastudio.datasets.StandardConfig[source]

Bases: object

Standardized complete configuration containing multiple datasets.

datasets: List of DatasetConfig objects.

extra: Additional top-level configuration fields.

__init__(datasets=<factory>, extra=<factory>)

Parameters:

datasets (List[DatasetConfig])
extra (dict)

Return type:

None

classmethod from_dict(data)[source]

Create standard config from dictionary.

Parameters:: data (dict) – Raw configuration dictionary with ‘datasets’ key.
Return type:: StandardConfig
Returns:: StandardConfig instance.

get_file_paths()[source]

Get all data file paths.

Return type:: List[str]
Returns:: List of file paths from all datasets.

get_sources_map()[source]

Get file_path to source mapping.

Returns:: Mapping from file_path to source name.
Return type:: dict

to_dict()[source]

Convert to dictionary.

Returns:: Configuration as dictionary.
Return type:: dict

datasets: List[DatasetConfig]

extra: dict