Datasets 模块
Data loading, saving, and format handling.
- class datastudio.datasets.StandardDataLoader[source]
Bases:
objectDataset loader with LMDB image caching and checkpoint resume.
Supports automatic format detection, parallel image loading, adaptive batch sizing, and item-level checkpoint for resumable processing.
Example:
loader = StandardDataLoader( data_root='/data/datasets', dataset={'file_path': 'train.jsonl'}, batch_size=32, logger=logger, ) for batch in loader: process(batch)
- __init__(data_root, dataset, batch_size, parallel_loading=True, num_workers=256, logger=None, cache_dir='~/cache/images_lmdb_sharded', lmdb_num_shards=32, lmdb_map_size_per_shard=1099511627776, lmdb_readonly=False, lmdb_lock=False, resize_image=True, resize_image_size=1024, use_image=True, use_lmdb_cache=True, adjust_batch_size=True, checkpoint_manager=None, **kwargs)[source]
- class datastudio.datasets.StandardDataSaver[source]
Bases:
objectDataset saver that groups output by source and tracks statistics.
Output directory structure:
output_dir/ ├── source1/ │ ├── file1.json │ └── rejected/ │ └── file1.json ├── source2/ │ └── file2.jsonl └── config.yaml
- __init__(output_dir, dataset, logger, save_yaml_config=False, save_yaml_name=None, incremental_save_threshold=8192)[source]
Initialize the data saver.
- Parameters:
- class datastudio.datasets.FormatRegistry[source]
Bases:
objectRegistry for data format handlers with automatic schema conversion.
Detects file format by extension, converts between
conversationsandmessagesschemas automatically, and preserves original schema metadata for round-trip fidelity.- classmethod get(file_path)[source]
Get format handler instance based on file extension.
- Parameters:
file_path (
str) – Path to data file.- Returns:
Format handler instance.
- Return type:
- Raises:
ValueError – If file extension is not supported.
- classmethod load(file_path, add_source_file=True, remove_rejected=True, auto_normalize=True)[source]
Load data from file.
- Parameters:
file_path (
str) – Path to data file.add_source_file (
bool) – Whether to add ‘file_path’ field to each item.remove_rejected (
bool) – Whether to remove ‘filtered’/’rejected’ fields.auto_normalize (
bool) – Whether to auto-convert to standard schema (default: True). Original schema is stored in ‘_original_schema’ for round-trip.
- Returns:
List of data dictionaries in standard schema.
- Return type:
Note
This method modifies the loaded data items in place by: - Converting to standard schema if auto_normalize=True - Adding ‘file_path’ field if add_source_file=True - Removing ‘filtered’ and ‘rejected’ fields if remove_rejected=True
If you need to preserve the original data, make a deep copy after loading.
- classmethod register(format_class)[source]
Register a format handler (decorator).
- Parameters:
format_class – Format class to register.
- Returns:
The registered format class.
- class datastudio.datasets.BaseFormat[source]
Bases:
ABCAbstract base for data format handlers.
Subclasses implement
extensions(),load(), andsave(), then register via@FormatRegistry.register.Example:
@FormatRegistry.register class MyFormat(BaseFormat): @classmethod def extensions(cls) -> list: return ['.myext'] def load(self, file_path: str) -> list: ... def save(self, data: list, file_path: str, **kwargs) -> None: ...
- abstractmethod classmethod extensions()[source]
Return list of supported file extensions.
- Returns:
List of file extensions (e.g., [‘.json’, ‘.JSON’]).
- Return type:
- class datastudio.datasets.JsonFormat[source]
Bases:
BaseFormatJSON data format handler.
Supports loading and saving data in JSON format. Single objects are wrapped in a list for consistency.
- class datastudio.datasets.JsonlFormat[source]
Bases:
BaseFormatJSONL (JSON Lines) data format handler.
Each line contains one JSON object. Invalid lines are skipped with a warning.
- classmethod extensions()[source]
Return supported file extensions.
- Returns:
[‘.jsonl’]
- Return type:
- class datastudio.datasets.ConfigLoader[source]
Bases:
objectConfiguration loader for loading and saving YAML configs.
Provides static methods for loading YAML configurations and converting them to standardized format.
Example
>>> config = ConfigLoader.load("config.yaml") >>> paths = config.get_file_paths() >>> sources = config.get_sources_map()
- classmethod create_config(file_paths, sources=None, data_sizes=None)[source]
Create a standard configuration from file paths.
- classmethod load(yaml_path)[source]
Load YAML configuration file.
- Parameters:
yaml_path (
str) – Path to YAML file.- Return type:
- Returns:
StandardConfig instance.
- classmethod load_file_paths(yaml_path, data_root=None)[source]
Load config and return all data file paths.
- classmethod load_sources_map(yaml_path)[source]
Load config and return file_path to source mapping.
- classmethod save(config, yaml_path)[source]
Save configuration to YAML file.
- Parameters:
config (
StandardConfig) – StandardConfig to save.yaml_path (
str) – Output file path.
- Return type:
- class datastudio.datasets.DatasetConfig[source]
Bases:
objectStandard configuration for a single dataset.
- file_path
Path to the data file.
- source
Data source name/identifier.
- data_size
Number of samples (optional).
- extra
Additional metadata fields.
- __init__(file_path, source='Unknown', data_size=None, extra=<factory>)
- class datastudio.datasets.StandardConfig[source]
Bases:
objectStandardized complete configuration containing multiple datasets.
- datasets
List of DatasetConfig objects.
- extra
Additional top-level configuration fields.
- __init__(datasets=<factory>, extra=<factory>)
- Parameters:
datasets (List[DatasetConfig])
extra (dict)
- Return type:
None
- classmethod from_dict(data)[source]
Create standard config from dictionary.
- Parameters:
data (
dict) – Raw configuration dictionary with ‘datasets’ key.- Return type:
- Returns:
StandardConfig instance.
- get_sources_map()[source]
Get file_path to source mapping.
- Returns:
Mapping from file_path to source name.
- Return type:
-
datasets:
List[DatasetConfig]