openclean.data.metadata.fs module
Implementation of the metadata store class that maintains metadata information about dataset snapshots in files on the local file system. Metadata for each snapshot is maintained in a separate directory with different json files for each identifiable object.
- openclean.data.metadata.fs.FILE(column_id: Optional[int] = None, row_id: Optional[int] = None) str
Get name for metadata file. The file name depends on whether identifier for the column and row are given or not. The following are the file names of metadata files for different types of resources:
ds.json: Dataset annotations
col_{column_id}.json: Column annotations
row_{row_id}.json: Row annotations
cell_{column_id}_{row_id}.json: Dataset cell annotations.
- Parameters
snapshot_id (int) – Unique snapshot version identifier.
metadata_id (int) – Unique metadata object identifier.
- Return type
string
- class openclean.data.metadata.fs.FileSystemMetadataStore(basedir: str, encoder: Optional[json.encoder.JSONEncoder] = None, decoder: Optional[Callable] = None)
Bases:
openclean.data.metadata.base.MetadataStore
Metadata store that maintains annotations for a dataset snapshot in JSON files with a given base directory. The files that maintain annotations are named using the FileSystemMetadataStoreFactory resource identifier. The following are the file names of metadata files for different types of resources:
ds.json: Dataset annotations
col_{column_id}.json: Column annotations
row_{row_id}.json: Row annotations
cell_{column_id}_{row_id}.json: Dataset cell annotations.
- read(column_id: Optional[int] = None, row_id: Optional[int] = None) Dict
Read the annotation dictionary for the specified object.
- Parameters
column_id (int, default=None) – Column identifier for the referenced object (None for rows or full datasets).
row_id (int, default=None) – Row identifier for the referenced object (None for columns or full datasets).
- Return type
dict
- write(doc: Dict, column_id: Optional[int] = None, row_id: Optional[int] = None)
Write the annotation dictionary for the specified object.
- Parameters
doc (dict) – Annotation dictionary that is being written to file.
column_id (int, default=None) – Column identifier for the referenced object (None for rows or full datasets).
row_id (int, default=None) – Row identifier for the referenced object (None for columns or full datasets).
- Return type
dict
- class openclean.data.metadata.fs.FileSystemMetadataStoreFactory(basedir: str, encoder: Optional[json.encoder.JSONEncoder] = None, decoder: Optional[Callable] = None)
Bases:
openclean.data.metadata.base.MetadataStoreFactory
Factory pattern for volatile metadata stores.
- get_store(version: int) openclean.data.metadata.fs.FileSystemMetadataStore
Get the metadata store for the dataset snapshot with the given version identifier.
- Parameters
version (int) – Unique version identifier
- Return type
- rollback(version: int)
Remove metadata for all dataset versions that are after the given rollback version.
- Parameters
version (int) – Unique identifier of the rollback version.