openclean.data.metadata.fs module

Implementation of the metadata store class that maintains metadata information about dataset snapshots in files on the local file system. Metadata for each snapshot is maintained in a separate directory with different json files for each identifiable object.

openclean.data.metadata.fs.FILE(column_id: Optional[int] = None, row_id: Optional[int] = None) str

Get name for metadata file. The file name depends on whether identifier for the column and row are given or not. The following are the file names of metadata files for different types of resources:

  • ds.json: Dataset annotations

  • col_{column_id}.json: Column annotations

  • row_{row_id}.json: Row annotations

  • cell_{column_id}_{row_id}.json: Dataset cell annotations.

Parameters
  • snapshot_id (int) – Unique snapshot version identifier.

  • metadata_id (int) – Unique metadata object identifier.

Return type

string

class openclean.data.metadata.fs.FileSystemMetadataStore(basedir: str, encoder: Optional[json.encoder.JSONEncoder] = None, decoder: Optional[Callable] = None)

Bases: openclean.data.metadata.base.MetadataStore

Metadata store that maintains annotations for a dataset snapshot in JSON files with a given base directory. The files that maintain annotations are named using the FileSystemMetadataStoreFactory resource identifier. The following are the file names of metadata files for different types of resources:

  • ds.json: Dataset annotations

  • col_{column_id}.json: Column annotations

  • row_{row_id}.json: Row annotations

  • cell_{column_id}_{row_id}.json: Dataset cell annotations.

read(column_id: Optional[int] = None, row_id: Optional[int] = None) Dict

Read the annotation dictionary for the specified object.

Parameters
  • column_id (int, default=None) – Column identifier for the referenced object (None for rows or full datasets).

  • row_id (int, default=None) – Row identifier for the referenced object (None for columns or full datasets).

Return type

dict

write(doc: Dict, column_id: Optional[int] = None, row_id: Optional[int] = None)

Write the annotation dictionary for the specified object.

Parameters
  • doc (dict) – Annotation dictionary that is being written to file.

  • column_id (int, default=None) – Column identifier for the referenced object (None for rows or full datasets).

  • row_id (int, default=None) – Row identifier for the referenced object (None for columns or full datasets).

Return type

dict

class openclean.data.metadata.fs.FileSystemMetadataStoreFactory(basedir: str, encoder: Optional[json.encoder.JSONEncoder] = None, decoder: Optional[Callable] = None)

Bases: openclean.data.metadata.base.MetadataStoreFactory

Factory pattern for volatile metadata stores.

get_store(version: int) openclean.data.metadata.fs.FileSystemMetadataStore

Get the metadata store for the dataset snapshot with the given version identifier.

Parameters

version (int) – Unique version identifier

Return type

openclean.data.metadata.fs.FileSystemMetadataStore

rollback(version: int)

Remove metadata for all dataset versions that are after the given rollback version.

Parameters

version (int) – Unique identifier of the rollback version.