openclean.data.archive.base module
Interfaces and base classes for the data store that is used to maintain all versions of a data frame.
- class openclean.data.archive.base.ActionHandle
Bases:
object
Interface for action handles. Defines the serializatio method to_dict that is used to get a descriptor for the action that created a dataset snapshot.
- abstract to_dict() Dict
Get a dictionary serialization for the action.
- Return type
dict
- class openclean.data.archive.base.ArchiveStore
Bases:
object
Interface for the data store that is used to maintain the different versions of a dataset that a user creates using the openclean (Jupyter) API.
- abstract apply(operators: Union[histore.document.operator.DatasetOperator, List[histore.document.operator.DatasetOperator]], origin: Optional[int] = None, validate: Optional[bool] = None) List[histore.archive.snapshot.Snapshot]
Apply a given operator or a sequence of operators on a snapshot in the archive.
The resulting snapshot(s) will directly be merged into the archive. This method allows to update data in an archive directly without the need to checkout the snapshot first and then commit the modified version(s).
Returns list of handles for the created snapshots.
Note that there are some limitations for this method. Most importantly, the order of rows cannot be modified and neither can it insert new rows at this point. Columns can be added, moved, renamed, and deleted.
- Parameters
operators (histore.document.operator.DatasetOperator or) – list of histore.document.stream.DatasetOperator Operator(s) that is/are used to update the rows in a dataset snapshot to create new snapshot(s) in this archive.
origin (int, default=None) – Unique version identifier for the original snapshot that is being updated. By default the last version is updated.
validate (bool, default=False) – Validate that the resulting archive is in proper order before committing the action.
- Return type
histore.archive.snapshot.Snapshot
- abstract checkout(version: Optional[int] = None) pandas.core.frame.DataFrame
Get a specific version of the dataset. The dataset snapshot is identified by the unique version identifier.
Returns the data frame and version number for the dataset snapshot.
Raises a ValueError if the given version is unknown.
- Parameters
version (int) – Unique dataset version identifier.
- Return type
pd.DataFrame
- Raises
ValueError –
- abstract commit(source: Union[pandas.core.frame.DataFrame, str, histore.document.base.Document], action: Optional[openclean.data.archive.base.ActionHandle] = None, checkout: Optional[bool] = False) Union[pandas.core.frame.DataFrame, str, histore.document.base.Document]
Insert a new dataset snapshot.
Returns the inserted data frame with potentially modified row indexes.
- Parameters
source (openclean.data.stream.base.Datasource) – Input data frame or stream containing the new dataset version that is being stored.
action (openclean.data.archive.base.ActionHandle, default=None) – Optional handle of the action that created the new dataset version.
checkout (bool, default=False) – Checkout the commited snapshot and return the result. This option is required only if the row index of the given data frame has been modified by the commit operation, i.e., if the index of the given data frame contained non-integers, negative values, or duplicate values.
- Return type
openclean.data.stream.base.Datasource
- abstract last_version() int
Get the version identifier for the last dataset snapshot.
- Return type
int
- Raises
ValueError –
- abstract metadata(version: Optional[int] = None) openclean.data.metadata.base.MetadataStore
Get metadata that is associated with the referenced dataset version. If no version is specified the metadata collection for the latest version is returned.
Raises a ValueError if the specified version is unknown.
- Parameters
version (int) – Unique dataset version identifier.
- Return type
openclean_.data.metadata.base.MetadataStore
- Raises
ValueError –
- abstract open(version: Optional[int] = None) histore.archive.reader.SnapshotReader
Get a stream reader for a dataset snapshot.
- Parameters
version (int, default=None) – Unique version identifier. By default the last version is used.
- Return type
openclean.data.archive.base.SnapshotReader
- abstract rollback(version: int) pandas.core.frame.DataFrame
Rollback the archive history to the snapshot with the given version identifier.
Returns the data frame for the napshot that is now the last snapshot in the modified archive.
- Parameters
version (int) – Unique identifier of the rollback version.
- Return type
pd.DataFrame
- abstract schema() histore.archive.schema.ArchiveSchema
Get the schema history for the archived dataset.
- Return type
openclean.data.archive.base.ArchiveSchema
- abstract snapshots() List[histore.archive.snapshot.Snapshot]
Get list of handles for all versions of the dataset.
- Return type
list of histore.archive.snapshot.Snapshot
- Raises
ValueError –
- openclean.data.archive.base.create(dataset: str, source: Optional[Union[pandas.core.frame.DataFrame, str, histore.document.base.Document]], primary_key: Optional[List[str]], replace: Optional[bool] = False) histore.archive.base.Archive
Create a new archive for a dataset with the given identifier. If an archive with the given identifier exists it will be replaced (if the replace flag is True) or an error is raised.
- Parameters
dataset (string) – Unique dataset identifier.
source (openclean.data.archive.base.Datasource, default=None) – Initial dataset snapshot that is loaded into the created archive.
primary_key (list of string) – List of primary key attributes for merging snapshots into the created archive.
replace (bool, default=False) – Replace an existing archive with the same name if it exists.
- Return type
histore.archive.base.Archive
- Raises
ValueError –
- openclean.data.archive.base.delete(dataset: str)
Delete the existing archive for a dataset. Raises a ValueError if the dataset is unknown.
- Parameters
dataset (string) – Unique dataset identifier.
- Raises
ValueError –
- openclean.data.archive.base.get(dataset: str) histore.archive.base.Archive
Get the existing archive for a dataset. Raises a ValueError if the dataset is unknown.
- Parameters
dataset (string) – Unique dataset identifier.
- Return type
histore.archive.base.Archive
- Raises
ValueError –
- openclean.data.archive.base.manager() histore.archive.manager.persist.PersistentArchiveManager
Get instance of the archive manager that is used to maintain master datasets.
- Return type
histore.archive.manager.base.ArchiveManager