openclean.data.archive.cache module

Implementation of the datastore class that caches dataset snapshots in memory.

class openclean.data.archive.cache.CacheEntry(df: Optional[pandas.core.frame.DataFrame] = None, version: Optional[int] = None)

Bases: object

Entry in a datastore cache. Maintains the data frame and version identifier.

df: pandas.core.frame.DataFrame = None

version: int = None

class openclean.data.archive.cache.CachedDatastore(datastore: openclean.data.archive.base.ArchiveStore)

Bases: openclean.data.archive.base.ArchiveStore

Wrapper around a datastore that maintains the last dataset version that was commited or checked out in main memory. This follows the assumption that the spreadsheet view will always display (and modify) this version (and only this version).

apply(operators: Union[histore.document.operator.DatasetOperator, List[histore.document.operator.DatasetOperator]], origin: Optional[int] = None, validate: Optional[bool] = None) → List[histore.archive.snapshot.Snapshot]

Apply a given operator or a sequence of operators on a snapshot in the archive.

The resulting snapshot(s) will directly be merged into the archive. This method allows to update data in an archive directly without the need to checkout the snapshot first and then commit the modified version(s).

Returns list of handles for the created snapshots.

Note that there are some limitations for this method. Most importantly, the order of rows cannot be modified and neither can it insert new rows at this point. Columns can be added, moved, renamed, and deleted.

Parameters

operators (histore.document.operator.DatasetOperator or) – list of histore.document.stream.DatasetOperator Operator(s) that is/are used to update the rows in a dataset snapshot to create new snapshot(s) in this archive.
origin (int, default=None) – Unique version identifier for the original snapshot that is being updated. By default the last version is updated.
validate (bool, default=False) – Validate that the resulting archive is in proper order before committing the action.

Return type

histore.archive.snapshot.Snapshot

checkout(version: Optional[int] = None, no_cache: Optional[bool] = False) → pandas.core.frame.DataFrame

Get a specific version of a dataset. The dataset snapshot is identified by the unique version identifier.

Raises a ValueError if the given version is unknown.

Parameters

version (int) – Unique dataset version identifier.
no_cache (bool, default=None) – If True, ignore cached dataset version and checkout the dataset from the associated data store.

Return type

pd.DataFrame

Raises

ValueError –

commit(source: Union[pandas.core.frame.DataFrame, str, histore.document.base.Document], action: Optional[openclean.data.archive.base.ActionHandle] = None, checkout: Optional[bool] = False) → Union[pandas.core.frame.DataFrame, str, histore.document.base.Document]

Insert a new version for a dataset.

Returns the inserted data frame with potentially modified row indexes.

Parameters

source (openclean.data.stream.base.Datasource) – Input data frame or stream containing the new dataset version that is being stored.
action (openclean.data.archive.base.ActionHandle, default=None) – Optional handle of the action that created the new dataset version.
checkout (bool, default=False) – Checkout the commited snapshot and return the result. This option is required only if the row index of the given data frame has been modified by the commit operation, i.e., if the index of the given data frame contained non-integers, negative values, or duplicate values.

Return type

openclean.data.stream.base.Datasource

last_version() → int

Get a identifier for the last version of the dataset.

Return type: int
Raises: ValueError –

metadata(version: Optional[int] = None) → openclean.data.metadata.base.MetadataStore

Get metadata that is associated with the referenced dataset version. If no version is specified the metadata collection for the latest version is returned.

Raises a ValueError if the specified version is unknown.

Parameters: version (int) – Unique dataset version identifier.
Return type: openclean.data.metadata.base.MetadataStore
Raises: ValueError –

open(version: Optional[int] = None) → histore.archive.reader.SnapshotReader

Get a stream reader for a dataset snapshot.

Parameters: version (int, default=None) – Unique version identifier. By default the last version is used.
Return type: openclean.data.archive.base.SnapshotReader

rollback(version: int) → pandas.core.frame.DataFrame

Rollback the archive history to the snapshot with the given version identifier.

Returns the data frame for the napshot that is now the last snapshot in the modified archive.

Parameters: version (int) – Unique identifier of the rollback version.
Return type: pd.DataFrame

schema() → histore.archive.schema.ArchiveSchema

Get the schema history for the archived dataset.

Return type: openclean.data.archive.base.ArchiveSchema

snapshots() → List[histore.archive.snapshot.Snapshot]

Get list of handles for all versions of a given dataset.

Return type: list of histore.archive.snapshot.Snapshot
Raises: ValueError –