openclean.data.refdata module

Module that provides access to reference data sets that are downloaded from a reference data repository to the local file system.

class openclean.data.refdata.RefStore(basedir: Optional[str] = None, loader: Optional[refdata.repo.loader.RepositoryIndexLoader] = None, auto_download: Optional[bool] = None, connect_url: Optional[str] = None)

Bases: refdata.store.base.LocalStore

Default local store for the openclean package. Uses the module name and package version to set the respective properties of the created local store instance.

openclean.data.refdata.download(key: str)

Download the file with the given unique identifier to the local reference data store.

Parameters

key (string) – Unique reference data file identifier.

openclean.data.refdata.list() List[refdata.base.DatasetDescriptor]

Get the descriptors for all datasets that have been downloaded and are available from the local dataset store.

Return type

list of refdata.base.DatasetDescriptor

openclean.data.refdata.load(key: str, auto_download: Optional[bool] = None) refdata.dataset.base.DatasetHandle

Get a handle for the dataset with the given unique identifier.

If the dataset has not been downloaded to the local store yet, it will be downloaded if the auto_download flag is True or if the environment variable REFDATA_AUTODOWNLOAD is set to True. The auto_download parameter for this function will override the value in the environment variable when loading the dataset.

If the dataset is not available in the local store (and is not downloaded automatically) an error is raised.

Parameters
  • key (string) – External unique dataset identifier.

  • auto_download (bool, default=None) – Override the class global auto download flag.

Return type

refdata.dataset.DatasetHandle

Raises

refdata.error.NotDownloadedError

openclean.data.refdata.open(key: str, auto_download: Optional[bool] = None) IO

Open the data file for the dataset with the given unique identifier.

If the dataset has not been downloaded to the local store yet, it will be downloaded if the auto_download flag is True or if the environment variable REFDATA_AUTODOWNLOAD is set to True. The auto_download parameter for this function will override the value in the environment variable when opening the dataset.

If the dataset is not available in the local store (and is not downloaded automatically) an error is raised.

Parameters
  • key (string) – External unique dataset identifier.

  • auto_download (bool, default=None) – Override the class global auto download flag.

Return type

file-like object

Raises

refdata.error.NotDownloadedError

openclean.data.refdata.remove(key: str) bool

Remove the dataset with the given unique identifier from the local store. Returns True if the dataset was removed and False if the dataset had not been downloaded before.

Parameters

key (string) – External unique dataset identifier.

Return type

bool

openclean.data.refdata.repository(filter: Optional[Union[str, List[str], Set[str]]] = None) List[refdata.base.DatasetDescriptor]

Query the repository index that is associated with the local reference data store.

The filter is a single tag or a list of tags. The result will include those datasets that contain all the query tags. The search includes the dataset tags as well a the tags for individual dataset columns.

If no filter is specified the full list of datasets descriptors in the repository is returned.

Parameters

filter (string, list of string, or set of string) – (List of) query tags.

Return type

list of refdata.base.DatasetDescriptor

openclean.data.refdata.store() openclean.data.refdata.RefStore

Get an instance of the local reference data store.

This function is used by other function sin this module to create the local reference data store. By now we use all the default settings when creating the data store. In the future we may want to use a openclean-specific configuration instead.

Return type

openclean.data.refdata.RefStore