Configuration

openclean defines several environment variables that can be used to configure the behavior of different parts of the package.

Data Storage

Some components of the openclean package store external data files. These files will be stored in sub-folders of a base directory that is specified by the environment variable OPENCLEAN_DATADIR. By default, the folder openclean/data user’s cache directory is used as the base directory.

Multi-Threading

Several tasks in openclean lend themselves well to being run using multiple threads (e.g., key collision clustering using :class:KeyCollision). If the environment variable OPENCLEAN_THREADS is set to a positive integer value, it defines the number of parallel threads that are used by default. If the variable is not set (or set to 1) a single thread is used.

Configuration for Workers for External Processes

openclean integrates data cleaning and data profiling tools that are implemented in programming languages other than Python and that are executed as external processes. For this purpose, openclean depends on the flowServ package that supports execution of sequential workflows (data processing pipelines) in different environments. The environments that are currently supported either use the Python subprocess package of Docker.

Workers for external processes are configured using configuration files that define the type of execution engine that is used for different tasks (refer to the flowServ documentation for file formats and configuration options).