Data Provenance

openclean provides users the ability to maintain the provenance of the operations performed on a dataset. Just like a version control system, it has methods to load, commit, and checkout versions of the dataset. Here we talk about the methods available to achieve this.

We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.

import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')

from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))

ds.head()

	Job #	Borough	Street Name	GIS_NTA_NAME
0	140915936	MANHATTAN	EAST 93RD STREET	Upper East Side-Carnegie Hill
1	340737929	BROOKLYN	16TH AVE	Bensonhurst West
2	440601733	QUEENS	AUSTIN STREET	Kew Gardens
3	340737901	BROOKLYN	3RD STREET	Carroll Gardens-Columbia Street-Red Hook
4	240275910	BRONX	CEDAR AVENUE	University Heights-Morris Heights

Initialize

To be able to maintain provenance in openclean, a persistent instance of the openclean engine needs to be created. All versions of the data files are maintained inside the base directory.

from openclean.engine.base import DB

db = DB(basedir='./archive', create=True)

Create

The user can then register their dataset with the initialized openclean engine.

df = db.create(source=ds, name='jobs', primary_key='Job #')

The create method returns a pandas dataframe object which we shall apply some transformations on.

Commit

Let’s create a new version of the dataset by performing lowercasing the Borough column.

from openclean.operator.transform.update import update

df = db.checkout('jobs') # get the dataset

lower_cased = update(df, columns='Borough', func=str.lower)

lower_cased

	Job #	Borough	Street Name	GIS_NTA_NAME
2735	140915936	manhattan	EAST 93RD STREET	Upper East Side-Carnegie Hill
6822	340737929	brooklyn	16TH AVE	Bensonhurst West
8988	440601733	queens	AUSTIN STREET	Kew Gardens
6820	340737901	brooklyn	3RD STREET	Carroll Gardens-Columbia Street-Red Hook
4188	240275910	bronx	CEDAR AVENUE	University Heights-Morris Heights
...	...	...	...	...
7359	340744369	brooklyn	FOUNTAIN AVENUE	East New York
3503	201204883	bronx	LAFAYETTE AVENUE	Soundview-Castle Hill-Clason Point-Harding Park
3504	201204883	bronx	LAFAYETTE AVENUE	Soundview-Castle Hill-Clason Point-Harding Park
3422	140923847	manhattan	MADISON AVENUE	Midtown-Midtown South
7360	340744378	brooklyn	AMBOY STREET	Brownsville

10000 rows × 4 columns

The new dataset version can then be committed in the engine using the commit method.

db.commit(name='jobs', source=lower_cased)

	Job #	Borough	Street Name	GIS_NTA_NAME
2735	140915936	manhattan	EAST 93RD STREET	Upper East Side-Carnegie Hill
6822	340737929	brooklyn	16TH AVE	Bensonhurst West
8988	440601733	queens	AUSTIN STREET	Kew Gardens
6820	340737901	brooklyn	3RD STREET	Carroll Gardens-Columbia Street-Red Hook
4188	240275910	bronx	CEDAR AVENUE	University Heights-Morris Heights
...	...	...	...	...
7359	340744369	brooklyn	FOUNTAIN AVENUE	East New York
3503	201204883	bronx	LAFAYETTE AVENUE	Soundview-Castle Hill-Clason Point-Harding Park
3504	201204883	bronx	LAFAYETTE AVENUE	Soundview-Castle Hill-Clason Point-Harding Park
3422	140923847	manhattan	MADISON AVENUE	Midtown-Midtown South
7360	340744378	brooklyn	AMBOY STREET	Brownsville

10000 rows × 4 columns

To see the different versions of each dataset, we can simply request a log from the engine:

logs = db.dataset('jobs').log()
logs

[LogEntry(descriptor={'optype': 'load', 'columns': None}, action=None, version=0),
 LogEntry(descriptor={'optype': 'commit', 'columns': None}, action=<openclean.engine.action.CommitOp object at 0x7f54fa1935d0>, version=1)]

Checkout

Users can checkout a previous version of a dataset to see what it looked like.

db.dataset('jobs').checkout(logs[0].version)

	Job #	Borough	Street Name	GIS_NTA_NAME
2735	140915936	MANHATTAN	EAST 93RD STREET	Upper East Side-Carnegie Hill
6822	340737929	BROOKLYN	16TH AVE	Bensonhurst West
8988	440601733	QUEENS	AUSTIN STREET	Kew Gardens
6820	340737901	BROOKLYN	3RD STREET	Carroll Gardens-Columbia Street-Red Hook
4188	240275910	BRONX	CEDAR AVENUE	University Heights-Morris Heights
...	...	...	...	...
7359	340744369	BROOKLYN	FOUNTAIN AVENUE	East New York
3503	201204883	BRONX	LAFAYETTE AVENUE	Soundview-Castle Hill-Clason Point-Harding Park
3504	201204883	BRONX	LAFAYETTE AVENUE	Soundview-Castle Hill-Clason Point-Harding Park
3422	140923847	MANHATTAN	MADISON AVENUE	Midtown-Midtown South
7360	340744378	BROOKLYN	AMBOY STREET	Brownsville

10000 rows × 4 columns

Rollback

If the user is not happy with the changes, they can be rolled back to get the previous version of the dataset:

df = db.rollback('jobs', version=logs[0].version)

df.head()

	Job #	Borough	Street Name	GIS_NTA_NAME
2735	140915936	MANHATTAN	EAST 93RD STREET	Upper East Side-Carnegie Hill
6822	340737929	BROOKLYN	16TH AVE	Bensonhurst West
8988	440601733	QUEENS	AUSTIN STREET	Kew Gardens
6820	340737901	BROOKLYN	3RD STREET	Carroll Gardens-Columbia Street-Red Hook
4188	240275910	BRONX	CEDAR AVENUE	University Heights-Morris Heights

Register

Additionally, the functionality is complemented by a GUI provided by openclean-notebook that allows users to register custom functions and apply them across datasets and versions seamlessly. A visual example of what this looks like is present in the Custom functions section.

Other Examples

A full example notebook performing operations and maintaining provenance on a real dataset is available here.