Data Provenance

openclean provides users the ability to maintain the provenance of the operations performed on a dataset. Just like a version control system, it has methods to load, commit, and checkout versions of the dataset. Here we talk about the methods available to achieve this.

We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.

import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))

ds.head()
Job # Borough Street Name GIS_NTA_NAME
0 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
1 340737929 BROOKLYN 16TH AVE Bensonhurst West
2 440601733 QUEENS AUSTIN STREET Kew Gardens
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights

Initialize

To be able to maintain provenance in openclean, a persistent instance of the openclean engine needs to be created. All versions of the data files are maintained inside the base directory.

from openclean.engine.base import DB

db = DB(basedir='./archive', create=True)

Create

The user can then register their dataset with the initialized openclean engine.

df = db.create(source=ds, name='jobs', primary_key='Job #')

The create method returns a pandas dataframe object which we shall apply some transformations on.

Commit

Let’s create a new version of the dataset by performing lowercasing the Borough column.

from openclean.operator.transform.update import update

df = db.checkout('jobs') # get the dataset

lower_cased = update(df, columns='Borough', func=str.lower)

lower_cased
Job # Borough Street Name GIS_NTA_NAME
2735 140915936 manhattan EAST 93RD STREET Upper East Side-Carnegie Hill
6822 340737929 brooklyn 16TH AVE Bensonhurst West
8988 440601733 queens AUSTIN STREET Kew Gardens
6820 340737901 brooklyn 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4188 240275910 bronx CEDAR AVENUE University Heights-Morris Heights
... ... ... ... ...
7359 340744369 brooklyn FOUNTAIN AVENUE East New York
3503 201204883 bronx LAFAYETTE AVENUE Soundview-Castle Hill-Clason Point-Harding Park
3504 201204883 bronx LAFAYETTE AVENUE Soundview-Castle Hill-Clason Point-Harding Park
3422 140923847 manhattan MADISON AVENUE Midtown-Midtown South
7360 340744378 brooklyn AMBOY STREET Brownsville

10000 rows × 4 columns

The new dataset version can then be committed in the engine using the commit method.

db.commit(name='jobs', source=lower_cased)
Job # Borough Street Name GIS_NTA_NAME
2735 140915936 manhattan EAST 93RD STREET Upper East Side-Carnegie Hill
6822 340737929 brooklyn 16TH AVE Bensonhurst West
8988 440601733 queens AUSTIN STREET Kew Gardens
6820 340737901 brooklyn 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4188 240275910 bronx CEDAR AVENUE University Heights-Morris Heights
... ... ... ... ...
7359 340744369 brooklyn FOUNTAIN AVENUE East New York
3503 201204883 bronx LAFAYETTE AVENUE Soundview-Castle Hill-Clason Point-Harding Park
3504 201204883 bronx LAFAYETTE AVENUE Soundview-Castle Hill-Clason Point-Harding Park
3422 140923847 manhattan MADISON AVENUE Midtown-Midtown South
7360 340744378 brooklyn AMBOY STREET Brownsville

10000 rows × 4 columns

To see the different versions of each dataset, we can simply request a log from the engine:

logs = db.dataset('jobs').log()
logs
[LogEntry(descriptor={'optype': 'load', 'columns': None}, action=None, version=0),
 LogEntry(descriptor={'optype': 'commit', 'columns': None}, action=<openclean.engine.action.CommitOp object at 0x7f54fa1935d0>, version=1)]

Checkout

Users can checkout a previous version of a dataset to see what it looked like.

db.dataset('jobs').checkout(logs[0].version)
Job # Borough Street Name GIS_NTA_NAME
2735 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
6822 340737929 BROOKLYN 16TH AVE Bensonhurst West
8988 440601733 QUEENS AUSTIN STREET Kew Gardens
6820 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4188 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights
... ... ... ... ...
7359 340744369 BROOKLYN FOUNTAIN AVENUE East New York
3503 201204883 BRONX LAFAYETTE AVENUE Soundview-Castle Hill-Clason Point-Harding Park
3504 201204883 BRONX LAFAYETTE AVENUE Soundview-Castle Hill-Clason Point-Harding Park
3422 140923847 MANHATTAN MADISON AVENUE Midtown-Midtown South
7360 340744378 BROOKLYN AMBOY STREET Brownsville

10000 rows × 4 columns

Rollback

If the user is not happy with the changes, they can be rolled back to get the previous version of the dataset:

df = db.rollback('jobs', version=logs[0].version)

df.head()
Job # Borough Street Name GIS_NTA_NAME
2735 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
6822 340737929 BROOKLYN 16TH AVE Bensonhurst West
8988 440601733 QUEENS AUSTIN STREET Kew Gardens
6820 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4188 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights

Register

Additionally, the functionality is complemented by a GUI provided by openclean-notebook that allows users to register custom functions and apply them across datasets and versions seamlessly. A visual example of what this looks like is present in the Custom functions section.

Other Examples

A full example notebook performing operations and maintaining provenance on a real dataset is available here.