Data Provenance
openclean provides users the ability to maintain the provenance of the operations performed on a dataset. Just like a version control system, it has methods to load, commit, and checkout versions of the dataset. Here we talk about the methods available to achieve this.
We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.
import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset
ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))
ds.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
0 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
2 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights |
Initialize
To be able to maintain provenance in openclean, a persistent instance of the openclean engine needs to be created. All versions of the data files are maintained inside the base directory.
from openclean.engine.base import DB
db = DB(basedir='./archive', create=True)
Create
The user can then register their dataset with the initialized openclean engine.
df = db.create(source=ds, name='jobs', primary_key='Job #')
The create method returns a pandas dataframe object which we shall apply some transformations on.
Commit
Let’s create a new version of the dataset by performing lowercasing the Borough column.
from openclean.operator.transform.update import update
df = db.checkout('jobs') # get the dataset
lower_cased = update(df, columns='Borough', func=str.lower)
lower_cased
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
2735 | 140915936 | manhattan | EAST 93RD STREET | Upper East Side-Carnegie Hill |
6822 | 340737929 | brooklyn | 16TH AVE | Bensonhurst West |
8988 | 440601733 | queens | AUSTIN STREET | Kew Gardens |
6820 | 340737901 | brooklyn | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4188 | 240275910 | bronx | CEDAR AVENUE | University Heights-Morris Heights |
... | ... | ... | ... | ... |
7359 | 340744369 | brooklyn | FOUNTAIN AVENUE | East New York |
3503 | 201204883 | bronx | LAFAYETTE AVENUE | Soundview-Castle Hill-Clason Point-Harding Park |
3504 | 201204883 | bronx | LAFAYETTE AVENUE | Soundview-Castle Hill-Clason Point-Harding Park |
3422 | 140923847 | manhattan | MADISON AVENUE | Midtown-Midtown South |
7360 | 340744378 | brooklyn | AMBOY STREET | Brownsville |
10000 rows × 4 columns
The new dataset version can then be committed in the engine using the commit method.
db.commit(name='jobs', source=lower_cased)
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
2735 | 140915936 | manhattan | EAST 93RD STREET | Upper East Side-Carnegie Hill |
6822 | 340737929 | brooklyn | 16TH AVE | Bensonhurst West |
8988 | 440601733 | queens | AUSTIN STREET | Kew Gardens |
6820 | 340737901 | brooklyn | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4188 | 240275910 | bronx | CEDAR AVENUE | University Heights-Morris Heights |
... | ... | ... | ... | ... |
7359 | 340744369 | brooklyn | FOUNTAIN AVENUE | East New York |
3503 | 201204883 | bronx | LAFAYETTE AVENUE | Soundview-Castle Hill-Clason Point-Harding Park |
3504 | 201204883 | bronx | LAFAYETTE AVENUE | Soundview-Castle Hill-Clason Point-Harding Park |
3422 | 140923847 | manhattan | MADISON AVENUE | Midtown-Midtown South |
7360 | 340744378 | brooklyn | AMBOY STREET | Brownsville |
10000 rows × 4 columns
To see the different versions of each dataset, we can simply request a log from the engine:
logs = db.dataset('jobs').log()
logs
[LogEntry(descriptor={'optype': 'load', 'columns': None}, action=None, version=0),
LogEntry(descriptor={'optype': 'commit', 'columns': None}, action=<openclean.engine.action.CommitOp object at 0x7f54fa1935d0>, version=1)]
Checkout
Users can checkout a previous version of a dataset to see what it looked like.
db.dataset('jobs').checkout(logs[0].version)
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
2735 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
6822 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
8988 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
6820 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4188 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights |
... | ... | ... | ... | ... |
7359 | 340744369 | BROOKLYN | FOUNTAIN AVENUE | East New York |
3503 | 201204883 | BRONX | LAFAYETTE AVENUE | Soundview-Castle Hill-Clason Point-Harding Park |
3504 | 201204883 | BRONX | LAFAYETTE AVENUE | Soundview-Castle Hill-Clason Point-Harding Park |
3422 | 140923847 | MANHATTAN | MADISON AVENUE | Midtown-Midtown South |
7360 | 340744378 | BROOKLYN | AMBOY STREET | Brownsville |
10000 rows × 4 columns
Rollback
If the user is not happy with the changes, they can be rolled back to get the previous version of the dataset:
df = db.rollback('jobs', version=logs[0].version)
df.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
2735 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
6822 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
8988 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
6820 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4188 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights |
Register
Additionally, the functionality is complemented by a GUI provided by openclean-notebook that allows users to register custom functions and apply them across datasets and versions seamlessly. A visual example of what this looks like is present in the Custom functions section.
Other Examples
A full example notebook performing operations and maintaining provenance on a real dataset is available here.