Getting Started
openclean provides useful functionality to identify bugs and anomalous values, make fixes and wrangle datasets. Here, we walk through a simple example to get you acquainted with openclean in 10 minutes! Our misspellings dataset contains Street, Neighborhood, and Borough names for New York City with a bunch of spelling mistakes. The goal of this exercise is to fix those errors using some tools we have at our disposal.
Loading Data
openclean uses a dataset (a wrapped pandas dataframe) as its primary data storage object. It can be created from any source data type accepted by pandas. Compressed Gzip files (.gz) are also accepted. For large datasets, it might be desirable to perform lazy evaluation on the data instead of loading it all to memory. To allow this, openclean lets users stream their datasets. More information on Datasets and Streams is provided in the Data Model section.
import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset
ds = dataset(os.path.join(path_to_file, 'misspellings.csv'))
ds.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
0 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
2 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights |
Profiling the Dataset
To provide insights about the dataset features, openclean comes with data profiling capabilities. A user can select the default profiler to get basic statistics or plug in their own profilers for advanced computations. More information about profilers is available in the Data Profiling section.
from openclean.profiling.dataset import dataset_profile
# generate a profile
profiles = dataset_profile(ds)
# see all stats
profiles.stats()
total | empty | distinct | uniqueness | entropy | |
---|---|---|---|---|---|
Job # | 10000 | 0 | 9329 | 0.932900 | 13.136359 |
Borough | 10000 | 2 | 11 | 0.001100 | 2.074184 |
Street Name | 10000 | 0 | 3691 | 0.369100 | 10.990810 |
GIS_NTA_NAME | 10000 | 29 | 192 | 0.019256 | 7.040698 |
We see that there exist 11 distinct values in the Borough column whereas there are only 5 Boroughs in New York City and a bunch of them are empty/missing values.
ds['Borough'].value_counts()
MANHATTAN 3442
BROOKLYN 2999
QUEENS 2090
BRONX 932
STATEN ISLAND 529
2
BRONKS 1
QUEINS 1
BOOKLYN 1
BROOKLIN 1
MENHATTAN 1
QEENS 1
Name: Borough, dtype: int64
Going into further depth, we see realize there are a few variations for Brooklyn, Queens, Bronx, and Manhattan.
Selecting Columns
As discussed earlier, we want to fix the mistakes in the Borough column. We can separate this column from the entire dataset using the select operation. Before we do that, for this example, let’s assume we need to get rid of rows that have missing values. So we’ll use the filter operator and the IsNotEmpty Eval function. Eval functions are explained in Data Model.
from openclean.operator.transform.filter import filter
from openclean.function.eval.null import IsNotEmpty
ds = filter(ds, predicate=IsNotEmpty('Borough'))
ds['Borough'].value_counts()
MANHATTAN 3442
BROOKLYN 2999
QUEENS 2090
BRONX 932
STATEN ISLAND 529
BRONKS 1
QUEINS 1
BOOKLYN 1
BROOKLIN 1
MENHATTAN 1
QEENS 1
Name: Borough, dtype: int64
Now, let’s separate out the column of interest. You can read more on selecting columns and other dataset/stream transformations in the Data Transformation section.
from openclean.operator.transform.select import select
misspelled_data = select(ds, columns=['Borough'], names=['messy_borough'])
misspelled_data['messy_borough'].unique()
array(['MANHATTAN', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND',
'QUEINS', 'BOOKLYN', 'MENHATTAN', 'QEENS', 'BRONKS', 'BROOKLIN'],
dtype=object)
Downloading and Preparing Master data
With openclean, a user can easily incorporate other datasets to enrich the data cleaning process. For e.g., let’s download an official list of borough names from the Borough Population projections dataset using Socrata to help us with the wrangling. We shall use this as the ground truth for correct spellings. You can read more about master datasets in the Data Enrichment section.
After downloading the master data, we preprocess it a bit to match the case with our input dataset. We use the update transformation from Data Wrangling and Cleaning to achieve this which can accept both: a dictionary or a function as the second argument.
from openclean.data.source.socrata import Socrata
from openclean.operator.transform.update import update
# download the master data and select the relevant column
nyc_boroughs = Socrata().dataset('xywu-7bv9').load()
nyc_boroughs = select(nyc_boroughs, columns=['Borough'])
# uppercase and strip the values to match with the misspelled data
nyc_boroughs = update(nyc_boroughs, 'Borough', str.upper)
nyc_boroughs = update(nyc_boroughs, 'Borough', str.strip)
nyc_boroughs
Borough | |
---|---|
0 | NYC TOTAL |
1 | BRONX |
2 | BROOKLYN |
3 | MANHATTAN |
4 | QUEENS |
5 | STATEN ISLAND |
Identifying Fixes
We are now familiar with the mistakes in the data and have a master dataset with corrections available. openclean provides cleaning operators and repair strategies to let users fix their datasets with the minimum amount of coding involved. A list of various cleaning operators available can be accessed in the Data Wrangling and Cleaning section.
Here, we calculate Fuzzy String Similarity between messy_borough and Master data to create a mapping of misspellings to the possible fixes.
from openclean.function.matching.base import DefaultStringMatcher
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.data.mapping import Mapping
from pprint import pprint
# the master vocabulary list
VOCABULARY = nyc_boroughs['Borough']
# create a string matcher that uses the provided vocabulary and similarity algorithm
matcher = DefaultStringMatcher(
vocabulary=VOCABULARY,
similarity=FuzzySimilarity()
)
# create a mapping to store the fixes
fixes = Mapping()
# look for matches in the vocabulary
for query in set(misspelled_data['messy_borough']):
fixes.add(query, matcher.find_matches(query))
# print the fixes
pprint(fixes)
Mapping(<class 'list'>,
{'BOOKLYN': [StringMatch(term='BROOKLYN', score=0.875)],
'BRONKS': [StringMatch(term='BRONX', score=0.6666666666666667)],
'BRONX': [StringMatch(term='BRONX', score=1)],
'BROOKLIN': [StringMatch(term='BROOKLYN', score=0.875)],
'BROOKLYN': [StringMatch(term='BROOKLYN', score=1)],
'MANHATTAN': [StringMatch(term='MANHATTAN', score=1)],
'MENHATTAN': [StringMatch(term='MANHATTAN', score=0.8888888888888888)],
'QEENS': [StringMatch(term='QUEENS', score=0.8333333333333334)],
'QUEENS': [StringMatch(term='QUEENS', score=1)],
'QUEINS': [StringMatch(term='QUEENS', score=0.8333333333333334)],
'STATEN ISLAND': [StringMatch(term='STATEN ISLAND', score=1)]})
The generated fixes mapping contains messy_borough content as keys and found matches from the vocabulary along with a match score as values.
Making Repairs
The simplest repair strategy here would be to look up messy_borough values in the fixes map and replace them. We achieve this with the update transformation from the Data Wrangling and Cleaning section.
from openclean.operator.transform.update import update
misspelled_data = update(misspelled_data, 'messy_borough', fixes.to_lookup())
misspelled_data['messy_borough'].unique()
array(['MANHATTAN', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND'],
dtype=object)
We fixed it! One can also observe the decrease in uniqueness and entropy.
dataset_profile(misspelled_data).stats()
total | empty | distinct | uniqueness | entropy | |
---|---|---|---|---|---|
messy_borough | 9998 | 0 | 5 | 0.0005 | 2.066639 |
As we saw in this tiny real-world example, openclean makes it straightforward to not only load and stream datasets, but also to profile them to identify bugs and provide master data alongside providing a toolkit to identify and make fixes.
More Examples
We provide many other Jupyter notebooks as examples to demonstrate different capabilities of openclean. All our notebooks along with the used datasets can be found in the Step by Step Guides.