Getting Started

openclean provides useful functionality to identify bugs and anomalous values, make fixes and wrangle datasets. Here, we walk through a simple example to get you acquainted with openclean in 10 minutes! Our misspellings dataset contains Street, Neighborhood, and Borough names for New York City with a bunch of spelling mistakes. The goal of this exercise is to fix those errors using some tools we have at our disposal.

Loading Data

openclean uses a dataset (a wrapped pandas dataframe) as its primary data storage object. It can be created from any source data type accepted by pandas. Compressed Gzip files (.gz) are also accepted. For large datasets, it might be desirable to perform lazy evaluation on the data instead of loading it all to memory. To allow this, openclean lets users stream their datasets. More information on Datasets and Streams is provided in the Data Model section.

import os

path_to_file = os.path.join(os.getcwd(), 'source', 'data')

from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'misspellings.csv'))

ds.head()

	Job #	Borough	Street Name	GIS_NTA_NAME
0	140915936	MANHATTAN	EAST 93RD STREET	Upper East Side-Carnegie Hill
1	340737929	BROOKLYN	16TH AVE	Bensonhurst West
2	440601733	QUEENS	AUSTIN STREET	Kew Gardens
3	340737901	BROOKLYN	3RD STREET	Carroll Gardens-Columbia Street-Red Hook
4	240275910	BRONX	CEDAR AVENUE	University Heights-Morris Heights

Profiling the Dataset

To provide insights about the dataset features, openclean comes with data profiling capabilities. A user can select the default profiler to get basic statistics or plug in their own profilers for advanced computations. More information about profilers is available in the Data Profiling section.

from openclean.profiling.dataset import dataset_profile

# generate a profile
profiles = dataset_profile(ds)

# see all stats
profiles.stats()

	total	empty	distinct	uniqueness	entropy
Job #	10000	0	9329	0.932900	13.136359
Borough	10000	2	11	0.001100	2.074184
Street Name	10000	0	3691	0.369100	10.990810
GIS_NTA_NAME	10000	29	192	0.019256	7.040698

We see that there exist 11 distinct values in the Borough column whereas there are only 5 Boroughs in New York City and a bunch of them are empty/missing values.

ds['Borough'].value_counts()

MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
                    2
BRONKS              1
QUEINS              1
BOOKLYN             1
BROOKLIN            1
MENHATTAN           1
QEENS               1
Name: Borough, dtype: int64

Going into further depth, we see realize there are a few variations for Brooklyn, Queens, Bronx, and Manhattan.

Selecting Columns

As discussed earlier, we want to fix the mistakes in the Borough column. We can separate this column from the entire dataset using the select operation. Before we do that, for this example, let’s assume we need to get rid of rows that have missing values. So we’ll use the filter operator and the IsNotEmpty Eval function. Eval functions are explained in Data Model.

from openclean.operator.transform.filter import filter
from openclean.function.eval.null import IsNotEmpty

ds = filter(ds, predicate=IsNotEmpty('Borough'))

ds['Borough'].value_counts()

MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
BRONKS              1
QUEINS              1
BOOKLYN             1
BROOKLIN            1
MENHATTAN           1
QEENS               1
Name: Borough, dtype: int64

Now, let’s separate out the column of interest. You can read more on selecting columns and other dataset/stream transformations in the Data Transformation section.

from openclean.operator.transform.select import select

misspelled_data = select(ds, columns=['Borough'], names=['messy_borough'])

misspelled_data['messy_borough'].unique()

array(['MANHATTAN', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND',
       'QUEINS', 'BOOKLYN', 'MENHATTAN', 'QEENS', 'BRONKS', 'BROOKLIN'],
      dtype=object)

Downloading and Preparing Master data

With openclean, a user can easily incorporate other datasets to enrich the data cleaning process. For e.g., let’s download an official list of borough names from the Borough Population projections dataset using Socrata to help us with the wrangling. We shall use this as the ground truth for correct spellings. You can read more about master datasets in the Data Enrichment section.

After downloading the master data, we preprocess it a bit to match the case with our input dataset. We use the update transformation from Data Wrangling and Cleaning to achieve this which can accept both: a dictionary or a function as the second argument.

from openclean.data.source.socrata import Socrata
from openclean.operator.transform.update import update

# download the master data and select the relevant column
nyc_boroughs = Socrata().dataset('xywu-7bv9').load()
nyc_boroughs = select(nyc_boroughs, columns=['Borough'])

# uppercase and strip the values to match with the misspelled data
nyc_boroughs = update(nyc_boroughs, 'Borough', str.upper)
nyc_boroughs = update(nyc_boroughs, 'Borough', str.strip)

nyc_boroughs

	Borough
0	NYC TOTAL
1	BRONX
2	BROOKLYN
3	MANHATTAN
4	QUEENS
5	STATEN ISLAND

Identifying Fixes

We are now familiar with the mistakes in the data and have a master dataset with corrections available. openclean provides cleaning operators and repair strategies to let users fix their datasets with the minimum amount of coding involved. A list of various cleaning operators available can be accessed in the Data Wrangling and Cleaning section.

Here, we calculate Fuzzy String Similarity between messy_borough and Master data to create a mapping of misspellings to the possible fixes.

from openclean.function.matching.base import DefaultStringMatcher
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.data.mapping import Mapping
from pprint import pprint

# the master vocabulary list
VOCABULARY = nyc_boroughs['Borough']

# create a string matcher that uses the provided vocabulary and similarity algorithm
matcher = DefaultStringMatcher(
        vocabulary=VOCABULARY,
        similarity=FuzzySimilarity()
)

# create a mapping to store the fixes
fixes = Mapping()

# look for matches in the vocabulary
for query in set(misspelled_data['messy_borough']):
    fixes.add(query, matcher.find_matches(query))

# print the fixes
pprint(fixes)

Mapping(<class 'list'>,
        {'BOOKLYN': [StringMatch(term='BROOKLYN', score=0.875)],
         'BRONKS': [StringMatch(term='BRONX', score=0.6666666666666667)],
         'BRONX': [StringMatch(term='BRONX', score=1)],
         'BROOKLIN': [StringMatch(term='BROOKLYN', score=0.875)],
         'BROOKLYN': [StringMatch(term='BROOKLYN', score=1)],
         'MANHATTAN': [StringMatch(term='MANHATTAN', score=1)],
         'MENHATTAN': [StringMatch(term='MANHATTAN', score=0.8888888888888888)],
         'QEENS': [StringMatch(term='QUEENS', score=0.8333333333333334)],
         'QUEENS': [StringMatch(term='QUEENS', score=1)],
         'QUEINS': [StringMatch(term='QUEENS', score=0.8333333333333334)],
         'STATEN ISLAND': [StringMatch(term='STATEN ISLAND', score=1)]})

The generated fixes mapping contains messy_borough content as keys and found matches from the vocabulary along with a match score as values.

Making Repairs

The simplest repair strategy here would be to look up messy_borough values in the fixes map and replace them. We achieve this with the update transformation from the Data Wrangling and Cleaning section.

from openclean.operator.transform.update import update

misspelled_data = update(misspelled_data, 'messy_borough', fixes.to_lookup())

misspelled_data['messy_borough'].unique()

array(['MANHATTAN', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND'],
      dtype=object)

We fixed it! One can also observe the decrease in uniqueness and entropy.

dataset_profile(misspelled_data).stats()

	total	empty	distinct	uniqueness	entropy
messy_borough	9998	0	5	0.0005	2.066639

As we saw in this tiny real-world example, openclean makes it straightforward to not only load and stream datasets, but also to profile them to identify bugs and provide master data alongside providing a toolkit to identify and make fixes.

More Examples

We provide many other Jupyter notebooks as examples to demonstrate different capabilities of openclean. All our notebooks along with the used datasets can be found in the Step by Step Guides.