Getting Started

openclean provides useful functionality to identify bugs and anomalous values, make fixes and wrangle datasets. Here, we walk through a simple example to get you acquainted with openclean in 10 minutes! Our misspellings dataset contains Street, Neighborhood, and Borough names for New York City with a bunch of spelling mistakes. The goal of this exercise is to fix those errors using some tools we have at our disposal.

Loading Data

openclean uses a dataset (a wrapped pandas dataframe) as its primary data storage object. It can be created from any source data type accepted by pandas. Compressed Gzip files (.gz) are also accepted. For large datasets, it might be desirable to perform lazy evaluation on the data instead of loading it all to memory. To allow this, openclean lets users stream their datasets. More information on Datasets and Streams is provided in the Data Model section.

import os

path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'misspellings.csv'))

ds.head()
Job # Borough Street Name GIS_NTA_NAME
0 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
1 340737929 BROOKLYN 16TH AVE Bensonhurst West
2 440601733 QUEENS AUSTIN STREET Kew Gardens
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights

Profiling the Dataset

To provide insights about the dataset features, openclean comes with data profiling capabilities. A user can select the default profiler to get basic statistics or plug in their own profilers for advanced computations. More information about profilers is available in the Data Profiling section.

from openclean.profiling.dataset import dataset_profile

# generate a profile
profiles = dataset_profile(ds)

# see all stats
profiles.stats()
total empty distinct uniqueness entropy
Job # 10000 0 9329 0.932900 13.136359
Borough 10000 2 11 0.001100 2.074184
Street Name 10000 0 3691 0.369100 10.990810
GIS_NTA_NAME 10000 29 192 0.019256 7.040698

We see that there exist 11 distinct values in the Borough column whereas there are only 5 Boroughs in New York City and a bunch of them are empty/missing values.

ds['Borough'].value_counts()
MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
                    2
BRONKS              1
QUEINS              1
BOOKLYN             1
BROOKLIN            1
MENHATTAN           1
QEENS               1
Name: Borough, dtype: int64

Going into further depth, we see realize there are a few variations for Brooklyn, Queens, Bronx, and Manhattan.

Selecting Columns

As discussed earlier, we want to fix the mistakes in the Borough column. We can separate this column from the entire dataset using the select operation. Before we do that, for this example, let’s assume we need to get rid of rows that have missing values. So we’ll use the filter operator and the IsNotEmpty Eval function. Eval functions are explained in Data Model.

from openclean.operator.transform.filter import filter
from openclean.function.eval.null import IsNotEmpty

ds = filter(ds, predicate=IsNotEmpty('Borough'))

ds['Borough'].value_counts()
MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
BRONKS              1
QUEINS              1
BOOKLYN             1
BROOKLIN            1
MENHATTAN           1
QEENS               1
Name: Borough, dtype: int64

Now, let’s separate out the column of interest. You can read more on selecting columns and other dataset/stream transformations in the Data Transformation section.

from openclean.operator.transform.select import select

misspelled_data = select(ds, columns=['Borough'], names=['messy_borough'])

misspelled_data['messy_borough'].unique()
array(['MANHATTAN', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND',
       'QUEINS', 'BOOKLYN', 'MENHATTAN', 'QEENS', 'BRONKS', 'BROOKLIN'],
      dtype=object)

Downloading and Preparing Master data

With openclean, a user can easily incorporate other datasets to enrich the data cleaning process. For e.g., let’s download an official list of borough names from the Borough Population projections dataset using Socrata to help us with the wrangling. We shall use this as the ground truth for correct spellings. You can read more about master datasets in the Data Enrichment section.

After downloading the master data, we preprocess it a bit to match the case with our input dataset. We use the update transformation from Data Wrangling and Cleaning to achieve this which can accept both: a dictionary or a function as the second argument.

from openclean.data.source.socrata import Socrata
from openclean.operator.transform.update import update

# download the master data and select the relevant column
nyc_boroughs = Socrata().dataset('xywu-7bv9').load()
nyc_boroughs = select(nyc_boroughs, columns=['Borough'])

# uppercase and strip the values to match with the misspelled data
nyc_boroughs = update(nyc_boroughs, 'Borough', str.upper)
nyc_boroughs = update(nyc_boroughs, 'Borough', str.strip)

nyc_boroughs
Borough
0 NYC TOTAL
1 BRONX
2 BROOKLYN
3 MANHATTAN
4 QUEENS
5 STATEN ISLAND

Identifying Fixes

We are now familiar with the mistakes in the data and have a master dataset with corrections available. openclean provides cleaning operators and repair strategies to let users fix their datasets with the minimum amount of coding involved. A list of various cleaning operators available can be accessed in the Data Wrangling and Cleaning section.

Here, we calculate Fuzzy String Similarity between messy_borough and Master data to create a mapping of misspellings to the possible fixes.

from openclean.function.matching.base import DefaultStringMatcher
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.data.mapping import Mapping
from pprint import pprint

# the master vocabulary list
VOCABULARY = nyc_boroughs['Borough']

# create a string matcher that uses the provided vocabulary and similarity algorithm
matcher = DefaultStringMatcher(
        vocabulary=VOCABULARY,
        similarity=FuzzySimilarity()
)

# create a mapping to store the fixes
fixes = Mapping()

# look for matches in the vocabulary
for query in set(misspelled_data['messy_borough']):
    fixes.add(query, matcher.find_matches(query))

# print the fixes
pprint(fixes)
Mapping(<class 'list'>,
        {'BOOKLYN': [StringMatch(term='BROOKLYN', score=0.875)],
         'BRONKS': [StringMatch(term='BRONX', score=0.6666666666666667)],
         'BRONX': [StringMatch(term='BRONX', score=1)],
         'BROOKLIN': [StringMatch(term='BROOKLYN', score=0.875)],
         'BROOKLYN': [StringMatch(term='BROOKLYN', score=1)],
         'MANHATTAN': [StringMatch(term='MANHATTAN', score=1)],
         'MENHATTAN': [StringMatch(term='MANHATTAN', score=0.8888888888888888)],
         'QEENS': [StringMatch(term='QUEENS', score=0.8333333333333334)],
         'QUEENS': [StringMatch(term='QUEENS', score=1)],
         'QUEINS': [StringMatch(term='QUEENS', score=0.8333333333333334)],
         'STATEN ISLAND': [StringMatch(term='STATEN ISLAND', score=1)]})

The generated fixes mapping contains messy_borough content as keys and found matches from the vocabulary along with a match score as values.

Making Repairs

The simplest repair strategy here would be to look up messy_borough values in the fixes map and replace them. We achieve this with the update transformation from the Data Wrangling and Cleaning section.

from openclean.operator.transform.update import update

misspelled_data = update(misspelled_data, 'messy_borough', fixes.to_lookup())

misspelled_data['messy_borough'].unique()
array(['MANHATTAN', 'BROOKLYN', 'QUEENS', 'BRONX', 'STATEN ISLAND'],
      dtype=object)

We fixed it! One can also observe the decrease in uniqueness and entropy.

dataset_profile(misspelled_data).stats()
total empty distinct uniqueness entropy
messy_borough 9998 0 5 0.0005 2.066639

As we saw in this tiny real-world example, openclean makes it straightforward to not only load and stream datasets, but also to profile them to identify bugs and provide master data alongside providing a toolkit to identify and make fixes.

More Examples

We provide many other Jupyter notebooks as examples to demonstrate different capabilities of openclean. All our notebooks along with the used datasets can be found in the Step by Step Guides.