Data Wrangling and Cleaning

openclean comes with the following operators to help users find anomalies, make fixes and wrangle their datasets.

Note

This list is growing and will be updated periodically.

We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.

import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')

from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))

ds.head()

	Job #	Borough	Street Name	GIS_NTA_NAME
0	140915936	MANHATTAN	EAST 93RD STREET	Upper East Side-Carnegie Hill
1	340737929	BROOKLYN	16TH AVE	Bensonhurst West
2	440601733	QUEENS	AUSTIN STREET	Kew Gardens
3	340737901	BROOKLYN	3RD STREET	Carroll Gardens-Columbia Street-Red Hook
4	240275910	BRONX	CEDAR AVENUE	University Heights-Morris Heights

Functional Dependency Violations

openclean makes it easy to identify any functional dependency violations in datasets.

from openclean.operator.map.violations import fd_violations
from openclean.operator.collector.count import distinct

fd1_violations = fd_violations(ds, ['Street Name', 'GIS_NTA_NAME'], ['Borough'])

print('# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is {}\n'.format(len(fd1_violations)))
for key, gr in fd1_violations.items():
    print(gr[['Street Name', 'GIS_NTA_NAME', 'Borough']])

# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is 1

     Street Name                            GIS_NTA_NAME    Borough
    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy     QUEENS
    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
   BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
  BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN

We identify one violation in the above example. Clearly, it is row 11 as someone with domain knowledge should be able to point out that FD(BROADWAY SoHo-TriBeCa-Civic Center-Little Italy) -> QUEENS is incorrect. Let’s fix this using one of the repair strategies available to us:

from openclean.operator.collector.repair import Shortest, Vote, conflict_repair

# Define the conflict resolution strategy. We use a majority vote for both RHS attributes.
strategy = {'Borough': Vote()}

# resolve the conflicts
resolved = conflict_repair(conflicts=fd1_violations, strategy=strategy, in_order=False)

This should replace the violation value with the maximum frequency value of the group and resolve the conflict.

violation_group = resolved[(resolved['Street Name']=='BROADWAY') & (resolved['GIS_NTA_NAME']=='SoHo-TriBeCa-Civic Center-Little Italy')]

fd2_violations = fd_violations(resolved, ['Street Name', 'GIS_NTA_NAME'], ['Borough'])

print('# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is {}\n'.format(len(fd2_violations)))
print(violation_group)

# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is 0

          Job #    Borough Street Name                            GIS_NTA_NAME
  140915990  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
  140915992  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
  121848590  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
 123919901  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
123512430  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
140838360  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
123843485  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
123569352  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
140911798  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
140914704  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
123788311  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
123830514  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
123070292  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
140917774  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
140944478  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
122967021  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
123569389  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
140918648  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
123787081  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
140918620  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
104115011  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
120304973  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
122989141  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
140922072  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
120004244  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
121183129  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
122881097  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy

A complete list of repair strategies can be accessed in the API Reference

Missing Values

Depending on the use case, missing values can be handled by the filter transformation (removing them) or by the update transformation (new values). They are both explained in Data Transformation. We demonstrate both again here.

misspelled_data = dataset(os.path.join(path_to_file, 'misspellings.csv'))
misspelled_data['Borough'].value_counts()

MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
                    2
BRONKS              1
BROOKLIN            1
QUEINS              1
QEENS               1
MENHATTAN           1
BOOKLYN             1
Name: Borough, dtype: int64

We see there are empty values in this column. First, let’s update them using a lambda function that uses the is_empty value function.

from openclean.operator.transform.update import update
from openclean.function.value.null import is_empty

updated_misspelled = update(misspelled_data, 'Borough', lambda x: 'Unknown' if is_empty(x) else x)

updated_misspelled['Borough'].value_counts()

MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
Unknown             2
BRONKS              1
BROOKLIN            1
QUEINS              1
QEENS               1
MENHATTAN           1
BOOKLYN             1
Name: Borough, dtype: int64

We’ve replaced missing values with Unknown. But, because there is no way to be sure of the correct value without using other columns and data augmentation techniques from the Data Enrichment section to get the correct Borough, we shall filter out the nulls in this example using the IsNotEmpty eval function.

from openclean.operator.transform.filter import filter
from openclean.function.eval.null import IsNotEmpty

misspelled_data = filter(misspelled_data, predicate=IsNotEmpty('Borough'))

misspelled_data['Borough'].value_counts()

MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
BRONKS              1
BROOKLIN            1
QUEINS              1
QEENS               1
MENHATTAN           1
BOOKLYN             1
Name: Borough, dtype: int64

The whole range of eval functions and value functions can be accessed in the Eval package of the API Reference and Value package of the API Reference respectively as explained in Data Model.

Misspellings and Data Entry Bugs

openclean can help identify misspellings and data entry bugs using its powerful string matcher class. It helps detect anomalous values using:

Phonetic Matching
Phonetic algorithms transform the input strings to normalized phonetic encodings before comparing them. openclean has the following phonetic string algorithms:
- NYSIIS
- Soundex
- Metaphone
Fuzzy Matching
Implementation of fuzzy string matching using n-gram overlaps and Levenshtein or cosine distance.

StringMatcher objects ingest a vocabulary and a matching algorithm that is used to identify dataset values that are misspelled. These can optionally be stored into an openclean mapping to be reused later with other datasets as translation tables.

from openclean.function.matching.base import DefaultStringMatcher
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.data.mapping import Mapping

VOCABULARY = ['BROOKLYN' ,'MANHATTAN','STATEN ISLAND','BRONX', 'QUEENS']

matcher = DefaultStringMatcher(
    vocabulary=VOCABULARY,
    similarity=FuzzySimilarity()
)

map = Mapping()
for query in set(misspelled_data['Borough']):
    map.add(query, matcher.find_matches(query))

print(map)

Mapping(<class 'list'>, {'QUEINS': [StringMatch(term='QUEENS', score=0.8333333333333334)], 'QUEENS': [StringMatch(term='QUEENS', score=1)], 'STATEN ISLAND': [StringMatch(term='STATEN ISLAND', score=1)], 'MANHATTAN': [StringMatch(term='MANHATTAN', score=1)], 'MENHATTAN': [StringMatch(term='MANHATTAN', score=0.8888888888888888)], 'BROOKLIN': [StringMatch(term='BROOKLYN', score=0.875)], 'BRONX': [StringMatch(term='BRONX', score=1)], 'BRONKS': [StringMatch(term='BRONX', score=0.6666666666666667)], 'BROOKLYN': [StringMatch(term='BROOKLYN', score=1)], 'QEENS': [StringMatch(term='QUEENS', score=0.8333333333333334)], 'BOOKLYN': [StringMatch(term='BROOKLYN', score=0.875)]})

The map shows all misspellings matched at least one value from the vocabulary so the map can be used to fix the Borough column. The user will have to manually intervene and update the map if for a query value there were zero or more than one matches from the vocabulary.

Fixing is easy, we can use the update operation with the Lookup eval function (to provide default values if key not found in the map).

from openclean.function.eval.domain import Lookup
from openclean.operator.transform.update import update
from openclean.function.eval.base import Col


fixed = update(misspelled_data, 'Borough', Lookup(columns=['Borough'], mapping=map.to_lookup(), default=Col('Borough')))

print(fixed['Borough'].unique())

['MANHATTAN' 'BROOKLYN' 'QUEENS' 'BRONX' 'STATEN ISLAND']

KNN Clustering
openclean lets users use KNN clustering to identify values in a dataset that are potential variations of one another and suggests a replacement for each. For e.g. those with missing or extra punctuation. A well-put example demonstrating this can be found here.

Data Standardization

Many a time, users will be faced with situations where the dataset contains variations of spellings (with extra/missing punctuations) for example, a Business or a street name. openclean provides multiple ways to enforce consistency across datasets:

Token Signature Outliers
When values in the dataset are expected to have a signature token present, this functionality helps identify anomalies. For e.g. an address column often requires values to contain street suffixes. The Token Signature class will ensure any values that don’t have one are highlighted. Here is a notebook demonstrating this.
KNN Clustering
As mentioned in the previous section, KNN Clustering can be used to identify sets of similar values. openclean takes it up a notch by recommending a suggested value for each identified cluster.

These two tools become even more powerful when synergized with each other as demonstrated in this notebook where we use first replace different token signature representations with a standard one and then use KNN Clustering to fix the possible value variations in a street name column.

Statistical Outliers

openclean provides many statistical anomaly detection operators that are implemented by the scikit-learn machine learning library. To name them, we have:

Here we use a simple ensemble approach that applies all these operators to the dataset’s GIS_NTA_NAME column.

from collections import Counter

ensemble = Counter()

from openclean.embedding.feature.default import UniqueSetEmbedding
from openclean.profiling.anomalies.sklearn import (
    dbscan,
    isolation_forest,
    local_outlier_factor,
    one_class_svm,
    robust_covariance
)

for f in [dbscan, isolation_forest, local_outlier_factor, one_class_svm, robust_covariance]:
    ensemble.update(f(ds, 'GIS_NTA_NAME', features=UniqueSetEmbedding()))

We then count for each value, the number of operators that classified the value as an outlier.

# Output values that have been classified as outliers by at least three out of the
# five operators.

prev = 0
for value, count in ensemble.most_common():
    if count < 3:
        break
    if count < prev:
        print()
    if count != prev:
        print('{}\t{}'.format(count, value))
    else:
        print('\t{}'.format(value))
    prev = count

5	

4	Mariner's Harbor-Arlington-Port Ivory-Graniteville
	St. Albans
	Breezy Point-Belle Harbor-Rockaway Park-Broad Channel
	Co-op City
	Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill

3	park-cemetery-etc-Brooklyn
	Ft. Totten-Bay Terrace-Clearview
	Annadale-Huguenot-Prince's Bay-Eltingville
	park-cemetery-etc-Manhattan
	Grasmere-Arrochar-Ft. Wadsworth
	park-cemetery-etc-Queens
	Sheepshead Bay-Gerritsen Beach-Manhattan Beach

Statistically classified as anomalies, these neighborhoods can be those with fewer job requests or misspellings. Something a user with domain knowledge can verify.

Custom functions

A user can create their own data cleaning operators, apply them and reuse them as per their requirements. With openclean-notebook, these eval functions or callables can further be registered on a UI and applied to datasets visually. The following screengrab shows how custom functions together with openclean-notebook enhance a user’s data munging experience: