Data Wrangling and Cleaning

openclean comes with the following operators to help users find anomalies, make fixes and wrangle their datasets.

Note

This list is growing and will be updated periodically.

We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.

import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))

ds.head()
Job # Borough Street Name GIS_NTA_NAME
0 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
1 340737929 BROOKLYN 16TH AVE Bensonhurst West
2 440601733 QUEENS AUSTIN STREET Kew Gardens
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights

Functional Dependency Violations

openclean makes it easy to identify any functional dependency violations in datasets.

from openclean.operator.map.violations import fd_violations
from openclean.operator.collector.count import distinct

fd1_violations = fd_violations(ds, ['Street Name', 'GIS_NTA_NAME'], ['Borough'])

print('# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is {}\n'.format(len(fd1_violations)))
for key, gr in fd1_violations.items():
    print(gr[['Street Name', 'GIS_NTA_NAME', 'Borough']])
# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is 1

     Street Name                            GIS_NTA_NAME    Borough
10      BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
11      BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy     QUEENS
61      BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
691     BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
1205    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
1398    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
1546    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
1616    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
1728    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
1781    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
2366    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
2590    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
2958    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
3535    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
3556    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
3725    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
4225    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
4441    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
5333    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
5555    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
5583    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
6561    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
7428    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
7939    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
8027    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
9426    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN
9811    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy  MANHATTAN

We identify one violation in the above example. Clearly, it is row 11 as someone with domain knowledge should be able to point out that FD(BROADWAY SoHo-TriBeCa-Civic Center-Little Italy) -> QUEENS is incorrect. Let’s fix this using one of the repair strategies available to us:

from openclean.operator.collector.repair import Shortest, Vote, conflict_repair

# Define the conflict resolution strategy. We use a majority vote for both RHS attributes.
strategy = {'Borough': Vote()}

# resolve the conflicts
resolved = conflict_repair(conflicts=fd1_violations, strategy=strategy, in_order=False)

This should replace the violation value with the maximum frequency value of the group and resolve the conflict.

violation_group = resolved[(resolved['Street Name']=='BROADWAY') & (resolved['GIS_NTA_NAME']=='SoHo-TriBeCa-Civic Center-Little Italy')]

fd2_violations = fd_violations(resolved, ['Street Name', 'GIS_NTA_NAME'], ['Borough'])

print('# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is {}\n'.format(len(fd2_violations)))
print(violation_group)
# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is 0

          Job #    Borough Street Name                            GIS_NTA_NAME
10    140915990  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
11    140915992  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
61    121848590  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
691   123919901  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
1205  123512430  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
1398  140838360  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
1546  123843485  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
1616  123569352  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
1728  140911798  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
1781  140914704  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
2366  123788311  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
2590  123830514  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
2958  123070292  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
3535  140917774  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
3556  140944478  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
3725  122967021  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
4225  123569389  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
4441  140918648  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
5333  123787081  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
5555  140918620  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
5583  104115011  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
6561  120304973  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
7428  122989141  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
7939  140922072  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
8027  120004244  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
9426  121183129  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy
9811  122881097  MANHATTAN    BROADWAY  SoHo-TriBeCa-Civic Center-Little Italy

A complete list of repair strategies can be accessed in the API Reference

Missing Values

Depending on the use case, missing values can be handled by the filter transformation (removing them) or by the update transformation (new values). They are both explained in Data Transformation. We demonstrate both again here.

misspelled_data = dataset(os.path.join(path_to_file, 'misspellings.csv'))
misspelled_data['Borough'].value_counts()
MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
                    2
BRONKS              1
BROOKLIN            1
QUEINS              1
QEENS               1
MENHATTAN           1
BOOKLYN             1
Name: Borough, dtype: int64

We see there are empty values in this column. First, let’s update them using a lambda function that uses the is_empty value function.

from openclean.operator.transform.update import update
from openclean.function.value.null import is_empty

updated_misspelled = update(misspelled_data, 'Borough', lambda x: 'Unknown' if is_empty(x) else x)

updated_misspelled['Borough'].value_counts()
MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
Unknown             2
BRONKS              1
BROOKLIN            1
QUEINS              1
QEENS               1
MENHATTAN           1
BOOKLYN             1
Name: Borough, dtype: int64

We’ve replaced missing values with Unknown. But, because there is no way to be sure of the correct value without using other columns and data augmentation techniques from the Data Enrichment section to get the correct Borough, we shall filter out the nulls in this example using the IsNotEmpty eval function.

from openclean.operator.transform.filter import filter
from openclean.function.eval.null import IsNotEmpty

misspelled_data = filter(misspelled_data, predicate=IsNotEmpty('Borough'))

misspelled_data['Borough'].value_counts()
MANHATTAN        3442
BROOKLYN         2999
QUEENS           2090
BRONX             932
STATEN ISLAND     529
BRONKS              1
BROOKLIN            1
QUEINS              1
QEENS               1
MENHATTAN           1
BOOKLYN             1
Name: Borough, dtype: int64

The whole range of eval functions and value functions can be accessed in the Eval package of the API Reference and Value package of the API Reference respectively as explained in Data Model.

Misspellings and Data Entry Bugs

openclean can help identify misspellings and data entry bugs using its powerful string matcher class. It helps detect anomalous values using:

  • Phonetic Matching

    Phonetic algorithms transform the input strings to normalized phonetic encodings before comparing them. openclean has the following phonetic string algorithms:

    • NYSIIS

    • Soundex

    • Metaphone

  • Fuzzy Matching

    Implementation of fuzzy string matching using n-gram overlaps and Levenshtein or cosine distance.

StringMatcher objects ingest a vocabulary and a matching algorithm that is used to identify dataset values that are misspelled. These can optionally be stored into an openclean mapping to be reused later with other datasets as translation tables.

from openclean.function.matching.base import DefaultStringMatcher
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.data.mapping import Mapping

VOCABULARY = ['BROOKLYN' ,'MANHATTAN','STATEN ISLAND','BRONX', 'QUEENS']

matcher = DefaultStringMatcher(
    vocabulary=VOCABULARY,
    similarity=FuzzySimilarity()
)

map = Mapping()
for query in set(misspelled_data['Borough']):
    map.add(query, matcher.find_matches(query))

print(map)
Mapping(<class 'list'>, {'QUEINS': [StringMatch(term='QUEENS', score=0.8333333333333334)], 'QUEENS': [StringMatch(term='QUEENS', score=1)], 'STATEN ISLAND': [StringMatch(term='STATEN ISLAND', score=1)], 'MANHATTAN': [StringMatch(term='MANHATTAN', score=1)], 'MENHATTAN': [StringMatch(term='MANHATTAN', score=0.8888888888888888)], 'BROOKLIN': [StringMatch(term='BROOKLYN', score=0.875)], 'BRONX': [StringMatch(term='BRONX', score=1)], 'BRONKS': [StringMatch(term='BRONX', score=0.6666666666666667)], 'BROOKLYN': [StringMatch(term='BROOKLYN', score=1)], 'QEENS': [StringMatch(term='QUEENS', score=0.8333333333333334)], 'BOOKLYN': [StringMatch(term='BROOKLYN', score=0.875)]})

The map shows all misspellings matched at least one value from the vocabulary so the map can be used to fix the Borough column. The user will have to manually intervene and update the map if for a query value there were zero or more than one matches from the vocabulary.

Fixing is easy, we can use the update operation with the Lookup eval function (to provide default values if key not found in the map).

from openclean.function.eval.domain import Lookup
from openclean.operator.transform.update import update
from openclean.function.eval.base import Col


fixed = update(misspelled_data, 'Borough', Lookup(columns=['Borough'], mapping=map.to_lookup(), default=Col('Borough')))

print(fixed['Borough'].unique())
['MANHATTAN' 'BROOKLYN' 'QUEENS' 'BRONX' 'STATEN ISLAND']
  • KNN Clustering

    openclean lets users use KNN clustering to identify values in a dataset that are potential variations of one another and suggests a replacement for each. For e.g. those with missing or extra punctuation. A well-put example demonstrating this can be found here.

Data Standardization

Many a time, users will be faced with situations where the dataset contains variations of spellings (with extra/missing punctuations) for example, a Business or a street name. openclean provides multiple ways to enforce consistency across datasets:

  • Token Signature Outliers

    When values in the dataset are expected to have a signature token present, this functionality helps identify anomalies. For e.g. an address column often requires values to contain street suffixes. The Token Signature class will ensure any values that don’t have one are highlighted. Here is a notebook demonstrating this.

  • KNN Clustering

    As mentioned in the previous section, KNN Clustering can be used to identify sets of similar values. openclean takes it up a notch by recommending a suggested value for each identified cluster.

These two tools become even more powerful when synergized with each other as demonstrated in this notebook where we use first replace different token signature representations with a standard one and then use KNN Clustering to fix the possible value variations in a street name column.

Statistical Outliers

openclean provides many statistical anomaly detection operators that are implemented by the scikit-learn machine learning library. To name them, we have:

Here we use a simple ensemble approach that applies all these operators to the dataset’s GIS_NTA_NAME column.

from collections import Counter

ensemble = Counter()

from openclean.embedding.feature.default import UniqueSetEmbedding
from openclean.profiling.anomalies.sklearn import (
    dbscan,
    isolation_forest,
    local_outlier_factor,
    one_class_svm,
    robust_covariance
)

for f in [dbscan, isolation_forest, local_outlier_factor, one_class_svm, robust_covariance]:
    ensemble.update(f(ds, 'GIS_NTA_NAME', features=UniqueSetEmbedding()))

We then count for each value, the number of operators that classified the value as an outlier.

# Output values that have been classified as outliers by at least three out of the
# five operators.

prev = 0
for value, count in ensemble.most_common():
    if count < 3:
        break
    if count < prev:
        print()
    if count != prev:
        print('{}\t{}'.format(count, value))
    else:
        print('\t{}'.format(value))
    prev = count
5	

4	Mariner's Harbor-Arlington-Port Ivory-Graniteville
	St. Albans
	Breezy Point-Belle Harbor-Rockaway Park-Broad Channel
	Co-op City
	Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill

3	park-cemetery-etc-Brooklyn
	Ft. Totten-Bay Terrace-Clearview
	Annadale-Huguenot-Prince's Bay-Eltingville
	park-cemetery-etc-Manhattan
	Grasmere-Arrochar-Ft. Wadsworth
	park-cemetery-etc-Queens
	Sheepshead Bay-Gerritsen Beach-Manhattan Beach

Statistically classified as anomalies, these neighborhoods can be those with fewer job requests or misspellings. Something a user with domain knowledge can verify.

Custom functions

A user can create their own data cleaning operators, apply them and reuse them as per their requirements. With openclean-notebook, these eval functions or callables can further be registered on a UI and applied to datasets visually. The following screengrab shows how custom functions together with openclean-notebook enhance a user’s data munging experience:

../_images/custom_func.gif