Data Wrangling and Cleaning
openclean comes with the following operators to help users find anomalies, make fixes and wrangle their datasets.
Note
This list is growing and will be updated periodically.
We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.
import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset
ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))
ds.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
0 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
2 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights |
Functional Dependency Violations
openclean makes it easy to identify any functional dependency violations in datasets.
from openclean.operator.map.violations import fd_violations
from openclean.operator.collector.count import distinct
fd1_violations = fd_violations(ds, ['Street Name', 'GIS_NTA_NAME'], ['Borough'])
print('# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is {}\n'.format(len(fd1_violations)))
for key, gr in fd1_violations.items():
print(gr[['Street Name', 'GIS_NTA_NAME', 'Borough']])
# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is 1
Street Name GIS_NTA_NAME Borough
10 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
11 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy QUEENS
61 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
691 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
1205 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
1398 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
1546 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
1616 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
1728 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
1781 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
2366 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
2590 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
2958 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
3535 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
3556 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
3725 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
4225 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
4441 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
5333 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
5555 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
5583 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
6561 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
7428 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
7939 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
8027 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
9426 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
9811 BROADWAY SoHo-TriBeCa-Civic Center-Little Italy MANHATTAN
We identify one violation in the above example. Clearly, it is row 11 as someone with domain knowledge should be able to point out that FD(BROADWAY SoHo-TriBeCa-Civic Center-Little Italy) -> QUEENS is incorrect. Let’s fix this using one of the repair strategies available to us:
from openclean.operator.collector.repair import Shortest, Vote, conflict_repair
# Define the conflict resolution strategy. We use a majority vote for both RHS attributes.
strategy = {'Borough': Vote()}
# resolve the conflicts
resolved = conflict_repair(conflicts=fd1_violations, strategy=strategy, in_order=False)
This should replace the violation value with the maximum frequency value of the group and resolve the conflict.
violation_group = resolved[(resolved['Street Name']=='BROADWAY') & (resolved['GIS_NTA_NAME']=='SoHo-TriBeCa-Civic Center-Little Italy')]
fd2_violations = fd_violations(resolved, ['Street Name', 'GIS_NTA_NAME'], ['Borough'])
print('# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is {}\n'.format(len(fd2_violations)))
print(violation_group)
# of violations for FD(Street Name, GIS_NTA_NAME -> Borough) is 0
Job # Borough Street Name GIS_NTA_NAME
10 140915990 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
11 140915992 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
61 121848590 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
691 123919901 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
1205 123512430 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
1398 140838360 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
1546 123843485 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
1616 123569352 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
1728 140911798 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
1781 140914704 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
2366 123788311 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
2590 123830514 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
2958 123070292 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
3535 140917774 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
3556 140944478 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
3725 122967021 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
4225 123569389 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
4441 140918648 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
5333 123787081 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
5555 140918620 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
5583 104115011 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
6561 120304973 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
7428 122989141 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
7939 140922072 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
8027 120004244 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
9426 121183129 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
9811 122881097 MANHATTAN BROADWAY SoHo-TriBeCa-Civic Center-Little Italy
A complete list of repair strategies can be accessed in the API Reference
Missing Values
Depending on the use case, missing values can be handled by the filter transformation (removing them) or by the update transformation (new values). They are both explained in Data Transformation. We demonstrate both again here.
misspelled_data = dataset(os.path.join(path_to_file, 'misspellings.csv'))
misspelled_data['Borough'].value_counts()
MANHATTAN 3442
BROOKLYN 2999
QUEENS 2090
BRONX 932
STATEN ISLAND 529
2
BRONKS 1
BROOKLIN 1
QUEINS 1
QEENS 1
MENHATTAN 1
BOOKLYN 1
Name: Borough, dtype: int64
We see there are empty values in this column. First, let’s update them using a lambda function that uses the is_empty value function.
from openclean.operator.transform.update import update
from openclean.function.value.null import is_empty
updated_misspelled = update(misspelled_data, 'Borough', lambda x: 'Unknown' if is_empty(x) else x)
updated_misspelled['Borough'].value_counts()
MANHATTAN 3442
BROOKLYN 2999
QUEENS 2090
BRONX 932
STATEN ISLAND 529
Unknown 2
BRONKS 1
BROOKLIN 1
QUEINS 1
QEENS 1
MENHATTAN 1
BOOKLYN 1
Name: Borough, dtype: int64
We’ve replaced missing values with Unknown. But, because there is no way to be sure of the correct value without using other columns and data augmentation techniques from the Data Enrichment section to get the correct Borough, we shall filter out the nulls in this example using the IsNotEmpty eval function.
from openclean.operator.transform.filter import filter
from openclean.function.eval.null import IsNotEmpty
misspelled_data = filter(misspelled_data, predicate=IsNotEmpty('Borough'))
misspelled_data['Borough'].value_counts()
MANHATTAN 3442
BROOKLYN 2999
QUEENS 2090
BRONX 932
STATEN ISLAND 529
BRONKS 1
BROOKLIN 1
QUEINS 1
QEENS 1
MENHATTAN 1
BOOKLYN 1
Name: Borough, dtype: int64
The whole range of eval functions and value functions can be accessed in the Eval package of the API Reference and Value package of the API Reference respectively as explained in Data Model.
Misspellings and Data Entry Bugs
openclean can help identify misspellings and data entry bugs using its powerful string matcher class. It helps detect anomalous values using:
- Phonetic Matching
Phonetic algorithms transform the input strings to normalized phonetic encodings before comparing them. openclean has the following phonetic string algorithms:
NYSIIS
Soundex
Metaphone
- Fuzzy Matching
Implementation of fuzzy string matching using n-gram overlaps and Levenshtein or cosine distance.
StringMatcher objects ingest a vocabulary and a matching algorithm that is used to identify dataset values that are misspelled. These can optionally be stored into an openclean mapping to be reused later with other datasets as translation tables.
from openclean.function.matching.base import DefaultStringMatcher
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.data.mapping import Mapping
VOCABULARY = ['BROOKLYN' ,'MANHATTAN','STATEN ISLAND','BRONX', 'QUEENS']
matcher = DefaultStringMatcher(
vocabulary=VOCABULARY,
similarity=FuzzySimilarity()
)
map = Mapping()
for query in set(misspelled_data['Borough']):
map.add(query, matcher.find_matches(query))
print(map)
Mapping(<class 'list'>, {'QUEINS': [StringMatch(term='QUEENS', score=0.8333333333333334)], 'QUEENS': [StringMatch(term='QUEENS', score=1)], 'STATEN ISLAND': [StringMatch(term='STATEN ISLAND', score=1)], 'MANHATTAN': [StringMatch(term='MANHATTAN', score=1)], 'MENHATTAN': [StringMatch(term='MANHATTAN', score=0.8888888888888888)], 'BROOKLIN': [StringMatch(term='BROOKLYN', score=0.875)], 'BRONX': [StringMatch(term='BRONX', score=1)], 'BRONKS': [StringMatch(term='BRONX', score=0.6666666666666667)], 'BROOKLYN': [StringMatch(term='BROOKLYN', score=1)], 'QEENS': [StringMatch(term='QUEENS', score=0.8333333333333334)], 'BOOKLYN': [StringMatch(term='BROOKLYN', score=0.875)]})
The map shows all misspellings matched at least one value from the vocabulary so the map can be used to fix the Borough column. The user will have to manually intervene and update the map if for a query value there were zero or more than one matches from the vocabulary.
Fixing is easy, we can use the update operation with the Lookup eval function (to provide default values if key not found in the map).
from openclean.function.eval.domain import Lookup
from openclean.operator.transform.update import update
from openclean.function.eval.base import Col
fixed = update(misspelled_data, 'Borough', Lookup(columns=['Borough'], mapping=map.to_lookup(), default=Col('Borough')))
print(fixed['Borough'].unique())
['MANHATTAN' 'BROOKLYN' 'QUEENS' 'BRONX' 'STATEN ISLAND']
- KNN Clustering
openclean lets users use KNN clustering to identify values in a dataset that are potential variations of one another and suggests a replacement for each. For e.g. those with missing or extra punctuation. A well-put example demonstrating this can be found here.
Data Standardization
Many a time, users will be faced with situations where the dataset contains variations of spellings (with extra/missing punctuations) for example, a Business or a street name. openclean provides multiple ways to enforce consistency across datasets:
- Token Signature Outliers
When values in the dataset are expected to have a signature token present, this functionality helps identify anomalies. For e.g. an address column often requires values to contain street suffixes. The Token Signature class will ensure any values that don’t have one are highlighted. Here is a notebook demonstrating this.
- KNN Clustering
As mentioned in the previous section, KNN Clustering can be used to identify sets of similar values. openclean takes it up a notch by recommending a suggested value for each identified cluster.
These two tools become even more powerful when synergized with each other as demonstrated in this notebook where we use first replace different token signature representations with a standard one and then use KNN Clustering to fix the possible value variations in a street name column.
Statistical Outliers
openclean provides many statistical anomaly detection operators that are implemented by the scikit-learn machine learning library. To name them, we have:
Here we use a simple ensemble approach that applies all these operators to the dataset’s GIS_NTA_NAME column.
from collections import Counter
ensemble = Counter()
from openclean.embedding.feature.default import UniqueSetEmbedding
from openclean.profiling.anomalies.sklearn import (
dbscan,
isolation_forest,
local_outlier_factor,
one_class_svm,
robust_covariance
)
for f in [dbscan, isolation_forest, local_outlier_factor, one_class_svm, robust_covariance]:
ensemble.update(f(ds, 'GIS_NTA_NAME', features=UniqueSetEmbedding()))
We then count for each value, the number of operators that classified the value as an outlier.
# Output values that have been classified as outliers by at least three out of the
# five operators.
prev = 0
for value, count in ensemble.most_common():
if count < 3:
break
if count < prev:
print()
if count != prev:
print('{}\t{}'.format(count, value))
else:
print('\t{}'.format(value))
prev = count
5
4 Mariner's Harbor-Arlington-Port Ivory-Graniteville
St. Albans
Breezy Point-Belle Harbor-Rockaway Park-Broad Channel
Co-op City
Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill
3 park-cemetery-etc-Brooklyn
Ft. Totten-Bay Terrace-Clearview
Annadale-Huguenot-Prince's Bay-Eltingville
park-cemetery-etc-Manhattan
Grasmere-Arrochar-Ft. Wadsworth
park-cemetery-etc-Queens
Sheepshead Bay-Gerritsen Beach-Manhattan Beach
Statistically classified as anomalies, these neighborhoods can be those with fewer job requests or misspellings. Something a user with domain knowledge can verify.
Custom functions
A user can create their own data cleaning operators, apply them and reuse them as per their requirements. With openclean-notebook, these eval functions or callables can further be registered on a UI and applied to datasets visually. The following screengrab shows how custom functions together with openclean-notebook enhance a user’s data munging experience: