Statistical Outliers in City names
This notebook demonstrates the use of anomaly detection operators that are implemented by the scikit-learn machine learning library. There are five different anomaly detection operators that are included in openclean. Here we use a simple ensemble approach that applies all five operators to a sample of the DOB Job Application Filing dataset and counts for each value the number of operators that classified the value as an outlier.
[1]:
# Use the 'DOB Job Application Filings - Download' notebook to download the
# 'DOB Job Application Filings' dataset for this example.
datafile = './ic3t-wcy2.tsv.gz'
# As an alternative, you can also use the smaller dataset sample that is
# included in the repository.
#
# datafile = './data/ic3t-wcy2.tsv.gz'
[2]:
# Use a random sample of 10,000 records for this example.
from openclean.pipeline import stream
df = stream(datafile).select('City ').update('City ', str.upper).sample(10000, random_state=42).to_df()
[3]:
# Print (a subset of) the distinct city names in the sample.
df['City '].value_counts()
[3]:
NEW YORK 3680
BROOKLYN 1594
QUEENS 538
BRONX 470
NY 462
...
BROOKLLYN 1
CAMBRIA HEIGHTS 1
STUART 1
BRONXVILLE 1
BROOKLYM 1
Name: City , Length: 513, dtype: int64
[4]:
# Use a counter to maintain count of how many anomaly detection operators
# classified each value as an outlier.
from collections import Counter
ensemble = Counter()
[5]:
# Apply fife different anomaly detection operators to the values in the city column.
# Here we use a default value embedding that ignores the frequency of each value (since
# in this NYC Open Dataset city names like NEW YORK and any of the five boroughs are
# more frequent that other names).
from openclean.embedding.feature.default import UniqueSetEmbedding
from openclean.profiling.anomalies.sklearn import (
dbscan,
isolation_forest,
local_outlier_factor,
one_class_svm,
robust_covariance
)
for f in [dbscan, isolation_forest, local_outlier_factor, one_class_svm, robust_covariance]:
ensemble.update(f(df, 'City ', features=UniqueSetEmbedding()))
[6]:
# Output values that have been classified as outliers by at least three out of the
# five operators.
prev = 0
for value, count in ensemble.most_common():
if count < 3:
break
if count < prev:
print()
if count != prev:
print('{}\t{}'.format(count, value))
else:
print('\t{}'.format(value))
prev = count
4 L.I.C.
SI,NY
N Y
N.Y.
NEW YORK
S.I.,NY
L.I.CITY
_BK
S.I.
SUITE 2107 NY
L.I.C
MIAMI
LIC.
BKLYN.
B'KLYN
QUEEN S
3 LONG ISLN. CITY
NEW YOURK
S.OZONE PARK
RICHMOND-HILL
NEW YORK\
S. RICHMOND HIL
HOLLIS HILLS
NEW CANAAN
LONG ISL.CITY
NEW YORK,
ROCKVILLE_CENTR
MINEOLA,
N.MIAMI BEACH
QUEENS _VILLAGE
FLUS. MEADOWS
SO. PLAINFIELD
MC LEAN
S. OZONE PARK
LONG ISL. CITY
S. PLAINFIELD
FLUSHING MEADOW
JACKSON HTS.
ST. PETERSBURG
BROOKLYN,
NEW YORK CITY
NEW YORK, NY
PHILADELPHIA
MT.VERNON
SO. OZONE PARK
MT. KISCO