Statistical Outliers in City names

This notebook demonstrates the use of anomaly detection operators that are implemented by the scikit-learn machine learning library. There are five different anomaly detection operators that are included in openclean. Here we use a simple ensemble approach that applies all five operators to a sample of the DOB Job Application Filing dataset and counts for each value the number of operators that classified the value as an outlier.

[1]:

# Use the 'DOB Job Application Filings - Download' notebook to download the
# 'DOB Job Application Filings' dataset for this example.

datafile = './ic3t-wcy2.tsv.gz'

# As an alternative, you can also use the smaller dataset sample that is
# included in the repository.
#
# datafile = './data/ic3t-wcy2.tsv.gz'

[2]:

# Use a random sample of 10,000 records for this example.

from openclean.pipeline import stream

df = stream(datafile).select('City ').update('City ', str.upper).sample(10000, random_state=42).to_df()

[3]:

# Print (a subset of) the distinct city names in the sample.

df['City '].value_counts()

[3]:

NEW YORK           3680
BROOKLYN           1594
QUEENS              538
BRONX               470
NY                  462
                   ...
BROOKLLYN             1
CAMBRIA HEIGHTS       1
STUART                1
BRONXVILLE            1
BROOKLYM              1
Name: City , Length: 513, dtype: int64

[4]:

# Use a counter to maintain count of how many anomaly detection operators
# classified each value as an outlier.

from collections import Counter

ensemble = Counter()

[5]:

# Apply fife different anomaly detection operators to the values in the city column.
# Here we use a default value embedding that ignores the frequency of each value (since
# in this NYC Open Dataset city names like NEW YORK and any of the five boroughs are
# more frequent that other names).

from openclean.embedding.feature.default import UniqueSetEmbedding
from openclean.profiling.anomalies.sklearn import (
    dbscan,
    isolation_forest,
    local_outlier_factor,
    one_class_svm,
    robust_covariance
)

for f in [dbscan, isolation_forest, local_outlier_factor, one_class_svm, robust_covariance]:
    ensemble.update(f(df, 'City ', features=UniqueSetEmbedding()))

[6]:

# Output values that have been classified as outliers by at least three out of the
# five operators.

prev = 0
for value, count in ensemble.most_common():
    if count < 3:
        break
    if count < prev:
        print()
    if count != prev:
        print('{}\t{}'.format(count, value))
    else:
        print('\t{}'.format(value))
    prev = count

4       L.I.C.
        SI,NY
        N Y
        N.Y.
        NEW  YORK
        S.I.,NY
        L.I.CITY
        _BK
        S.I.
        SUITE 2107 NY
        L.I.C
        MIAMI
        LIC.
        BKLYN.
        B'KLYN
        QUEEN S

3       LONG ISLN. CITY
        NEW  YOURK
        S.OZONE PARK
        RICHMOND-HILL
        NEW YORK\
        S. RICHMOND HIL
        HOLLIS HILLS
        NEW CANAAN
        LONG ISL.CITY
        NEW YORK,
        ROCKVILLE_CENTR
        MINEOLA,
        N.MIAMI BEACH
        QUEENS _VILLAGE
        FLUS. MEADOWS
        SO. PLAINFIELD
        MC LEAN
        S. OZONE PARK
        LONG ISL. CITY
        S. PLAINFIELD
        FLUSHING MEADOW
        JACKSON HTS.
        ST. PETERSBURG
        BROOKLYN,
        NEW YORK  CITY
        NEW YORK, NY
        PHILADELPHIA
        MT.VERNON
        SO. OZONE PARK
        MT. KISCO