Standardization of Street Names
Find groups of different street names that might be alternative representations of the same street. This is an example for the key collision clustering supported by openclean. Uses the NYC Parking Violations Issued - Fiscal Year 2014 dataset.
[1]:
# Download the full 'DOB Job Application Fiings' dataset.
# Note that this is a file of ~ GB!
import gzip
import os
from openclean.data.source.socrata import Socrata
datafile = './jt7v-77mi.tsv.gz'
# Download file only if it does not exist already.
if not os.path.isfile(datafile):
with gzip.open(datafile, 'wb') as f:
ds = Socrata().dataset('jt7v-77mi')
print('Downloading ...\n')
print(ds.name + '\n')
print(ds.description)
ds.write(f)
# As an alternative, you can also use the smaller dataset sample that is
# included in the repository.
#
# datafile = './data/jt7v-77mi.tsv.gz'
[2]:
# Use streaming function to avoid having to load the full dataset
# into memory.
from openclean.pipeline import stream
df = stream(datafile)
[3]:
# Get distinct set of street names. By computing the distinct set of
# street names first we avoid computing keys for each distinct street
# name multiple times.
streets = df.select('Street').distinct()
print('{} distinct streets (for {} total values)'.format(len(streets), sum(streets.values())))
115567 distinct streets (for 9100278 total values)
[4]:
# Cluster street names using key collision (with the default key generator).
# Remove clusters that contain less than seven distinct values (for display
# purposes). Use multiple threads (4) to generate value keys in parallel.
from openclean.cluster.key import key_collision
# Minimum cluster size. Use seven as defaultfor the full dataset (to limit
# the number of clusters that are printed in the next cell).
minsize = 7
# Use minimum cluster size of 2 when using the dataset sample
# minsize = 2
clusters = key_collision(values=streets, minsize=minsize, threads=4)
print('{} clusters of size {} or greater'.format(len(clusters), minsize))
13 clusters of size 7 or greater
[5]:
# For each cluster print cluster values, their frequency counts,
# and the suggested common value for the cluster.
def print_cluster(cnumber, cluster):
print('Cluster {} (of size {})\n'.format(cnumber, len(cluster)))
for val, count in cluster.items():
print('{} ({})'.format(val, count))
print('\nSuggested value: {}\n\n'.format(cluster.suggestion()))
# Sort clusters by decreasing number of distinct values.
clusters.sort(key=lambda c: len(c), reverse=True)
for i in range(len(clusters)):
print_cluster(i + 1, clusters[i])
Cluster 1 (of size 8)
2ND AVE (4075)
2nd Ave (67751)
2ND AVE (5)
2ND AVE. (1)
AVE 2ND (1)
2ND AVE (1)
2ND AVE (2)
2ND AVE (1)
Suggested value: 2nd Ave
Cluster 2 (of size 8)
ST NICHOLAS AVE (2451)
ST. NICHOLAS AVE (125)
St Nicholas Ave (23462)
ST, NICHOLAS AVE (1)
ST NICHOLAS AVE (9)
ST NICHOLAS AVE (1)
ST NICHOLAS AVE (4)
ST. NICHOLAS AVE (1)
Suggested value: St Nicholas Ave
Cluster 3 (of size 8)
LAWRENCE ST (165)
ST LAWRENCE (34)
LAWRENCE ST (1)
Lawrence St (2368)
ST. LAWRENCE (2)
ST LAWRENCE ST (1)
LAWRENCE ST. (1)
ST. LAWRENCE ST (1)
Suggested value: Lawrence St
Cluster 4 (of size 8)
ST NICHOLAS (847)
ST NICHOLAS ST (31)
NICHOLAS ST (27)
ST. NICHOLAS (27)
ST NICHOLAS (2)
ST NICHOLAS ST (1)
Nicholas St (79)
ST. NICHOLAS ST (1)
Suggested value: ST NICHOLAS
Cluster 5 (of size 7)
W 125 ST (3365)
W 125 ST (1)
W. 125 ST. (1)
W .125 ST (5)
W 125 ST (2)
W 125 ST (1)
W. 125 ST (3)
Suggested value: W 125 ST
Cluster 6 (of size 7)
FERRY LOT 2 (743)
FERRY LOT #2 (140)
FERRY LOT #2 (1)
FERRY LOT 2 (3)
FERRY LOT # 2 (121)
FERRY LOT # 2 (2)
FERRY LOT #2 (1)
Suggested value: FERRY LOT 2
Cluster 7 (of size 7)
3RD AVE (11554)
3rd Ave (148186)
3RD AVE (8)
3RD AVE. (1)
3RD AVE (1)
3RD AVE (2)
3RD AVE (1)
Suggested value: 3rd Ave
Cluster 8 (of size 7)
CONEY ISLAND AVE (3618)
CONEY ISLAND AVE (9)
CONEY ISLAND AVE (9)
Coney Island Ave (35776)
CONEY ISLAND AVE (1)
CONEY ISLAND AVE . (1)
CONEY ISLAND AVE. (1)
Suggested value: Coney Island Ave
Cluster 9 (of size 7)
W TREMONT AVE (110)
W. TREMONT AVE (17)
W W TREMONT AVE (1)
W Tremont Ave (848)
W TREMONT AVE (1)
W. TREMONT AVE (1)
W .TREMONT AVE (1)
Suggested value: W Tremont Ave
Cluster 10 (of size 7)
LGA TERMINAL B (26)
LGA, TERMINAL B (1)
LGA/ TERMINAL B (1)
TERMINAL B LGA (20)
TERMINAL B - LGA (2)
TERMINAL B -LGA (1)
LGA TERMINAL B, (1)
Suggested value: LGA TERMINAL B
Cluster 11 (of size 7)
EL GRANT HWY (67)
E.L GRANT HWY (10)
E.L. GRANT HWY (19)
EL GRANT HWY (1)
EL. GRANT HWY (2)
E/L/ GRANT HWY (1)
E-L GRANT HWY (1)
Suggested value: EL GRANT HWY
Cluster 12 (of size 7)
JOHN ST (186)
ST JOHN (10)
John St (4192)
ST JOHN ST (8)
ST. JOHN ST (1)
ST. JOHN (1)
JOHN ST. (1)
Suggested value: John St
Cluster 13 (of size 7)
ST JOHNS PL (1478)
ST. JOHNS PL (77)
St Johns Pl (4816)
ST JOHNS PL. (1)
ST. JOHNS PL. (1)
ST JOHNS PL (1)
ST JOHNS PL (2)
Suggested value: St Johns Pl