Profiling - DOHMH New York City Restaurant Inspection Results
Data Profiling
In this step we start be getting an overview of the properties for the Restaurant Inspection dataset. We then focus on a few quality issues related to the data that we want to extract to plot the score distributions.
[1]:
# For this demo we make use of openclean's streaming option for large datasets (to avoid
# having to load the full dataset into memory).
import os
from openclean.pipeline import stream
ds = stream(os.path.join('data', '43nn-pn8j.tsv.gz'))
Basic Dataset Properties and Statistics
[2]:
# Get the list of column names in the dataset.
ds.columns
[2]:
['CAMIS',
'DBA',
'BORO',
'BUILDING',
'STREET',
'ZIPCODE',
'PHONE',
'CUISINE DESCRIPTION',
'INSPECTION DATE',
'ACTION',
'VIOLATION CODE',
'VIOLATION DESCRIPTION',
'CRITICAL FLAG',
'SCORE',
'GRADE',
'GRADE DATE',
'RECORD DATE',
'INSPECTION TYPE',
'Latitude',
'Longitude',
'Community Board',
'Council District',
'Census Tract',
'BIN',
'BBL',
'NTA']
[3]:
# Take a look ath the first 10 rows in the dataset.
ds.head()
[3]:
CAMIS | DBA | BORO | BUILDING | STREET | ZIPCODE | PHONE | CUISINE DESCRIPTION | INSPECTION DATE | ACTION | ... | RECORD DATE | INSPECTION TYPE | Latitude | Longitude | Community Board | Council District | Census Tract | BIN | BBL | NTA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 50001317 | D'LILI BAKERY | Manhattan | 526 | W 207TH ST | 10034 | 2123040356 | Bakery | 05/04/2018 | Violations were cited in the following area(s). | ... | 09/28/2019 | Cycle Inspection / Initial Inspection | 40.865269576991 | -73.919671779695 | 112 | 10 | 029300 | 1064780 | 1022220025 | MN01 |
1 | 41716448 | HARLEM SHAKE | Manhattan | 100 | WEST 124 STREET | 10027 | 2122228300 | American | 07/30/2019 | Violations were cited in the following area(s). | ... | 09/28/2019 | Cycle Inspection / Re-inspection | 40.807286117139 | -73.946454954277 | 110 | 09 | 022200 | 1057790 | 1019080035 | MN11 |
2 | 50044535 | HAWKERS | Manhattan | 225 | E 14TH ST | 10003 | 2129821688 | Asian | 06/14/2017 | Violations were cited in the following area(s). | ... | 09/28/2019 | Cycle Inspection / Initial Inspection | 40.732989026505 | -73.986454624452 | 106 | 02 | 004800 | 1019505 | 1008960012 | MN21 |
3 | 50041601 | 118 KITCHEN | Manhattan | 1 | E 118TH ST | 10035 | 2127226888 | Tex-Mex | 05/09/2019 | Violations were cited in the following area(s). | ... | 09/28/2019 | Cycle Inspection / Initial Inspection | 40.801870167292 | -73.945260143617 | 111 | 09 | 018400 | 1053946 | 1017450001 | MN34 |
4 | 50045596 | NEW CHINA | Bronx | 5690 | MOSHOLU AVE | 10471 | 7188841111 | Chinese | 09/07/2016 | Violations were cited in the following area(s). | ... | 09/28/2019 | Cycle Inspection / Re-inspection | 40.905434879643 | -73.900843816155 | 208 | 11 | 033700 | 2084805 | 2058481761 | BX22 |
5 | 41156678 | SUNSWICK | Queens | 3502 | 35 STREET | 11106 | 7187520620 | American | 05/02/2017 | Violations were cited in the following area(s). | ... | 09/28/2019 | Cycle Inspection / Re-inspection | 40.756372985356 | -73.925517485863 | 401 | 26 | 005700 | 4009585 | 4006380025 | QN70 |
6 | 41434036 | CANCUN MEXICAN RESTAURANT | Manhattan | 937 | 8 AVENUE | 10019 | 2123077307 | Mexican | 04/14/2016 | Violations were cited in the following area(s). | ... | 09/28/2019 | Cycle Inspection / Re-inspection | 40.765676019444 | -73.983642958358 | 104 | 03 | 013900 | 1025437 | 1010460032 | MN15 |
7 | 50058219 | HAN JOO BBQ CYCJ | Queens | 4106 | 149TH PL | 11355 | 7183596888 | Korean | 03/21/2017 | Violations were cited in the following area(s). | ... | 09/28/2019 | Pre-permit (Operational) / Initial Inspection | 40.762113093414 | -73.814688971988 | 407 | 20 | 116700 | 4447126 | 4050540032 | QN51 |
8 | 41018004 | ANGELICA PIZZERIA | Brooklyn | 30 | NEVINS STREET | 11217 | 7188522728 | Pizza/Italian | 04/18/2019 | Violations were cited in the following area(s). | ... | 09/28/2019 | Cycle Inspection / Initial Inspection | 40.687850595312 | -73.981473402889 | 302 | 33 | 003700 | 3000515 | 3001660037 | BK38 |
9 | 50064285 | E.A.K. RAMEN | Manhattan | 469 | 6TH AVE | 10011 | 6468632027 | Japanese | 06/07/2017 | Establishment Closed by DOHMH. Violations wer... | ... | 09/28/2019 | Pre-permit (Operational) / Initial Inspection | 40.73558361361 | -73.998141677927 | 102 | 03 | 007100 | 1010580 | 1006070045 | MN23 |
10 rows × 26 columns
[4]:
# Count the number of rows in the dataset.
ds.count()
[4]:
392131
[5]:
# Get basic profile information for all columns in the dataset.
profiles = ds.profile()
profiles
[5]:
[{'column': 'CAMIS',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 0,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 392131})}),
'minmaxValues': {'int': {'minimum': 30075445, 'maximum': 50099136}}}},
{'column': 'DBA',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 588,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 391478, 'int': 55, 'date': 10})}),
'minmaxValues': {'str': {'minimum': '#1 Chinese Restaurant',
'maximum': 'zx accounting service llc'},
'int': {'minimum': 33, 'maximum': 1976},
'date': {'minimum': datetime.datetime(1983, 1, 2, 0, 0),
'maximum': datetime.datetime(1983, 1, 2, 0, 0)}}}},
{'column': 'BORO',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 0,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 392015, 'int': 116})}),
'minmaxValues': {'str': {'minimum': 'Bronx', 'maximum': 'Staten Island'},
'int': {'minimum': 0, 'maximum': 0}}}},
{'column': 'BUILDING',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 231,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 381625, 'str': 10095, 'date': 180})}),
'minmaxValues': {'int': {'minimum': 0, 'maximum': 65626564},
'str': {'minimum': '.1-A', 'maximum': 'W105'},
'date': {'minimum': datetime.datetime(120, 1, 2, 0, 0),
'maximum': datetime.datetime(3906, 1, 2, 0, 0)}}}},
{'column': 'STREET',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 0,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 392131})}),
'minmaxValues': {'str': {'minimum': '& GRAND CENTRAL',
'maximum': 'west 34th st'}}}},
{'column': 'ZIPCODE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 5557,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 386558, 'str': 16})}),
'minmaxValues': {'int': {'minimum': 10000, 'maximum': 12345},
'str': {'minimum': 'N/A', 'maximum': 'N/A'}}}},
{'column': 'PHONE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 11,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 391725, 'str': 395})}),
'minmaxValues': {'int': {'minimum': 0, 'maximum': 90172017170},
'str': {'minimum': '212690369_', 'maximum': '__________'}}}},
{'column': 'CUISINE DESCRIPTION',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 0,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 392131})}),
'minmaxValues': {'str': {'minimum': 'Afghan',
'maximum': 'Vietnamese/Cambodian/Malaysia'}}}},
{'column': 'INSPECTION DATE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 0,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'date': 392131})}),
'minmaxValues': {'date': {'minimum': datetime.datetime(1900, 1, 1, 0, 0),
'maximum': datetime.datetime(2019, 9, 27, 0, 0)}}}},
{'column': 'ACTION',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 1337,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 390794})}),
'minmaxValues': {'str': {'minimum': 'Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.',
'maximum': 'Violations were cited in the following area(s).'}}}},
{'column': 'VIOLATION CODE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 5696,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 386272, 'float': 163})}),
'minmaxValues': {'str': {'minimum': '02A', 'maximum': '22G'},
'float': {'minimum': 1500.0, 'maximum': 15000.0}}}},
{'column': 'VIOLATION DESCRIPTION',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 8918,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 383213})}),
'minmaxValues': {'str': {'minimum': '""No Smoking” and/or \'Smoking Permitted” sign not conspicuously posted. Health warning not present on \'Smoking Permitted”',
'maximum': '“Choking first aid” poster not posted. “Alcohol and pregnancy” warning sign not posted. Resuscitation equipment: exhaled air resuscitation masks (adult & pediatric), latex gloves, sign not posted. Inspection report sign not posted.'}}}},
{'column': 'CRITICAL FLAG',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 8918,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 383213})}),
'minmaxValues': {'str': {'minimum': 'N', 'maximum': 'Y'}}}},
{'column': 'SCORE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 17046,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 375085})}),
'minmaxValues': {'int': {'minimum': -1, 'maximum': 164}}}},
{'column': 'GRADE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 193610,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 198521})}),
'minmaxValues': {'str': {'minimum': 'A', 'maximum': 'Z'}}}},
{'column': 'GRADE DATE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 195416,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'date': 196715})}),
'minmaxValues': {'date': {'minimum': datetime.datetime(2013, 6, 7, 0, 0),
'maximum': datetime.datetime(2019, 9, 26, 0, 0)}}}},
{'column': 'RECORD DATE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 0,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'date': 392131})}),
'minmaxValues': {'date': {'minimum': datetime.datetime(2019, 9, 28, 0, 0),
'maximum': datetime.datetime(2019, 9, 28, 0, 0)}}}},
{'column': 'INSPECTION TYPE',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 1337,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 390794})}),
'minmaxValues': {'str': {'minimum': 'Administrative Miscellaneous / Compliance Inspection',
'maximum': 'Trans Fat / Second Compliance Inspection'}}}},
{'column': 'Latitude',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 430,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'float': 386144, 'int': 5557})}),
'minmaxValues': {'float': {'minimum': 40.508068517904,
'maximum': 40.912822326386},
'int': {'minimum': 0, 'maximum': 0}}}},
{'column': 'Longitude',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 430,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'float': 386144, 'int': 5557})}),
'minmaxValues': {'float': {'minimum': -74.248434473464,
'maximum': -73.700928057808},
'int': {'minimum': 0, 'maximum': 0}}}},
{'column': 'Community Board',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 5987,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 386144})}),
'minmaxValues': {'int': {'minimum': 101, 'maximum': 595}}}},
{'column': 'Council District',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 5987,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 386144})}),
'minmaxValues': {'int': {'minimum': 1, 'maximum': 51}}}},
{'column': 'Census Tract',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 5987,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 386144})}),
'minmaxValues': {'int': {'minimum': 100, 'maximum': 162100}}}},
{'column': 'BIN',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 7670,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 384461})}),
'minmaxValues': {'int': {'minimum': 1000000, 'maximum': 5799501}}}},
{'column': 'BBL',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 430,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'int': 391701})}),
'minmaxValues': {'int': {'minimum': 1, 'maximum': 5270000501}}}},
{'column': 'NTA',
'stats': {'totalValueCount': 392131,
'emptyValueCount': 5987,
'datatypes': defaultdict(collections.Counter,
{'total': Counter({'str': 386144})}),
'minmaxValues': {'str': {'minimum': 'BK09', 'maximum': 'SI54'}}}}]
[6]:
# Print number of empty cells for each column
profiles.stats()['empty']
[6]:
CAMIS 0
DBA 588
BORO 0
BUILDING 231
STREET 0
ZIPCODE 5557
PHONE 11
CUISINE DESCRIPTION 0
INSPECTION DATE 0
ACTION 1337
VIOLATION CODE 5696
VIOLATION DESCRIPTION 8918
CRITICAL FLAG 8918
SCORE 17046
GRADE 193610
GRADE DATE 195416
RECORD DATE 0
INSPECTION TYPE 1337
Latitude 430
Longitude 430
Community Board 5987
Council District 5987
Census Tract 5987
BIN 7670
BBL 430
NTA 5987
Name: empty, dtype: int64
[7]:
# How many records are there for date '09/28/2019'
from openclean.function.eval.base import Col
ds.filter(Col('RECORD DATE') == '09/28/2019').count()
[7]:
392131
[8]:
# Print column type information (as data frame)
profiles.types()
[8]:
date | float | int | str | |
---|---|---|---|---|
CAMIS | 0 | 0 | 392131 | 0 |
DBA | 10 | 0 | 55 | 391478 |
BORO | 0 | 0 | 116 | 392015 |
BUILDING | 180 | 0 | 381625 | 10095 |
STREET | 0 | 0 | 0 | 392131 |
ZIPCODE | 0 | 0 | 386558 | 16 |
PHONE | 0 | 0 | 391725 | 395 |
CUISINE DESCRIPTION | 0 | 0 | 0 | 392131 |
INSPECTION DATE | 392131 | 0 | 0 | 0 |
ACTION | 0 | 0 | 0 | 390794 |
VIOLATION CODE | 0 | 163 | 0 | 386272 |
VIOLATION DESCRIPTION | 0 | 0 | 0 | 383213 |
CRITICAL FLAG | 0 | 0 | 0 | 383213 |
SCORE | 0 | 0 | 375085 | 0 |
GRADE | 0 | 0 | 0 | 198521 |
GRADE DATE | 196715 | 0 | 0 | 0 |
RECORD DATE | 392131 | 0 | 0 | 0 |
INSPECTION TYPE | 0 | 0 | 0 | 390794 |
Latitude | 0 | 386144 | 5557 | 0 |
Longitude | 0 | 386144 | 5557 | 0 |
Community Board | 0 | 0 | 386144 | 0 |
Council District | 0 | 0 | 386144 | 0 |
Census Tract | 0 | 0 | 386144 | 0 |
BIN | 0 | 0 | 384461 | 0 |
BBL | 0 | 0 | 391701 | 0 |
NTA | 0 | 0 | 0 | 386144 |
Datatype Outliers (?)
Note that type detection is difficult and can be ambigious for some values. Let us take a closer look at some columns that appear to have values of different types.
[9]:
# Get a quick look at the columns that contain values of
# different (raw) data types.
profiles.multitype_columns().types()
[9]:
date | float | int | str | |
---|---|---|---|---|
DBA | 10 | 0 | 55 | 391478 |
BORO | 0 | 0 | 116 | 392015 |
BUILDING | 180 | 0 | 381625 | 10095 |
ZIPCODE | 0 | 0 | 386558 | 16 |
PHONE | 0 | 0 | 391725 | 395 |
VIOLATION CODE | 0 | 163 | 0 | 386272 |
Latitude | 0 | 386144 | 5557 | 0 |
Longitude | 0 | 386144 | 5557 | 0 |
[10]:
# Get minimum and maximum values for each datatype in columns 'DBA'
profiles.minmax('DBA')
[10]:
min | max | |
---|---|---|
str | #1 Chinese Restaurant | zx accounting service llc |
int | 33 | 1976 |
date | 1983-01-02 00:00:00 | 1983-01-02 00:00:00 |
[11]:
# Which values are identified as 'date' in column 'BUILDING'
from openclean.function.eval.datatype import IsDatetime
ds.select('BUILDING').filter(IsDatetime('BUILDING')).distinct()
[11]:
Counter({'271 1/2': 7,
'94 1/2': 23,
'168 1/2': 5,
'3906 1/2': 12,
'34 1/2': 37,
'120 1/2': 10,
'227-02/08': 35,
'134 1/2': 7,
'1045 1/2': 19,
'22 1-2': 6,
'42 1/2': 3,
'67 1/2': 6,
'501 1/2': 6,
'48 1/2': 3,
'27 1/2': 1})
[12]:
# Filter distinct values in column 'VIOLATION CODE' that were recognized
# as being of type float.
from openclean.function.eval.base import Col
from openclean.function.eval.datatype import IsFloat
ds.filter(IsFloat('VIOLATION CODE')).distinct('VIOLATION CODE')
[12]:
Counter({'15E2': 134, '15E3': 29})
[13]:
# List the distinct values (and their total counts) for
# column 'BORO'
ds.distinct('BORO')
[13]:
Counter({'Manhattan': 153958,
'Bronx': 35761,
'Queens': 89963,
'Brooklyn': 99317,
'Staten Island': 13016,
'0': 116})
Boroughs that are ‘0’
Take a look at the rows that have 0 as ‘BORO’. Is it possible to infer the borough from other columns soch as BBL (block-borough-lot) or NTA (Neighborhood Tabulation Area).
[14]:
# Profile only those rows that have a 'BORO' value '0'. Here we use the default
# column profiler that generates sets of distinct values for each column.
from openclean.function.eval.base import Col
from openclean.profiling.column import DefaultColumnProfiler
boro_0 = ds.filter(Col('BORO') == '0').profile(default_profiler=DefaultColumnProfiler)
boro_0.stats()
[14]:
total | empty | distinct | uniqueness | entropy | |
---|---|---|---|---|---|
CAMIS | 116 | 0 | 21 | 0.181034 | 2.758740 |
DBA | 116 | 9 | 12 | 0.112150 | 2.297430 |
BORO | 116 | 0 | 1 | 0.008621 | 0.000000 |
BUILDING | 116 | 0 | 6 | 0.051724 | 2.207016 |
STREET | 116 | 0 | 6 | 0.051724 | 2.207016 |
ZIPCODE | 116 | 0 | 4 | 0.034483 | 1.319701 |
PHONE | 116 | 0 | 21 | 0.181034 | 2.758740 |
CUISINE DESCRIPTION | 116 | 0 | 4 | 0.034483 | 1.596319 |
INSPECTION DATE | 116 | 0 | 32 | 0.275862 | 4.709064 |
ACTION | 116 | 16 | 1 | 0.010000 | 0.000000 |
VIOLATION CODE | 116 | 16 | 30 | 0.300000 | 4.367569 |
VIOLATION DESCRIPTION | 116 | 16 | 30 | 0.300000 | 4.367569 |
CRITICAL FLAG | 116 | 16 | 2 | 0.020000 | 0.976500 |
SCORE | 116 | 20 | 18 | 0.187500 | 3.881308 |
GRADE | 116 | 68 | 3 | 0.062500 | 1.199460 |
GRADE DATE | 116 | 68 | 17 | 0.354167 | 3.917481 |
RECORD DATE | 116 | 0 | 1 | 0.008621 | 0.000000 |
INSPECTION TYPE | 116 | 16 | 7 | 0.070000 | 1.986198 |
Latitude | 116 | 116 | 0 | NaN | NaN |
Longitude | 116 | 116 | 0 | NaN | NaN |
Community Board | 116 | 116 | 0 | NaN | NaN |
Council District | 116 | 116 | 0 | NaN | NaN |
Census Tract | 116 | 116 | 0 | NaN | NaN |
BIN | 116 | 116 | 0 | NaN | NaN |
BBL | 116 | 116 | 0 | NaN | NaN |
NTA | 116 | 116 | 0 | NaN | NaN |
[15]:
# There are only four different zip codes. The different values and their
# frequency are included in the profiler results.
boro_0.column('ZIPCODE')['topValues']
[15]:
[('11249', 81), ('N/A', 16), ('10168', 14), ('10285', 5)]
[16]:
# What are the boroughs that are associated with zip codes
# '10168', '10285' in our dataset.
from openclean.function.eval.domain import IsIn
ds.filter(IsIn('ZIPCODE', {'10168', '10285'})).select('BORO').distinct()
[16]:
Counter({'Manhattan': 8, '0': 19})
[17]:
# What are the boroughs that are associated with zip code
# '11249' in our dataset.
ds.filter(Col('ZIPCODE') == '11249').select('BORO').distinct()
[17]:
Counter({'Brooklyn': 2358, '0': 81})
[18]:
# NOTE (for cleaning step):
#
# It seems save to replace '0' in column 'BORO' for those rows that have
# zip codes in ['11249', '10168', '10285'] using the mapping below.Also
# delete those records that have 'ZIPCODE' 'N/A'
boro_0_mapping = {'11249': 'Brooklyn', '10168': 'Manhattan', '10285': 'Manhattan'}
New Establishments
From the dataset description: “… Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.”
We expect that records with an ‘INSPECTION DATE’ 1900/1/1 have no value for ‘ACTION’, ‘INSPECTION TYPE’, ‘SCORE’, and ‘GRADE’.
[19]:
from datetime import datetime
from openclean.function.eval.datatype import Datetime
from openclean.profiling.column import DefaultColumnProfiler
# Date that identifies new establishments that have not
# been inspected.
new_establ_date = datetime(1900, 1, 1)
# Count total, empty, and distinct values for the columns that we
# expect to be empty for new establishments.
ds.filter(Datetime('INSPECTION DATE') == new_establ_date)\
.select(['ACTION', 'INSPECTION TYPE', 'SCORE', 'GRADE'])\
.profile(default_profiler=DefaultColumnProfiler)\
.stats()
[19]:
total | empty | distinct | uniqueness | entropy | |
---|---|---|---|---|---|
ACTION | 1337 | 1337 | 0 | None | None |
INSPECTION TYPE | 1337 | 1337 | 0 | None | None |
SCORE | 1337 | 1337 | 0 | None | None |
GRADE | 1337 | 1337 | 0 | None | None |
[20]:
# NOTE (for cleaning step):
#
# Delete records with date 1900/1/1.
Missing Scores
We are interested in the overall scores that were assigned to individual establishments as a result of the inspection. For new establishments we have seen that the score can be missing in some cases. Are there other records that do not have a ‘SCORE’?
[21]:
from openclean.function.eval.logic import And
from openclean.function.eval.datatype import Datetime
from openclean.function.eval.null import IsEmpty
df = ds.filter(And(Datetime('INSPECTION DATE') != new_establ_date, IsEmpty('SCORE'))).to_df()
[22]:
df.shape
[22]:
(15709, 26)
[23]:
df['INSPECTION TYPE'].value_counts()
[23]:
Administrative Miscellaneous / Initial Inspection 7018
Smoke-Free Air Act / Initial Inspection 2320
Administrative Miscellaneous / Re-inspection 2063
Trans Fat / Initial Inspection 1648
Calorie Posting / Initial Inspection 1123
Smoke-Free Air Act / Re-inspection 586
Trans Fat / Re-inspection 394
Calorie Posting / Re-inspection 233
Administrative Miscellaneous / Compliance Inspection 118
Administrative Miscellaneous / Reopening Inspection 102
Trans Fat / Compliance Inspection 42
Administrative Miscellaneous / Second Compliance Inspection 19
Smoke-Free Air Act / Compliance Inspection 17
Calorie Posting / Compliance Inspection 16
Trans Fat / Second Compliance Inspection 7
Smoke-Free Air Act / Limited Inspection 2
Trans Fat / Limited Inspection 1
Name: INSPECTION TYPE, dtype: int64
[24]:
# QUESTION:
#
# IS there an easy way to get the difference between distinct value counts for those records that
# have a score and tose that don't. That is, are there INSPECTION TYPE's that never result in score
# and others that always have a score?
# NOTE (for cleaning step):
#
# Delete records with empty score.
Cuisine Types
We want to compare inspection results by cuisine type. Get a quick look at the different values in that column.
[25]:
ds.distinct('CUISINE DESCRIPTION')
[25]:
Counter({'Bakery': 12305,
'American': 82709,
'Asian': 6237,
'Tex-Mex': 1786,
'Chinese': 41431,
'Mexican': 15811,
'Korean': 5431,
'Pizza/Italian': 8354,
'Japanese': 13956,
'Donuts': 5199,
'Sandwiches': 4108,
'Pizza': 17536,
'Caribbean': 14080,
'French': 4499,
'Delicatessen': 6257,
'Seafood': 3092,
'Café/Coffee/Tea': 18833,
'Spanish': 12231,
'Salads': 956,
'Juice, Smoothies, Fruit Salads': 4573,
'Afghan': 190,
'Chinese/Japanese': 884,
'Filipino': 666,
'Thai': 5334,
'Irish': 3336,
'Latin (Cuban, Dominican, Puerto Rican, South & Central American)': 17117,
'Eastern European': 1369,
'Turkish': 1179,
'Italian': 15798,
'Bottled beverages, including water, sodas, juices, etc.': 1202,
'Hamburgers': 4385,
'African': 1729,
'Chicken': 7337,
'Other': 2535,
'Barbecue': 764,
'Jewish/Kosher': 5550,
'Greek': 2162,
'English': 207,
'Bangladeshi': 1193,
'Sandwiches/Salads/Mixed Buffet': 2740,
'Middle Eastern': 2927,
'Pakistani': 651,
'Bagels/Pretzels': 2630,
'Soul Food': 1138,
'Steak': 1277,
'German': 428,
'Peruvian': 1849,
'Indian': 6841,
'Brazilian': 488,
'Tapas': 516,
'Hawaiian': 496,
'Russian': 1086,
'Vietnamese/Cambodian/Malaysia': 1895,
'Mediterranean': 4219,
'Ice Cream, Gelato, Yogurt, Ices': 3044,
'Vegetarian': 1874,
'Continental': 629,
'Soups & Sandwiches': 552,
'Hotdogs/Pretzels': 269,
'Egyptian': 188,
'Chinese/Cuban': 420,
'Creole': 526,
'Moroccan': 149,
'Polish': 542,
'Ethiopian': 231,
'Portuguese': 133,
'Australian': 378,
'Fruits/Vegetables': 90,
'Armenian': 305,
'Czech': 41,
'Hotdogs': 185,
'Scandinavian': 88,
'Soups': 56,
'Nuts/Confectionary': 34,
'Creole/Cajun': 133,
'Southwestern': 80,
'Iranian': 60,
'Pancakes/Waffles': 221,
'Not Listed/Not Applicable': 98,
'Chilean': 23,
'Cajun': 82,
'Californian': 90,
'Indonesian': 101,
'Basque': 7})