Profiling - DOHMH New York City Restaurant Inspection Results

Data Profiling

In this step we start be getting an overview of the properties for the Restaurant Inspection dataset. We then focus on a few quality issues related to the data that we want to extract to plot the score distributions.

[1]:

# For this demo we make use of openclean's streaming option for large datasets (to avoid
# having to load the full dataset into memory).

import os

from openclean.pipeline import stream

ds = stream(os.path.join('data', '43nn-pn8j.tsv.gz'))

Basic Dataset Properties and Statistics

[2]:

# Get the list of column names in the dataset.

ds.columns

[2]:

['CAMIS',
 'DBA',
 'BORO',
 'BUILDING',
 'STREET',
 'ZIPCODE',
 'PHONE',
 'CUISINE DESCRIPTION',
 'INSPECTION DATE',
 'ACTION',
 'VIOLATION CODE',
 'VIOLATION DESCRIPTION',
 'CRITICAL FLAG',
 'SCORE',
 'GRADE',
 'GRADE DATE',
 'RECORD DATE',
 'INSPECTION TYPE',
 'Latitude',
 'Longitude',
 'Community Board',
 'Council District',
 'Census Tract',
 'BIN',
 'BBL',
 'NTA']

[3]:

# Take a look ath the first 10 rows in the dataset.

ds.head()

[3]:

	CAMIS	DBA	BORO	BUILDING	STREET	ZIPCODE	PHONE	CUISINE DESCRIPTION	INSPECTION DATE	ACTION	...	RECORD DATE	INSPECTION TYPE	Latitude	Longitude	Community Board	Council District	Census Tract	BIN	BBL	NTA
0	50001317	D'LILI BAKERY	Manhattan	526	W 207TH ST	10034	2123040356	Bakery	05/04/2018	Violations were cited in the following area(s).	...	09/28/2019	Cycle Inspection / Initial Inspection	40.865269576991	-73.919671779695	112	10	029300	1064780	1022220025	MN01
1	41716448	HARLEM SHAKE	Manhattan	100	WEST 124 STREET	10027	2122228300	American	07/30/2019	Violations were cited in the following area(s).	...	09/28/2019	Cycle Inspection / Re-inspection	40.807286117139	-73.946454954277	110	09	022200	1057790	1019080035	MN11
2	50044535	HAWKERS	Manhattan	225	E 14TH ST	10003	2129821688	Asian	06/14/2017	Violations were cited in the following area(s).	...	09/28/2019	Cycle Inspection / Initial Inspection	40.732989026505	-73.986454624452	106	02	004800	1019505	1008960012	MN21
3	50041601	118 KITCHEN	Manhattan	1	E 118TH ST	10035	2127226888	Tex-Mex	05/09/2019	Violations were cited in the following area(s).	...	09/28/2019	Cycle Inspection / Initial Inspection	40.801870167292	-73.945260143617	111	09	018400	1053946	1017450001	MN34
4	50045596	NEW CHINA	Bronx	5690	MOSHOLU AVE	10471	7188841111	Chinese	09/07/2016	Violations were cited in the following area(s).	...	09/28/2019	Cycle Inspection / Re-inspection	40.905434879643	-73.900843816155	208	11	033700	2084805	2058481761	BX22
5	41156678	SUNSWICK	Queens	3502	35 STREET	11106	7187520620	American	05/02/2017	Violations were cited in the following area(s).	...	09/28/2019	Cycle Inspection / Re-inspection	40.756372985356	-73.925517485863	401	26	005700	4009585	4006380025	QN70
6	41434036	CANCUN MEXICAN RESTAURANT	Manhattan	937	8 AVENUE	10019	2123077307	Mexican	04/14/2016	Violations were cited in the following area(s).	...	09/28/2019	Cycle Inspection / Re-inspection	40.765676019444	-73.983642958358	104	03	013900	1025437	1010460032	MN15
7	50058219	HAN JOO BBQ CYCJ	Queens	4106	149TH PL	11355	7183596888	Korean	03/21/2017	Violations were cited in the following area(s).	...	09/28/2019	Pre-permit (Operational) / Initial Inspection	40.762113093414	-73.814688971988	407	20	116700	4447126	4050540032	QN51
8	41018004	ANGELICA PIZZERIA	Brooklyn	30	NEVINS STREET	11217	7188522728	Pizza/Italian	04/18/2019	Violations were cited in the following area(s).	...	09/28/2019	Cycle Inspection / Initial Inspection	40.687850595312	-73.981473402889	302	33	003700	3000515	3001660037	BK38
9	50064285	E.A.K. RAMEN	Manhattan	469	6TH AVE	10011	6468632027	Japanese	06/07/2017	Establishment Closed by DOHMH. Violations wer...	...	09/28/2019	Pre-permit (Operational) / Initial Inspection	40.73558361361	-73.998141677927	102	03	007100	1010580	1006070045	MN23

10 rows × 26 columns

[4]:

# Count the number of rows in the dataset.

ds.count()

[4]:

[5]:

# Get basic profile information for all columns in the dataset.

profiles = ds.profile()
profiles

[5]:

[{'column': 'CAMIS',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 392131})}),
   'minmaxValues': {'int': {'minimum': 30075445, 'maximum': 50099136}}}},
 {'column': 'DBA',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 588,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 391478, 'int': 55, 'date': 10})}),
   'minmaxValues': {'str': {'minimum': '#1 Chinese Restaurant',
     'maximum': 'zx accounting service llc'},
    'int': {'minimum': 33, 'maximum': 1976},
    'date': {'minimum': datetime.datetime(1983, 1, 2, 0, 0),
     'maximum': datetime.datetime(1983, 1, 2, 0, 0)}}}},
 {'column': 'BORO',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 392015, 'int': 116})}),
   'minmaxValues': {'str': {'minimum': 'Bronx', 'maximum': 'Staten Island'},
    'int': {'minimum': 0, 'maximum': 0}}}},
 {'column': 'BUILDING',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 231,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 381625, 'str': 10095, 'date': 180})}),
   'minmaxValues': {'int': {'minimum': 0, 'maximum': 65626564},
    'str': {'minimum': '.1-A', 'maximum': 'W105'},
    'date': {'minimum': datetime.datetime(120, 1, 2, 0, 0),
     'maximum': datetime.datetime(3906, 1, 2, 0, 0)}}}},
 {'column': 'STREET',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 392131})}),
   'minmaxValues': {'str': {'minimum': '& GRAND CENTRAL',
     'maximum': 'west 34th st'}}}},
 {'column': 'ZIPCODE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5557,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 386558, 'str': 16})}),
   'minmaxValues': {'int': {'minimum': 10000, 'maximum': 12345},
    'str': {'minimum': 'N/A', 'maximum': 'N/A'}}}},
 {'column': 'PHONE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 11,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 391725, 'str': 395})}),
   'minmaxValues': {'int': {'minimum': 0, 'maximum': 90172017170},
    'str': {'minimum': '212690369_', 'maximum': '__________'}}}},
 {'column': 'CUISINE DESCRIPTION',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 392131})}),
   'minmaxValues': {'str': {'minimum': 'Afghan',
     'maximum': 'Vietnamese/Cambodian/Malaysia'}}}},
 {'column': 'INSPECTION DATE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'date': 392131})}),
   'minmaxValues': {'date': {'minimum': datetime.datetime(1900, 1, 1, 0, 0),
     'maximum': datetime.datetime(2019, 9, 27, 0, 0)}}}},
 {'column': 'ACTION',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 1337,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 390794})}),
   'minmaxValues': {'str': {'minimum': 'Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.',
     'maximum': 'Violations were cited in the following area(s).'}}}},
 {'column': 'VIOLATION CODE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5696,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 386272, 'float': 163})}),
   'minmaxValues': {'str': {'minimum': '02A', 'maximum': '22G'},
    'float': {'minimum': 1500.0, 'maximum': 15000.0}}}},
 {'column': 'VIOLATION DESCRIPTION',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 8918,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 383213})}),
   'minmaxValues': {'str': {'minimum': '""No Smoking” and/or \'Smoking Permitted” sign not conspicuously posted. Health warning not present on \'Smoking Permitted”',
     'maximum': '“Choking first aid” poster not posted. “Alcohol and pregnancy” warning sign not posted. Resuscitation equipment: exhaled air resuscitation masks (adult & pediatric), latex gloves, sign not posted. Inspection report sign not posted.'}}}},
 {'column': 'CRITICAL FLAG',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 8918,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 383213})}),
   'minmaxValues': {'str': {'minimum': 'N', 'maximum': 'Y'}}}},
 {'column': 'SCORE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 17046,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 375085})}),
   'minmaxValues': {'int': {'minimum': -1, 'maximum': 164}}}},
 {'column': 'GRADE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 193610,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 198521})}),
   'minmaxValues': {'str': {'minimum': 'A', 'maximum': 'Z'}}}},
 {'column': 'GRADE DATE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 195416,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'date': 196715})}),
   'minmaxValues': {'date': {'minimum': datetime.datetime(2013, 6, 7, 0, 0),
     'maximum': datetime.datetime(2019, 9, 26, 0, 0)}}}},
 {'column': 'RECORD DATE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'date': 392131})}),
   'minmaxValues': {'date': {'minimum': datetime.datetime(2019, 9, 28, 0, 0),
     'maximum': datetime.datetime(2019, 9, 28, 0, 0)}}}},
 {'column': 'INSPECTION TYPE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 1337,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 390794})}),
   'minmaxValues': {'str': {'minimum': 'Administrative Miscellaneous / Compliance Inspection',
     'maximum': 'Trans Fat / Second Compliance Inspection'}}}},
 {'column': 'Latitude',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 430,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'float': 386144, 'int': 5557})}),
   'minmaxValues': {'float': {'minimum': 40.508068517904,
     'maximum': 40.912822326386},
    'int': {'minimum': 0, 'maximum': 0}}}},
 {'column': 'Longitude',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 430,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'float': 386144, 'int': 5557})}),
   'minmaxValues': {'float': {'minimum': -74.248434473464,
     'maximum': -73.700928057808},
    'int': {'minimum': 0, 'maximum': 0}}}},
 {'column': 'Community Board',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5987,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 386144})}),
   'minmaxValues': {'int': {'minimum': 101, 'maximum': 595}}}},
 {'column': 'Council District',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5987,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 386144})}),
   'minmaxValues': {'int': {'minimum': 1, 'maximum': 51}}}},
 {'column': 'Census Tract',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5987,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 386144})}),
   'minmaxValues': {'int': {'minimum': 100, 'maximum': 162100}}}},
 {'column': 'BIN',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 7670,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 384461})}),
   'minmaxValues': {'int': {'minimum': 1000000, 'maximum': 5799501}}}},
 {'column': 'BBL',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 430,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 391701})}),
   'minmaxValues': {'int': {'minimum': 1, 'maximum': 5270000501}}}},
 {'column': 'NTA',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5987,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 386144})}),
   'minmaxValues': {'str': {'minimum': 'BK09', 'maximum': 'SI54'}}}}]

[6]:

# Print number of empty cells for each column

profiles.stats()['empty']

[6]:

CAMIS                         0
DBA                         588
BORO                          0
BUILDING                    231
STREET                        0
ZIPCODE                    5557
PHONE                        11
CUISINE DESCRIPTION           0
INSPECTION DATE               0
ACTION                     1337
VIOLATION CODE             5696
VIOLATION DESCRIPTION      8918
CRITICAL FLAG              8918
SCORE                     17046
GRADE                    193610
GRADE DATE               195416
RECORD DATE                   0
INSPECTION TYPE            1337
Latitude                    430
Longitude                   430
Community Board            5987
Council District           5987
Census Tract               5987
BIN                        7670
BBL                         430
NTA                        5987
Name: empty, dtype: int64

[7]:

# How many records are there for date '09/28/2019'
from openclean.function.eval.base import Col

ds.filter(Col('RECORD DATE') == '09/28/2019').count()

[7]:

[8]:

# Print column type information (as data frame)

profiles.types()

[8]:

	date	float	int	str
CAMIS	0	0	392131	0
DBA	10	0	55	391478
BORO	0	0	116	392015
BUILDING	180	0	381625	10095
STREET	0	0	0	392131
ZIPCODE	0	0	386558	16
PHONE	0	0	391725	395
CUISINE DESCRIPTION	0	0	0	392131
INSPECTION DATE	392131	0	0	0
ACTION	0	0	0	390794
VIOLATION CODE	0	163	0	386272
VIOLATION DESCRIPTION	0	0	0	383213
CRITICAL FLAG	0	0	0	383213
SCORE	0	0	375085	0
GRADE	0	0	0	198521
GRADE DATE	196715	0	0	0
RECORD DATE	392131	0	0	0
INSPECTION TYPE	0	0	0	390794
Latitude	0	386144	5557	0
Longitude	0	386144	5557	0
Community Board	0	0	386144	0
Council District	0	0	386144	0
Census Tract	0	0	386144	0
BIN	0	0	384461	0
BBL	0	0	391701	0
NTA	0	0	0	386144

Datatype Outliers (?)

Note that type detection is difficult and can be ambigious for some values. Let us take a closer look at some columns that appear to have values of different types.

[9]:

# Get a quick look at the columns that contain values of
# different (raw) data types.

profiles.multitype_columns().types()

[9]:

	date	float	int	str
DBA	10	0	55	391478
BORO	0	0	116	392015
BUILDING	180	0	381625	10095
ZIPCODE	0	0	386558	16
PHONE	0	0	391725	395
VIOLATION CODE	0	163	0	386272
Latitude	0	386144	5557	0
Longitude	0	386144	5557	0

[10]:

# Get minimum and maximum values for each datatype in columns 'DBA'

profiles.minmax('DBA')

[10]:

	min	max
str	#1 Chinese Restaurant	zx accounting service llc
int	33	1976
date	1983-01-02 00:00:00	1983-01-02 00:00:00

[11]:

# Which values are identified as 'date' in column 'BUILDING'

from openclean.function.eval.datatype import IsDatetime

ds.select('BUILDING').filter(IsDatetime('BUILDING')).distinct()

[11]:

Counter({'271 1/2': 7,
         '94 1/2': 23,
         '168 1/2': 5,
         '3906 1/2': 12,
         '34   1/2': 37,
         '120 1/2': 10,
         '227-02/08': 35,
         '134 1/2': 7,
         '1045 1/2': 19,
         '22 1-2': 6,
         '42 1/2': 3,
         '67 1/2': 6,
         '501 1/2': 6,
         '48 1/2': 3,
         '27 1/2': 1})

[12]:

# Filter distinct values in column 'VIOLATION CODE' that were recognized
# as being of type float.

from openclean.function.eval.base import Col
from openclean.function.eval.datatype import IsFloat

ds.filter(IsFloat('VIOLATION CODE')).distinct('VIOLATION CODE')

[12]:

Counter({'15E2': 134, '15E3': 29})

[13]:

# List the distinct values (and their total counts) for
# column 'BORO'

ds.distinct('BORO')

[13]:

Counter({'Manhattan': 153958,
         'Bronx': 35761,
         'Queens': 89963,
         'Brooklyn': 99317,
         'Staten Island': 13016,
         '0': 116})

Boroughs that are ‘0’

Take a look at the rows that have 0 as ‘BORO’. Is it possible to infer the borough from other columns soch as BBL (block-borough-lot) or NTA (Neighborhood Tabulation Area).

[14]:

# Profile only those rows that have a 'BORO' value '0'. Here we use the default
# column profiler that generates sets of distinct values for each column.

from openclean.function.eval.base import Col
from openclean.profiling.column import DefaultColumnProfiler

boro_0 = ds.filter(Col('BORO') == '0').profile(default_profiler=DefaultColumnProfiler)
boro_0.stats()

[14]:

	total	empty	distinct	uniqueness	entropy
CAMIS	116	0	21	0.181034	2.758740
DBA	116	9	12	0.112150	2.297430
BORO	116	0	1	0.008621	0.000000
BUILDING	116	0	6	0.051724	2.207016
STREET	116	0	6	0.051724	2.207016
ZIPCODE	116	0	4	0.034483	1.319701
PHONE	116	0	21	0.181034	2.758740
CUISINE DESCRIPTION	116	0	4	0.034483	1.596319
INSPECTION DATE	116	0	32	0.275862	4.709064
ACTION	116	16	1	0.010000	0.000000
VIOLATION CODE	116	16	30	0.300000	4.367569
VIOLATION DESCRIPTION	116	16	30	0.300000	4.367569
CRITICAL FLAG	116	16	2	0.020000	0.976500
SCORE	116	20	18	0.187500	3.881308
GRADE	116	68	3	0.062500	1.199460
GRADE DATE	116	68	17	0.354167	3.917481
RECORD DATE	116	0	1	0.008621	0.000000
INSPECTION TYPE	116	16	7	0.070000	1.986198
Latitude	116	116	0	NaN	NaN
Longitude	116	116	0	NaN	NaN
Community Board	116	116	0	NaN	NaN
Council District	116	116	0	NaN	NaN
Census Tract	116	116	0	NaN	NaN
BIN	116	116	0	NaN	NaN
BBL	116	116	0	NaN	NaN
NTA	116	116	0	NaN	NaN

[15]:

# There are only four different zip codes. The different values and their
# frequency are included in the profiler results.

boro_0.column('ZIPCODE')['topValues']

[15]:

[('11249', 81), ('N/A', 16), ('10168', 14), ('10285', 5)]

[16]:

# What are the boroughs that are associated with zip codes
# '10168', '10285' in our dataset.

from openclean.function.eval.domain import IsIn

ds.filter(IsIn('ZIPCODE', {'10168', '10285'})).select('BORO').distinct()

[16]:

Counter({'Manhattan': 8, '0': 19})

[17]:

# What are the boroughs that are associated with zip code
# '11249' in our dataset.

ds.filter(Col('ZIPCODE') == '11249').select('BORO').distinct()

[17]:

Counter({'Brooklyn': 2358, '0': 81})

[18]:

# NOTE (for cleaning step):
#
# It seems save to replace '0' in column 'BORO' for those rows that have
# zip codes in ['11249', '10168', '10285'] using the mapping below.Also
# delete those records that have 'ZIPCODE' 'N/A'

boro_0_mapping = {'11249': 'Brooklyn', '10168': 'Manhattan', '10285': 'Manhattan'}

New Establishments

From the dataset description: “… Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.”

We expect that records with an ‘INSPECTION DATE’ 1900/1/1 have no value for ‘ACTION’, ‘INSPECTION TYPE’, ‘SCORE’, and ‘GRADE’.

[19]:

from datetime import datetime
from openclean.function.eval.datatype import Datetime
from openclean.profiling.column import DefaultColumnProfiler

# Date that identifies new establishments that have not
# been inspected.
new_establ_date = datetime(1900, 1, 1)

# Count total, empty, and distinct values for the columns that we
# expect to be empty for new establishments.
ds.filter(Datetime('INSPECTION DATE') == new_establ_date)\
    .select(['ACTION', 'INSPECTION TYPE', 'SCORE', 'GRADE'])\
    .profile(default_profiler=DefaultColumnProfiler)\
    .stats()

[19]:

	total	empty	uniqueness	entropy
ACTION	1337	1337	None	None
INSPECTION TYPE	1337	1337	None	None
SCORE	1337	1337	None	None
GRADE	1337	1337	None	None

[20]:

# NOTE (for cleaning step):
#
# Delete records with date 1900/1/1.

Missing Scores

We are interested in the overall scores that were assigned to individual establishments as a result of the inspection. For new establishments we have seen that the score can be missing in some cases. Are there other records that do not have a ‘SCORE’?

[21]:

from openclean.function.eval.logic import And
from openclean.function.eval.datatype import Datetime
from openclean.function.eval.null import IsEmpty

df = ds.filter(And(Datetime('INSPECTION DATE') != new_establ_date, IsEmpty('SCORE'))).to_df()

[22]:

df.shape

[22]:

(15709, 26)

[23]:

df['INSPECTION TYPE'].value_counts()

[23]:

Administrative Miscellaneous / Initial Inspection              7018
Smoke-Free Air Act / Initial Inspection                        2320
Administrative Miscellaneous / Re-inspection                   2063
Trans Fat / Initial Inspection                                 1648
Calorie Posting / Initial Inspection                           1123
Smoke-Free Air Act / Re-inspection                              586
Trans Fat / Re-inspection                                       394
Calorie Posting / Re-inspection                                 233
Administrative Miscellaneous / Compliance Inspection            118
Administrative Miscellaneous / Reopening Inspection             102
Trans Fat / Compliance Inspection                                42
Administrative Miscellaneous / Second Compliance Inspection      19
Smoke-Free Air Act / Compliance Inspection                       17
Calorie Posting / Compliance Inspection                          16
Trans Fat / Second Compliance Inspection                          7
Smoke-Free Air Act / Limited Inspection                           2
Trans Fat / Limited Inspection                                    1
Name: INSPECTION TYPE, dtype: int64

[24]:

# QUESTION:
#
# IS there an easy way to get the difference between distinct value counts for those records that
# have a score and tose that don't. That is, are there INSPECTION TYPE's that never result in score
# and others that always have a score?

# NOTE (for cleaning step):
#
# Delete records with empty score.

Cuisine Types

We want to compare inspection results by cuisine type. Get a quick look at the different values in that column.

[25]:

ds.distinct('CUISINE DESCRIPTION')

[25]:

Counter({'Bakery': 12305,
         'American': 82709,
         'Asian': 6237,
         'Tex-Mex': 1786,
         'Chinese': 41431,
         'Mexican': 15811,
         'Korean': 5431,
         'Pizza/Italian': 8354,
         'Japanese': 13956,
         'Donuts': 5199,
         'Sandwiches': 4108,
         'Pizza': 17536,
         'Caribbean': 14080,
         'French': 4499,
         'Delicatessen': 6257,
         'Seafood': 3092,
         'Café/Coffee/Tea': 18833,
         'Spanish': 12231,
         'Salads': 956,
         'Juice, Smoothies, Fruit Salads': 4573,
         'Afghan': 190,
         'Chinese/Japanese': 884,
         'Filipino': 666,
         'Thai': 5334,
         'Irish': 3336,
         'Latin (Cuban, Dominican, Puerto Rican, South & Central American)': 17117,
         'Eastern European': 1369,
         'Turkish': 1179,
         'Italian': 15798,
         'Bottled beverages, including water, sodas, juices, etc.': 1202,
         'Hamburgers': 4385,
         'African': 1729,
         'Chicken': 7337,
         'Other': 2535,
         'Barbecue': 764,
         'Jewish/Kosher': 5550,
         'Greek': 2162,
         'English': 207,
         'Bangladeshi': 1193,
         'Sandwiches/Salads/Mixed Buffet': 2740,
         'Middle Eastern': 2927,
         'Pakistani': 651,
         'Bagels/Pretzels': 2630,
         'Soul Food': 1138,
         'Steak': 1277,
         'German': 428,
         'Peruvian': 1849,
         'Indian': 6841,
         'Brazilian': 488,
         'Tapas': 516,
         'Hawaiian': 496,
         'Russian': 1086,
         'Vietnamese/Cambodian/Malaysia': 1895,
         'Mediterranean': 4219,
         'Ice Cream, Gelato, Yogurt, Ices': 3044,
         'Vegetarian': 1874,
         'Continental': 629,
         'Soups & Sandwiches': 552,
         'Hotdogs/Pretzels': 269,
         'Egyptian': 188,
         'Chinese/Cuban': 420,
         'Creole': 526,
         'Moroccan': 149,
         'Polish': 542,
         'Ethiopian': 231,
         'Portuguese': 133,
         'Australian': 378,
         'Fruits/Vegetables': 90,
         'Armenian': 305,
         'Czech': 41,
         'Hotdogs': 185,
         'Scandinavian': 88,
         'Soups': 56,
         'Nuts/Confectionary': 34,
         'Creole/Cajun': 133,
         'Southwestern': 80,
         'Iranian': 60,
         'Pancakes/Waffles': 221,
         'Not Listed/Not Applicable': 98,
         'Chilean': 23,
         'Cajun': 82,
         'Californian': 90,
         'Indonesian': 101,
         'Basque': 7})