Profiling - DOHMH New York City Restaurant Inspection Results

Data Profiling

In this step we start be getting an overview of the properties for the Restaurant Inspection dataset. We then focus on a few quality issues related to the data that we want to extract to plot the score distributions.

[1]:
# For this demo we make use of openclean's streaming option for large datasets (to avoid
# having to load the full dataset into memory).

import os

from openclean.pipeline import stream

ds = stream(os.path.join('data', '43nn-pn8j.tsv.gz'))

Basic Dataset Properties and Statistics

[2]:
# Get the list of column names in the dataset.

ds.columns
[2]:
['CAMIS',
 'DBA',
 'BORO',
 'BUILDING',
 'STREET',
 'ZIPCODE',
 'PHONE',
 'CUISINE DESCRIPTION',
 'INSPECTION DATE',
 'ACTION',
 'VIOLATION CODE',
 'VIOLATION DESCRIPTION',
 'CRITICAL FLAG',
 'SCORE',
 'GRADE',
 'GRADE DATE',
 'RECORD DATE',
 'INSPECTION TYPE',
 'Latitude',
 'Longitude',
 'Community Board',
 'Council District',
 'Census Tract',
 'BIN',
 'BBL',
 'NTA']
[3]:
# Take a look ath the first 10 rows in the dataset.

ds.head()
[3]:
CAMIS DBA BORO BUILDING STREET ZIPCODE PHONE CUISINE DESCRIPTION INSPECTION DATE ACTION ... RECORD DATE INSPECTION TYPE Latitude Longitude Community Board Council District Census Tract BIN BBL NTA
0 50001317 D'LILI BAKERY Manhattan 526 W 207TH ST 10034 2123040356 Bakery 05/04/2018 Violations were cited in the following area(s). ... 09/28/2019 Cycle Inspection / Initial Inspection 40.865269576991 -73.919671779695 112 10 029300 1064780 1022220025 MN01
1 41716448 HARLEM SHAKE Manhattan 100 WEST 124 STREET 10027 2122228300 American 07/30/2019 Violations were cited in the following area(s). ... 09/28/2019 Cycle Inspection / Re-inspection 40.807286117139 -73.946454954277 110 09 022200 1057790 1019080035 MN11
2 50044535 HAWKERS Manhattan 225 E 14TH ST 10003 2129821688 Asian 06/14/2017 Violations were cited in the following area(s). ... 09/28/2019 Cycle Inspection / Initial Inspection 40.732989026505 -73.986454624452 106 02 004800 1019505 1008960012 MN21
3 50041601 118 KITCHEN Manhattan 1 E 118TH ST 10035 2127226888 Tex-Mex 05/09/2019 Violations were cited in the following area(s). ... 09/28/2019 Cycle Inspection / Initial Inspection 40.801870167292 -73.945260143617 111 09 018400 1053946 1017450001 MN34
4 50045596 NEW CHINA Bronx 5690 MOSHOLU AVE 10471 7188841111 Chinese 09/07/2016 Violations were cited in the following area(s). ... 09/28/2019 Cycle Inspection / Re-inspection 40.905434879643 -73.900843816155 208 11 033700 2084805 2058481761 BX22
5 41156678 SUNSWICK Queens 3502 35 STREET 11106 7187520620 American 05/02/2017 Violations were cited in the following area(s). ... 09/28/2019 Cycle Inspection / Re-inspection 40.756372985356 -73.925517485863 401 26 005700 4009585 4006380025 QN70
6 41434036 CANCUN MEXICAN RESTAURANT Manhattan 937 8 AVENUE 10019 2123077307 Mexican 04/14/2016 Violations were cited in the following area(s). ... 09/28/2019 Cycle Inspection / Re-inspection 40.765676019444 -73.983642958358 104 03 013900 1025437 1010460032 MN15
7 50058219 HAN JOO BBQ CYCJ Queens 4106 149TH PL 11355 7183596888 Korean 03/21/2017 Violations were cited in the following area(s). ... 09/28/2019 Pre-permit (Operational) / Initial Inspection 40.762113093414 -73.814688971988 407 20 116700 4447126 4050540032 QN51
8 41018004 ANGELICA PIZZERIA Brooklyn 30 NEVINS STREET 11217 7188522728 Pizza/Italian 04/18/2019 Violations were cited in the following area(s). ... 09/28/2019 Cycle Inspection / Initial Inspection 40.687850595312 -73.981473402889 302 33 003700 3000515 3001660037 BK38
9 50064285 E.A.K. RAMEN Manhattan 469 6TH AVE 10011 6468632027 Japanese 06/07/2017 Establishment Closed by DOHMH. Violations wer... ... 09/28/2019 Pre-permit (Operational) / Initial Inspection 40.73558361361 -73.998141677927 102 03 007100 1010580 1006070045 MN23

10 rows × 26 columns

[4]:
# Count the number of rows in the dataset.

ds.count()
[4]:
392131
[5]:
# Get basic profile information for all columns in the dataset.

profiles = ds.profile()
profiles
[5]:
[{'column': 'CAMIS',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 392131})}),
   'minmaxValues': {'int': {'minimum': 30075445, 'maximum': 50099136}}}},
 {'column': 'DBA',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 588,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 391478, 'int': 55, 'date': 10})}),
   'minmaxValues': {'str': {'minimum': '#1 Chinese Restaurant',
     'maximum': 'zx accounting service llc'},
    'int': {'minimum': 33, 'maximum': 1976},
    'date': {'minimum': datetime.datetime(1983, 1, 2, 0, 0),
     'maximum': datetime.datetime(1983, 1, 2, 0, 0)}}}},
 {'column': 'BORO',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 392015, 'int': 116})}),
   'minmaxValues': {'str': {'minimum': 'Bronx', 'maximum': 'Staten Island'},
    'int': {'minimum': 0, 'maximum': 0}}}},
 {'column': 'BUILDING',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 231,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 381625, 'str': 10095, 'date': 180})}),
   'minmaxValues': {'int': {'minimum': 0, 'maximum': 65626564},
    'str': {'minimum': '.1-A', 'maximum': 'W105'},
    'date': {'minimum': datetime.datetime(120, 1, 2, 0, 0),
     'maximum': datetime.datetime(3906, 1, 2, 0, 0)}}}},
 {'column': 'STREET',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 392131})}),
   'minmaxValues': {'str': {'minimum': '& GRAND CENTRAL',
     'maximum': 'west 34th st'}}}},
 {'column': 'ZIPCODE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5557,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 386558, 'str': 16})}),
   'minmaxValues': {'int': {'minimum': 10000, 'maximum': 12345},
    'str': {'minimum': 'N/A', 'maximum': 'N/A'}}}},
 {'column': 'PHONE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 11,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 391725, 'str': 395})}),
   'minmaxValues': {'int': {'minimum': 0, 'maximum': 90172017170},
    'str': {'minimum': '212690369_', 'maximum': '__________'}}}},
 {'column': 'CUISINE DESCRIPTION',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 392131})}),
   'minmaxValues': {'str': {'minimum': 'Afghan',
     'maximum': 'Vietnamese/Cambodian/Malaysia'}}}},
 {'column': 'INSPECTION DATE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'date': 392131})}),
   'minmaxValues': {'date': {'minimum': datetime.datetime(1900, 1, 1, 0, 0),
     'maximum': datetime.datetime(2019, 9, 27, 0, 0)}}}},
 {'column': 'ACTION',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 1337,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 390794})}),
   'minmaxValues': {'str': {'minimum': 'Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.',
     'maximum': 'Violations were cited in the following area(s).'}}}},
 {'column': 'VIOLATION CODE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5696,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 386272, 'float': 163})}),
   'minmaxValues': {'str': {'minimum': '02A', 'maximum': '22G'},
    'float': {'minimum': 1500.0, 'maximum': 15000.0}}}},
 {'column': 'VIOLATION DESCRIPTION',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 8918,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 383213})}),
   'minmaxValues': {'str': {'minimum': '""No Smoking” and/or \'Smoking Permitted” sign not conspicuously posted. Health warning not present on \'Smoking Permitted”',
     'maximum': '“Choking first aid” poster not posted. “Alcohol and pregnancy” warning sign not posted. Resuscitation equipment: exhaled air resuscitation masks (adult & pediatric), latex gloves, sign not posted. Inspection report sign not posted.'}}}},
 {'column': 'CRITICAL FLAG',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 8918,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 383213})}),
   'minmaxValues': {'str': {'minimum': 'N', 'maximum': 'Y'}}}},
 {'column': 'SCORE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 17046,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 375085})}),
   'minmaxValues': {'int': {'minimum': -1, 'maximum': 164}}}},
 {'column': 'GRADE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 193610,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 198521})}),
   'minmaxValues': {'str': {'minimum': 'A', 'maximum': 'Z'}}}},
 {'column': 'GRADE DATE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 195416,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'date': 196715})}),
   'minmaxValues': {'date': {'minimum': datetime.datetime(2013, 6, 7, 0, 0),
     'maximum': datetime.datetime(2019, 9, 26, 0, 0)}}}},
 {'column': 'RECORD DATE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 0,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'date': 392131})}),
   'minmaxValues': {'date': {'minimum': datetime.datetime(2019, 9, 28, 0, 0),
     'maximum': datetime.datetime(2019, 9, 28, 0, 0)}}}},
 {'column': 'INSPECTION TYPE',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 1337,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 390794})}),
   'minmaxValues': {'str': {'minimum': 'Administrative Miscellaneous / Compliance Inspection',
     'maximum': 'Trans Fat / Second Compliance Inspection'}}}},
 {'column': 'Latitude',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 430,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'float': 386144, 'int': 5557})}),
   'minmaxValues': {'float': {'minimum': 40.508068517904,
     'maximum': 40.912822326386},
    'int': {'minimum': 0, 'maximum': 0}}}},
 {'column': 'Longitude',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 430,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'float': 386144, 'int': 5557})}),
   'minmaxValues': {'float': {'minimum': -74.248434473464,
     'maximum': -73.700928057808},
    'int': {'minimum': 0, 'maximum': 0}}}},
 {'column': 'Community Board',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5987,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 386144})}),
   'minmaxValues': {'int': {'minimum': 101, 'maximum': 595}}}},
 {'column': 'Council District',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5987,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 386144})}),
   'minmaxValues': {'int': {'minimum': 1, 'maximum': 51}}}},
 {'column': 'Census Tract',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5987,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 386144})}),
   'minmaxValues': {'int': {'minimum': 100, 'maximum': 162100}}}},
 {'column': 'BIN',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 7670,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 384461})}),
   'minmaxValues': {'int': {'minimum': 1000000, 'maximum': 5799501}}}},
 {'column': 'BBL',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 430,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'int': 391701})}),
   'minmaxValues': {'int': {'minimum': 1, 'maximum': 5270000501}}}},
 {'column': 'NTA',
  'stats': {'totalValueCount': 392131,
   'emptyValueCount': 5987,
   'datatypes': defaultdict(collections.Counter,
               {'total': Counter({'str': 386144})}),
   'minmaxValues': {'str': {'minimum': 'BK09', 'maximum': 'SI54'}}}}]
[6]:
# Print number of empty cells for each column

profiles.stats()['empty']
[6]:
CAMIS                         0
DBA                         588
BORO                          0
BUILDING                    231
STREET                        0
ZIPCODE                    5557
PHONE                        11
CUISINE DESCRIPTION           0
INSPECTION DATE               0
ACTION                     1337
VIOLATION CODE             5696
VIOLATION DESCRIPTION      8918
CRITICAL FLAG              8918
SCORE                     17046
GRADE                    193610
GRADE DATE               195416
RECORD DATE                   0
INSPECTION TYPE            1337
Latitude                    430
Longitude                   430
Community Board            5987
Council District           5987
Census Tract               5987
BIN                        7670
BBL                         430
NTA                        5987
Name: empty, dtype: int64
[7]:
# How many records are there for date '09/28/2019'
from openclean.function.eval.base import Col

ds.filter(Col('RECORD DATE') == '09/28/2019').count()
[7]:
392131
[8]:
# Print column type information (as data frame)

profiles.types()
[8]:
date float int str
CAMIS 0 0 392131 0
DBA 10 0 55 391478
BORO 0 0 116 392015
BUILDING 180 0 381625 10095
STREET 0 0 0 392131
ZIPCODE 0 0 386558 16
PHONE 0 0 391725 395
CUISINE DESCRIPTION 0 0 0 392131
INSPECTION DATE 392131 0 0 0
ACTION 0 0 0 390794
VIOLATION CODE 0 163 0 386272
VIOLATION DESCRIPTION 0 0 0 383213
CRITICAL FLAG 0 0 0 383213
SCORE 0 0 375085 0
GRADE 0 0 0 198521
GRADE DATE 196715 0 0 0
RECORD DATE 392131 0 0 0
INSPECTION TYPE 0 0 0 390794
Latitude 0 386144 5557 0
Longitude 0 386144 5557 0
Community Board 0 0 386144 0
Council District 0 0 386144 0
Census Tract 0 0 386144 0
BIN 0 0 384461 0
BBL 0 0 391701 0
NTA 0 0 0 386144

Datatype Outliers (?)

Note that type detection is difficult and can be ambigious for some values. Let us take a closer look at some columns that appear to have values of different types.

[9]:
# Get a quick look at the columns that contain values of
# different (raw) data types.

profiles.multitype_columns().types()
[9]:
date float int str
DBA 10 0 55 391478
BORO 0 0 116 392015
BUILDING 180 0 381625 10095
ZIPCODE 0 0 386558 16
PHONE 0 0 391725 395
VIOLATION CODE 0 163 0 386272
Latitude 0 386144 5557 0
Longitude 0 386144 5557 0
[10]:
# Get minimum and maximum values for each datatype in columns 'DBA'

profiles.minmax('DBA')
[10]:
min max
str #1 Chinese Restaurant zx accounting service llc
int 33 1976
date 1983-01-02 00:00:00 1983-01-02 00:00:00
[11]:
# Which values are identified as 'date' in column 'BUILDING'

from openclean.function.eval.datatype import IsDatetime

ds.select('BUILDING').filter(IsDatetime('BUILDING')).distinct()
[11]:
Counter({'271 1/2': 7,
         '94 1/2': 23,
         '168 1/2': 5,
         '3906 1/2': 12,
         '34   1/2': 37,
         '120 1/2': 10,
         '227-02/08': 35,
         '134 1/2': 7,
         '1045 1/2': 19,
         '22 1-2': 6,
         '42 1/2': 3,
         '67 1/2': 6,
         '501 1/2': 6,
         '48 1/2': 3,
         '27 1/2': 1})
[12]:
# Filter distinct values in column 'VIOLATION CODE' that were recognized
# as being of type float.

from openclean.function.eval.base import Col
from openclean.function.eval.datatype import IsFloat

ds.filter(IsFloat('VIOLATION CODE')).distinct('VIOLATION CODE')
[12]:
Counter({'15E2': 134, '15E3': 29})
[13]:
# List the distinct values (and their total counts) for
# column 'BORO'

ds.distinct('BORO')
[13]:
Counter({'Manhattan': 153958,
         'Bronx': 35761,
         'Queens': 89963,
         'Brooklyn': 99317,
         'Staten Island': 13016,
         '0': 116})

Boroughs that are ‘0’

Take a look at the rows that have 0 as ‘BORO’. Is it possible to infer the borough from other columns soch as BBL (block-borough-lot) or NTA (Neighborhood Tabulation Area).

[14]:
# Profile only those rows that have a 'BORO' value '0'. Here we use the default
# column profiler that generates sets of distinct values for each column.

from openclean.function.eval.base import Col
from openclean.profiling.column import DefaultColumnProfiler

boro_0 = ds.filter(Col('BORO') == '0').profile(default_profiler=DefaultColumnProfiler)
boro_0.stats()
[14]:
total empty distinct uniqueness entropy
CAMIS 116 0 21 0.181034 2.758740
DBA 116 9 12 0.112150 2.297430
BORO 116 0 1 0.008621 0.000000
BUILDING 116 0 6 0.051724 2.207016
STREET 116 0 6 0.051724 2.207016
ZIPCODE 116 0 4 0.034483 1.319701
PHONE 116 0 21 0.181034 2.758740
CUISINE DESCRIPTION 116 0 4 0.034483 1.596319
INSPECTION DATE 116 0 32 0.275862 4.709064
ACTION 116 16 1 0.010000 0.000000
VIOLATION CODE 116 16 30 0.300000 4.367569
VIOLATION DESCRIPTION 116 16 30 0.300000 4.367569
CRITICAL FLAG 116 16 2 0.020000 0.976500
SCORE 116 20 18 0.187500 3.881308
GRADE 116 68 3 0.062500 1.199460
GRADE DATE 116 68 17 0.354167 3.917481
RECORD DATE 116 0 1 0.008621 0.000000
INSPECTION TYPE 116 16 7 0.070000 1.986198
Latitude 116 116 0 NaN NaN
Longitude 116 116 0 NaN NaN
Community Board 116 116 0 NaN NaN
Council District 116 116 0 NaN NaN
Census Tract 116 116 0 NaN NaN
BIN 116 116 0 NaN NaN
BBL 116 116 0 NaN NaN
NTA 116 116 0 NaN NaN
[15]:
# There are only four different zip codes. The different values and their
# frequency are included in the profiler results.

boro_0.column('ZIPCODE')['topValues']
[15]:
[('11249', 81), ('N/A', 16), ('10168', 14), ('10285', 5)]
[16]:
# What are the boroughs that are associated with zip codes
# '10168', '10285' in our dataset.

from openclean.function.eval.domain import IsIn

ds.filter(IsIn('ZIPCODE', {'10168', '10285'})).select('BORO').distinct()
[16]:
Counter({'Manhattan': 8, '0': 19})
[17]:
# What are the boroughs that are associated with zip code
# '11249' in our dataset.

ds.filter(Col('ZIPCODE') == '11249').select('BORO').distinct()
[17]:
Counter({'Brooklyn': 2358, '0': 81})
[18]:
# NOTE (for cleaning step):
#
# It seems save to replace '0' in column 'BORO' for those rows that have
# zip codes in ['11249', '10168', '10285'] using the mapping below.Also
# delete those records that have 'ZIPCODE' 'N/A'

boro_0_mapping = {'11249': 'Brooklyn', '10168': 'Manhattan', '10285': 'Manhattan'}

New Establishments

From the dataset description: “… Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.

We expect that records with an ‘INSPECTION DATE’ 1900/1/1 have no value for ‘ACTION’, ‘INSPECTION TYPE’, ‘SCORE’, and ‘GRADE’.

[19]:
from datetime import datetime
from openclean.function.eval.datatype import Datetime
from openclean.profiling.column import DefaultColumnProfiler

# Date that identifies new establishments that have not
# been inspected.
new_establ_date = datetime(1900, 1, 1)

# Count total, empty, and distinct values for the columns that we
# expect to be empty for new establishments.
ds.filter(Datetime('INSPECTION DATE') == new_establ_date)\
    .select(['ACTION', 'INSPECTION TYPE', 'SCORE', 'GRADE'])\
    .profile(default_profiler=DefaultColumnProfiler)\
    .stats()
[19]:
total empty distinct uniqueness entropy
ACTION 1337 1337 0 None None
INSPECTION TYPE 1337 1337 0 None None
SCORE 1337 1337 0 None None
GRADE 1337 1337 0 None None
[20]:
# NOTE (for cleaning step):
#
# Delete records with date 1900/1/1.

Missing Scores

We are interested in the overall scores that were assigned to individual establishments as a result of the inspection. For new establishments we have seen that the score can be missing in some cases. Are there other records that do not have a ‘SCORE’?

[21]:
from openclean.function.eval.logic import And
from openclean.function.eval.datatype import Datetime
from openclean.function.eval.null import IsEmpty

df = ds.filter(And(Datetime('INSPECTION DATE') != new_establ_date, IsEmpty('SCORE'))).to_df()
[22]:
df.shape
[22]:
(15709, 26)
[23]:
df['INSPECTION TYPE'].value_counts()
[23]:
Administrative Miscellaneous / Initial Inspection              7018
Smoke-Free Air Act / Initial Inspection                        2320
Administrative Miscellaneous / Re-inspection                   2063
Trans Fat / Initial Inspection                                 1648
Calorie Posting / Initial Inspection                           1123
Smoke-Free Air Act / Re-inspection                              586
Trans Fat / Re-inspection                                       394
Calorie Posting / Re-inspection                                 233
Administrative Miscellaneous / Compliance Inspection            118
Administrative Miscellaneous / Reopening Inspection             102
Trans Fat / Compliance Inspection                                42
Administrative Miscellaneous / Second Compliance Inspection      19
Smoke-Free Air Act / Compliance Inspection                       17
Calorie Posting / Compliance Inspection                          16
Trans Fat / Second Compliance Inspection                          7
Smoke-Free Air Act / Limited Inspection                           2
Trans Fat / Limited Inspection                                    1
Name: INSPECTION TYPE, dtype: int64
[24]:
# QUESTION:
#
# IS there an easy way to get the difference between distinct value counts for those records that
# have a score and tose that don't. That is, are there INSPECTION TYPE's that never result in score
# and others that always have a score?

# NOTE (for cleaning step):
#
# Delete records with empty score.

Cuisine Types

We want to compare inspection results by cuisine type. Get a quick look at the different values in that column.

[25]:
ds.distinct('CUISINE DESCRIPTION')
[25]:
Counter({'Bakery': 12305,
         'American': 82709,
         'Asian': 6237,
         'Tex-Mex': 1786,
         'Chinese': 41431,
         'Mexican': 15811,
         'Korean': 5431,
         'Pizza/Italian': 8354,
         'Japanese': 13956,
         'Donuts': 5199,
         'Sandwiches': 4108,
         'Pizza': 17536,
         'Caribbean': 14080,
         'French': 4499,
         'Delicatessen': 6257,
         'Seafood': 3092,
         'Café/Coffee/Tea': 18833,
         'Spanish': 12231,
         'Salads': 956,
         'Juice, Smoothies, Fruit Salads': 4573,
         'Afghan': 190,
         'Chinese/Japanese': 884,
         'Filipino': 666,
         'Thai': 5334,
         'Irish': 3336,
         'Latin (Cuban, Dominican, Puerto Rican, South & Central American)': 17117,
         'Eastern European': 1369,
         'Turkish': 1179,
         'Italian': 15798,
         'Bottled beverages, including water, sodas, juices, etc.': 1202,
         'Hamburgers': 4385,
         'African': 1729,
         'Chicken': 7337,
         'Other': 2535,
         'Barbecue': 764,
         'Jewish/Kosher': 5550,
         'Greek': 2162,
         'English': 207,
         'Bangladeshi': 1193,
         'Sandwiches/Salads/Mixed Buffet': 2740,
         'Middle Eastern': 2927,
         'Pakistani': 651,
         'Bagels/Pretzels': 2630,
         'Soul Food': 1138,
         'Steak': 1277,
         'German': 428,
         'Peruvian': 1849,
         'Indian': 6841,
         'Brazilian': 488,
         'Tapas': 516,
         'Hawaiian': 496,
         'Russian': 1086,
         'Vietnamese/Cambodian/Malaysia': 1895,
         'Mediterranean': 4219,
         'Ice Cream, Gelato, Yogurt, Ices': 3044,
         'Vegetarian': 1874,
         'Continental': 629,
         'Soups & Sandwiches': 552,
         'Hotdogs/Pretzels': 269,
         'Egyptian': 188,
         'Chinese/Cuban': 420,
         'Creole': 526,
         'Moroccan': 149,
         'Polish': 542,
         'Ethiopian': 231,
         'Portuguese': 133,
         'Australian': 378,
         'Fruits/Vegetables': 90,
         'Armenian': 305,
         'Czech': 41,
         'Hotdogs': 185,
         'Scandinavian': 88,
         'Soups': 56,
         'Nuts/Confectionary': 34,
         'Creole/Cajun': 133,
         'Southwestern': 80,
         'Iranian': 60,
         'Pancakes/Waffles': 221,
         'Not Listed/Not Applicable': 98,
         'Chilean': 23,
         'Cajun': 82,
         'Californian': 90,
         'Indonesian': 101,
         'Basque': 7})