Data Profiling

openclean comes with pre-configured tools to profile datasets that help to report some actionable metrics. It also provides a fairly easy-to-implement interface for users to create/attach their own data profilers. A user can select the default profiler to get basic statistics (min/max, entropy, distinct values, datatypes, etc) or plug in their own profilers for advanced computations.

We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.

import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))

ds.head()
Job # Borough Street Name GIS_NTA_NAME
0 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
1 340737929 BROOKLYN 16TH AVE Bensonhurst West
2 440601733 QUEENS AUSTIN STREET Kew Gardens
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights

Using the openclean profiler

from openclean.profiling.dataset import dataset_profile

profiles = dataset_profile(ds)
print(profiles)
[{'column': 'Job #', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'int': 10000}), 'distinct': Counter({'int': 9329})}), 'minmaxValues': {'int': {'minimum': 102020858, 'maximum': 540182354}}, 'distinctValueCount': 9329, 'entropy': 13.136358755966901, 'topValues': [('121191138', 7), ('301779396', 6), ('320621993', 6), ('121190549', 6), ('121328660', 5), ('121204428', 5), ('121204963', 5), ('121191325', 5), ('121205481', 5), ('320594120', 5)]}}, {'column': 'Borough', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 10000}), 'distinct': Counter({'str': 5})}), 'minmaxValues': {'str': {'minimum': 'BRONX', 'maximum': 'STATEN ISLAND'}}, 'distinctValueCount': 5, 'entropy': 2.066625059397576, 'topValues': [('MANHATTAN', 3443), ('BROOKLYN', 3002), ('QUEENS', 2093), ('BRONX', 933), ('STATEN ISLAND', 529)]}}, {'column': 'Street Name', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 9985, 'date': 15}), 'distinct': Counter({'str': 3678, 'date': 13})}), 'minmaxValues': {'str': {'minimum': '1 AVE', 'maximum': 'ZEREGA AVENUE'}, 'date': {'minimum': datetime.datetime(103, 4, 3, 0, 0), 'maximum': datetime.datetime(2065, 4, 3, 0, 0)}}, 'distinctValueCount': 3691, 'entropy': 10.99080989068581, 'topValues': [('BROADWAY', 231), ('PARK AVENUE', 112), ('FIFTH AVENUE', 73), ('MADISON AVENUE', 66), ('LEXINGTON AVENUE', 65), ('5TH AVENUE', 62), ('AVENUE OF THE AMERICAS', 49), ('WEST END AVENUE', 41), ('3RD AVENUE', 35), ('THIRD AVENUE', 35)]}}, {'column': 'GIS_NTA_NAME', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 29, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 9971}), 'distinct': Counter({'str': 192})}), 'minmaxValues': {'str': {'minimum': 'Allerton-Pelham Gardens', 'maximum': 'park-cemetery-etc-Queens'}}, 'distinctValueCount': 192, 'entropy': 7.040697875315458, 'topValues': [('Midtown-Midtown South', 471), ('Hudson Yards-Chelsea-Flatiron-Union Square', 307), ('Upper East Side-Carnegie Hill', 241), ('Upper West Side', 230), ('Turtle Bay-East Midtown', 229), ('SoHo-TriBeCa-Civic Center-Little Italy', 214), ('West Village', 202), ('Bedford', 180), ('Lincoln Square', 178), ('Battery Park City-Lower Manhattan', 163)]}}]

These profiled results can be accessed directly:

# see all stats
print(profiles.stats())
              total  empty  distinct  uniqueness    entropy
Job #         10000      0      9329    0.932900  13.136359
Borough       10000      0         5    0.000500   2.066625
Street Name   10000      0      3691    0.369100  10.990810
GIS_NTA_NAME  10000     29       192    0.019256   7.040698

or can be queried:

# query the minimum and maximum in the Borough column
print(profiles.minmax('Borough'))
       min            max
str  BRONX  STATEN ISLAND

To see different data types in the column:

# look at all the datatypes in the dataset
print(profiles.types())
              date   int   str
Job #            0  9329     0
Borough          0     0     5
Street Name     13     0  3678
GIS_NTA_NAME     0     0   192

We also realize from this that one good exercise to ensure data quality could be to look at the Street Name column’s values that have been classified as dates.

Visualizing profiled results

The openclean-notebook allows profiled results to be visually seen in the notebook. The following screengrab demonstrates this using the Auctus profiler with the openclean-notebook spreadsheet UI:

../_images/auctus_profiler.gif