Data Profiling

openclean comes with pre-configured tools to profile datasets that help to report some actionable metrics. It also provides a fairly easy-to-implement interface for users to create/attach their own data profilers. A user can select the default profiler to get basic statistics (min/max, entropy, distinct values, datatypes, etc) or plug in their own profilers for advanced computations.

We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.

import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')

from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))

ds.head()

	Job #	Borough	Street Name	GIS_NTA_NAME
0	140915936	MANHATTAN	EAST 93RD STREET	Upper East Side-Carnegie Hill
1	340737929	BROOKLYN	16TH AVE	Bensonhurst West
2	440601733	QUEENS	AUSTIN STREET	Kew Gardens
3	340737901	BROOKLYN	3RD STREET	Carroll Gardens-Columbia Street-Red Hook
4	240275910	BRONX	CEDAR AVENUE	University Heights-Morris Heights

Using the openclean profiler

from openclean.profiling.dataset import dataset_profile

profiles = dataset_profile(ds)
print(profiles)

[{'column': 'Job #', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'int': 10000}), 'distinct': Counter({'int': 9329})}), 'minmaxValues': {'int': {'minimum': 102020858, 'maximum': 540182354}}, 'distinctValueCount': 9329, 'entropy': 13.136358755966901, 'topValues': [('121191138', 7), ('301779396', 6), ('320621993', 6), ('121190549', 6), ('121328660', 5), ('121204428', 5), ('121204963', 5), ('121191325', 5), ('121205481', 5), ('320594120', 5)]}}, {'column': 'Borough', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 10000}), 'distinct': Counter({'str': 5})}), 'minmaxValues': {'str': {'minimum': 'BRONX', 'maximum': 'STATEN ISLAND'}}, 'distinctValueCount': 5, 'entropy': 2.066625059397576, 'topValues': [('MANHATTAN', 3443), ('BROOKLYN', 3002), ('QUEENS', 2093), ('BRONX', 933), ('STATEN ISLAND', 529)]}}, {'column': 'Street Name', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 9985, 'date': 15}), 'distinct': Counter({'str': 3678, 'date': 13})}), 'minmaxValues': {'str': {'minimum': '1 AVE', 'maximum': 'ZEREGA AVENUE'}, 'date': {'minimum': datetime.datetime(103, 4, 3, 0, 0), 'maximum': datetime.datetime(2065, 4, 3, 0, 0)}}, 'distinctValueCount': 3691, 'entropy': 10.99080989068581, 'topValues': [('BROADWAY', 231), ('PARK AVENUE', 112), ('FIFTH AVENUE', 73), ('MADISON AVENUE', 66), ('LEXINGTON AVENUE', 65), ('5TH AVENUE', 62), ('AVENUE OF THE AMERICAS', 49), ('WEST END AVENUE', 41), ('3RD AVENUE', 35), ('THIRD AVENUE', 35)]}}, {'column': 'GIS_NTA_NAME', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 29, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 9971}), 'distinct': Counter({'str': 192})}), 'minmaxValues': {'str': {'minimum': 'Allerton-Pelham Gardens', 'maximum': 'park-cemetery-etc-Queens'}}, 'distinctValueCount': 192, 'entropy': 7.040697875315458, 'topValues': [('Midtown-Midtown South', 471), ('Hudson Yards-Chelsea-Flatiron-Union Square', 307), ('Upper East Side-Carnegie Hill', 241), ('Upper West Side', 230), ('Turtle Bay-East Midtown', 229), ('SoHo-TriBeCa-Civic Center-Little Italy', 214), ('West Village', 202), ('Bedford', 180), ('Lincoln Square', 178), ('Battery Park City-Lower Manhattan', 163)]}}]

These profiled results can be accessed directly:

# see all stats
print(profiles.stats())

              total  empty  distinct  uniqueness    entropy
Job #         10000      0      9329    0.932900  13.136359
Borough       10000      0         5    0.000500   2.066625
Street Name   10000      0      3691    0.369100  10.990810
GIS_NTA_NAME  10000     29       192    0.019256   7.040698

or can be queried:

# query the minimum and maximum in the Borough column
print(profiles.minmax('Borough'))

       min            max
str  BRONX  STATEN ISLAND

To see different data types in the column:

# look at all the datatypes in the dataset
print(profiles.types())

              date   int   str
Job #            0  9329     0
Borough          0     0     5
Street Name     13     0  3678
GIS_NTA_NAME     0     0   192

We also realize from this that one good exercise to ensure data quality could be to look at the Street Name column’s values that have been classified as dates.

Visualizing profiled results

The openclean-notebook allows profiled results to be visually seen in the notebook. The following screengrab demonstrates this using the Auctus profiler with the openclean-notebook spreadsheet UI: