Data Profiling
openclean comes with pre-configured tools to profile datasets that help to report some actionable metrics. It also provides a fairly easy-to-implement interface for users to create/attach their own data profilers. A user can select the default profiler to get basic statistics (min/max, entropy, distinct values, datatypes, etc) or plug in their own profilers for advanced computations.
We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.
import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset
ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))
ds.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
0 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
2 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights |
Using the openclean profiler
from openclean.profiling.dataset import dataset_profile
profiles = dataset_profile(ds)
print(profiles)
[{'column': 'Job #', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'int': 10000}), 'distinct': Counter({'int': 9329})}), 'minmaxValues': {'int': {'minimum': 102020858, 'maximum': 540182354}}, 'distinctValueCount': 9329, 'entropy': 13.136358755966901, 'topValues': [('121191138', 7), ('301779396', 6), ('320621993', 6), ('121190549', 6), ('121328660', 5), ('121204428', 5), ('121204963', 5), ('121191325', 5), ('121205481', 5), ('320594120', 5)]}}, {'column': 'Borough', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 10000}), 'distinct': Counter({'str': 5})}), 'minmaxValues': {'str': {'minimum': 'BRONX', 'maximum': 'STATEN ISLAND'}}, 'distinctValueCount': 5, 'entropy': 2.066625059397576, 'topValues': [('MANHATTAN', 3443), ('BROOKLYN', 3002), ('QUEENS', 2093), ('BRONX', 933), ('STATEN ISLAND', 529)]}}, {'column': 'Street Name', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 0, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 9985, 'date': 15}), 'distinct': Counter({'str': 3678, 'date': 13})}), 'minmaxValues': {'str': {'minimum': '1 AVE', 'maximum': 'ZEREGA AVENUE'}, 'date': {'minimum': datetime.datetime(103, 4, 3, 0, 0), 'maximum': datetime.datetime(2065, 4, 3, 0, 0)}}, 'distinctValueCount': 3691, 'entropy': 10.99080989068581, 'topValues': [('BROADWAY', 231), ('PARK AVENUE', 112), ('FIFTH AVENUE', 73), ('MADISON AVENUE', 66), ('LEXINGTON AVENUE', 65), ('5TH AVENUE', 62), ('AVENUE OF THE AMERICAS', 49), ('WEST END AVENUE', 41), ('3RD AVENUE', 35), ('THIRD AVENUE', 35)]}}, {'column': 'GIS_NTA_NAME', 'stats': {'totalValueCount': 10000, 'emptyValueCount': 29, 'datatypes': defaultdict(<class 'collections.Counter'>, {'total': Counter({'str': 9971}), 'distinct': Counter({'str': 192})}), 'minmaxValues': {'str': {'minimum': 'Allerton-Pelham Gardens', 'maximum': 'park-cemetery-etc-Queens'}}, 'distinctValueCount': 192, 'entropy': 7.040697875315458, 'topValues': [('Midtown-Midtown South', 471), ('Hudson Yards-Chelsea-Flatiron-Union Square', 307), ('Upper East Side-Carnegie Hill', 241), ('Upper West Side', 230), ('Turtle Bay-East Midtown', 229), ('SoHo-TriBeCa-Civic Center-Little Italy', 214), ('West Village', 202), ('Bedford', 180), ('Lincoln Square', 178), ('Battery Park City-Lower Manhattan', 163)]}}]
These profiled results can be accessed directly:
# see all stats
print(profiles.stats())
total empty distinct uniqueness entropy
Job # 10000 0 9329 0.932900 13.136359
Borough 10000 0 5 0.000500 2.066625
Street Name 10000 0 3691 0.369100 10.990810
GIS_NTA_NAME 10000 29 192 0.019256 7.040698
or can be queried:
# query the minimum and maximum in the Borough column
print(profiles.minmax('Borough'))
min max
str BRONX STATEN ISLAND
To see different data types in the column:
# look at all the datatypes in the dataset
print(profiles.types())
date int str
Job # 0 9329 0
Borough 0 0 5
Street Name 13 0 3678
GIS_NTA_NAME 0 0 192
We also realize from this that one good exercise to ensure data quality could be to look at the Street Name column’s values that have been classified as dates.
Visualizing profiled results
The openclean-notebook allows profiled results to be visually seen in the notebook. The following screengrab demonstrates this using the Auctus profiler with the openclean-notebook spreadsheet UI: