Welcome to openclean’s Documentation!
openclean is a Python library for data profiling and data cleaning. The project is motivated by the fact that data preparation is still a major bottleneck for many data science projects. Data preparation requires profiling to gain an understanding of data quality issues, and data manipulation to transform the data into a form that is fit for the intended purpose.
While a large number of different tools and techniques have previously been developed for profiling and cleaning data, one main issue that we see with these tools is the lack of access to them in a single (unified) framework. Existing tools may be implemented in different programming languages and require significant effort to install and interface with. In other cases, promising data cleaning methods have been published in the scientific literature but there is no suitable codebase available for them. We believe that the lack of seamless access to existing work is a major contributor to why data preparation is so time-consuming.
The goal of openclean goal is to bring together data cleaning tools in a single environment that is easy and intuitive to use for a data scientist. openclean allows users to compose and execute cleaning pipelines that are built using a variety of different tools. We aim for openclean to be flexible and extensible to allow easy integration of new functionality. To this end, we define a set of primitives and API’s for the different types of operators (actors) in openclean pipelines.
- Installation
- Getting Started
- Data Model
- Data Profiling
- Data Transformation
- Data Wrangling and Cleaning
- Data Enrichment
- Data Provenance
- Step by Step Guides
- Downloading master data from Reference Data Repository
- Downloading DOB Job Application Filings from Socrata
- Misspellings in Country Names
- Statistical Outliers in City names
- Misspellings of Brooklyn
- Profiling - DOHMH New York City Restaurant Inspection Results
- Wrangling - DOHMH New York City Restaurant Inspection Results
- Features
- Setting up
- Loading data
- Profiling
- Transformations
- kNN Clustering - DOHMH New York City Restaurant Inspection Results
- Functional Dependency Violations
- Token Signature Outliers for Street Names
- Standardization of Street Names
- User-defined Functions
- Engine - Datastore
- Extensions
- Configuration
- Contributing
- Frequently Asked Questions