Downloading master data from Reference Data Repository

openclean integrates the refdata package to provides easy access to several different reference datasets that are available online for download. Reference datasets are for example a great source for lookup tables and mappings that are used in data cleaning for outlier detection and data standardization.

[1]:
# Setup the environment for this demo. All files will be
# stored in a subfolder data.

import os

from openclean.config import ENV_DATA_DIR
from refdata.config import ENV_BASEDIR

os.environ[ENV_DATA_DIR] = './data/archives'
os.environ[ENV_BASEDIR] = './data/refdata'
[2]:
# Import the amasterdata and reference data modules.

import openclean.data.archive.base as masterdata
import openclean.data.refdata as refdata
[3]:
# Print a listing for all dataset in the default
# reference data repository.

for dataset in refdata.repository():
    print('{} ({})'.format(dataset.name, dataset.identifier))
Company Suffixes (company_suffixes)
Cities in the U.S. (encyclopaedia_britannica:us_cities)
NYC Finance - State Codes (nyc.gov:dof:state_codes)
REST Countries (restcountries.eu)
C1 Street Suffix Abbreviations (usps:street_abbrev)
C2 Secondary Unit Designators (usps:secondary_unit_designators)
States and territories of the U.S. (wikipedia:us_states)

restcountries.eu

Dataset of countries in the world that is available from the restcountries.eu project.

[4]:
# Download the countries dataset.

refdata.download('restcountries.eu')
[5]:
# Print the first rows in the countries dataset.

countries = refdata.load('restcountries.eu').df()

print(countries.head())
             name alpha2Code alpha3Code    capital   region        subregion
0     Afghanistan         AF        AFG      Kabul     Asia    Southern Asia
1   Åland Islands         AX        ALA  Mariehamn   Europe  Northern Europe
2         Albania         AL        ALB     Tirana   Europe  Southern Europe
3         Algeria         DZ        DZA    Algiers   Africa  Northern Africa
4  American Samoa         AS        ASM  Pago Pago  Oceania        Polynesia
[6]:
# Print entry for 'Venezuela'

countries.loc[countries['name'] == 'Venezuela (Bolivarian Republic of)']
[6]:
name alpha2Code alpha3Code capital region subregion
243 Venezuela (Bolivarian Republic of) VE VEN Caracas Americas South America

Modified Mastedata Copies

The user has the option to modify the downloaded data and store the updated dataset version in the local masterdata repository.

[7]:
# Create a local masterdata archive for the countries dataset.

archive = masterdata.create('restcountries', source=countries, primary_key=['alpha3Code'], replace=True)
[8]:
# Rename 'Venezuela (Bolivarian Republic of)' to 'Venezuela'.

from openclean.operator.transform.update import update

countries = update(countries, 'name', {'Venezuela (Bolivarian Republic of)': 'Venezuela'})
countries.loc[countries['name'] == 'Venezuela']
[8]:
name alpha2Code alpha3Code capital region subregion
243 Venezuela VE VEN Caracas Americas South America
[9]:
countries.head()
[9]:
name alpha2Code alpha3Code capital region subregion
0 Afghanistan AF AFG Kabul Asia Southern Asia
1 Åland Islands AX ALA Mariehamn Europe Northern Europe
2 Albania AL ALB Tirana Europe Southern Europe
3 Algeria DZ DZA Algiers Africa Northern Africa
4 American Samoa AS ASM Pago Pago Oceania Polynesia
[10]:
# Store the modified dataset as a new snapshot in the masterdata
# archive.

archive = masterdata.get('restcountries')
archive.commit(countries)
[10]:
<Snapshot (version=1 description='' at=2021-04-27 13:54:01.371209-04:00)>
[11]:
countries = masterdata.get('restcountries').checkout()
countries.loc[countries['name'] == 'Venezuela']
[11]:
name alpha2Code alpha3Code capital region subregion
239 Venezuela VE VEN Caracas Americas South America
[12]:
snapshots = masterdata.get('restcountries').snapshots()
for s in snapshots:
    print(s)
<Snapshot (version=0 description='' at=2021-04-27 13:53:59.129191-04:00)>
<Snapshot (version=1 description='' at=2021-04-27 13:54:01.371209-04:00)>
[13]:
prov = masterdata.get('restcountries').diff(0, 1)
prov.describe()
Schema Changes
==============
Inserted Columns : 0
Deleted Columns  : 0
Moved Columns    : 0
Renamed Columns  : 0

Data Changes
============
Inserted Rows    : 0
Deleted Rows     : 0
Moved Rows       : 0
Updated Rows     : 1
Updated Values   : 1
[14]:
row = prov.rows().update()[0]
old_val, new_val = row.cells[0].values()

print("Country name updated from '{}' to '{}'".format(old_val, new_val))
Country name updated from 'Venezuela (Bolivarian Republic of)' to 'Venezuela'

Encyclopaedia Britannica

Datasets that are extracted from Web pages of the Encyclopaedia Britannica.

[15]:
# Get dataset with US city names. Used the auto_download
# option that will fetch the dataset if it has not been
# downloaded yet.

us_cities = refdata.load(key='encyclopaedia_britannica:us_cities', auto_download=True).df()
us_cities.head()
[15]:
city state
0 Demopolis Alabama
1 Sylacauga Alabama
2 Troy Alabama
3 Dothan Alabama
4 Prichard Alabama

Cleanup

[16]:
# Remove the created data folder.

import shutil

shutil.rmtree('./data')