Downloading master data from Reference Data Repository
openclean integrates the refdata package to provides easy access to several different reference datasets that are available online for download. Reference datasets are for example a great source for lookup tables and mappings that are used in data cleaning for outlier detection and data standardization.
[1]:
# Setup the environment for this demo. All files will be
# stored in a subfolder data.
import os
from openclean.config import ENV_DATA_DIR
from refdata.config import ENV_BASEDIR
os.environ[ENV_DATA_DIR] = './data/archives'
os.environ[ENV_BASEDIR] = './data/refdata'
[2]:
# Import the amasterdata and reference data modules.
import openclean.data.archive.base as masterdata
import openclean.data.refdata as refdata
[3]:
# Print a listing for all dataset in the default
# reference data repository.
for dataset in refdata.repository():
print('{} ({})'.format(dataset.name, dataset.identifier))
Company Suffixes (company_suffixes)
Cities in the U.S. (encyclopaedia_britannica:us_cities)
NYC Finance - State Codes (nyc.gov:dof:state_codes)
REST Countries (restcountries.eu)
C1 Street Suffix Abbreviations (usps:street_abbrev)
C2 Secondary Unit Designators (usps:secondary_unit_designators)
States and territories of the U.S. (wikipedia:us_states)
restcountries.eu
Dataset of countries in the world that is available from the restcountries.eu project.
[4]:
# Download the countries dataset.
refdata.download('restcountries.eu')
[5]:
# Print the first rows in the countries dataset.
countries = refdata.load('restcountries.eu').df()
print(countries.head())
name alpha2Code alpha3Code capital region subregion
0 Afghanistan AF AFG Kabul Asia Southern Asia
1 Åland Islands AX ALA Mariehamn Europe Northern Europe
2 Albania AL ALB Tirana Europe Southern Europe
3 Algeria DZ DZA Algiers Africa Northern Africa
4 American Samoa AS ASM Pago Pago Oceania Polynesia
[6]:
# Print entry for 'Venezuela'
countries.loc[countries['name'] == 'Venezuela (Bolivarian Republic of)']
[6]:
name | alpha2Code | alpha3Code | capital | region | subregion | |
---|---|---|---|---|---|---|
243 | Venezuela (Bolivarian Republic of) | VE | VEN | Caracas | Americas | South America |
Modified Mastedata Copies
The user has the option to modify the downloaded data and store the updated dataset version in the local masterdata repository.
[7]:
# Create a local masterdata archive for the countries dataset.
archive = masterdata.create('restcountries', source=countries, primary_key=['alpha3Code'], replace=True)
[8]:
# Rename 'Venezuela (Bolivarian Republic of)' to 'Venezuela'.
from openclean.operator.transform.update import update
countries = update(countries, 'name', {'Venezuela (Bolivarian Republic of)': 'Venezuela'})
countries.loc[countries['name'] == 'Venezuela']
[8]:
name | alpha2Code | alpha3Code | capital | region | subregion | |
---|---|---|---|---|---|---|
243 | Venezuela | VE | VEN | Caracas | Americas | South America |
[9]:
countries.head()
[9]:
name | alpha2Code | alpha3Code | capital | region | subregion | |
---|---|---|---|---|---|---|
0 | Afghanistan | AF | AFG | Kabul | Asia | Southern Asia |
1 | Åland Islands | AX | ALA | Mariehamn | Europe | Northern Europe |
2 | Albania | AL | ALB | Tirana | Europe | Southern Europe |
3 | Algeria | DZ | DZA | Algiers | Africa | Northern Africa |
4 | American Samoa | AS | ASM | Pago Pago | Oceania | Polynesia |
[10]:
# Store the modified dataset as a new snapshot in the masterdata
# archive.
archive = masterdata.get('restcountries')
archive.commit(countries)
[10]:
<Snapshot (version=1 description='' at=2021-04-27 13:54:01.371209-04:00)>
[11]:
countries = masterdata.get('restcountries').checkout()
countries.loc[countries['name'] == 'Venezuela']
[11]:
name | alpha2Code | alpha3Code | capital | region | subregion | |
---|---|---|---|---|---|---|
239 | Venezuela | VE | VEN | Caracas | Americas | South America |
[12]:
snapshots = masterdata.get('restcountries').snapshots()
for s in snapshots:
print(s)
<Snapshot (version=0 description='' at=2021-04-27 13:53:59.129191-04:00)>
<Snapshot (version=1 description='' at=2021-04-27 13:54:01.371209-04:00)>
[13]:
prov = masterdata.get('restcountries').diff(0, 1)
prov.describe()
Schema Changes
==============
Inserted Columns : 0
Deleted Columns : 0
Moved Columns : 0
Renamed Columns : 0
Data Changes
============
Inserted Rows : 0
Deleted Rows : 0
Moved Rows : 0
Updated Rows : 1
Updated Values : 1
[14]:
row = prov.rows().update()[0]
old_val, new_val = row.cells[0].values()
print("Country name updated from '{}' to '{}'".format(old_val, new_val))
Country name updated from 'Venezuela (Bolivarian Republic of)' to 'Venezuela'
Encyclopaedia Britannica
Datasets that are extracted from Web pages of the Encyclopaedia Britannica.
[15]:
# Get dataset with US city names. Used the auto_download
# option that will fetch the dataset if it has not been
# downloaded yet.
us_cities = refdata.load(key='encyclopaedia_britannica:us_cities', auto_download=True).df()
us_cities.head()
[15]:
city | state | |
---|---|---|
0 | Demopolis | Alabama |
1 | Sylacauga | Alabama |
2 | Troy | Alabama |
3 | Dothan | Alabama |
4 | Prichard | Alabama |
Cleanup
[16]:
# Remove the created data folder.
import shutil
shutil.rmtree('./data')