Drawing

openclean is a Python library for data profiling and data cleaning. It is motivated by the fact that data preparation is still a major bottleneck for many data science projects. Data preparation requires profiling to gain an understanding of data quality issues, and data manipulation to transform the data into a form that is fit for the intended purpose.

While a large number of different tools and techniques have previously been developed for profiling and cleaning data, one main issue that we see with these tools is the lack of access to them in a single (unified) framework. Existing tools may be implemented in different programming languages and require significant effort to install and interface with. In other cases, promising data cleaning methods have been published in the scientific literature but there is no suitable codebase available for them. We believe that the lack of seamless access to existing work is a major contributor to why data preparation is so time consuming.

The goal of openclean goal is to bring together data cleaning tools in a single environment that is easy and intuitive to use for a data scientist. openclean allows users to compose and execute cleaning pipelines that are built using a variety of different tools. We aim for openclean to be flexible and extensible to allow easy integration of new functionality. To this end, we define a set of primitives and API’s for the different types of operators (actors) in openclean pipelines.

Features

openclean has many features that make the data wrangling experience straightforward. It shines particularly in these areas:

Data Profiling

openclean comes with a profiler to provide users actionable metrics about their data’s quality. It allows users to detect possible problems early on by providing various statistical measures of the data from min-max frequencies, to uniqueness and entropy calculations. The interface is easy to implement and can be extended by python savvy users to cater their needs.

Data Cleaning & Wrangling

openclean’s operators have been created specifically to handle data janitorial tasks. They help identify and present statistical anomalies, fix functional dependency violations, locate and update spelling mistakes, and handle missing values gracefully. As openclean is growing fast, so is this list of operators!

Data Enrichment

openclean seamlessly integrates with Socrata and Reference Data Repository to provide it’s users master datasets which can be incorporated in the data cleaning process.

Data Provenance

openclean comes with a mini-version control engine that allows users to maintain versions of their datasets and at any point commit, checkout or rollback changes. Not only this, users can register custom functions inside the openclean engine and apply them effortlessly across different datasets/notebooks.

Drawing is available on PyPI ( https://pypi.org/project/openclean-core/ )


Standardizing Ethiopian dates and Woreda names

This notebook demonstrates openclean’s abilities on solving some major hurdles encountered in the Ethiopian Vaccine deliveries dataset showing monthly vaccine deliveries to various Zones and Woredas in Ethiopia between 2017-2019 (note: the dataset values have been randomized).

Setting up

[1]:
import pprint
import os, re

pp = pprint.PrettyPrinter(indent=2)

The DB engine allows users to profile and view their data through a user interface. They can also create recipes of operations on samples and apply them lazily over a full set once they’re convinced with the changes.

The engine also provides users the ability to maintain provenance of the operations performed on a dataset. Just like a version control system, it has methods to load, commit, and checkout versions of the dataset. To learn more about maintaining provenance in openclean, check out the documentation. Finally, not only can they create versions of their datasets, they can also register custom functions with the engine and use them across notebooks.

[2]:
# load openclean jupyter widget

from openclean_notebook import DB

db = DB(basedir='.openclean', create=True)

Loading data

We use the stream operator to show how large datasets can be streamed through openclean transformations without ever needing to fully load them into memory. We also perform a couple horizontal and vertical slicing operations that are evaluated lazily. The final dataset we will use in this notebook has 10k rows and 10 columns and contains information on 3 administrative levels inside Ethiopia: Region, Zone and Woreda. For succinctness, we only use data from the ‘Oromia’ Region.

[3]:
# Load data

from openclean.pipeline import stream
from openclean.profiling.datatype.convert import DefaultConverter
from openclean.function.eval.base import Eq, Col

vacc = stream(os.path.join('data', 'ethiopia-vacc-randomized.csv'))\
    .typecast(DefaultConverter()) \
    .select(['EthiopianMonth', 'EthMonNum', 'EthYear',
             'RegionName', 'ZoneName', 'WoredaName',
             'DeliveredPrivateClinics', 'DeliveredPublicClinics',
             'DeliveredOther', 'DeliveredTotal'])\
    .where(Eq(Col('RegionName'), 'Oromia'))\
    .to_df()
[4]:
vacc.sample(5, random_state=42)
[4]:
EthiopianMonth EthMonNum EthYear RegionName ZoneName WoredaName DeliveredPrivateClinics DeliveredPublicClinics DeliveredOther DeliveredTotal
6252 Yekatit 6th 2009 Oromia South West Shewa Wonchi 22 7 9 38
4684 Hedar 3rd 2011 Oromia Bale Sinana 36 69 45 150
1731 Teqemt 2nd 2010 Oromia Bale Raitu 94 90 0 184
4742 Tir 5th 2011 Oromia West Wellega Homa 74 45 13 132
4521 Hamle 11th 2009 Oromia West Shewa Guder Hosp 74 74 46 194

Profiling

openclean comes with pre-configured tools that profile datasets and report actionable metrics on data quality as well as with an API to let users create / plug their own. By default, we use the auctus profiler. The power of the user interface coupled with the profiler can be seen here.

[5]:
# load the dataset to the widget

db.load_dataset(source=vacc, name='vacc')
[5]:
<openclean.engine.dataset.FullDataset at 0x7fe299597e20>
[6]:
# fix string dates
@db.register.eval('get_numerical_value')
def get_numerical_value(value):
    """Ingest a given string and return only the numerical characters

        e.g.: 12th -> 12
    """
    return re.findall("\d+", value)[0]

[7]:
# The detail and column views show various profiled metrics

db.edit('vacc', n=100)
[8]:
vacc = db.checkout(name='vacc')

Transformations

This section performs standardization transformations on the dataset. A few updates that the dataset needs are: - Recreating Gregorian dates from Ethiopian dates - Fixing spelling mistakes in Woreda Names

Date Conversion

An ethiopian calendar year spreads over parts of two Gregorian years as it starts in September. Hence we need both the month and year to tranform it to it’s gregorian counterpart. We should be able to assign a month and year to each record. We cant calculate the exact gregorian day for each record because we don’t have ethiopian day information. So we’ll assume the 1st of each month while providing the user freedom to modify it based on domain knowledge.

Note: Ethiopian calendars have 13 months (https://allaboutethio.com/tcalendar.html) which can create uneven time deltas between records because all months are 30 days long and the 13th is 5 or 6(leap) days. Luckily this dataset disregards the 13th month.

YYYY-MM-DD (Eth) -> YYYY-MM-DD (Greg)

2009-01-01 (Eth) —> 2016-09-16 (Greg)

2009-13-05 (Eth) —> 2017-09-15 (Greg)

[9]:
# the dataset has ethiopic dates in 3 columns

vacc.sample(5, random_state=42)[['EthYear','EthMonNum','EthiopianMonth']]
[9]:
EthYear EthMonNum EthiopianMonth
6252 2009 6 Yekatit
4684 2011 3 Hedar
1731 2010 2 Teqemt
4742 2011 5 Tir
4521 2009 11 Hamle

We use an external library (ethiopian-date) to parse these dates into Gregorian by constructing an openclean operator. The operator expects 2 columns and a function to perform on those columns and can be passed to any similar dataset for the same conversion.

[10]:
from ethiopian_date import EthiopianDateConverter
from openclean.function.eval.base import Eval
from openclean.operator.transform.insert import inscol


# the operator should have two input values and return a date
convert_date = lambda x, y : EthiopianDateConverter.to_gregorian(int(x), int(y), 1)


# create a new operator that expects two columns name 'EthYear' and 'EthMonth' and it runs
# the convert_date callable on those columns
date_parser = Eval(columns=['EthYear','EthMonNum'],
     func = convert_date,
     is_unary = False)


# insert a new column in the vacc dataset called 'Converted_Date' at position 0 using the date_parser operator
vacc = inscol(vacc, 'Converted_Date', 0, date_parser)
[11]:
# the Converted_date column has the converted dates

vacc.sample(10, random_state=42)
[11]:
Converted_Date EthiopianMonth EthMonNum EthYear RegionName ZoneName WoredaName DeliveredPrivateClinics DeliveredPublicClinics DeliveredOther DeliveredTotal
6252 2017-02-08 Yekatit 6 2009 Oromia South West Shewa Wonchi 22 7 9 38
4684 2018-11-10 Hedar 3 2011 Oromia Bale Sinana 36 69 45 150
1731 2017-10-11 Teqemt 2 2010 Oromia Bale Raitu 94 90 0 184
4742 2019-01-09 Tir 5 2011 Oromia West Wellega Homa 74 45 13 132
4521 2017-07-08 Hamle 11 2009 Oromia West Shewa Guder Hosp 74 74 46 194
6340 2018-04-09 Miyazia 8 2010 Oromia Bale Seweyna 7 83 2 92
576 2018-01-09 Tir 5 2010 Oromia West Hararge Guba Qoricha 77 95 15 187
5202 2018-07-08 Hamle 11 2010 Oromia Lege Dadi Lege Tafo Town Lege Dadi Lege Tafo Town 59 11 84 154
6363 2019-01-09 Tir 5 2011 Oromia Qeleme Wellega Dale Sedi 43 86 20 149
439 2019-05-09 Ginbot 9 2011 Oromia Arsi Tena 86 42 53 181

Standardizing Spellings

Because this dataset was collected over a period of time and the Zone and Woreda names are in a different language, they susceptible to spelling mistakes when transliterated into english. We check this in this section.

We use official lists of regional names provided by the Ethiopian Government as master data to cross reference the vacc data Woredas.

[12]:
from openclean.data.load import dataset
from openclean.operator.transform.apply import apply
from openclean.operator.transform.select import select

# load the master data. As per the meta data:
# - admin0 - Country
# - admin1 - State/Region
# - admin2 - Zone
# - admin3 - Woreda
admin_boundaries = dataset(os.path.join('data','admin-boundaries.csv'),\
                           typecast=DefaultConverter())


# the openclean select method can be used to select specific columns and change their names
admin_boundaries = select(admin_boundaries,
                          ['admin0Name_en', 'admin1Name_en', 'admin2Name_en', 'admin3Name_en'],
                          ['Country', 'Region', 'Zone', 'Woreda'])


# the apply method applies a function over one or more columns
admin_boundaries = apply(admin_boundaries, ['Country','Region','Zone','Woreda'] , str.title)


admin_boundaries.head(10)
[12]:
Country Region Zone Woreda
0 Ethiopia Oromia West Guji Olanciti Town
1 Ethiopia Oromia Horo Gudru Wellega Gudeya Bila
2 Ethiopia Somali Shabelle Aba-Korow
3 Ethiopia Afar Kilbati /Zone2 Abaala
4 Ethiopia Afar Kilbati /Zone2 Abaala Town
5 Ethiopia Oromia Horo Gudru Wellega Ababo
6 Ethiopia Harari Harari Abadir
7 Ethiopia Oromia Horo Gudru Wellega Abay Chomen
8 Ethiopia Oromia West Guji Abaya
9 Ethiopia Oromia Horo Gudru Wellega Abe Dongoro

First, let’s check how many values in our dataset are not present in the official list.

[13]:
errors = set(vacc['WoredaName']) - set(admin_boundaries['Woreda'])
print('there are {} errors'.format(len(errors)))
there are 216 errors

Next, we use the openclean StringMatcher to find potential matches to these errors and all other values in our dataset. The string matcher uses Fuzzy Similarity and uses admin_boundaries as master vocabulary. We create a mapping of all matches per query.

[14]:
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.function.matching.base import DefaultStringMatcher
from openclean.data.mapping import Mapping

matcher = DefaultStringMatcher(
            vocabulary = admin_boundaries['Woreda'],
            similarity = FuzzySimilarity(),
            best_matches_only=True,
            no_match_threshold=0.4,
            cache_results = True)

woreda_map = Mapping()
for query in set(vacc['WoredaName']):
    woreda_map.add(query, matcher.find_matches(query))
pp.pprint(woreda_map)
Mapping(<class 'list'>,
        { 'Abay Chomen': [StringMatch(term='Abay Chomen', score=1)],
          'Abaya': [StringMatch(term='Abaya', score=1)],
          'Abe Dengoro': [ StringMatch(term='Abe Dongoro', score=0.9090909090909091)],
          'Abichugna': [StringMatch(term="Abichugna Gne'A", score=0.6)],
          'Abuna Gindeberet': [ StringMatch(term='Abuna Ginde Beret', score=0.9411764705882353)],
          'Adaba': [StringMatch(term='Adaba', score=1)],
          'Adama': [StringMatch(term='Adama', score=1)],
          'Adama Town': [StringMatch(term='Adama Town', score=1)],
          'Adami Tulu Jido Kombolcha': [ StringMatch(term='Adama Tulu Jido Kombolcha', score=0.96)],
          'Adea': [ StringMatch(term='Adet', score=0.75),
                    StringMatch(term='Adwa', score=0.75)],
          'Adea Berga': [StringMatch(term='Adda Berga', score=0.9)],
          'Adola Hospital': [StringMatch(term='Adola Town', score=0.5)],
          'Adola Reda': [StringMatch(term='Adola Town', score=0.6)],
          'Adola Town': [StringMatch(term='Adola Town', score=1)],
          'Aga Wayyu': [StringMatch(term='Aga Wayu', score=0.8888888888888888)],
          'Agarfa': [StringMatch(term='Agarfa', score=1)],
          'Agaro': [StringMatch(term='Amaro', score=0.8)],
          'Agaro Hospital': [StringMatch(term='Agaro Town', score=0.5)],
          'Akaki': [StringMatch(term='Akaki', score=1)],
          'Ale': [StringMatch(term='Ale', score=1)],
          'Aleiltu': [StringMatch(term='Aleltu', score=0.8571428571428572)],
          'Algesachi': [StringMatch(term='Alge Sachi', score=0.9)],
          'Ambo': [StringMatch(term='Afambo', score=0.6666666666666667)],
          'Ambo Hospital': [ StringMatch(term='Ambo Zuria', score=0.5384615384615384)],
          'Ambo Town': [StringMatch(term='Ambo Town', score=1)],
          'Ambo University  Hosp': [],
          'Ameya': [StringMatch(term='Ameya', score=1)],
          'Ameya Hospital': [],
          'Amigna': [StringMatch(term='Amigna', score=1)],
          'Amuru': [StringMatch(term='Amuru', score=1)],
          'Anchar': [StringMatch(term='Anchar', score=1)],
          'Anfilo': [StringMatch(term='Anfilo', score=1)],
          'Anna Sora': [StringMatch(term='Ana Sora', score=0.8888888888888888)],
          'Arena Buluq': [ StringMatch(term='Harena Buluk', score=0.8333333333333334)],
          'Arero': [StringMatch(term='Arero', score=1)],
          'Arsi Negele Rural': [ StringMatch(term='Arsi Negele Town', score=0.7058823529411764)],
          'Arsi Negele Town': [StringMatch(term='Arsi Negele Town', score=1)],
          'Aseko': [StringMatch(term='Aseko', score=1)],
          'Assela Town': [ StringMatch(term='Asela Town', score=0.9090909090909091)],
          'Aweday Town': [StringMatch(term='Aweday Town', score=1)],
          'Ayira': [StringMatch(term='Ayira', score=1)],
          'Ayira Hospital': [],
          'B/Tolyi': [StringMatch(term='Tole', score=0.4285714285714286)],
          'Babile': [StringMatch(term='Berahile', score=0.625)],
          'Babile Woreda': [ StringMatch(term='Babile (Or)', score=0.6923076923076923)],
          'Babo Gembel': [ StringMatch(term='Chabe Gambeltu', score=0.5714285714285714)],
          'Bako Hospital': [ StringMatch(term='Maokomo Special', score=0.4666666666666667)],
          'Bako Tibe': [StringMatch(term='Bako Tibe', score=1)],
          'Bale Gesgara': [ StringMatch(term='Bele Gesgar', score=0.8333333333333334)],
          'Bantu  Hospital': [],
          'Batu': [StringMatch(term='Bati', score=0.75)],
          'Becho': [ StringMatch(term='Bero', score=0.6),
                     StringMatch(term='Decha', score=0.6),
                     StringMatch(term='Gechi', score=0.6),
                     StringMatch(term='Mecha', score=0.6)],
          'Bedele Hospital': [ StringMatch(term='Bedele Town', score=0.5333333333333333),
                               StringMatch(term='Badele Zuria', score=0.5333333333333333)],
          'Bedele Town': [StringMatch(term='Bedele Town', score=1)],
          'Bedele Zuriya': [ StringMatch(term='Badele Zuria', score=0.8461538461538461)],
          'Bedeno': [StringMatch(term='Bedeno', score=1)],
          'Bedesa Town': [ StringMatch(term='Bedele Town', score=0.8181818181818181)],
          'Begi': [StringMatch(term='Begi', score=1)],
          'Begi Hospital': [],
          'Bekoji Town': [StringMatch(term='Bekoji Town', score=1)],
          'Berbere': [StringMatch(term='Berbere', score=1)],
          'Berreh': [StringMatch(term='Bereh', score=0.8333333333333334)],
          'Biebirsa Kojowa': [ StringMatch(term='Birbirsa Kojowa', score=0.9333333333333333)],
          'Bilonopa': [StringMatch(term='Bilo Nopha', score=0.8)],
          'Bishan Guracha Town': [ StringMatch(term='Bishan Guracha', score=0.736842105263158)],
          'Bishoftu Town': [StringMatch(term='Bishoftu Town', score=1)],
          'Bisidimo Hospital': [ StringMatch(term='Bilo Nopha', score=0.4117647058823529)],
          'Boji Cheqorsa': [ StringMatch(term='Boji Chekorsa', score=0.9230769230769231)],
          'Boji Dermeji': [ StringMatch(term='Boji Dirmeji', score=0.9166666666666666)],
          'Boke': [StringMatch(term='Boke', score=1)],
          'Boneya Bushe': [ StringMatch(term='Boneya Boshe', score=0.9166666666666666)],
          'Bora': [ StringMatch(term='Bore', score=0.75),
                    StringMatch(term='Bura', score=0.75)],
          'Bore': [StringMatch(term='Bore', score=1)],
          'Bore Hospital': [ StringMatch(term='Borecha', score=0.46153846153846156),
                             StringMatch(term='Bule Hora', score=0.46153846153846156)],
          'Boricha': [StringMatch(term='Boricha', score=1)],
          'Boset': [StringMatch(term='Boset', score=1)],
          'Bule Hora': [StringMatch(term='Bule Hora', score=1)],
          'Bule Hora Hospital': [ StringMatch(term='Bule Hora Town', score=0.6111111111111112)],
          'Bule Hora Toun': [ StringMatch(term='Bule Hora Town', score=0.9285714285714286)],
          'Burayu Town': [ StringMatch(term='Bure Town', score=0.7272727272727273),
                           StringMatch(term='Durame Town', score=0.7272727272727273)],
          'Bure': [ StringMatch(term='Bura', score=0.75),
                    StringMatch(term='Bule', score=0.75),
                    StringMatch(term='Bore', score=0.75)],
          'Burka Dimtu': [ StringMatch(term='Burqua Dhintu', score=0.6923076923076923)],
          'Chelenko Hospital': [],
          'Chelia': [StringMatch(term='Cheliya', score=0.8571428571428572)],
          'Chewaqa': [StringMatch(term='Chwaka', score=0.7142857142857143)],
          'Chinakesen': [StringMatch(term='Chinaksen', score=0.9)],
          'Chiro Hospital': [ StringMatch(term='Chiro Zuria', score=0.5714285714285714)],
          'Chiro Town': [StringMatch(term='Chiro Town', score=1)],
          'Chiro Zuriya': [ StringMatch(term='Chiro Zuria', score=0.9166666666666666)],
          'Chole': [StringMatch(term='Chole', score=1)],
          'Chomen Guduru': [ StringMatch(term='Choman Guduru', score=0.9230769230769231)],
          'Chora': [StringMatch(term='Chifra', score=0.6666666666666667)],
          'Chora Boter': [ StringMatch(term='Chora (Buno Bedele)', score=0.4736842105263158)],
          'Codi': [StringMatch(term='Cobi', score=0.75)],
          'Dale Sedi': [ StringMatch(term='Dale Sadi', score=0.8888888888888888)],
          'Dale Wabera': [StringMatch(term='Dale Wabera', score=1)],
          'Dambi Dollo': [StringMatch(term='Denbi Dollo Town', score=0.5625)],
          'Dambi Dolo Hospital': [ StringMatch(term='Denbi Dollo Town', score=0.4736842105263158)],
          'Dano': [StringMatch(term='Dano', score=1)],
          'Dapho Hana': [StringMatch(term='Dabo Hana', score=0.8)],
          'Darimu': [StringMatch(term='Darimu', score=1)],
          'Daro Lebu': [StringMatch(term='Daro Lebu', score=1)],
          'Dawe Qachen': [StringMatch(term='Dawe Ketchen', score=0.75)],
          'Dawe Serar': [ StringMatch(term='Dale Wabera', score=0.5454545454545454)],
          'Dawo': [StringMatch(term='Dawo', score=1)],
          'Debre Libanos': [StringMatch(term='Debre Libanos', score=1)],
          'Deder': [StringMatch(term='Deder', score=1)],
          'Deder Hospital': [StringMatch(term='Deder Town', score=0.5)],
          'Deder Town': [StringMatch(term='Deder Town', score=1)],
          'Dedesa': [StringMatch(term='Dedesa', score=1)],
          'Dedo': [StringMatch(term='Dedo', score=1)],
          'Degem': [StringMatch(term='Degem', score=1)],
          'Deksis': [StringMatch(term='Diksis', score=0.8333333333333334)],
          'Delo Mena': [ StringMatch(term='Doyogena', score=0.5555555555555556),
                         StringMatch(term='Melo Gada', score=0.5555555555555556)],
          'Dendi': [StringMatch(term='Dendi', score=1)],
          'Dera': [ StringMatch(term='Gera', score=0.75),
                    StringMatch(term='Wera', score=0.75),
                    StringMatch(term='Dega', score=0.75),
                    StringMatch(term='Dara', score=0.75)],
          'Dera Hospital': [StringMatch(term='Derashe Special', score=0.6)],
          'Dhas': [StringMatch(term='Dhas', score=1)],
          'Dhidesa Hospital': [],
          'Didu': [StringMatch(term='Didu', score=1)],
          'Diga': [StringMatch(term='Diga', score=1)],
          'Digeluna Tijo': [ StringMatch(term='Degeluna Tijo', score=0.9230769230769231)],
          'Dilo': [StringMatch(term='Dilo', score=1)],
          'Dima': [ StringMatch(term='Diga', score=0.75),
                    StringMatch(term='Disa', score=0.75),
                    StringMatch(term='Dita', score=0.75),
                    StringMatch(term='Dama', score=0.75)],
          'Dinsho': [StringMatch(term='Dinsho', score=1)],
          'Dire': [StringMatch(term='Dire', score=1)],
          'Doba': [StringMatch(term='Doba', score=1)],
          'Dodola  Hospital': [StringMatch(term='Dodola Town', score=0.5)],
          'Dodola Rural': [ StringMatch(term='Dodola Town', score=0.5833333333333333)],
          'Dodola Town': [StringMatch(term='Dodola Town', score=1)],
          'Dodota': [StringMatch(term='Dodota', score=1)],
          'Doreni': [StringMatch(term='Dorani', score=0.8333333333333334)],
          'Dubluk': [StringMatch(term='Dubluk', score=1)],
          'Dugda': [StringMatch(term='Dugda', score=1)],
          'Dugda Dawa': [StringMatch(term='Dugda Dawa', score=1)],
          'Dukem Town': [ StringMatch(term='Durame Town', score=0.7272727272727273)],
          'Ebantu': [StringMatch(term='Ibantu', score=0.8333333333333334)],
          'Ejerie': [StringMatch(term='Saesie', score=0.5)],
          'Ejersa Lafo': [StringMatch(term='Ejersa Lafo', score=1)],
          'El Way': [StringMatch(term='Elwaya', score=0.6666666666666667)],
          'Elifata': [StringMatch(term='Ifata', score=0.7142857142857143)],
          'Enkelo Wabe': [ StringMatch(term='Inkolo Wabe', score=0.8181818181818181)],
          'Fedis': [StringMatch(term='Fedis', score=1)],
          'Fentale': [StringMatch(term='Fentale', score=1)],
          'Fiche Hospital': [StringMatch(term='Fiche Town', score=0.5)],
          'Fichetown': [StringMatch(term='Fiche Town', score=0.9)],
          'Gambo Hospital': [StringMatch(term='Ambo Zuria', score=0.5)],
          'Garemuleta Hospital': [],
          'Gasera': [StringMatch(term='Gasera', score=1)],
          'Gawo Qebe': [ StringMatch(term='Gawo Kebe', score=0.8888888888888888)],
          'Gechi': [StringMatch(term='Gechi', score=1)],
          'Gedeb Asasa': [StringMatch(term='Gedeb Asasa', score=1)],
          'Gedo Hospital': [],
          'Gelan Town': [ StringMatch(term='Gedeb Town', score=0.7),
                          StringMatch(term='Asela Town', score=0.7),
                          StringMatch(term='Dejen Town', score=0.7),
                          StringMatch(term='Dila Town', score=0.7),
                          StringMatch(term='Goba Town', score=0.7)],
          'Gelana': [StringMatch(term='Delanta', score=0.7142857142857143)],
          'Gelemso Hospital': [],
          'Gemechis': [StringMatch(term='Gemechis', score=1)],
          'Genji': [ StringMatch(term='Gena', score=0.6),
                     StringMatch(term='Gaji', score=0.6),
                     StringMatch(term='Gechi', score=0.6)],
          'Gera': [StringMatch(term='Gera', score=1)],
          'Gida Ayana': [StringMatch(term='Gida Ayana', score=1)],
          'Gidami': [StringMatch(term='Gidami', score=1)],
          'Gidami Hosp': [StringMatch(term='Gidami', score=0.5454545454545454)],
          'Gimbi': [StringMatch(term='Gimbi', score=1)],
          'Gimbi Adventist Hospital': [],
          'Gimbi Public  Hospital': [],
          'Gimbi Rural': [ StringMatch(term='Gimbi Town', score=0.5454545454545454),
                           StringMatch(term='Gimbichu', score=0.5454545454545454)],
          'Gimbichu': [StringMatch(term='Gimbichu', score=1)],
          'Ginde Beret': [StringMatch(term='Ginde Beret', score=1)],
          'Gindeberet Hospital': [ StringMatch(term='Ginde Beret', score=0.4736842105263158)],
          'Ginir': [StringMatch(term='Ginir', score=1)],
          'Ginir Town': [StringMatch(term='Ginir Town', score=1)],
          'Girar Jarso': [ StringMatch(term='Gerar Jarso', score=0.9090909090909091)],
          'Girawa': [StringMatch(term='Girawa', score=1)],
          'Girja': [StringMatch(term='Girawa', score=0.6666666666666667)],
          'Goba': [ StringMatch(term='Doba', score=0.75),
                    StringMatch(term='Goma', score=0.75),
                    StringMatch(term='Guba', score=0.75)],
          'Goba Town': [StringMatch(term='Goba Town', score=1)],
          'Gobu Seyo': [StringMatch(term='Gobu Seyo', score=1)],
          'Gojo Hospital': [ StringMatch(term='Golo Oda', score=0.46153846153846156)],
          'Gole Oda': [StringMatch(term='Golo Oda', score=0.875)],
          'Gololcha': [StringMatch(term='Golocha', score=0.875)],
          'Gomma': [StringMatch(term='Goma', score=0.8)],
          'Gomole': [StringMatch(term='Gomole', score=1)],
          'Goro': [ StringMatch(term='Horo', score=0.75),
                    StringMatch(term='Soro', score=0.75)],
          'Goro Dola': [ StringMatch(term='Gora Dola', score=0.8888888888888888)],
          'Goro Gutu': [StringMatch(term='Goro Gutu', score=1)],
          'Goro Muti': [StringMatch(term='Goro Muti', score=1)],
          'Guba Qoricha': [ StringMatch(term='Goba Koricha', score=0.8333333333333334)],
          'Guchi': [StringMatch(term='Guchi', score=1)],
          'Guder Hosp': [ StringMatch(term='Gonder Town', score=0.5454545454545454),
                          StringMatch(term='Lude Hitosa', score=0.5454545454545454)],
          'Gudeyabila': [ StringMatch(term='Gudeya Bila', score=0.9090909090909091)],
          'Gudru': [StringMatch(term='Guduru', score=0.8333333333333334)],
          'Guliso': [StringMatch(term='Guliso', score=1)],
          'Guma': [StringMatch(term='Gumay', score=0.8)],
          'Gumi Eldallo': [StringMatch(term='Gumi Idalo', score=0.75)],
          'Guna': [StringMatch(term='Guna', score=1)],
          'Gura Dhamole': [ StringMatch(term='Gura Damole', score=0.9166666666666666)],
          'Gursum': [ StringMatch(term='Gursum (Or)', score=0.5454545454545454),
                      StringMatch(term='Gursum (Sm)', score=0.5454545454545454)],
          'Guto Gida': [StringMatch(term='Guto Gida', score=1)],
          'Hababo Guduru': [ StringMatch(term='Choman Guduru', score=0.6153846153846154)],
          'Habro': [StringMatch(term='Habro', score=1)],
          'Halu': [ StringMatch(term='Kalu', score=0.75),
                    StringMatch(term='Haru', score=0.75)],
          'Hambela Wamena': [StringMatch(term='Hambela Wamena', score=1)],
          'Hanbala': [ StringMatch(term='Hawela', score=0.5714285714285714),
                       StringMatch(term='Abaala', score=0.5714285714285714),
                       StringMatch(term='Hanruka', score=0.5714285714285714),
                       StringMatch(term='Dangila', score=0.5714285714285714)],
          'Haro Limu': [StringMatch(term='Haro Limu', score=1)],
          'Haromaya Hospital': [ StringMatch(term='Haromaya Town', score=0.5882352941176471)],
          'Haromaya Rural': [ StringMatch(term='Haromaya Town', score=0.6428571428571428)],
          'Haromaya Town': [StringMatch(term='Haromaya Town', score=1)],
          'Haru': [StringMatch(term='Haru', score=1)],
          'Hawa Gelan': [StringMatch(term='Hawa Galan', score=0.9)],
          'Hawa Gelan Hosp': [StringMatch(term='Hawa Galan', score=0.6)],
          'Hawi Gudina': [ StringMatch(term='Hawi Gudina\n', score=0.9166666666666666)],
          'Hebal Arsi': [StringMatch(term='Heban Arsi', score=0.9)],
          'Hidabu Abote': [StringMatch(term='Hidabu Abote', score=1)],
          'Hitosa': [StringMatch(term='Hitosa', score=1)],
          'Holeta': [StringMatch(term='Holeta Town', score=0.5454545454545454)],
          'Holeta Town': [StringMatch(term='Holeta Town', score=1)],
          'Homa': [StringMatch(term='Homa', score=1)],
          'Horo': [StringMatch(term='Horo', score=1)],
          'Horro Buluk': [ StringMatch(term='Horo Buluk', score=0.9090909090909091)],
          'Hurumu': [StringMatch(term='Hurumu', score=1)],
          'Ilu': [StringMatch(term='Ilu', score=1)],
          'Ilu Galan': [StringMatch(term='Illu Galan', score=0.9)],
          'Inchini Hospital': [],
          'Jarso': [StringMatch(term='Ararso', score=0.6666666666666667)],
          'Jarte Jardga': [ StringMatch(term='Jarte Jardega', score=0.9230769230769231)],
          'Jeju': [StringMatch(term='Jeju', score=1)],
          'Jeldu': [StringMatch(term='Jeldu', score=1)],
          'Jibat': [StringMatch(term='Jibat', score=1)],
          'Jido': [StringMatch(term='Jida', score=0.75)],
          'Jima Geneti': [ StringMatch(term='Jimma Genete', score=0.8333333333333334)],
          'Jima Rare': [StringMatch(term='Jimma Rare', score=0.9)],
          'Jimma Arjo': [StringMatch(term='Jimma Arjo', score=1)],
          'Jimma Horo': [StringMatch(term='Jimma Horo', score=1)],
          'Jimma Spe Town': [ StringMatch(term='Jimma Town', score=0.7142857142857143)],
          'Kake Hosp': [ StringMatch(term='Kawo Koisha', score=0.4545454545454546),
                         StringMatch(term='Lude Hitosa', score=0.4545454545454546)],
          'Kercha': [StringMatch(term='Kercha', score=1)],
          'Kercha Hospital': [ StringMatch(term='Tercha Zuriya', score=0.5333333333333333)],
          'Kersa': [StringMatch(term='Kercha', score=0.6666666666666667)],
          'Kersa Eh': [ StringMatch(term='Mersa Town', score=0.5),
                        StringMatch(term='Bereh', score=0.5)],
          'Kersana Malima': [StringMatch(term='Kersana Malima', score=1)],
          'Kimbibit': [StringMatch(term='Kimbibit', score=1)],
          'Kiremu': [StringMatch(term='Kiremu', score=1)],
          'Kofele': [StringMatch(term='Kofele', score=1)],
          'Kokosa': [StringMatch(term='Kokosa', score=1)],
          'Kokosa  Hospital': [],
          'Kombolicha': [StringMatch(term='Kombolcha', score=0.9)],
          'Kore': [StringMatch(term='Kore', score=1)],
          'Kumbi': [StringMatch(term='Kumbi', score=1)],
          'Kundala': [ StringMatch(term='Kunneba', score=0.5714285714285714),
                       StringMatch(term='Undulu', score=0.5714285714285714)],
          'Kurfa Chele': [StringMatch(term='Kurfa Chele', score=1)],
          'Kuyu': [StringMatch(term='Kuyu', score=1)],
          'Kuyu Hospital': [],
          'Lalo Asabi': [StringMatch(term='Lalo Asabi', score=1)],
          'Laloqile': [StringMatch(term='Lalo Kile', score=0.7777777777777778)],
          'Lata Sibu': [ StringMatch(term='Leta Sibu', score=0.8888888888888888)],
          'Lege Dadi Lege Tafo Town': [ StringMatch(term='Lege Tafo-Lege Dadi Town', score=0.7083333333333333)],
          'Legehida': [StringMatch(term='Legehida', score=1)],
          'Leqa Dulecha': [ StringMatch(term='Leka Dulecha', score=0.9166666666666666)],
          'Liban Jawi': [StringMatch(term='Liban Jawi', score=1)],
          'Liben': [StringMatch(term='Liben', score=1)],
          'Limu': [ StringMatch(term='Darimu', score=0.5),
                    StringMatch(term='Kiremu', score=0.5)],
          'Limu Hospital': [ StringMatch(term='Limu Kosa', score=0.6153846153846154)],
          'Limu Kosa': [StringMatch(term='Limu Kosa', score=1)],
          'Limu Seka': [StringMatch(term='Limu Seka', score=1)],
          'Limuna Bilbilo': [ StringMatch(term='Limu Bilbilo', score=0.8571428571428572)],
          'Loke Hada Hospital': [],
          'Lomme': [StringMatch(term='Loma', score=0.6)],
          'Ludehetosa': [ StringMatch(term='Lude Hitosa', score=0.8181818181818181)],
          'Mana': [StringMatch(term='Zana', score=0.75)],
          'Mancho': [StringMatch(term='Mancho', score=1)],
          'Matahara': [StringMatch(term='May Kadra', score=0.5555555555555556)],
          'Meda Welabu': [StringMatch(term='Meda Welabu', score=1)],
          'Meiso': [StringMatch(term='Miesso', score=0.6666666666666667)],
          'Meko': [StringMatch(term='Meko', score=1)],
          'Melka Belo': [StringMatch(term='Melka Balo', score=0.9)],
          'Mendi Hospital': [StringMatch(term='Mendi Town', score=0.5)],
          'Mendi Town': [StringMatch(term='Mendi Town', score=1)],
          'Mene Sibu': [ StringMatch(term='Mana Sibu', score=0.7777777777777778)],
          'Merti': [StringMatch(term='Merti', score=1)],
          'Mesela': [StringMatch(term='Mesela', score=1)],
          'Meta': [StringMatch(term='Meta', score=1)],
          'Meta Waliqite': [ StringMatch(term='Meta Walkite', score=0.8461538461538461)],
          'Metarobi': [StringMatch(term='Meta Robi', score=0.8888888888888888)],
          'Metu Rural': [StringMatch(term='Metu Zuria', score=0.7)],
          'Metu Town': [StringMatch(term='Metu Town', score=1)],
          'Meyo': [ StringMatch(term='Meko', score=0.75),
                    StringMatch(term='Miyo', score=0.75)],
          'Meyu Muleke': [StringMatch(term='Meyu Muleke', score=1)],
          'Midakegni': [ StringMatch(term='Mida Kegn', score=0.7777777777777778)],
          'Midega Tole': [StringMatch(term='Midhaga Tola', score=0.75)],
          'Mkelka Soda': [ StringMatch(term='Melka Soda', score=0.9090909090909091)],
          'Modjo Town': [StringMatch(term='Mojo Town', score=0.9)],
          'Mojo': [StringMatch(term='Nejo', score=0.5)],
          'Moyale': [ StringMatch(term='Megale', score=0.6666666666666667),
                      StringMatch(term='Yocale', score=0.6666666666666667)],
          'Moyale Hospital': [ StringMatch(term='Moyale (Or)', score=0.5333333333333333),
                               StringMatch(term='Moyale (Sm)', score=0.5333333333333333)],
          'Muke Turi Hospital': [],
          'Mullo': [ StringMatch(term='Mulo', score=0.8),
                     StringMatch(term='Tullo', score=0.8)],
          'Munesa': [StringMatch(term='Munessa', score=0.8571428571428572)],
          'Negele Hospital': [ StringMatch(term='Negele Town', score=0.5333333333333333)],
          'Negele Town': [StringMatch(term='Negele Town', score=1)],
          'Nejo Hospital': [ StringMatch(term='Nejo Town', score=0.46153846153846156)],
          'Nejo Rural': [StringMatch(term='Nejo Town', score=0.5)],
          'Nejo Town': [StringMatch(term='Nejo Town', score=1)],
          'Nekemte Town': [StringMatch(term='Nekemte Town', score=1)],
          'Nensebo': [StringMatch(term='Nenesebo', score=0.875)],
          'Nole Kaba': [StringMatch(term='Nole Kaba', score=1)],
          'Nono': [StringMatch(term='Nono', score=1)],
          'Nono Benja': [StringMatch(term='Nono Benja', score=1)],
          'Nono Sele': [StringMatch(term='Nono Benja', score=0.6)],
          'Nunu Qumba': [StringMatch(term='Nunu Kumba', score=0.9)],
          'O/Beyam': [StringMatch(term='Omo Beyam', score=0.6666666666666667)],
          'Oda Bultum': [StringMatch(term='Kuni /Oda Bultum', score=0.625)],
          'Odo Shakiso': [StringMatch(term='Odo Shakiso', score=1)],
          'Olanciti Hospital': [ StringMatch(term='Olanciti Town', score=0.5882352941176471)],
          'Omo Nada Hospital': [ StringMatch(term='Omo Nada', score=0.47058823529411764)],
          'Omonada': [StringMatch(term='Omo Nada', score=0.875)],
          'Qiltu Kara': [StringMatch(term='Kiltu Kara', score=0.9)],
          'Raitu': [StringMatch(term='Rayitu', score=0.8333333333333334)],
          'Robe': [StringMatch(term='Robe', score=1)],
          'Robe Town': [StringMatch(term='Robe Town', score=1)],
          'Saba Boru': [StringMatch(term='Saba Boru', score=1)],
          'Sandefa': [StringMatch(term='Sankura', score=0.5714285714285714)],
          'Sasiga': [StringMatch(term='Sasiga', score=1)],
          'Sebeta Awas': [ StringMatch(term='Sebeta Hawas', score=0.9166666666666666)],
          'Sebeta Town': [StringMatch(term='Sebeta Town', score=1)],
          'Sedan Chanka': [StringMatch(term='Sedi Chenka', score=0.75)],
          'Seden Sodo': [StringMatch(term='Seden Sodo', score=1)],
          'Seka Chekorsa': [StringMatch(term='Seka Chekorsa', score=1)],
          'Seka Chhokorsa Hospital': [ StringMatch(term='Seka Chekorsa', score=0.5217391304347826)],
          'Sekoru': [StringMatch(term='Sekoru', score=1)],
          'Seru': [StringMatch(term='Seru', score=1)],
          'Setema': [StringMatch(term='Setema', score=1)],
          'Setema Hospital': [ StringMatch(term='Sebeta Hawas', score=0.4666666666666667)],
          'Seweyna': [StringMatch(term='Seweyna', score=1)],
          'Seyo': [StringMatch(term='Sayo', score=0.75)],
          'Seyo Nole': [ StringMatch(term='Sayo Nole', score=0.8888888888888888)],
          'Shabe': [StringMatch(term='Shala', score=0.6)],
          'Shakiso Town': [StringMatch(term='Shakiso Town', score=1)],
          'Shala': [StringMatch(term='Shala', score=1)],
          'Shambu Hospital': [ StringMatch(term='Shambu Town', score=0.5333333333333333)],
          'Shambu Town': [StringMatch(term='Shambu Town', score=1)],
          'Shashamane Town': [ StringMatch(term='Shashemene Town', score=0.8666666666666667)],
          'Shashemene Rural': [ StringMatch(term='Shashemene Zuria', score=0.8125)],
          'Shenan Kolu': [ StringMatch(term='Shanan Kolu', score=0.9090909090909091)],
          'Shirka': [StringMatch(term='Shirka', score=1)],
          'Sibu Sire': [StringMatch(term='Sibu Sire', score=1)],
          'Sigmo': [StringMatch(term='Sigmo', score=1)],
          'Sinana': [StringMatch(term='Sinana', score=1)],
          'Siraro': [StringMatch(term='Siraro', score=1)],
          'Sire': [StringMatch(term='Sire', score=1)],
          "Sodo Dac'Ha": [ StringMatch(term='Sodo Daci', score=0.7272727272727273)],
          'St.Luke Hospital': [],
          'Sude': [StringMatch(term='Sude', score=1)],
          'Sululta Town': [StringMatch(term='Sululta Town', score=1)],
          'Suro Barguda': [ StringMatch(term='Suro Berguda', score=0.9166666666666666)],
          'Teltele': [StringMatch(term='Teltale', score=0.8571428571428572)],
          'Tena': [StringMatch(term='Tena', score=1)],
          'Tibe Kutaye': [ StringMatch(term='Toke Kutaye', score=0.8181818181818181)],
          'Tikur Enchini': [StringMatch(term='Tikur Enchini', score=1)],
          'Tiro Afeta': [StringMatch(term='Tiro Afeta', score=1)],
          'Tiyo': [StringMatch(term='Tiyo', score=1)],
          'Tole': [StringMatch(term='Tole', score=1)],
          'Tulo': [StringMatch(term='Tullo', score=0.8)],
          'Tulu Bolo Hospital': [],
          'Uraga': [StringMatch(term='Uraga', score=1)],
          'Wachile': [StringMatch(term='Wachile', score=1)],
          'Wadara': [StringMatch(term='Wadera', score=0.8333333333333334)],
          'Walmera': [StringMatch(term='Welmera', score=0.8571428571428572)],
          'Wama Hagelo': [ StringMatch(term='Wama Hagalo', score=0.9090909090909091)],
          'Wayu Tuqa': [ StringMatch(term='Wayu Tuka', score=0.8888888888888888)],
          'Were Jarso': [StringMatch(term='Wara Jarso', score=0.8)],
          'West Harerge\tGumbi Bordede': [ StringMatch(term='Gumbi Bordede', score=0.5)],
          'Woliso Rural': [ StringMatch(term='Woliso Town', score=0.5833333333333333)],
          'Woliso Town': [StringMatch(term='Woliso Town', score=1)],
          'Wonchi': [StringMatch(term='Wenchi', score=0.8333333333333334)],
          'Wondo': [StringMatch(term='Wondo', score=1)],
          'Wuchale': [StringMatch(term='Wuchale', score=1)],
          'Yabelo': [StringMatch(term='Yabelo', score=1)],
          'Yabelo Hospital': [ StringMatch(term='Yabelo Town', score=0.5333333333333333)],
          'Yabelo Rural': [ StringMatch(term='Yabelo Town', score=0.5833333333333333)],
          'Yaya Gulele': [StringMatch(term='Yaya Gulele', score=1)],
          'Yayu': [StringMatch(term='Yayu', score=1)],
          'Yemalogi Wolel': [StringMatch(term='Yama Logi Welel', score=0.8)],
          'Yubdo': [StringMatch(term='Yubdo', score=1)],
          'Zeway Dugda': [ StringMatch(term='Ziway Dugda', score=0.9090909090909091)]})

Looking at query terms that still didn’t match anything in the master data we realize there exists a common pattern. These terms are hospital names. Maybe rerunning the matcher with stopwords such as HOSP and HOSPITAL removed will yield better results

[15]:
woreda_map.unmatched()
[15]:
{'Ambo University  Hosp',
 'Ameya Hospital',
 'Ayira Hospital',
 'Bantu  Hospital',
 'Begi Hospital',
 'Chelenko Hospital',
 'Dhidesa Hospital',
 'Garemuleta Hospital',
 'Gedo Hospital',
 'Gelemso Hospital',
 'Gimbi Adventist Hospital',
 'Gimbi Public  Hospital',
 'Inchini Hospital',
 'Kokosa  Hospital',
 'Kuyu Hospital',
 'Loke Hada Hospital',
 'Muke Turi Hospital',
 'St.Luke Hospital',
 'Tulu Bolo Hospital'}

We replace the stopwords and see if this solves the problem

[16]:
vacc = apply(vacc, 'WoredaName', lambda x: x.replace('Hospital','')\
             .replace('Hosp','').replace('University','').strip())

woreda_map = Mapping()
for query in set(vacc['WoredaName']):
    woreda_map.add(query, matcher.find_matches(query))
woreda_map
[16]:
Mapping(list,
        {'Dodota': [StringMatch(term='Dodota', score=1)],
         'Haromaya Town': [StringMatch(term='Haromaya Town', score=1)],
         'Merti': [StringMatch(term='Merti', score=1)],
         'Shambu Town': [StringMatch(term='Shambu Town', score=1)],
         'Setema': [StringMatch(term='Setema', score=1)],
         'Dapho Hana': [StringMatch(term='Dabo Hana', score=0.8)],
         'Dugda': [StringMatch(term='Dugda', score=1)],
         'Kundala': [StringMatch(term='Kunneba', score=0.5714285714285714),
          StringMatch(term='Undulu', score=0.5714285714285714)],
         'Haromaya': [StringMatch(term='Haro Maya', score=0.8888888888888888)],
         'Deder Town': [StringMatch(term='Deder Town', score=1)],
         'Dale Wabera': [StringMatch(term='Dale Wabera', score=1)],
         'Chelenko': [StringMatch(term='Cheliya', score=0.5),
          StringMatch(term='Chena', score=0.5),
          StringMatch(term='Shenkor', score=0.5),
          StringMatch(term='Chole', score=0.5),
          StringMatch(term='Sheko', score=0.5)],
         "Sodo Dac'Ha": [StringMatch(term='Sodo Daci', score=0.7272727272727273)],
         'Bilonopa': [StringMatch(term='Bilo Nopha', score=0.8)],
         'Muke Turi': [StringMatch(term='Metu Zuria', score=0.5)],
         'Midakegni': [StringMatch(term='Mida Kegn', score=0.7777777777777778)],
         'Bule Hora': [StringMatch(term='Bule Hora', score=1)],
         'Tulu Bolo': [StringMatch(term='Tullo', score=0.5555555555555556),
          StringMatch(term='Tulo (Or)', score=0.5555555555555556)],
         'Mullo': [StringMatch(term='Mulo', score=0.8),
          StringMatch(term='Tullo', score=0.8)],
         'Sebeta Awas': [StringMatch(term='Sebeta Hawas', score=0.9166666666666666)],
         'Elifata': [StringMatch(term='Ifata', score=0.7142857142857143)],
         'Kore': [StringMatch(term='Kore', score=1)],
         'Begi': [StringMatch(term='Begi', score=1)],
         'Shirka': [StringMatch(term='Shirka', score=1)],
         'Bore': [StringMatch(term='Bore', score=1)],
         'Horro Buluk': [StringMatch(term='Horo Buluk', score=0.9090909090909091)],
         'Gojo': [StringMatch(term='Goglo', score=0.6),
          StringMatch(term='Gonje', score=0.6)],
         'Limu Kosa': [StringMatch(term='Limu Kosa', score=1)],
         'Aseko': [StringMatch(term='Aseko', score=1)],
         'Gambo': [StringMatch(term='Garbo', score=0.8),
          StringMatch(term='Gimbo', score=0.8)],
         'Adola': [StringMatch(term='Adola', score=1)],
         'Jimma Horo': [StringMatch(term='Jimma Horo', score=1)],
         'Hebal Arsi': [StringMatch(term='Heban Arsi', score=0.9)],
         'Seru': [StringMatch(term='Seru', score=1)],
         'Gimbi Public': [StringMatch(term='Gimbi Town', score=0.5),
          StringMatch(term='Gimbichu', score=0.5),
          StringMatch(term='Kimbibit', score=0.5)],
         'Gedeb Asasa': [StringMatch(term='Gedeb Asasa', score=1)],
         'Adea Berga': [StringMatch(term='Adda Berga', score=0.9)],
         'Walmera': [StringMatch(term='Welmera', score=0.8571428571428572)],
         'Ambo Town': [StringMatch(term='Ambo Town', score=1)],
         'Ameya': [StringMatch(term='Ameya', score=1)],
         'Jima Geneti': [StringMatch(term='Jimma Genete', score=0.8333333333333334)],
         'Bora': [StringMatch(term='Bore', score=0.75),
          StringMatch(term='Bura', score=0.75)],
         'Chewaqa': [StringMatch(term='Chwaka', score=0.7142857142857143)],
         'Ejerie': [StringMatch(term='Saesie', score=0.5)],
         'Kombolicha': [StringMatch(term='Kombolcha', score=0.9)],
         'Daro Lebu': [StringMatch(term='Daro Lebu', score=1)],
         'Dubluk': [StringMatch(term='Dubluk', score=1)],
         'Amigna': [StringMatch(term='Amigna', score=1)],
         'Berreh': [StringMatch(term='Bereh', score=0.8333333333333334)],
         'Aleiltu': [StringMatch(term='Aleltu', score=0.8571428571428572)],
         'Ejersa Lafo': [StringMatch(term='Ejersa Lafo', score=1)],
         'Algesachi': [StringMatch(term='Alge Sachi', score=0.9)],
         'Ale': [StringMatch(term='Ale', score=1)],
         'Raitu': [StringMatch(term='Rayitu', score=0.8333333333333334)],
         'Codi': [StringMatch(term='Cobi', score=0.75)],
         'Seka Chhokorsa': [StringMatch(term='Seka Chekorsa', score=0.8571428571428572)],
         'Shakiso Town': [StringMatch(term='Shakiso Town', score=1)],
         'Gomma': [StringMatch(term='Goma', score=0.8)],
         'Bedeno': [StringMatch(term='Bedeno', score=1)],
         'Jeju': [StringMatch(term='Jeju', score=1)],
         'Shabe': [StringMatch(term='Shala', score=0.6)],
         'Haromaya Rural': [StringMatch(term='Haromaya Town', score=0.6428571428571428)],
         'Abay Chomen': [StringMatch(term='Abay Chomen', score=1)],
         'Degem': [StringMatch(term='Degem', score=1)],
         'Lomme': [StringMatch(term='Loma', score=0.6)],
         'Limu': [StringMatch(term='Darimu', score=0.5),
          StringMatch(term='Kiremu', score=0.5)],
         'Mesela': [StringMatch(term='Mesela', score=1)],
         'Abuna Gindeberet': [StringMatch(term='Abuna Ginde Beret', score=0.9411764705882353)],
         'Meiso': [StringMatch(term='Miesso', score=0.6666666666666667)],
         'Sedan Chanka': [StringMatch(term='Sedi Chenka', score=0.75)],
         'Tibe Kutaye': [StringMatch(term='Toke Kutaye', score=0.8181818181818181)],
         'Bale Gesgara': [StringMatch(term='Bele Gesgar', score=0.8333333333333334)],
         'Gera': [StringMatch(term='Gera', score=1)],
         'Adami Tulu Jido Kombolcha': [StringMatch(term='Adama Tulu Jido Kombolcha', score=0.96)],
         'Nono': [StringMatch(term='Nono', score=1)],
         'Ludehetosa': [StringMatch(term='Lude Hitosa', score=0.8181818181818181)],
         'Legehida': [StringMatch(term='Legehida', score=1)],
         'Holeta': [StringMatch(term='Holeta Town', score=0.5454545454545454)],
         'Gumi Eldallo': [StringMatch(term='Gumi Idalo', score=0.75)],
         'Yabelo': [StringMatch(term='Yabelo', score=1)],
         'Guliso': [StringMatch(term='Guliso', score=1)],
         'Bako Tibe': [StringMatch(term='Bako Tibe', score=1)],
         'B/Tolyi': [StringMatch(term='Tole', score=0.4285714285714286)],
         'Dinsho': [StringMatch(term='Dinsho', score=1)],
         'Wachile': [StringMatch(term='Wachile', score=1)],
         'Bishan Guracha Town': [StringMatch(term='Bishan Guracha', score=0.736842105263158)],
         'Homa': [StringMatch(term='Homa', score=1)],
         'Bedele Town': [StringMatch(term='Bedele Town', score=1)],
         'Hitosa': [StringMatch(term='Hitosa', score=1)],
         'Sire': [StringMatch(term='Sire', score=1)],
         'Gimbichu': [StringMatch(term='Gimbichu', score=1)],
         'Deksis': [StringMatch(term='Diksis', score=0.8333333333333334)],
         'Sandefa': [StringMatch(term='Sankura', score=0.5714285714285714)],
         'Goro Gutu': [StringMatch(term='Goro Gutu', score=1)],
         'Shashemene Rural': [StringMatch(term='Shashemene Zuria', score=0.8125)],
         'Akaki': [StringMatch(term='Akaki', score=1)],
         'Adama': [StringMatch(term='Adama', score=1)],
         'Chomen Guduru': [StringMatch(term='Choman Guduru', score=0.9230769230769231)],
         'Woliso Rural': [StringMatch(term='Woliso Town', score=0.5833333333333333)],
         'Chole': [StringMatch(term='Chole', score=1)],
         'Jimma Arjo': [StringMatch(term='Jimma Arjo', score=1)],
         'Kiremu': [StringMatch(term='Kiremu', score=1)],
         'Gursum': [StringMatch(term='Gursum (Or)', score=0.5454545454545454),
          StringMatch(term='Gursum (Sm)', score=0.5454545454545454)],
         'Tena': [StringMatch(term='Tena', score=1)],
         'Tiyo': [StringMatch(term='Tiyo', score=1)],
         'Debre Libanos': [StringMatch(term='Debre Libanos', score=1)],
         'Omonada': [StringMatch(term='Omo Nada', score=0.875)],
         'Hanbala': [StringMatch(term='Hawela', score=0.5714285714285714),
          StringMatch(term='Abaala', score=0.5714285714285714),
          StringMatch(term='Hanruka', score=0.5714285714285714),
          StringMatch(term='Dangila', score=0.5714285714285714)],
         'Dano': [StringMatch(term='Dano', score=1)],
         'Meyu Muleke': [StringMatch(term='Meyu Muleke', score=1)],
         'Chiro Town': [StringMatch(term='Chiro Town', score=1)],
         'Jarte Jardga': [StringMatch(term='Jarte Jardega', score=0.9230769230769231)],
         'Olanciti': [StringMatch(term='Olanciti Town', score=0.6153846153846154)],
         'Burayu Town': [StringMatch(term='Bure Town', score=0.7272727272727273),
          StringMatch(term='Durame Town', score=0.7272727272727273)],
         'Were Jarso': [StringMatch(term='Wara Jarso', score=0.8)],
         'Bure': [StringMatch(term='Bura', score=0.75),
          StringMatch(term='Bule', score=0.75),
          StringMatch(term='Bore', score=0.75)],
         'Meta Waliqite': [StringMatch(term='Meta Walkite', score=0.8461538461538461)],
         'Gudru': [StringMatch(term='Guduru', score=0.8333333333333334)],
         'Wuchale': [StringMatch(term='Wuchale', score=1)],
         'Dedesa': [StringMatch(term='Dedesa', score=1)],
         'Delo Mena': [StringMatch(term='Doyogena', score=0.5555555555555556),
          StringMatch(term='Melo Gada', score=0.5555555555555556)],
         'Horo': [StringMatch(term='Horo', score=1)],
         'Kurfa Chele': [StringMatch(term='Kurfa Chele', score=1)],
         'Gololcha': [StringMatch(term='Golocha', score=0.875)],
         'Dawe Qachen': [StringMatch(term='Dawe Ketchen', score=0.75)],
         'Kokosa': [StringMatch(term='Kokosa', score=1)],
         'Abaya': [StringMatch(term='Abaya', score=1)],
         'Nejo Rural': [StringMatch(term='Nejo Town', score=0.5)],
         'Nunu Qumba': [StringMatch(term='Nunu Kumba', score=0.9)],
         'Tulo': [StringMatch(term='Tullo', score=0.8)],
         'Suro Barguda': [StringMatch(term='Suro Berguda', score=0.9166666666666666)],
         'Agaro': [StringMatch(term='Amaro', score=0.8)],
         'Boneya Bushe': [StringMatch(term='Boneya Boshe', score=0.9166666666666666)],
         'Midega Tole': [StringMatch(term='Midhaga Tola', score=0.75)],
         'Robe Town': [StringMatch(term='Robe Town', score=1)],
         'Seyo': [StringMatch(term='Sayo', score=0.75)],
         'Halu': [StringMatch(term='Kalu', score=0.75),
          StringMatch(term='Haru', score=0.75)],
         'St.Luke': [StringMatch(term='Dubluk', score=0.4285714285714286)],
         'Dedo': [StringMatch(term='Dedo', score=1)],
         'Adama Town': [StringMatch(term='Adama Town', score=1)],
         'Nejo': [StringMatch(term='Nejo', score=1)],
         'Hambela Wamena': [StringMatch(term='Hambela Wamena', score=1)],
         'Assela Town': [StringMatch(term='Asela Town', score=0.9090909090909091)],
         'Dambi Dolo': [StringMatch(term='Damboya', score=0.5),
          StringMatch(term='Denbi Dollo Town', score=0.5),
          StringMatch(term='Damot Gale', score=0.5),
          StringMatch(term='Damot Sore', score=0.5),
          StringMatch(term='Gimbi Town', score=0.5)],
         'Goba Town': [StringMatch(term='Goba Town', score=1)],
         'Guto Gida': [StringMatch(term='Guto Gida', score=1)],
         'Guba Qoricha': [StringMatch(term='Goba Koricha', score=0.8333333333333334)],
         'Dima': [StringMatch(term='Diga', score=0.75),
          StringMatch(term='Disa', score=0.75),
          StringMatch(term='Dita', score=0.75),
          StringMatch(term='Dama', score=0.75)],
         'West Harerge\tGumbi Bordede': [StringMatch(term='Gumbi Bordede', score=0.5)],
         'Dawo': [StringMatch(term='Dawo', score=1)],
         'Moyale': [StringMatch(term='Megale', score=0.6666666666666667),
          StringMatch(term='Yocale', score=0.6666666666666667)],
         'Seden Sodo': [StringMatch(term='Seden Sodo', score=1)],
         'Chiro': [StringMatch(term='Chire', score=0.8)],
         'Ilu': [StringMatch(term='Ilu', score=1)],
         'Nono Sele': [StringMatch(term='Nono Benja', score=0.6)],
         'Melka Belo': [StringMatch(term='Melka Balo', score=0.9)],
         'Omo Nada': [StringMatch(term='Omo Nada', score=1)],
         'Mancho': [StringMatch(term='Mancho', score=1)],
         'Laloqile': [StringMatch(term='Lalo Kile', score=0.7777777777777778)],
         'Goro Muti': [StringMatch(term='Goro Muti', score=1)],
         'Aga Wayyu': [StringMatch(term='Aga Wayu', score=0.8888888888888888)],
         'Guma': [StringMatch(term='Gumay', score=0.8)],
         'Kofele': [StringMatch(term='Kofele', score=1)],
         'Modjo Town': [StringMatch(term='Mojo Town', score=0.9)],
         'Yabelo Rural': [StringMatch(term='Yabelo Town', score=0.5833333333333333)],
         'Gemechis': [StringMatch(term='Gemechis', score=1)],
         'Dhas': [StringMatch(term='Dhas', score=1)],
         'Bedesa Town': [StringMatch(term='Bedele Town', score=0.8181818181818181)],
         'Dawe Serar': [StringMatch(term='Dale Wabera', score=0.5454545454545454)],
         'Yaya Gulele': [StringMatch(term='Yaya Gulele', score=1)],
         'Boset': [StringMatch(term='Boset', score=1)],
         'Bako': [StringMatch(term='Babo', score=0.75)],
         'Bishoftu Town': [StringMatch(term='Bishoftu Town', score=1)],
         'Chinakesen': [StringMatch(term='Chinaksen', score=0.9)],
         'Dodola Rural': [StringMatch(term='Dodola Town', score=0.5833333333333333)],
         'Lata Sibu': [StringMatch(term='Leta Sibu', score=0.8888888888888888)],
         'Shenan Kolu': [StringMatch(term='Shanan Kolu', score=0.9090909090909091)],
         'Jeldu': [StringMatch(term='Jeldu', score=1)],
         'Gura Dhamole': [StringMatch(term='Gura Damole', score=0.9166666666666666)],
         'Kercha': [StringMatch(term='Kercha', score=1)],
         'Anfilo': [StringMatch(term='Anfilo', score=1)],
         'Oda Bultum': [StringMatch(term='Kuni /Oda Bultum', score=0.625)],
         'Sekoru': [StringMatch(term='Sekoru', score=1)],
         'Lege Dadi Lege Tafo Town': [StringMatch(term='Lege Tafo-Lege Dadi Town', score=0.7083333333333333)],
         'Gimbi Rural': [StringMatch(term='Gimbi Town', score=0.5454545454545454),
          StringMatch(term='Gimbichu', score=0.5454545454545454)],
         'Guna': [StringMatch(term='Guna', score=1)],
         'Nensebo': [StringMatch(term='Nenesebo', score=0.875)],
         'Sululta Town': [StringMatch(term='Sululta Town', score=1)],
         'Abichugna': [StringMatch(term="Abichugna Gne'A", score=0.6)],
         'Loke Hada': [StringMatch(term='Lege Hida', score=0.6666666666666667)],
         'Mojo': [StringMatch(term='Nejo', score=0.5)],
         'Girja': [StringMatch(term='Girawa', score=0.6666666666666667)],
         'Gelan Town': [StringMatch(term='Gedeb Town', score=0.7),
          StringMatch(term='Asela Town', score=0.7),
          StringMatch(term='Dejen Town', score=0.7),
          StringMatch(term='Dila Town', score=0.7),
          StringMatch(term='Goba Town', score=0.7)],
         'Tole': [StringMatch(term='Tole', score=1)],
         'Robe': [StringMatch(term='Robe', score=1)],
         'Guchi': [StringMatch(term='Guchi', score=1)],
         'Jibat': [StringMatch(term='Jibat', score=1)],
         'Metarobi': [StringMatch(term='Meta Robi', score=0.8888888888888888)],
         'Odo Shakiso': [StringMatch(term='Odo Shakiso', score=1)],
         'O/Beyam': [StringMatch(term='Omo Beyam', score=0.6666666666666667)],
         'Shala': [StringMatch(term='Shala', score=1)],
         'Adaba': [StringMatch(term='Adaba', score=1)],
         'Seweyna': [StringMatch(term='Seweyna', score=1)],
         'Chelia': [StringMatch(term='Cheliya', score=0.8571428571428572)],
         'Gudeyabila': [StringMatch(term='Gudeya Bila', score=0.9090909090909091)],
         'Negele Town': [StringMatch(term='Negele Town', score=1)],
         'Metu Rural': [StringMatch(term='Metu Zuria', score=0.7)],
         'Sinana': [StringMatch(term='Sinana', score=1)],
         'Gida Ayana': [StringMatch(term='Gida Ayana', score=1)],
         'Bedele Zuriya': [StringMatch(term='Badele Zuria', score=0.8461538461538461)],
         'Ambo': [StringMatch(term='Afambo', score=0.6666666666666667)],
         'Gidami': [StringMatch(term='Gidami', score=1)],
         'Arsi Negele Town': [StringMatch(term='Arsi Negele Town', score=1)],
         'Mendi': [StringMatch(term='Dendi', score=0.8)],
         'Arero': [StringMatch(term='Arero', score=1)],
         'Dilo': [StringMatch(term='Dilo', score=1)],
         'Kake': [StringMatch(term='Kache', score=0.6),
          StringMatch(term='Akaki', score=0.6)],
         'Enkelo Wabe': [StringMatch(term='Inkolo Wabe', score=0.8181818181818181)],
         'Adola Town': [StringMatch(term='Adola Town', score=1)],
         'Ayira': [StringMatch(term='Ayira', score=1)],
         'Dale Sedi': [StringMatch(term='Dale Sadi', score=0.8888888888888888)],
         'Chiro Zuriya': [StringMatch(term='Chiro Zuria', score=0.9166666666666666)],
         'Babile': [StringMatch(term='Berahile', score=0.625)],
         'Kersa Eh': [StringMatch(term='Mersa Town', score=0.5),
          StringMatch(term='Bereh', score=0.5)],
         'Gomole': [StringMatch(term='Gomole', score=1)],
         'Sude': [StringMatch(term='Sude', score=1)],
         'Meyo': [StringMatch(term='Meko', score=0.75),
          StringMatch(term='Miyo', score=0.75)],
         'Mendi Town': [StringMatch(term='Mendi Town', score=1)],
         'Meko': [StringMatch(term='Meko', score=1)],
         'Leqa Dulecha': [StringMatch(term='Leka Dulecha', score=0.9166666666666666)],
         'Amuru': [StringMatch(term='Amuru', score=1)],
         'Boke': [StringMatch(term='Boke', score=1)],
         'Nole Kaba': [StringMatch(term='Nole Kaba', score=1)],
         'Yemalogi Wolel': [StringMatch(term='Yama Logi Welel', score=0.8)],
         'Sigmo': [StringMatch(term='Sigmo', score=1)],
         'Guder': [StringMatch(term='Gumer', score=0.8)],
         'Didu': [StringMatch(term='Didu', score=1)],
         'Burka Dimtu': [StringMatch(term='Burqua Dhintu', score=0.6923076923076923)],
         'Wondo': [StringMatch(term='Wondo', score=1)],
         'Bantu': [StringMatch(term='Ibantu', score=0.8333333333333334)],
         'Jido': [StringMatch(term='Jida', score=0.75)],
         'Yayu': [StringMatch(term='Yayu', score=1)],
         'Mkelka Soda': [StringMatch(term='Melka Soda', score=0.9090909090909091)],
         'Ginir': [StringMatch(term='Ginir', score=1)],
         'Adea': [StringMatch(term='Adet', score=0.75),
          StringMatch(term='Adwa', score=0.75)],
         'Babo Gembel': [StringMatch(term='Chabe Gambeltu', score=0.5714285714285714)],
         'Bedele': [StringMatch(term='Bedeno', score=0.6666666666666667),
          StringMatch(term='Bedesa', score=0.6666666666666667)],
         'Kersa': [StringMatch(term='Kercha', score=0.6666666666666667)],
         'Dhidesa': [StringMatch(term='Dedesa', score=0.7142857142857143)],
         'Habro': [StringMatch(term='Habro', score=1)],
         'Ginir Town': [StringMatch(term='Ginir Town', score=1)],
         'Holeta Town': [StringMatch(term='Holeta Town', score=1)],
         'Nekemte Town': [StringMatch(term='Nekemte Town', score=1)],
         'Adola Reda': [StringMatch(term='Adola Town', score=0.6)],
         'Sebeta Town': [StringMatch(term='Sebeta Town', score=1)],
         'Liben': [StringMatch(term='Liben', score=1)],
         'Arsi Negele Rural': [StringMatch(term='Arsi Negele Town', score=0.7058823529411764)],
         'Anchar': [StringMatch(term='Anchar', score=1)],
         'Kersana Malima': [StringMatch(term='Kersana Malima', score=1)],
         'Fentale': [StringMatch(term='Fentale', score=1)],
         'Berbere': [StringMatch(term='Berbere', score=1)],
         'Mene Sibu': [StringMatch(term='Mana Sibu', score=0.7777777777777778)],
         'Siraro': [StringMatch(term='Siraro', score=1)],
         'Genji': [StringMatch(term='Gena', score=0.6),
          StringMatch(term='Gaji', score=0.6),
          StringMatch(term='Gechi', score=0.6)],
         'Inchini': [StringMatch(term='Tikur Enchini', score=0.5384615384615384)],
         'Seyo Nole': [StringMatch(term='Sayo Nole', score=0.8888888888888888)],
         'Gole Oda': [StringMatch(term='Golo Oda', score=0.875)],
         'Qiltu Kara': [StringMatch(term='Kiltu Kara', score=0.9)],
         'Goba': [StringMatch(term='Doba', score=0.75),
          StringMatch(term='Goma', score=0.75),
          StringMatch(term='Guba', score=0.75)],
         'Wonchi': [StringMatch(term='Wenchi', score=0.8333333333333334)],
         'Uraga': [StringMatch(term='Uraga', score=1)],
         'Boricha': [StringMatch(term='Boricha', score=1)],
         'Kimbibit': [StringMatch(term='Kimbibit', score=1)],
         'Gechi': [StringMatch(term='Gechi', score=1)],
         'Gindeberet': [StringMatch(term='Ginde Beret', score=0.9090909090909091)],
         'Woliso Town': [StringMatch(term='Woliso Town', score=1)],
         'Boji Dermeji': [StringMatch(term='Boji Dirmeji', score=0.9166666666666666)],
         'Gobu Seyo': [StringMatch(term='Gobu Seyo', score=1)],
         'Dodola Town': [StringMatch(term='Dodola Town', score=1)],
         'Bekoji Town': [StringMatch(term='Bekoji Town', score=1)],
         'Goro Dola': [StringMatch(term='Gora Dola', score=0.8888888888888888)],
         'Fichetown': [StringMatch(term='Fiche Town', score=0.9)],
         'Sibu Sire': [StringMatch(term='Sibu Sire', score=1)],
         'Becho': [StringMatch(term='Bero', score=0.6),
          StringMatch(term='Decha', score=0.6),
          StringMatch(term='Gechi', score=0.6),
          StringMatch(term='Mecha', score=0.6)],
         'Shashamane Town': [StringMatch(term='Shashemene Town', score=0.8666666666666667)],
         'Dukem Town': [StringMatch(term='Durame Town', score=0.7272727272727273)],
         'Dambi Dollo': [StringMatch(term='Denbi Dollo Town', score=0.5625)],
         'Arena Buluq': [StringMatch(term='Harena Buluk', score=0.8333333333333334)],
         'Anna Sora': [StringMatch(term='Ana Sora', score=0.8888888888888888)],
         'Jimma Spe Town': [StringMatch(term='Jimma Town', score=0.7142857142857143)],
         'Zeway Dugda': [StringMatch(term='Ziway Dugda', score=0.9090909090909091)],
         'Gimbi Adventist': [StringMatch(term='Gimbi Town', score=0.4666666666666667)],
         'Bisidimo': [StringMatch(term='Bilidigilu', score=0.5),
          StringMatch(term='Sigmo', score=0.5)],
         'Kuyu': [StringMatch(term='Kuyu', score=1)],
         'Digeluna Tijo': [StringMatch(term='Degeluna Tijo', score=0.9230769230769231)],
         'Munesa': [StringMatch(term='Munessa', score=0.8571428571428572)],
         'Ebantu': [StringMatch(term='Ibantu', score=0.8333333333333334)],
         'Shambu': [StringMatch(term='Shambu Town', score=0.5454545454545454)],
         'Gelana': [StringMatch(term='Delanta', score=0.7142857142857143)],
         'Goro': [StringMatch(term='Horo', score=0.75),
          StringMatch(term='Soro', score=0.75)],
         'Fiche': [StringMatch(term='Kache', score=0.6)],
         'Boji Cheqorsa': [StringMatch(term='Boji Chekorsa', score=0.9230769230769231)],
         'Batu': [StringMatch(term='Bati', score=0.75)],
         'Tikur Enchini': [StringMatch(term='Tikur Enchini', score=1)],
         'Gimbi': [StringMatch(term='Gimbi', score=1)],
         'Diga': [StringMatch(term='Diga', score=1)],
         'Hababo Guduru': [StringMatch(term='Choman Guduru', score=0.6153846153846154)],
         'Ilu Galan': [StringMatch(term='Illu Galan', score=0.9)],
         'Nono Benja': [StringMatch(term='Nono Benja', score=1)],
         'Dugda Dawa': [StringMatch(term='Dugda Dawa', score=1)],
         'Gasera': [StringMatch(term='Gasera', score=1)],
         'Bule Hora Toun': [StringMatch(term='Bule Hora Town', score=0.9285714285714286)],
         'Dera': [StringMatch(term='Gera', score=0.75),
          StringMatch(term='Wera', score=0.75),
          StringMatch(term='Dega', score=0.75),
          StringMatch(term='Dara', score=0.75)],
         'Kumbi': [StringMatch(term='Kumbi', score=1)],
         'Gelemso': [StringMatch(term='Telemt', score=0.5714285714285714),
          StringMatch(term='Lemmo', score=0.5714285714285714),
          StringMatch(term='Gerese', score=0.5714285714285714),
          StringMatch(term='Guliso', score=0.5714285714285714)],
         'Biebirsa Kojowa': [StringMatch(term='Birbirsa Kojowa', score=0.9333333333333333)],
         'Dodola': [StringMatch(term='Dodola', score=1)],
         'Mana': [StringMatch(term='Zana', score=0.75)],
         'Teltele': [StringMatch(term='Teltale', score=0.8571428571428572)],
         'Hidabu Abote': [StringMatch(term='Hidabu Abote', score=1)],
         'Chora Boter': [StringMatch(term='Chora (Buno Bedele)', score=0.4736842105263158)],
         'Garemuleta': [StringMatch(term='Geraleta', score=0.6)],
         'Doreni': [StringMatch(term='Dorani', score=0.8333333333333334)],
         'Nejo Town': [StringMatch(term='Nejo Town', score=1)],
         'Wadara': [StringMatch(term='Wadera', score=0.8333333333333334)],
         'Yubdo': [StringMatch(term='Yubdo', score=1)],
         'Agarfa': [StringMatch(term='Agarfa', score=1)],
         'Abe Dengoro': [StringMatch(term='Abe Dongoro', score=0.9090909090909091)],
         'Gawo Qebe': [StringMatch(term='Gawo Kebe', score=0.8888888888888888)],
         'Matahara': [StringMatch(term='May Kadra', score=0.5555555555555556)],
         'Fedis': [StringMatch(term='Fedis', score=1)],
         'Lalo Asabi': [StringMatch(term='Lalo Asabi', score=1)],
         'El Way': [StringMatch(term='Elwaya', score=0.6666666666666667)],
         'Dendi': [StringMatch(term='Dendi', score=1)],
         'Haru': [StringMatch(term='Haru', score=1)],
         'Haro Limu': [StringMatch(term='Haro Limu', score=1)],
         'Wayu Tuqa': [StringMatch(term='Wayu Tuka', score=0.8888888888888888)],
         'Chora': [StringMatch(term='Chifra', score=0.6666666666666667)],
         'Babile Woreda': [StringMatch(term='Babile (Or)', score=0.6923076923076923)],
         'Hawa Gelan': [StringMatch(term='Hawa Galan', score=0.9)],
         'Tiro Afeta': [StringMatch(term='Tiro Afeta', score=1)],
         'Jarso': [StringMatch(term='Ararso', score=0.6666666666666667)],
         'Girawa': [StringMatch(term='Girawa', score=1)],
         'Hurumu': [StringMatch(term='Hurumu', score=1)],
         'Dire': [StringMatch(term='Dire', score=1)],
         'Aweday Town': [StringMatch(term='Aweday Town', score=1)],
         'Seka Chekorsa': [StringMatch(term='Seka Chekorsa', score=1)],
         'Sasiga': [StringMatch(term='Sasiga', score=1)],
         'Limuna Bilbilo': [StringMatch(term='Limu Bilbilo', score=0.8571428571428572)],
         'Saba Boru': [StringMatch(term='Saba Boru', score=1)],
         'Jima Rare': [StringMatch(term='Jimma Rare', score=0.9)],
         'Deder': [StringMatch(term='Deder', score=1)],
         'Meda Welabu': [StringMatch(term='Meda Welabu', score=1)],
         'Meta': [StringMatch(term='Meta', score=1)],
         'Darimu': [StringMatch(term='Darimu', score=1)],
         'Metu Town': [StringMatch(term='Metu Town', score=1)],
         'Wama Hagelo': [StringMatch(term='Wama Hagalo', score=0.9090909090909091)],
         'Hawi Gudina': [StringMatch(term='Hawi Gudina\n', score=0.9166666666666666)],
         'Ginde Beret': [StringMatch(term='Ginde Beret', score=1)],
         'Gedo': [StringMatch(term='Dedo', score=0.75)],
         'Limu Seka': [StringMatch(term='Limu Seka', score=1)],
         'Liban Jawi': [StringMatch(term='Liban Jawi', score=1)],
         'Doba': [StringMatch(term='Doba', score=1)],
         'Negele': [StringMatch(term='Egela', score=0.6666666666666667),
          StringMatch(term='Megale', score=0.6666666666666667)],
         'Girar Jarso': [StringMatch(term='Gerar Jarso', score=0.9090909090909091)]})

No value was left unmatched. That is every query had atleast one match!

[17]:
woreda_map.unmatched()
[17]:
set()

To replace values with their closest match, the mapping needs to be converted into a lookup dictionary. But in doing so, openclean throws a bunch of warnings at us. Although the lookup dictionary was created, it doesn’t include all the values that had more than a single match.

[18]:
woreda_map.to_lookup()
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Kundala (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Chelenko (5 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Tulu Bolo (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Mullo (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Gojo (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Gambo (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Gimbi Public (3 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Bora (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Limu (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Gursum (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Hanbala (4 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Burayu Town (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Bure (3 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Delo Mena (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Halu (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Dambi Dolo (5 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Dima (4 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Moyale (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Gimbi Rural (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Gelan Town (5 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Kake (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Kersa Eh (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Meyo (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Adea (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Bedele (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Genji (3 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Goba (3 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Becho (4 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Bisidimo (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Goro (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Dera (4 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Gelemso (4 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
/home/heiko/projects/openclean/openclean-core/openclean/data/mapping.py:222: UserWarning: Ignoring key: Negele (2 matches). To include ignored keys, update the map to contain only 1 match per key
  warnings.warn('Ignoring key: {} ({} matches). To include ignored keys, '
[18]:
{'Dodota': 'Dodota',
 'Haromaya Town': 'Haromaya Town',
 'Merti': 'Merti',
 'Shambu Town': 'Shambu Town',
 'Setema': 'Setema',
 'Dapho Hana': 'Dabo Hana',
 'Dugda': 'Dugda',
 'Haromaya': 'Haro Maya',
 'Deder Town': 'Deder Town',
 'Dale Wabera': 'Dale Wabera',
 "Sodo Dac'Ha": 'Sodo Daci',
 'Bilonopa': 'Bilo Nopha',
 'Muke Turi': 'Metu Zuria',
 'Midakegni': 'Mida Kegn',
 'Bule Hora': 'Bule Hora',
 'Sebeta Awas': 'Sebeta Hawas',
 'Elifata': 'Ifata',
 'Kore': 'Kore',
 'Begi': 'Begi',
 'Shirka': 'Shirka',
 'Bore': 'Bore',
 'Horro Buluk': 'Horo Buluk',
 'Limu Kosa': 'Limu Kosa',
 'Aseko': 'Aseko',
 'Adola': 'Adola',
 'Jimma Horo': 'Jimma Horo',
 'Hebal Arsi': 'Heban Arsi',
 'Seru': 'Seru',
 'Gedeb Asasa': 'Gedeb Asasa',
 'Adea Berga': 'Adda Berga',
 'Walmera': 'Welmera',
 'Ambo Town': 'Ambo Town',
 'Ameya': 'Ameya',
 'Jima Geneti': 'Jimma Genete',
 'Chewaqa': 'Chwaka',
 'Ejerie': 'Saesie',
 'Kombolicha': 'Kombolcha',
 'Daro Lebu': 'Daro Lebu',
 'Dubluk': 'Dubluk',
 'Amigna': 'Amigna',
 'Berreh': 'Bereh',
 'Aleiltu': 'Aleltu',
 'Ejersa Lafo': 'Ejersa Lafo',
 'Algesachi': 'Alge Sachi',
 'Ale': 'Ale',
 'Raitu': 'Rayitu',
 'Codi': 'Cobi',
 'Seka Chhokorsa': 'Seka Chekorsa',
 'Shakiso Town': 'Shakiso Town',
 'Gomma': 'Goma',
 'Bedeno': 'Bedeno',
 'Jeju': 'Jeju',
 'Shabe': 'Shala',
 'Haromaya Rural': 'Haromaya Town',
 'Abay Chomen': 'Abay Chomen',
 'Degem': 'Degem',
 'Lomme': 'Loma',
 'Mesela': 'Mesela',
 'Abuna Gindeberet': 'Abuna Ginde Beret',
 'Meiso': 'Miesso',
 'Sedan Chanka': 'Sedi Chenka',
 'Tibe Kutaye': 'Toke Kutaye',
 'Bale Gesgara': 'Bele Gesgar',
 'Gera': 'Gera',
 'Adami Tulu Jido Kombolcha': 'Adama Tulu Jido Kombolcha',
 'Nono': 'Nono',
 'Ludehetosa': 'Lude Hitosa',
 'Legehida': 'Legehida',
 'Holeta': 'Holeta Town',
 'Gumi Eldallo': 'Gumi Idalo',
 'Yabelo': 'Yabelo',
 'Guliso': 'Guliso',
 'Bako Tibe': 'Bako Tibe',
 'B/Tolyi': 'Tole',
 'Dinsho': 'Dinsho',
 'Wachile': 'Wachile',
 'Bishan Guracha Town': 'Bishan Guracha',
 'Homa': 'Homa',
 'Bedele Town': 'Bedele Town',
 'Hitosa': 'Hitosa',
 'Sire': 'Sire',
 'Gimbichu': 'Gimbichu',
 'Deksis': 'Diksis',
 'Sandefa': 'Sankura',
 'Goro Gutu': 'Goro Gutu',
 'Shashemene Rural': 'Shashemene Zuria',
 'Akaki': 'Akaki',
 'Adama': 'Adama',
 'Chomen Guduru': 'Choman Guduru',
 'Woliso Rural': 'Woliso Town',
 'Chole': 'Chole',
 'Jimma Arjo': 'Jimma Arjo',
 'Kiremu': 'Kiremu',
 'Tena': 'Tena',
 'Tiyo': 'Tiyo',
 'Debre Libanos': 'Debre Libanos',
 'Omonada': 'Omo Nada',
 'Dano': 'Dano',
 'Meyu Muleke': 'Meyu Muleke',
 'Chiro Town': 'Chiro Town',
 'Jarte Jardga': 'Jarte Jardega',
 'Olanciti': 'Olanciti Town',
 'Were Jarso': 'Wara Jarso',
 'Meta Waliqite': 'Meta Walkite',
 'Gudru': 'Guduru',
 'Wuchale': 'Wuchale',
 'Dedesa': 'Dedesa',
 'Horo': 'Horo',
 'Kurfa Chele': 'Kurfa Chele',
 'Gololcha': 'Golocha',
 'Dawe Qachen': 'Dawe Ketchen',
 'Kokosa': 'Kokosa',
 'Abaya': 'Abaya',
 'Nejo Rural': 'Nejo Town',
 'Nunu Qumba': 'Nunu Kumba',
 'Tulo': 'Tullo',
 'Suro Barguda': 'Suro Berguda',
 'Agaro': 'Amaro',
 'Boneya Bushe': 'Boneya Boshe',
 'Midega Tole': 'Midhaga Tola',
 'Robe Town': 'Robe Town',
 'Seyo': 'Sayo',
 'St.Luke': 'Dubluk',
 'Dedo': 'Dedo',
 'Adama Town': 'Adama Town',
 'Nejo': 'Nejo',
 'Hambela Wamena': 'Hambela Wamena',
 'Assela Town': 'Asela Town',
 'Goba Town': 'Goba Town',
 'Guto Gida': 'Guto Gida',
 'Guba Qoricha': 'Goba Koricha',
 'West Harerge\tGumbi Bordede': 'Gumbi Bordede',
 'Dawo': 'Dawo',
 'Seden Sodo': 'Seden Sodo',
 'Chiro': 'Chire',
 'Ilu': 'Ilu',
 'Nono Sele': 'Nono Benja',
 'Melka Belo': 'Melka Balo',
 'Omo Nada': 'Omo Nada',
 'Mancho': 'Mancho',
 'Laloqile': 'Lalo Kile',
 'Goro Muti': 'Goro Muti',
 'Aga Wayyu': 'Aga Wayu',
 'Guma': 'Gumay',
 'Kofele': 'Kofele',
 'Modjo Town': 'Mojo Town',
 'Yabelo Rural': 'Yabelo Town',
 'Gemechis': 'Gemechis',
 'Dhas': 'Dhas',
 'Bedesa Town': 'Bedele Town',
 'Dawe Serar': 'Dale Wabera',
 'Yaya Gulele': 'Yaya Gulele',
 'Boset': 'Boset',
 'Bako': 'Babo',
 'Bishoftu Town': 'Bishoftu Town',
 'Chinakesen': 'Chinaksen',
 'Dodola Rural': 'Dodola Town',
 'Lata Sibu': 'Leta Sibu',
 'Shenan Kolu': 'Shanan Kolu',
 'Jeldu': 'Jeldu',
 'Gura Dhamole': 'Gura Damole',
 'Kercha': 'Kercha',
 'Anfilo': 'Anfilo',
 'Oda Bultum': 'Kuni /Oda Bultum',
 'Sekoru': 'Sekoru',
 'Lege Dadi Lege Tafo Town': 'Lege Tafo-Lege Dadi Town',
 'Guna': 'Guna',
 'Nensebo': 'Nenesebo',
 'Sululta Town': 'Sululta Town',
 'Abichugna': "Abichugna Gne'A",
 'Loke Hada': 'Lege Hida',
 'Mojo': 'Nejo',
 'Girja': 'Girawa',
 'Tole': 'Tole',
 'Robe': 'Robe',
 'Guchi': 'Guchi',
 'Jibat': 'Jibat',
 'Metarobi': 'Meta Robi',
 'Odo Shakiso': 'Odo Shakiso',
 'O/Beyam': 'Omo Beyam',
 'Shala': 'Shala',
 'Adaba': 'Adaba',
 'Seweyna': 'Seweyna',
 'Chelia': 'Cheliya',
 'Gudeyabila': 'Gudeya Bila',
 'Negele Town': 'Negele Town',
 'Metu Rural': 'Metu Zuria',
 'Sinana': 'Sinana',
 'Gida Ayana': 'Gida Ayana',
 'Bedele Zuriya': 'Badele Zuria',
 'Ambo': 'Afambo',
 'Gidami': 'Gidami',
 'Arsi Negele Town': 'Arsi Negele Town',
 'Mendi': 'Dendi',
 'Arero': 'Arero',
 'Dilo': 'Dilo',
 'Enkelo Wabe': 'Inkolo Wabe',
 'Adola Town': 'Adola Town',
 'Ayira': 'Ayira',
 'Dale Sedi': 'Dale Sadi',
 'Chiro Zuriya': 'Chiro Zuria',
 'Babile': 'Berahile',
 'Gomole': 'Gomole',
 'Sude': 'Sude',
 'Mendi Town': 'Mendi Town',
 'Meko': 'Meko',
 'Leqa Dulecha': 'Leka Dulecha',
 'Amuru': 'Amuru',
 'Boke': 'Boke',
 'Nole Kaba': 'Nole Kaba',
 'Yemalogi Wolel': 'Yama Logi Welel',
 'Sigmo': 'Sigmo',
 'Guder': 'Gumer',
 'Didu': 'Didu',
 'Burka Dimtu': 'Burqua Dhintu',
 'Wondo': 'Wondo',
 'Bantu': 'Ibantu',
 'Jido': 'Jida',
 'Yayu': 'Yayu',
 'Mkelka Soda': 'Melka Soda',
 'Ginir': 'Ginir',
 'Babo Gembel': 'Chabe Gambeltu',
 'Kersa': 'Kercha',
 'Dhidesa': 'Dedesa',
 'Habro': 'Habro',
 'Ginir Town': 'Ginir Town',
 'Holeta Town': 'Holeta Town',
 'Nekemte Town': 'Nekemte Town',
 'Adola Reda': 'Adola Town',
 'Sebeta Town': 'Sebeta Town',
 'Liben': 'Liben',
 'Arsi Negele Rural': 'Arsi Negele Town',
 'Anchar': 'Anchar',
 'Kersana Malima': 'Kersana Malima',
 'Fentale': 'Fentale',
 'Berbere': 'Berbere',
 'Mene Sibu': 'Mana Sibu',
 'Siraro': 'Siraro',
 'Inchini': 'Tikur Enchini',
 'Seyo Nole': 'Sayo Nole',
 'Gole Oda': 'Golo Oda',
 'Qiltu Kara': 'Kiltu Kara',
 'Wonchi': 'Wenchi',
 'Uraga': 'Uraga',
 'Boricha': 'Boricha',
 'Kimbibit': 'Kimbibit',
 'Gechi': 'Gechi',
 'Gindeberet': 'Ginde Beret',
 'Woliso Town': 'Woliso Town',
 'Boji Dermeji': 'Boji Dirmeji',
 'Gobu Seyo': 'Gobu Seyo',
 'Dodola Town': 'Dodola Town',
 'Bekoji Town': 'Bekoji Town',
 'Goro Dola': 'Gora Dola',
 'Fichetown': 'Fiche Town',
 'Sibu Sire': 'Sibu Sire',
 'Shashamane Town': 'Shashemene Town',
 'Dukem Town': 'Durame Town',
 'Dambi Dollo': 'Denbi Dollo Town',
 'Arena Buluq': 'Harena Buluk',
 'Anna Sora': 'Ana Sora',
 'Jimma Spe Town': 'Jimma Town',
 'Zeway Dugda': 'Ziway Dugda',
 'Gimbi Adventist': 'Gimbi Town',
 'Kuyu': 'Kuyu',
 'Digeluna Tijo': 'Degeluna Tijo',
 'Munesa': 'Munessa',
 'Ebantu': 'Ibantu',
 'Shambu': 'Shambu Town',
 'Gelana': 'Delanta',
 'Fiche': 'Kache',
 'Boji Cheqorsa': 'Boji Chekorsa',
 'Batu': 'Bati',
 'Tikur Enchini': 'Tikur Enchini',
 'Gimbi': 'Gimbi',
 'Diga': 'Diga',
 'Hababo Guduru': 'Choman Guduru',
 'Ilu Galan': 'Illu Galan',
 'Nono Benja': 'Nono Benja',
 'Dugda Dawa': 'Dugda Dawa',
 'Gasera': 'Gasera',
 'Bule Hora Toun': 'Bule Hora Town',
 'Kumbi': 'Kumbi',
 'Biebirsa Kojowa': 'Birbirsa Kojowa',
 'Dodola': 'Dodola',
 'Mana': 'Zana',
 'Teltele': 'Teltale',
 'Hidabu Abote': 'Hidabu Abote',
 'Chora Boter': 'Chora (Buno Bedele)',
 'Garemuleta': 'Geraleta',
 'Doreni': 'Dorani',
 'Nejo Town': 'Nejo Town',
 'Wadara': 'Wadera',
 'Yubdo': 'Yubdo',
 'Agarfa': 'Agarfa',
 'Abe Dengoro': 'Abe Dongoro',
 'Gawo Qebe': 'Gawo Kebe',
 'Matahara': 'May Kadra',
 'Fedis': 'Fedis',
 'Lalo Asabi': 'Lalo Asabi',
 'El Way': 'Elwaya',
 'Dendi': 'Dendi',
 'Haru': 'Haru',
 'Haro Limu': 'Haro Limu',
 'Wayu Tuqa': 'Wayu Tuka',
 'Chora': 'Chifra',
 'Babile Woreda': 'Babile (Or)',
 'Hawa Gelan': 'Hawa Galan',
 'Tiro Afeta': 'Tiro Afeta',
 'Jarso': 'Ararso',
 'Girawa': 'Girawa',
 'Hurumu': 'Hurumu',
 'Dire': 'Dire',
 'Aweday Town': 'Aweday Town',
 'Seka Chekorsa': 'Seka Chekorsa',
 'Sasiga': 'Sasiga',
 'Limuna Bilbilo': 'Limu Bilbilo',
 'Saba Boru': 'Saba Boru',
 'Jima Rare': 'Jimma Rare',
 'Deder': 'Deder',
 'Meda Welabu': 'Meda Welabu',
 'Meta': 'Meta',
 'Darimu': 'Darimu',
 'Metu Town': 'Metu Town',
 'Wama Hagelo': 'Wama Hagalo',
 'Hawi Gudina': 'Hawi Gudina\n',
 'Ginde Beret': 'Ginde Beret',
 'Gedo': 'Dedo',
 'Limu Seka': 'Limu Seka',
 'Liban Jawi': 'Liban Jawi',
 'Doba': 'Doba',
 'Girar Jarso': 'Gerar Jarso'}

This is where having a user with domain knowledge in the loop pays off. For values that had multiple matches, the user gets to select an option and update the mapping.

[19]:
multiple_matches = list()
for key, count in woreda_map.match_counts().items():
    if count > 1:
        multiple_matches.append(key)

pp.pprint(woreda_map.filter(multiple_matches))
Mapping(<class 'list'>,
        { 'Adea': [ StringMatch(term='Adet', score=0.75),
                    StringMatch(term='Adwa', score=0.75)],
          'Becho': [ StringMatch(term='Bero', score=0.6),
                     StringMatch(term='Decha', score=0.6),
                     StringMatch(term='Gechi', score=0.6),
                     StringMatch(term='Mecha', score=0.6)],
          'Bedele': [ StringMatch(term='Bedeno', score=0.6666666666666667),
                      StringMatch(term='Bedesa', score=0.6666666666666667)],
          'Bisidimo': [ StringMatch(term='Bilidigilu', score=0.5),
                        StringMatch(term='Sigmo', score=0.5)],
          'Bora': [ StringMatch(term='Bore', score=0.75),
                    StringMatch(term='Bura', score=0.75)],
          'Burayu Town': [ StringMatch(term='Bure Town', score=0.7272727272727273),
                           StringMatch(term='Durame Town', score=0.7272727272727273)],
          'Bure': [ StringMatch(term='Bura', score=0.75),
                    StringMatch(term='Bule', score=0.75),
                    StringMatch(term='Bore', score=0.75)],
          'Chelenko': [ StringMatch(term='Cheliya', score=0.5),
                        StringMatch(term='Chena', score=0.5),
                        StringMatch(term='Shenkor', score=0.5),
                        StringMatch(term='Chole', score=0.5),
                        StringMatch(term='Sheko', score=0.5)],
          'Dambi Dolo': [ StringMatch(term='Damboya', score=0.5),
                          StringMatch(term='Denbi Dollo Town', score=0.5),
                          StringMatch(term='Damot Gale', score=0.5),
                          StringMatch(term='Damot Sore', score=0.5),
                          StringMatch(term='Gimbi Town', score=0.5)],
          'Delo Mena': [ StringMatch(term='Doyogena', score=0.5555555555555556),
                         StringMatch(term='Melo Gada', score=0.5555555555555556)],
          'Dera': [ StringMatch(term='Gera', score=0.75),
                    StringMatch(term='Wera', score=0.75),
                    StringMatch(term='Dega', score=0.75),
                    StringMatch(term='Dara', score=0.75)],
          'Dima': [ StringMatch(term='Diga', score=0.75),
                    StringMatch(term='Disa', score=0.75),
                    StringMatch(term='Dita', score=0.75),
                    StringMatch(term='Dama', score=0.75)],
          'Gambo': [ StringMatch(term='Garbo', score=0.8),
                     StringMatch(term='Gimbo', score=0.8)],
          'Gelan Town': [ StringMatch(term='Gedeb Town', score=0.7),
                          StringMatch(term='Asela Town', score=0.7),
                          StringMatch(term='Dejen Town', score=0.7),
                          StringMatch(term='Dila Town', score=0.7),
                          StringMatch(term='Goba Town', score=0.7)],
          'Gelemso': [ StringMatch(term='Telemt', score=0.5714285714285714),
                       StringMatch(term='Lemmo', score=0.5714285714285714),
                       StringMatch(term='Gerese', score=0.5714285714285714),
                       StringMatch(term='Guliso', score=0.5714285714285714)],
          'Genji': [ StringMatch(term='Gena', score=0.6),
                     StringMatch(term='Gaji', score=0.6),
                     StringMatch(term='Gechi', score=0.6)],
          'Gimbi Public': [ StringMatch(term='Gimbi Town', score=0.5),
                            StringMatch(term='Gimbichu', score=0.5),
                            StringMatch(term='Kimbibit', score=0.5)],
          'Gimbi Rural': [ StringMatch(term='Gimbi Town', score=0.5454545454545454),
                           StringMatch(term='Gimbichu', score=0.5454545454545454)],
          'Goba': [ StringMatch(term='Doba', score=0.75),
                    StringMatch(term='Goma', score=0.75),
                    StringMatch(term='Guba', score=0.75)],
          'Gojo': [ StringMatch(term='Goglo', score=0.6),
                    StringMatch(term='Gonje', score=0.6)],
          'Goro': [ StringMatch(term='Horo', score=0.75),
                    StringMatch(term='Soro', score=0.75)],
          'Gursum': [ StringMatch(term='Gursum (Or)', score=0.5454545454545454),
                      StringMatch(term='Gursum (Sm)', score=0.5454545454545454)],
          'Halu': [ StringMatch(term='Kalu', score=0.75),
                    StringMatch(term='Haru', score=0.75)],
          'Hanbala': [ StringMatch(term='Hawela', score=0.5714285714285714),
                       StringMatch(term='Abaala', score=0.5714285714285714),
                       StringMatch(term='Hanruka', score=0.5714285714285714),
                       StringMatch(term='Dangila', score=0.5714285714285714)],
          'Kake': [ StringMatch(term='Kache', score=0.6),
                    StringMatch(term='Akaki', score=0.6)],
          'Kersa Eh': [ StringMatch(term='Mersa Town', score=0.5),
                        StringMatch(term='Bereh', score=0.5)],
          'Kundala': [ StringMatch(term='Kunneba', score=0.5714285714285714),
                       StringMatch(term='Undulu', score=0.5714285714285714)],
          'Limu': [ StringMatch(term='Darimu', score=0.5),
                    StringMatch(term='Kiremu', score=0.5)],
          'Meyo': [ StringMatch(term='Meko', score=0.75),
                    StringMatch(term='Miyo', score=0.75)],
          'Moyale': [ StringMatch(term='Megale', score=0.6666666666666667),
                      StringMatch(term='Yocale', score=0.6666666666666667)],
          'Mullo': [ StringMatch(term='Mulo', score=0.8),
                     StringMatch(term='Tullo', score=0.8)],
          'Negele': [ StringMatch(term='Egela', score=0.6666666666666667),
                      StringMatch(term='Megale', score=0.6666666666666667)],
          'Tulu Bolo': [ StringMatch(term='Tullo', score=0.5555555555555556),
                         StringMatch(term='Tulo (Or)', score=0.5555555555555556)]})
[20]:
multi_match_fixes = {
    'Chelenko': 'Cheliya',
    'Becho': 'Bero',
    'Burayu Town': 'Bure Town',
    'Negele': 'Egela',
    'Kundala': 'Kunneba',
    'Gursum': 'Gursum (Or)',
    'Meyo': 'Meko',
    'Limu': 'Darimu',
    'Dambi Dolo': 'Damboya',
    'Gelemso': 'Telemt',
    'Moyale': 'Megale',
    'Delo Mena': 'Doyogena',
    'Hanbala': 'Hawela',
    'Dima': 'Diga',
    'Halu': 'Kalu',
    'Kersa Eh': 'Mersa Town',
    'Bedele': 'Bedeno',
    'Dera': 'Gera',
    'Tulu Bolo': 'Tullo',
    'Gimbi Rural': 'Gimbi Town',
    'Gelan Town': 'Gedeb Town',
    'Kake': 'Kache',
    'Goro': 'Horo',
    'Gambo': 'Garbo',
    'Bora': 'Bore',
    'Goba': 'Doba',
    'Adea': 'Adet',
    'Bure': 'Bura',
    'Mullo': 'Mulo',
    'Gojo': 'Goglo',
    'Gimbi Public': 'Gimbi Town',
    'Bisidimo': 'Bilidigilu',
    'Genji': 'Gena'
}

woreda_map.update(multi_match_fixes)
[20]:
Mapping(list,
        {'Dodota': [StringMatch(term='Dodota', score=1)],
         'Haromaya Town': [StringMatch(term='Haromaya Town', score=1)],
         'Merti': [StringMatch(term='Merti', score=1)],
         'Shambu Town': [StringMatch(term='Shambu Town', score=1)],
         'Setema': [StringMatch(term='Setema', score=1)],
         'Dapho Hana': [StringMatch(term='Dabo Hana', score=0.8)],
         'Dugda': [StringMatch(term='Dugda', score=1)],
         'Kundala': [ExactMatch(term='Kunneba', score=1.0)],
         'Haromaya': [StringMatch(term='Haro Maya', score=0.8888888888888888)],
         'Deder Town': [StringMatch(term='Deder Town', score=1)],
         'Dale Wabera': [StringMatch(term='Dale Wabera', score=1)],
         'Chelenko': [ExactMatch(term='Cheliya', score=1.0)],
         "Sodo Dac'Ha": [StringMatch(term='Sodo Daci', score=0.7272727272727273)],
         'Bilonopa': [StringMatch(term='Bilo Nopha', score=0.8)],
         'Muke Turi': [StringMatch(term='Metu Zuria', score=0.5)],
         'Midakegni': [StringMatch(term='Mida Kegn', score=0.7777777777777778)],
         'Bule Hora': [StringMatch(term='Bule Hora', score=1)],
         'Tulu Bolo': [ExactMatch(term='Tullo', score=1.0)],
         'Mullo': [ExactMatch(term='Mulo', score=1.0)],
         'Sebeta Awas': [StringMatch(term='Sebeta Hawas', score=0.9166666666666666)],
         'Elifata': [StringMatch(term='Ifata', score=0.7142857142857143)],
         'Kore': [StringMatch(term='Kore', score=1)],
         'Begi': [StringMatch(term='Begi', score=1)],
         'Shirka': [StringMatch(term='Shirka', score=1)],
         'Bore': [StringMatch(term='Bore', score=1)],
         'Horro Buluk': [StringMatch(term='Horo Buluk', score=0.9090909090909091)],
         'Gojo': [ExactMatch(term='Goglo', score=1.0)],
         'Limu Kosa': [StringMatch(term='Limu Kosa', score=1)],
         'Aseko': [StringMatch(term='Aseko', score=1)],
         'Gambo': [ExactMatch(term='Garbo', score=1.0)],
         'Adola': [StringMatch(term='Adola', score=1)],
         'Jimma Horo': [StringMatch(term='Jimma Horo', score=1)],
         'Hebal Arsi': [StringMatch(term='Heban Arsi', score=0.9)],
         'Seru': [StringMatch(term='Seru', score=1)],
         'Gimbi Public': [ExactMatch(term='Gimbi Town', score=1.0)],
         'Gedeb Asasa': [StringMatch(term='Gedeb Asasa', score=1)],
         'Adea Berga': [StringMatch(term='Adda Berga', score=0.9)],
         'Walmera': [StringMatch(term='Welmera', score=0.8571428571428572)],
         'Ambo Town': [StringMatch(term='Ambo Town', score=1)],
         'Ameya': [StringMatch(term='Ameya', score=1)],
         'Jima Geneti': [StringMatch(term='Jimma Genete', score=0.8333333333333334)],
         'Bora': [ExactMatch(term='Bore', score=1.0)],
         'Chewaqa': [StringMatch(term='Chwaka', score=0.7142857142857143)],
         'Ejerie': [StringMatch(term='Saesie', score=0.5)],
         'Kombolicha': [StringMatch(term='Kombolcha', score=0.9)],
         'Daro Lebu': [StringMatch(term='Daro Lebu', score=1)],
         'Dubluk': [StringMatch(term='Dubluk', score=1)],
         'Amigna': [StringMatch(term='Amigna', score=1)],
         'Berreh': [StringMatch(term='Bereh', score=0.8333333333333334)],
         'Aleiltu': [StringMatch(term='Aleltu', score=0.8571428571428572)],
         'Ejersa Lafo': [StringMatch(term='Ejersa Lafo', score=1)],
         'Algesachi': [StringMatch(term='Alge Sachi', score=0.9)],
         'Ale': [StringMatch(term='Ale', score=1)],
         'Raitu': [StringMatch(term='Rayitu', score=0.8333333333333334)],
         'Codi': [StringMatch(term='Cobi', score=0.75)],
         'Seka Chhokorsa': [StringMatch(term='Seka Chekorsa', score=0.8571428571428572)],
         'Shakiso Town': [StringMatch(term='Shakiso Town', score=1)],
         'Gomma': [StringMatch(term='Goma', score=0.8)],
         'Bedeno': [StringMatch(term='Bedeno', score=1)],
         'Jeju': [StringMatch(term='Jeju', score=1)],
         'Shabe': [StringMatch(term='Shala', score=0.6)],
         'Haromaya Rural': [StringMatch(term='Haromaya Town', score=0.6428571428571428)],
         'Abay Chomen': [StringMatch(term='Abay Chomen', score=1)],
         'Degem': [StringMatch(term='Degem', score=1)],
         'Lomme': [StringMatch(term='Loma', score=0.6)],
         'Limu': [ExactMatch(term='Darimu', score=1.0)],
         'Mesela': [StringMatch(term='Mesela', score=1)],
         'Abuna Gindeberet': [StringMatch(term='Abuna Ginde Beret', score=0.9411764705882353)],
         'Meiso': [StringMatch(term='Miesso', score=0.6666666666666667)],
         'Sedan Chanka': [StringMatch(term='Sedi Chenka', score=0.75)],
         'Tibe Kutaye': [StringMatch(term='Toke Kutaye', score=0.8181818181818181)],
         'Bale Gesgara': [StringMatch(term='Bele Gesgar', score=0.8333333333333334)],
         'Gera': [StringMatch(term='Gera', score=1)],
         'Adami Tulu Jido Kombolcha': [StringMatch(term='Adama Tulu Jido Kombolcha', score=0.96)],
         'Nono': [StringMatch(term='Nono', score=1)],
         'Ludehetosa': [StringMatch(term='Lude Hitosa', score=0.8181818181818181)],
         'Legehida': [StringMatch(term='Legehida', score=1)],
         'Holeta': [StringMatch(term='Holeta Town', score=0.5454545454545454)],
         'Gumi Eldallo': [StringMatch(term='Gumi Idalo', score=0.75)],
         'Yabelo': [StringMatch(term='Yabelo', score=1)],
         'Guliso': [StringMatch(term='Guliso', score=1)],
         'Bako Tibe': [StringMatch(term='Bako Tibe', score=1)],
         'B/Tolyi': [StringMatch(term='Tole', score=0.4285714285714286)],
         'Dinsho': [StringMatch(term='Dinsho', score=1)],
         'Wachile': [StringMatch(term='Wachile', score=1)],
         'Bishan Guracha Town': [StringMatch(term='Bishan Guracha', score=0.736842105263158)],
         'Homa': [StringMatch(term='Homa', score=1)],
         'Bedele Town': [StringMatch(term='Bedele Town', score=1)],
         'Hitosa': [StringMatch(term='Hitosa', score=1)],
         'Sire': [StringMatch(term='Sire', score=1)],
         'Gimbichu': [StringMatch(term='Gimbichu', score=1)],
         'Deksis': [StringMatch(term='Diksis', score=0.8333333333333334)],
         'Sandefa': [StringMatch(term='Sankura', score=0.5714285714285714)],
         'Goro Gutu': [StringMatch(term='Goro Gutu', score=1)],
         'Shashemene Rural': [StringMatch(term='Shashemene Zuria', score=0.8125)],
         'Akaki': [StringMatch(term='Akaki', score=1)],
         'Adama': [StringMatch(term='Adama', score=1)],
         'Chomen Guduru': [StringMatch(term='Choman Guduru', score=0.9230769230769231)],
         'Woliso Rural': [StringMatch(term='Woliso Town', score=0.5833333333333333)],
         'Chole': [StringMatch(term='Chole', score=1)],
         'Jimma Arjo': [StringMatch(term='Jimma Arjo', score=1)],
         'Kiremu': [StringMatch(term='Kiremu', score=1)],
         'Gursum': [ExactMatch(term='Gursum (Or)', score=1.0)],
         'Tena': [StringMatch(term='Tena', score=1)],
         'Tiyo': [StringMatch(term='Tiyo', score=1)],
         'Debre Libanos': [StringMatch(term='Debre Libanos', score=1)],
         'Omonada': [StringMatch(term='Omo Nada', score=0.875)],
         'Hanbala': [ExactMatch(term='Hawela', score=1.0)],
         'Dano': [StringMatch(term='Dano', score=1)],
         'Meyu Muleke': [StringMatch(term='Meyu Muleke', score=1)],
         'Chiro Town': [StringMatch(term='Chiro Town', score=1)],
         'Jarte Jardga': [StringMatch(term='Jarte Jardega', score=0.9230769230769231)],
         'Olanciti': [StringMatch(term='Olanciti Town', score=0.6153846153846154)],
         'Burayu Town': [ExactMatch(term='Bure Town', score=1.0)],
         'Were Jarso': [StringMatch(term='Wara Jarso', score=0.8)],
         'Bure': [ExactMatch(term='Bura', score=1.0)],
         'Meta Waliqite': [StringMatch(term='Meta Walkite', score=0.8461538461538461)],
         'Gudru': [StringMatch(term='Guduru', score=0.8333333333333334)],
         'Wuchale': [StringMatch(term='Wuchale', score=1)],
         'Dedesa': [StringMatch(term='Dedesa', score=1)],
         'Delo Mena': [ExactMatch(term='Doyogena', score=1.0)],
         'Horo': [StringMatch(term='Horo', score=1)],
         'Kurfa Chele': [StringMatch(term='Kurfa Chele', score=1)],
         'Gololcha': [StringMatch(term='Golocha', score=0.875)],
         'Dawe Qachen': [StringMatch(term='Dawe Ketchen', score=0.75)],
         'Kokosa': [StringMatch(term='Kokosa', score=1)],
         'Abaya': [StringMatch(term='Abaya', score=1)],
         'Nejo Rural': [StringMatch(term='Nejo Town', score=0.5)],
         'Nunu Qumba': [StringMatch(term='Nunu Kumba', score=0.9)],
         'Tulo': [StringMatch(term='Tullo', score=0.8)],
         'Suro Barguda': [StringMatch(term='Suro Berguda', score=0.9166666666666666)],
         'Agaro': [StringMatch(term='Amaro', score=0.8)],
         'Boneya Bushe': [StringMatch(term='Boneya Boshe', score=0.9166666666666666)],
         'Midega Tole': [StringMatch(term='Midhaga Tola', score=0.75)],
         'Robe Town': [StringMatch(term='Robe Town', score=1)],
         'Seyo': [StringMatch(term='Sayo', score=0.75)],
         'Halu': [ExactMatch(term='Kalu', score=1.0)],
         'St.Luke': [StringMatch(term='Dubluk', score=0.4285714285714286)],
         'Dedo': [StringMatch(term='Dedo', score=1)],
         'Adama Town': [StringMatch(term='Adama Town', score=1)],
         'Nejo': [StringMatch(term='Nejo', score=1)],
         'Hambela Wamena': [StringMatch(term='Hambela Wamena', score=1)],
         'Assela Town': [StringMatch(term='Asela Town', score=0.9090909090909091)],
         'Dambi Dolo': [ExactMatch(term='Damboya', score=1.0)],
         'Goba Town': [StringMatch(term='Goba Town', score=1)],
         'Guto Gida': [StringMatch(term='Guto Gida', score=1)],
         'Guba Qoricha': [StringMatch(term='Goba Koricha', score=0.8333333333333334)],
         'Dima': [ExactMatch(term='Diga', score=1.0)],
         'West Harerge\tGumbi Bordede': [StringMatch(term='Gumbi Bordede', score=0.5)],
         'Dawo': [StringMatch(term='Dawo', score=1)],
         'Moyale': [ExactMatch(term='Megale', score=1.0)],
         'Seden Sodo': [StringMatch(term='Seden Sodo', score=1)],
         'Chiro': [StringMatch(term='Chire', score=0.8)],
         'Ilu': [StringMatch(term='Ilu', score=1)],
         'Nono Sele': [StringMatch(term='Nono Benja', score=0.6)],
         'Melka Belo': [StringMatch(term='Melka Balo', score=0.9)],
         'Omo Nada': [StringMatch(term='Omo Nada', score=1)],
         'Mancho': [StringMatch(term='Mancho', score=1)],
         'Laloqile': [StringMatch(term='Lalo Kile', score=0.7777777777777778)],
         'Goro Muti': [StringMatch(term='Goro Muti', score=1)],
         'Aga Wayyu': [StringMatch(term='Aga Wayu', score=0.8888888888888888)],
         'Guma': [StringMatch(term='Gumay', score=0.8)],
         'Kofele': [StringMatch(term='Kofele', score=1)],
         'Modjo Town': [StringMatch(term='Mojo Town', score=0.9)],
         'Yabelo Rural': [StringMatch(term='Yabelo Town', score=0.5833333333333333)],
         'Gemechis': [StringMatch(term='Gemechis', score=1)],
         'Dhas': [StringMatch(term='Dhas', score=1)],
         'Bedesa Town': [StringMatch(term='Bedele Town', score=0.8181818181818181)],
         'Dawe Serar': [StringMatch(term='Dale Wabera', score=0.5454545454545454)],
         'Yaya Gulele': [StringMatch(term='Yaya Gulele', score=1)],
         'Boset': [StringMatch(term='Boset', score=1)],
         'Bako': [StringMatch(term='Babo', score=0.75)],
         'Bishoftu Town': [StringMatch(term='Bishoftu Town', score=1)],
         'Chinakesen': [StringMatch(term='Chinaksen', score=0.9)],
         'Dodola Rural': [StringMatch(term='Dodola Town', score=0.5833333333333333)],
         'Lata Sibu': [StringMatch(term='Leta Sibu', score=0.8888888888888888)],
         'Shenan Kolu': [StringMatch(term='Shanan Kolu', score=0.9090909090909091)],
         'Jeldu': [StringMatch(term='Jeldu', score=1)],
         'Gura Dhamole': [StringMatch(term='Gura Damole', score=0.9166666666666666)],
         'Kercha': [StringMatch(term='Kercha', score=1)],
         'Anfilo': [StringMatch(term='Anfilo', score=1)],
         'Oda Bultum': [StringMatch(term='Kuni /Oda Bultum', score=0.625)],
         'Sekoru': [StringMatch(term='Sekoru', score=1)],
         'Lege Dadi Lege Tafo Town': [StringMatch(term='Lege Tafo-Lege Dadi Town', score=0.7083333333333333)],
         'Gimbi Rural': [ExactMatch(term='Gimbi Town', score=1.0)],
         'Guna': [StringMatch(term='Guna', score=1)],
         'Nensebo': [StringMatch(term='Nenesebo', score=0.875)],
         'Sululta Town': [StringMatch(term='Sululta Town', score=1)],
         'Abichugna': [StringMatch(term="Abichugna Gne'A", score=0.6)],
         'Loke Hada': [StringMatch(term='Lege Hida', score=0.6666666666666667)],
         'Mojo': [StringMatch(term='Nejo', score=0.5)],
         'Girja': [StringMatch(term='Girawa', score=0.6666666666666667)],
         'Gelan Town': [ExactMatch(term='Gedeb Town', score=1.0)],
         'Tole': [StringMatch(term='Tole', score=1)],
         'Robe': [StringMatch(term='Robe', score=1)],
         'Guchi': [StringMatch(term='Guchi', score=1)],
         'Jibat': [StringMatch(term='Jibat', score=1)],
         'Metarobi': [StringMatch(term='Meta Robi', score=0.8888888888888888)],
         'Odo Shakiso': [StringMatch(term='Odo Shakiso', score=1)],
         'O/Beyam': [StringMatch(term='Omo Beyam', score=0.6666666666666667)],
         'Shala': [StringMatch(term='Shala', score=1)],
         'Adaba': [StringMatch(term='Adaba', score=1)],
         'Seweyna': [StringMatch(term='Seweyna', score=1)],
         'Chelia': [StringMatch(term='Cheliya', score=0.8571428571428572)],
         'Gudeyabila': [StringMatch(term='Gudeya Bila', score=0.9090909090909091)],
         'Negele Town': [StringMatch(term='Negele Town', score=1)],
         'Metu Rural': [StringMatch(term='Metu Zuria', score=0.7)],
         'Sinana': [StringMatch(term='Sinana', score=1)],
         'Gida Ayana': [StringMatch(term='Gida Ayana', score=1)],
         'Bedele Zuriya': [StringMatch(term='Badele Zuria', score=0.8461538461538461)],
         'Ambo': [StringMatch(term='Afambo', score=0.6666666666666667)],
         'Gidami': [StringMatch(term='Gidami', score=1)],
         'Arsi Negele Town': [StringMatch(term='Arsi Negele Town', score=1)],
         'Mendi': [StringMatch(term='Dendi', score=0.8)],
         'Arero': [StringMatch(term='Arero', score=1)],
         'Dilo': [StringMatch(term='Dilo', score=1)],
         'Kake': [ExactMatch(term='Kache', score=1.0)],
         'Enkelo Wabe': [StringMatch(term='Inkolo Wabe', score=0.8181818181818181)],
         'Adola Town': [StringMatch(term='Adola Town', score=1)],
         'Ayira': [StringMatch(term='Ayira', score=1)],
         'Dale Sedi': [StringMatch(term='Dale Sadi', score=0.8888888888888888)],
         'Chiro Zuriya': [StringMatch(term='Chiro Zuria', score=0.9166666666666666)],
         'Babile': [StringMatch(term='Berahile', score=0.625)],
         'Kersa Eh': [ExactMatch(term='Mersa Town', score=1.0)],
         'Gomole': [StringMatch(term='Gomole', score=1)],
         'Sude': [StringMatch(term='Sude', score=1)],
         'Meyo': [ExactMatch(term='Meko', score=1.0)],
         'Mendi Town': [StringMatch(term='Mendi Town', score=1)],
         'Meko': [StringMatch(term='Meko', score=1)],
         'Leqa Dulecha': [StringMatch(term='Leka Dulecha', score=0.9166666666666666)],
         'Amuru': [StringMatch(term='Amuru', score=1)],
         'Boke': [StringMatch(term='Boke', score=1)],
         'Nole Kaba': [StringMatch(term='Nole Kaba', score=1)],
         'Yemalogi Wolel': [StringMatch(term='Yama Logi Welel', score=0.8)],
         'Sigmo': [StringMatch(term='Sigmo', score=1)],
         'Guder': [StringMatch(term='Gumer', score=0.8)],
         'Didu': [StringMatch(term='Didu', score=1)],
         'Burka Dimtu': [StringMatch(term='Burqua Dhintu', score=0.6923076923076923)],
         'Wondo': [StringMatch(term='Wondo', score=1)],
         'Bantu': [StringMatch(term='Ibantu', score=0.8333333333333334)],
         'Jido': [StringMatch(term='Jida', score=0.75)],
         'Yayu': [StringMatch(term='Yayu', score=1)],
         'Mkelka Soda': [StringMatch(term='Melka Soda', score=0.9090909090909091)],
         'Ginir': [StringMatch(term='Ginir', score=1)],
         'Adea': [ExactMatch(term='Adet', score=1.0)],
         'Babo Gembel': [StringMatch(term='Chabe Gambeltu', score=0.5714285714285714)],
         'Bedele': [ExactMatch(term='Bedeno', score=1.0)],
         'Kersa': [StringMatch(term='Kercha', score=0.6666666666666667)],
         'Dhidesa': [StringMatch(term='Dedesa', score=0.7142857142857143)],
         'Habro': [StringMatch(term='Habro', score=1)],
         'Ginir Town': [StringMatch(term='Ginir Town', score=1)],
         'Holeta Town': [StringMatch(term='Holeta Town', score=1)],
         'Nekemte Town': [StringMatch(term='Nekemte Town', score=1)],
         'Adola Reda': [StringMatch(term='Adola Town', score=0.6)],
         'Sebeta Town': [StringMatch(term='Sebeta Town', score=1)],
         'Liben': [StringMatch(term='Liben', score=1)],
         'Arsi Negele Rural': [StringMatch(term='Arsi Negele Town', score=0.7058823529411764)],
         'Anchar': [StringMatch(term='Anchar', score=1)],
         'Kersana Malima': [StringMatch(term='Kersana Malima', score=1)],
         'Fentale': [StringMatch(term='Fentale', score=1)],
         'Berbere': [StringMatch(term='Berbere', score=1)],
         'Mene Sibu': [StringMatch(term='Mana Sibu', score=0.7777777777777778)],
         'Siraro': [StringMatch(term='Siraro', score=1)],
         'Genji': [ExactMatch(term='Gena', score=1.0)],
         'Inchini': [StringMatch(term='Tikur Enchini', score=0.5384615384615384)],
         'Seyo Nole': [StringMatch(term='Sayo Nole', score=0.8888888888888888)],
         'Gole Oda': [StringMatch(term='Golo Oda', score=0.875)],
         'Qiltu Kara': [StringMatch(term='Kiltu Kara', score=0.9)],
         'Goba': [ExactMatch(term='Doba', score=1.0)],
         'Wonchi': [StringMatch(term='Wenchi', score=0.8333333333333334)],
         'Uraga': [StringMatch(term='Uraga', score=1)],
         'Boricha': [StringMatch(term='Boricha', score=1)],
         'Kimbibit': [StringMatch(term='Kimbibit', score=1)],
         'Gechi': [StringMatch(term='Gechi', score=1)],
         'Gindeberet': [StringMatch(term='Ginde Beret', score=0.9090909090909091)],
         'Woliso Town': [StringMatch(term='Woliso Town', score=1)],
         'Boji Dermeji': [StringMatch(term='Boji Dirmeji', score=0.9166666666666666)],
         'Gobu Seyo': [StringMatch(term='Gobu Seyo', score=1)],
         'Dodola Town': [StringMatch(term='Dodola Town', score=1)],
         'Bekoji Town': [StringMatch(term='Bekoji Town', score=1)],
         'Goro Dola': [StringMatch(term='Gora Dola', score=0.8888888888888888)],
         'Fichetown': [StringMatch(term='Fiche Town', score=0.9)],
         'Sibu Sire': [StringMatch(term='Sibu Sire', score=1)],
         'Becho': [ExactMatch(term='Bero', score=1.0)],
         'Shashamane Town': [StringMatch(term='Shashemene Town', score=0.8666666666666667)],
         'Dukem Town': [StringMatch(term='Durame Town', score=0.7272727272727273)],
         'Dambi Dollo': [StringMatch(term='Denbi Dollo Town', score=0.5625)],
         'Arena Buluq': [StringMatch(term='Harena Buluk', score=0.8333333333333334)],
         'Anna Sora': [StringMatch(term='Ana Sora', score=0.8888888888888888)],
         'Jimma Spe Town': [StringMatch(term='Jimma Town', score=0.7142857142857143)],
         'Zeway Dugda': [StringMatch(term='Ziway Dugda', score=0.9090909090909091)],
         'Gimbi Adventist': [StringMatch(term='Gimbi Town', score=0.4666666666666667)],
         'Bisidimo': [ExactMatch(term='Bilidigilu', score=1.0)],
         'Kuyu': [StringMatch(term='Kuyu', score=1)],
         'Digeluna Tijo': [StringMatch(term='Degeluna Tijo', score=0.9230769230769231)],
         'Munesa': [StringMatch(term='Munessa', score=0.8571428571428572)],
         'Ebantu': [StringMatch(term='Ibantu', score=0.8333333333333334)],
         'Shambu': [StringMatch(term='Shambu Town', score=0.5454545454545454)],
         'Gelana': [StringMatch(term='Delanta', score=0.7142857142857143)],
         'Goro': [ExactMatch(term='Horo', score=1.0)],
         'Fiche': [StringMatch(term='Kache', score=0.6)],
         'Boji Cheqorsa': [StringMatch(term='Boji Chekorsa', score=0.9230769230769231)],
         'Batu': [StringMatch(term='Bati', score=0.75)],
         'Tikur Enchini': [StringMatch(term='Tikur Enchini', score=1)],
         'Gimbi': [StringMatch(term='Gimbi', score=1)],
         'Diga': [StringMatch(term='Diga', score=1)],
         'Hababo Guduru': [StringMatch(term='Choman Guduru', score=0.6153846153846154)],
         'Ilu Galan': [StringMatch(term='Illu Galan', score=0.9)],
         'Nono Benja': [StringMatch(term='Nono Benja', score=1)],
         'Dugda Dawa': [StringMatch(term='Dugda Dawa', score=1)],
         'Gasera': [StringMatch(term='Gasera', score=1)],
         'Bule Hora Toun': [StringMatch(term='Bule Hora Town', score=0.9285714285714286)],
         'Dera': [ExactMatch(term='Gera', score=1.0)],
         'Kumbi': [StringMatch(term='Kumbi', score=1)],
         'Gelemso': [ExactMatch(term='Telemt', score=1.0)],
         'Biebirsa Kojowa': [StringMatch(term='Birbirsa Kojowa', score=0.9333333333333333)],
         'Dodola': [StringMatch(term='Dodola', score=1)],
         'Mana': [StringMatch(term='Zana', score=0.75)],
         'Teltele': [StringMatch(term='Teltale', score=0.8571428571428572)],
         'Hidabu Abote': [StringMatch(term='Hidabu Abote', score=1)],
         'Chora Boter': [StringMatch(term='Chora (Buno Bedele)', score=0.4736842105263158)],
         'Garemuleta': [StringMatch(term='Geraleta', score=0.6)],
         'Doreni': [StringMatch(term='Dorani', score=0.8333333333333334)],
         'Nejo Town': [StringMatch(term='Nejo Town', score=1)],
         'Wadara': [StringMatch(term='Wadera', score=0.8333333333333334)],
         'Yubdo': [StringMatch(term='Yubdo', score=1)],
         'Agarfa': [StringMatch(term='Agarfa', score=1)],
         'Abe Dengoro': [StringMatch(term='Abe Dongoro', score=0.9090909090909091)],
         'Gawo Qebe': [StringMatch(term='Gawo Kebe', score=0.8888888888888888)],
         'Matahara': [StringMatch(term='May Kadra', score=0.5555555555555556)],
         'Fedis': [StringMatch(term='Fedis', score=1)],
         'Lalo Asabi': [StringMatch(term='Lalo Asabi', score=1)],
         'El Way': [StringMatch(term='Elwaya', score=0.6666666666666667)],
         'Dendi': [StringMatch(term='Dendi', score=1)],
         'Haru': [StringMatch(term='Haru', score=1)],
         'Haro Limu': [StringMatch(term='Haro Limu', score=1)],
         'Wayu Tuqa': [StringMatch(term='Wayu Tuka', score=0.8888888888888888)],
         'Chora': [StringMatch(term='Chifra', score=0.6666666666666667)],
         'Babile Woreda': [StringMatch(term='Babile (Or)', score=0.6923076923076923)],
         'Hawa Gelan': [StringMatch(term='Hawa Galan', score=0.9)],
         'Tiro Afeta': [StringMatch(term='Tiro Afeta', score=1)],
         'Jarso': [StringMatch(term='Ararso', score=0.6666666666666667)],
         'Girawa': [StringMatch(term='Girawa', score=1)],
         'Hurumu': [StringMatch(term='Hurumu', score=1)],
         'Dire': [StringMatch(term='Dire', score=1)],
         'Aweday Town': [StringMatch(term='Aweday Town', score=1)],
         'Seka Chekorsa': [StringMatch(term='Seka Chekorsa', score=1)],
         'Sasiga': [StringMatch(term='Sasiga', score=1)],
         'Limuna Bilbilo': [StringMatch(term='Limu Bilbilo', score=0.8571428571428572)],
         'Saba Boru': [StringMatch(term='Saba Boru', score=1)],
         'Jima Rare': [StringMatch(term='Jimma Rare', score=0.9)],
         'Deder': [StringMatch(term='Deder', score=1)],
         'Meda Welabu': [StringMatch(term='Meda Welabu', score=1)],
         'Meta': [StringMatch(term='Meta', score=1)],
         'Darimu': [StringMatch(term='Darimu', score=1)],
         'Metu Town': [StringMatch(term='Metu Town', score=1)],
         'Wama Hagelo': [StringMatch(term='Wama Hagalo', score=0.9090909090909091)],
         'Hawi Gudina': [StringMatch(term='Hawi Gudina\n', score=0.9166666666666666)],
         'Ginde Beret': [StringMatch(term='Ginde Beret', score=1)],
         'Gedo': [StringMatch(term='Dedo', score=0.75)],
         'Limu Seka': [StringMatch(term='Limu Seka', score=1)],
         'Liban Jawi': [StringMatch(term='Liban Jawi', score=1)],
         'Doba': [StringMatch(term='Doba', score=1)],
         'Negele': [ExactMatch(term='Egela', score=1.0)],
         'Girar Jarso': [StringMatch(term='Gerar Jarso', score=0.9090909090909091)]})

No warnings this time!

We can now update the dataset using this mapping and standardize the different variations of Woreda names into official values.

[21]:
from openclean.function.eval.domain import Lookup
from openclean.operator.transform.update import update
from openclean.function.eval.base import Col

# update the vaa dataset's WoredaName column using the Lookup eval function
vacc = update(vacc, 'WoredaName', Lookup(columns=['WoredaName'], mapping=woreda_map.to_lookup(), default=Col('WoredaName')))

We confirm if there are any errors left in the data.

[22]:
# All misspelled WoredaNames should be replaced with the closest values in our master vocabulary

errors = set(vacc['WoredaName']) - set(admin_boundaries['Woreda'])
print('there are {} errors'.format(len(errors)))
print(errors)
there are 0 errors
set()

We apply our findings about common spelling mistakes from the vaccine dataset to a dataset on crop production from the Oromia region.

[23]:
# streaming and slicing a different dataset about farming in the Oromia region

crops = stream(os.path.join('data', 'ethiopia-crop-production-2010-11.csv'))\
    .where(Eq(Col('Region'), 'Oromia'))\
    .typecast(DefaultConverter()) \
    .to_df()

crops.sample(5)
[23]:
Crop Cluster Region DominantZone Woreda Land cultivated (Ha) Production (Qt) Productivity (Qt/ha)
109 Tef Oromia Tef Cluster Oromia South West Shewa Tole 4516.179612 17922.214527 3.968446
126 Mango East Shewa - Arsi Horticulture Cluster Oromia Arsi Zeway Dugda 0.12035 68.665683 570.548708
50 Maize East Wellega - Horo - West Shoa Maize Cluster Oromia Horo Gudru Wellega Jima Rare 1518.233234 38781.068885 25.543552
136 Sesame Oromia Sesame Cluster Oromia East Wellega Diga 408.783916 2875.041375 7.033157
70 Maize Jimma - Buno Bedele Maize Cluster Oromia Jimma Limu Kosa 426.384802 94259.82324 221.067503
[24]:
# let's see how many misspelled values exist in the crops dataframe

errors = set(crops.Woreda)  - set(admin_boundaries['Woreda'])
print('there are {} errors'.format(len(errors)))
there are 39 errors
[25]:
# replacing the similarly misspelled ones with their closest spellings from the official list

crops = update(crops, 'Woreda', Lookup(columns=['Woreda'], mapping=woreda_map.to_lookup(), default=Col('Woreda')))
[26]:
# were we able to fix everything?

errors = set(crops.Woreda)  - set(admin_boundaries['Woreda'])
print('there are {} errors'.format(len(errors)))
there are 0 errors

No more errors! Well prepared mappings capturing the essence of mistakes in data are really powerful and can be stored, shared and reused amongst data practitioners across various different use cases.


We welcome feedback and contributions! Please visit us at the following links for more great examples:

|b4cd048ba1e646e6960dd9440dcdfacb|

       https://openclean.readthedocs.io/
   </p>

|3ba7f8776567464181690132a9797d15|

       https://github.com/VIDA-NYU/openclean-core
   </p>

|c11cca09ec3b4534998286d5c73386ea|

    https://vida.engineering.nyu.edu/research/data-curation/
</p>