Record Linkages and Fuzzy Matching with Fuzzywuzzy and Recordlinkage

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you’ll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

The cutoff point¶

In this exercise, and throughout this chapter, you’ll be working with the restaurants DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.

This version of restaurants has been collected from many sources, where the cuisine_type column is riddled with typos, and should contain only italian, american and asian cuisine types. There are so many unique categories that remapping them manually isn’t scalable, and it’s best to use string similarity instead.

Before doing so, you want to establish the cutoff point for the similarity score using the fuzzywuzzy’s process.extract() function by finding the similarity score of the most distant typo of each category.

In [3]:

# !pip3 install --upgrade pip
# !pip3 install fuzzywuzzy
# !pip3 install python-Levenshtein
import pandas as pd
import datetime as dt
import missingno as msno
import matplotlib.pyplot as plt

filename = 'restaurants.csv'
restaurants = pd.read_csv(filename, index_col = 0)

from fuzzywuzzy import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = unique_types.shape[0]))

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = unique_types.shape[0]))

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = unique_types.shape[0]))

[('asian', 100), ('asiane', 91), ('asiann', 91), ('asiian', 91), ('asiaan', 91), ('asianne', 83), ('asiat', 80), ('italiann', 72), ('italiano', 72), ('italianne', 72), ('italian', 67), ('amurican', 62), ('american', 62), ('italiaan', 62), ('italiian', 62), ('itallian', 62), ('americann', 57), ('americano', 57), ('ameerican', 57), ('aamerican', 57), ('ameriican', 57), ('amerrican', 57), ('ammericann', 54), ('ameerrican', 54), ('ammereican', 54), ('america', 50), ('merican', 50), ('murican', 50), ('italien', 50), ('americen', 46), ('americin', 46), ('amerycan', 46), ('itali', 40)]
[('american', 100), ('americann', 94), ('americano', 94), ('ameerican', 94), ('aamerican', 94), ('ameriican', 94), ('amerrican', 94), ('america', 93), ('merican', 93), ('ammericann', 89), ('ameerrican', 89), ('ammereican', 89), ('amurican', 88), ('americen', 88), ('americin', 88), ('amerycan', 88), ('murican', 80), ('asian', 62), ('asiane', 57), ('asiann', 57), ('asiian', 57), ('asiaan', 57), ('italian', 53), ('asianne', 53), ('italiann', 50), ('italiano', 50), ('italiaan', 50), ('italiian', 50), ('itallian', 50), ('italianne', 47), ('asiat', 46), ('itali', 40), ('italien', 40)]
[('italian', 100), ('italiann', 93), ('italiano', 93), ('italiaan', 93), ('italiian', 93), ('itallian', 93), ('italianne', 88), ('italien', 86), ('itali', 83), ('asian', 67), ('asiane', 62), ('asiann', 62), ('asiian', 62), ('asiaan', 62), ('asianne', 57), ('amurican', 53), ('american', 53), ('americann', 50), ('asiat', 50), ('americano', 50), ('ameerican', 50), ('aamerican', 50), ('ameriican', 50), ('amerrican', 50), ('ammericann', 47), ('ameerrican', 47), ('ammereican', 47), ('america', 43), ('merican', 43), ('murican', 43), ('americen', 40), ('americin', 40), ('amerycan', 40)]

Remapping categories II¶

In the last exercise, you determined that the distance cutoff point for remapping typos of ‘american’, ‘asian’, and ‘italian’ cuisine types stored in the cuisine_type column should be 80.

In this exercise, you’re going to put it all together by finding matches with similarity scores equal to or higher than 80 by using fuzywuzzy.process’s extract() function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using process.extract(), the output is a list of tuples where each is formatted like:

(closest match, similarity score, index of match)

In [4]:

# Create a list of matches, comparing 'italian' with the cuisine_type column
# Syntax: dataframe.shape
# Return : Returns tuple of shape (Rows, columns) of dataframe/series
matches = process.extract('italian', restaurants['cuisine_type'], limit = restaurants.shape[0]) #[0] returns the number of rows in the df

# This is also an acceptable statement
# matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))


# Inspect the first 5 matches
print(matches[0:5])

[('italian', 100, 11), ('italian', 100, 25), ('italian', 100, 41), ('italian', 100, 47), ('italian', 100, 49)]

In [5]:

# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
  if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    restaurants.loc[restaurants['cuisine_type'] == match[0], 'cuisine_type'] = 'italian'

In [6]:

# Create list Categories
categories = ['american', 'asian', 'italian']

# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
  
  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
        restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine

      
# Inspect the final result
print(restaurants['cuisine_type'].unique())

['american' 'asian' 'italian']

To link or not to link?¶

Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.

In this exercise, you will classify each card whether it is a traditional join problem, or a record linkage one.

Pairs of restaurants¶

In the last lesson, you cleaned the restaurants dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named restaurants_new with new restaurants to train your model on, that’s been scraped from a new data source.

You’ve already cleaned the cuisine_type and city columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with restaurants.

In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between restaurants and restaurants_new.

In [16]:

#!pip3 install recordlinkage
import recordlinkage

filename = 'restaurants_2.csv'
restaurants = pd.read_csv(filename, index_col = 0)

filename = 'restaurants_new.csv'
restaurants_new = pd.read_csv(filename, index_col = 0)

# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

print(pairs)

MultiIndex([(  0,  0),
            (  0,  1),
            (  0,  7),
            (  0, 12),
            (  0, 13),
            (  0, 20),
            (  0, 27),
            (  0, 28),
            (  0, 39),
            (  0, 40),
            ...
            (284, 63),
            (284, 66),
            (287, 24),
            (287, 63),
            (287, 66),
            ( 40, 18),
            (281, 18),
            (288, 18),
            (302, 18),
            (308, 18)],
           length=3631)

Similar restaurants¶

In the last exercise, you generated pairs between restaurants and restaurants_new in an effort to cleanly merge both DataFrames using record linkage.

When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.

Now that your pairs have been generated and stored in pairs, you will find exact matches in the city and cuisine_type columns between each pair, and similar strings for each pair in the rest_name column.

In [23]:

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Use the appropriate comp_cl method to find similar strings with a 0.8 similarity threshold in the rest_name column of both DataFrames
# https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage-compare-object

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label = 'cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 

# Class to compare the attributes of candidate record pairs. The Compare class has methods like string, exact and numeric to initialise the comparing of the records. The compute method is used to start the actual comparing.
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)

print(potential_matches)

        city  cuisine_type  name
0   0      0             1   0.0
    1      0             1   0.0
    7      0             1   0.0
    12     0             1   0.0
    13     0             1   0.0
...      ...           ...   ...
40  18     0             1   0.0
281 18     0             1   0.0
288 18     0             1   0.0
302 18     0             1   0.0
308 18     0             1   0.0

[3631 rows x 3 columns]

In [26]:

print(restaurants.head())

                   rest_name                rest_addr         city  \
0  arnie morton's of chicago  435 s. la cienega blv .  los angeles   
1         art's delicatessen      12224 ventura blvd.  studio city   
2                  campanile      624 s. la brea ave.  los angeles   
3                      fenix   8358 sunset blvd. west    hollywood   
4         grill on the alley          9560 dayton way  los angeles   

        phone cuisine_type  
0  3102461501     american  
1  8187621221     american  
2  2139381447     american  
3  2138486677     american  
4  3102760615     american

In [27]:

print(restaurants_new.head())

  rest_name               rest_addr         city       phone  cuisine_type
0    kokomo       6333 w. third st.           la  2139330773      american
1    feenix  8358 sunset blvd. west    hollywood  2138486677      american
2   parkway    510 s. arroyo pkwy .     pasadena  8187951001   californian
3      r-23        923 e. third st.  los angeles  2136877178      japanese
4     gumbo       6333 w. third st.           la  2139330358  cajun/creole

In [31]:

# potential_matches[potential_matches.sum(axis = 1) >= n]
# Where n is the minimum number of columns you want matching to ensure a proper duplicate find

potential_matches[potential_matches.sum(axis = 1) >= 3]

Out[31]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

		city	cuisine_type	name
0	40	1	1	1.0
1	28	1	1	1.0
2	74	1	1	1.0
3	1	1	1	1.0
4	53	1	1	1.0
8	43	1	1	1.0
9	50	1	1	1.0
13	7	1	1	1.0
14	67	1	1	1.0
17	12	1	1	1.0
20	20	1	1	1.0
21	27	1	1	1.0
5	65	1	1	1.0
7	79	1	1	1.0
12	26	1	1	1.0
18	71	1	1	1.0
6	73	1	1	1.0
10	75	1	1	1.0
11	21	1	1	1.0
16	57	1	1	1.0
19	47	1	1	1.0
15	55	1	1	1.0

Linking them together!¶

In the last lesson, you’ve finished the bulk of the work on your effort to link restaurants and restaurants_new. You’ve generated the different pairs of potentially matching rows, searched for exact matches between the cuisine_type and city columns, but compared for similar strings in the rest_name column. You stored the DataFrame containing the scores in potential_matches.

Now it’s finally time to link both DataFrames. You will do so by first extracting all row indices of restaurants_new that are matching across the columns mentioned above from potential_matches. Then you will subset restaurants_new on these indices, then append the non-duplicate values to restaurants.

In [37]:

# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

print(matching_indices)

Int64Index([40, 28, 74,  1, 53, 43, 50,  7, 67, 12, 20, 27, 65, 79, 26, 71, 73,
            75, 21, 57, 47, 55],
           dtype='int64')

In [38]:

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

print(non_dup)

    rest_name                        rest_addr               city       phone  \
0      kokomo                6333 w. third st.                 la  2139330773   
2     parkway             510 s. arroyo pkwy .           pasadena  8187951001   
3        r-23                 923 e. third st.        los angeles  2136877178   
4       gumbo                6333 w. third st.                 la  2139330358   
5      pink's              709 n. la brea ave.                 la  2139314223   
6    original     875 s. figueroa st. downtown                 la  2136276879   
8    langer's              704 s. alvarado st.                 la  2134838050   
9          mo                7261 melrose ave.                 la  2139355280   
10     killer                 4000 colfax ave.        studio city  8185081570   
11       jody            2011 ocean front walk             venice  3103061995   
13      ocean            3117 ocean park blvd.       santa monica  3104525728   
14    bruno's              3838 centinela ave.          mar vista  3103975703   
15   brighton                9600 brighton way      beverly hills  3102767732   
16  diaghilev        1020 n. san vicente blvd.       w. hollywood  3108541111   
17    rubin's              15322 ventura blvd.             encino  8189056515   
18     ruth's               224 s. beverly dr.      beverly hills  3108598744   
19         la                    1842 e. first    st. boyle hts .  2132652887   
22     ruby's             45 s. fair oaks ave.           pasadena  8187967829   
23    mishima                8474 w. third st.                 la  2137820181   
24       baja                  3345 kimber dr.   westlake village  8054984049   
25    falafel                1059 broxton ave.           westwood  3102084444   
29      asahi              2027 sawtelle blvd.            west la  3104792231   
30    bristol           1570 rosecrans ave. s.           pasadena  3106435229   
31        pho                     642 broadway          chinatown  2136265530   
32      local        30869 thousand oaks blvd.   westlake village  8187067706   
33      joe's          1023 abbot kinney blvd.             venice  3103995811   
34     bistro                45 s. mentor ave.           pasadena  8187952478   
35      jan's               8424 beverly blvd.                 la  2136512866   
36  michael's                   1147 third st.       santa monica  3104510843   
37      ocean            145 n. atlantic blvd.      monterey park  8183082128   
38       cafe   9777 little santa monica blvd.      beverly hills  3108880108   
39       main              10509 w. pico blvd.        rancho park  3104757564   
41       indo         10428 1/2 national blvd.                 la  3108151290   
42       chez                         1716 pch      redondo beach  3105401222   
44        koo            8393 w. beverly blvd.                 la  2136559045   
45      rae's                  2901 pico blvd.       santa monica  3108287937   
46   benita's         1433 third st. promenade       santa monica  3104582889   
48     johnny            4663 long beach blvd.         long beach  3104237327   
49     johnny                7507 melrose ave.                 la  2136513361   
51       cafe                838 lincoln blvd.             venice  3103991955   
52  cassell's                3266 w. sixth st.                 la  2134808668   
54  maxwell's           13329 washington blvd.     marina del rey  3103067829   
56  hollywood               6145 franklin ave.          hollywood  2134677678   
58     duke's                8909 sunset blvd.       w. hollywood  3106523100   
59  johnnie's          4017 s. sepulveda blvd.        culver city  3103976654   
60  bernard's                 515 s. olive st.        los angeles  2136121580   
61      apple              10801 w. pico blvd.            west la  3104753585   
62  belvedere   9882 little santa monica blvd.      beverly hills  3107882306   
63    poquito               2635 w. olive ave.            burbank  8185632252   
64         la  10506 little santa monica blvd.       century city  3104704992   
66         la                        22800 pch             malibu  3104566299   
68  russell's                         1198 pch         seal beach  3105969556   
69     mani's              519 s. fairfax ave.                 la  2139388800   
70       john              10516 w. pico blvd.            west la  3102040692   
72    brent's             19565 parthenia ave.         northridge  8188865679   
76        don              1136 westwood blvd.           westwood  3102091422   
77      feast              1949 westwood blvd.            west la  3104750400   
78   mulberry              17040 ventura blvd.             encino  8189068881   
80    jiraffe            502 santa monica blvd       santa monica  3109176671   
81   martha's        22nd street grill 25 22nd  st. hermosa beach  3103767786   

        cuisine_type  
0           american  
2        californian  
3           japanese  
4       cajun/creole  
5           hot dogs  
6             diners  
8              delis  
9         hamburgers  
10           seafood  
11          hot dogs  
13          american  
14           italian  
15      coffee shops  
16           russian  
17          hot dogs  
18       steakhouses  
19   mexican/tex-mex  
22            diners  
23      noodle shops  
24           mexican  
25    middle eastern  
29      noodle shops  
30       californian  
31        vietnamese  
32       health food  
33  american ( new )  
34       californian  
35      coffee shops  
36       californian  
37           seafood  
38  pacific new wave  
39          american  
41        indonesian  
42          eclectic  
44           chicken  
45            diners  
46         fast food  
48     southern/soul  
49          american  
51          american  
52        hamburgers  
54          american  
56      coffee shops  
58      coffee shops  
59             delis  
60       continental  
61          american  
62  pacific new wave  
63           mexican  
64    french ( new )  
66           mexican  
68        hamburgers  
69          desserts  
70      coffee shops  
72             delis  
76           italian  
77           chinese  
78             pizza  
80       californian  
81          american

In [39]:

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

                    rest_name                  rest_addr               city  \
0   arnie morton's of chicago    435 s. la cienega blv .        los angeles   
1          art's delicatessen        12224 ventura blvd.        studio city   
2                   campanile        624 s. la brea ave.        los angeles   
3                       fenix     8358 sunset blvd. west          hollywood   
4          grill on the alley            9560 dayton way        los angeles   
..                        ...                        ...                ...   
76                        don        1136 westwood blvd.           westwood   
77                      feast        1949 westwood blvd.            west la   
78                   mulberry        17040 ventura blvd.             encino   
80                    jiraffe      502 santa monica blvd       santa monica   
81                   martha's  22nd street grill 25 22nd  st. hermosa beach   

         phone cuisine_type  
0   3102461501     american  
1   8187621221     american  
2   2139381447     american  
3   2138486677     american  
4   3102760615     american  
..         ...          ...  
76  3102091422      italian  
77  3104750400      chinese  
78  8189068881        pizza  
80  3109176671  californian  
81  3103767786     american  

[396 rows x 5 columns]

		city	cuisine_type	name
0	40	1	1	1.0
1	28	1	1	1.0
2	74	1	1	1.0
3	1	1	1	1.0
4	53	1	1	1.0
8	43	1	1	1.0
9	50	1	1	1.0
13	7	1	1	1.0
14	67	1	1	1.0
17	12	1	1	1.0
20	20	1	1	1.0
21	27	1	1	1.0
5	65	1	1	1.0
7	79	1	1	1.0
12	26	1	1	1.0
18	71	1	1	1.0
6	73	1	1	1.0
10	75	1	1	1.0
11	21	1	1	1.0
16	57	1	1	1.0
19	47	1	1	1.0
15	55	1	1	1.0

		city	cuisine_type	name
0	40	1	1	1.0
1	28	1	1	1.0
2	74	1	1	1.0
3	1	1	1	1.0
4	53	1	1	1.0
8	43	1	1	1.0
9	50	1	1	1.0
13	7	1	1	1.0
14	67	1	1	1.0
17	12	1	1	1.0
20	20	1	1	1.0
21	27	1	1	1.0
5	65	1	1	1.0
7	79	1	1	1.0
12	26	1	1	1.0
18	71	1	1	1.0
6	73	1	1	1.0
10	75	1	1	1.0
11	21	1	1	1.0
16	57	1	1	1.0
19	47	1	1	1.0
15	55	1	1	1.0

		city	cuisine_type	name
0	40	1	1	1.0
1	28	1	1	1.0
2	74	1	1	1.0
3	1	1	1	1.0
4	53	1	1	1.0
8	43	1	1	1.0
9	50	1	1	1.0
13	7	1	1	1.0
14	67	1	1	1.0
17	12	1	1	1.0
20	20	1	1	1.0
21	27	1	1	1.0
5	65	1	1	1.0
7	79	1	1	1.0
12	26	1	1	1.0
18	71	1	1	1.0
6	73	1	1	1.0
10	75	1	1	1.0
11	21	1	1	1.0
16	57	1	1	1.0
19	47	1	1	1.0
15	55	1	1	1.0