Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you’ll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.
The cutoff point¶
In this exercise, and throughout this chapter, you’ll be working with the restaurants DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.
This version of restaurants has been collected from many sources, where the cuisine_type column is riddled with typos, and should contain only italian, american and asian cuisine types. There are so many unique categories that remapping them manually isn’t scalable, and it’s best to use string similarity instead.
Before doing so, you want to establish the cutoff point for the similarity score using the fuzzywuzzy’s process.extract() function by finding the similarity score of the most distant typo of each category.
# !pip3 install --upgrade pip
# !pip3 install fuzzywuzzy
# !pip3 install python-Levenshtein
import pandas as pd
import datetime as dt
import missingno as msno
import matplotlib.pyplot as plt
filename = 'restaurants.csv'
restaurants = pd.read_csv(filename, index_col = 0)
from fuzzywuzzy import process
# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()
# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = unique_types.shape[0]))
# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = unique_types.shape[0]))
# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = unique_types.shape[0]))
[('asian', 100), ('asiane', 91), ('asiann', 91), ('asiian', 91), ('asiaan', 91), ('asianne', 83), ('asiat', 80), ('italiann', 72), ('italiano', 72), ('italianne', 72), ('italian', 67), ('amurican', 62), ('american', 62), ('italiaan', 62), ('italiian', 62), ('itallian', 62), ('americann', 57), ('americano', 57), ('ameerican', 57), ('aamerican', 57), ('ameriican', 57), ('amerrican', 57), ('ammericann', 54), ('ameerrican', 54), ('ammereican', 54), ('america', 50), ('merican', 50), ('murican', 50), ('italien', 50), ('americen', 46), ('americin', 46), ('amerycan', 46), ('itali', 40)]
[('american', 100), ('americann', 94), ('americano', 94), ('ameerican', 94), ('aamerican', 94), ('ameriican', 94), ('amerrican', 94), ('america', 93), ('merican', 93), ('ammericann', 89), ('ameerrican', 89), ('ammereican', 89), ('amurican', 88), ('americen', 88), ('americin', 88), ('amerycan', 88), ('murican', 80), ('asian', 62), ('asiane', 57), ('asiann', 57), ('asiian', 57), ('asiaan', 57), ('italian', 53), ('asianne', 53), ('italiann', 50), ('italiano', 50), ('italiaan', 50), ('italiian', 50), ('itallian', 50), ('italianne', 47), ('asiat', 46), ('itali', 40), ('italien', 40)]
[('italian', 100), ('italiann', 93), ('italiano', 93), ('italiaan', 93), ('italiian', 93), ('itallian', 93), ('italianne', 88), ('italien', 86), ('itali', 83), ('asian', 67), ('asiane', 62), ('asiann', 62), ('asiian', 62), ('asiaan', 62), ('asianne', 57), ('amurican', 53), ('american', 53), ('americann', 50), ('asiat', 50), ('americano', 50), ('ameerican', 50), ('aamerican', 50), ('ameriican', 50), ('amerrican', 50), ('ammericann', 47), ('ameerrican', 47), ('ammereican', 47), ('america', 43), ('merican', 43), ('murican', 43), ('americen', 40), ('americin', 40), ('amerycan', 40)]
Remapping categories II¶
In the last exercise, you determined that the distance cutoff point for remapping typos of ‘american’, ‘asian’, and ‘italian’ cuisine types stored in the cuisine_type column should be 80.
In this exercise, you’re going to put it all together by finding matches with similarity scores equal to or higher than 80 by using fuzywuzzy.process’s extract() function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using process.extract(), the output is a list of tuples where each is formatted like:
(closest match, similarity score, index of match)
# Create a list of matches, comparing 'italian' with the cuisine_type column
# Syntax: dataframe.shape
# Return : Returns tuple of shape (Rows, columns) of dataframe/series
matches = process.extract('italian', restaurants['cuisine_type'], limit = restaurants.shape[0]) #[0] returns the number of rows in the df
# This is also an acceptable statement
# matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
# Inspect the first 5 matches
print(matches[0:5])
[('italian', 100, 11), ('italian', 100, 25), ('italian', 100, 41), ('italian', 100, 47), ('italian', 100, 49)]
# Iterate through the list of matches to italian
for match in matches:
# Check whether the similarity score is greater than or equal to 80
if match[1] >= 80:
# Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
restaurants.loc[restaurants['cuisine_type'] == match[0], 'cuisine_type'] = 'italian'
# Create list Categories
categories = ['american', 'asian', 'italian']
# Iterate through categories
for cuisine in categories:
# Create a list of matches, comparing cuisine with the cuisine_type column
matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
# Iterate through the list of matches
for match in matches:
# Check whether the similarity score is greater than or equal to 80
if match[1] >= 80:
# If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
# Inspect the final result
print(restaurants['cuisine_type'].unique())
['american' 'asian' 'italian']
To link or not to link?¶
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.
In this exercise, you will classify each card whether it is a traditional join problem, or a record linkage one.
Pairs of restaurants¶
In the last lesson, you cleaned the restaurants dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named restaurants_new with new restaurants to train your model on, that’s been scraped from a new data source.
You’ve already cleaned the cuisine_type and city columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with restaurants.
In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between restaurants and restaurants_new.
#!pip3 install recordlinkage
import recordlinkage
filename = 'restaurants_2.csv'
restaurants = pd.read_csv(filename, index_col = 0)
filename = 'restaurants_new.csv'
restaurants_new = pd.read_csv(filename, index_col = 0)
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()
# Block pairing on cuisine_type
indexer.block('cuisine_type')
# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)
print(pairs)
MultiIndex([( 0, 0),
( 0, 1),
( 0, 7),
( 0, 12),
( 0, 13),
( 0, 20),
( 0, 27),
( 0, 28),
( 0, 39),
( 0, 40),
...
(284, 63),
(284, 66),
(287, 24),
(287, 63),
(287, 66),
( 40, 18),
(281, 18),
(288, 18),
(302, 18),
(308, 18)],
length=3631)
Similar restaurants¶
In the last exercise, you generated pairs between restaurants and restaurants_new in an effort to cleanly merge both DataFrames using record linkage.
When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.
Now that your pairs have been generated and stored in pairs, you will find exact matches in the city and cuisine_type columns between each pair, and similar strings for each pair in the rest_name column.
# Create a comparison object
comp_cl = recordlinkage.Compare()
# Use the appropriate comp_cl method to find similar strings with a 0.8 similarity threshold in the rest_name column of both DataFrames
# https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage-compare-object
# Create a comparison object
comp_cl = recordlinkage.Compare()
# Find exact matches on city, cuisine_types
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label = 'cuisine_type')
# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8)
# Class to compare the attributes of candidate record pairs. The Compare class has methods like string, exact and numeric to initialise the comparing of the records. The compute method is used to start the actual comparing.
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)
city cuisine_type name
0 0 0 1 0.0
1 0 1 0.0
7 0 1 0.0
12 0 1 0.0
13 0 1 0.0
... ... ... ...
40 18 0 1 0.0
281 18 0 1 0.0
288 18 0 1 0.0
302 18 0 1 0.0
308 18 0 1 0.0
[3631 rows x 3 columns]
print(restaurants.head())
rest_name rest_addr city \
0 arnie morton's of chicago 435 s. la cienega blv . los angeles
1 art's delicatessen 12224 ventura blvd. studio city
2 campanile 624 s. la brea ave. los angeles
3 fenix 8358 sunset blvd. west hollywood
4 grill on the alley 9560 dayton way los angeles
phone cuisine_type
0 3102461501 american
1 8187621221 american
2 2139381447 american
3 2138486677 american
4 3102760615 american
print(restaurants_new.head())
rest_name rest_addr city phone cuisine_type 0 kokomo 6333 w. third st. la 2139330773 american 1 feenix 8358 sunset blvd. west hollywood 2138486677 american 2 parkway 510 s. arroyo pkwy . pasadena 8187951001 californian 3 r-23 923 e. third st. los angeles 2136877178 japanese 4 gumbo 6333 w. third st. la 2139330358 cajun/creole
# potential_matches[potential_matches.sum(axis = 1) >= n]
# Where n is the minimum number of columns you want matching to ensure a proper duplicate find
potential_matches[potential_matches.sum(axis = 1) >= 3]
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| city | cuisine_type | name | ||
|---|---|---|---|---|
| 0 | 40 | 1 | 1 | 1.0 |
| 1 | 28 | 1 | 1 | 1.0 |
| 2 | 74 | 1 | 1 | 1.0 |
| 3 | 1 | 1 | 1 | 1.0 |
| 4 | 53 | 1 | 1 | 1.0 |
| 8 | 43 | 1 | 1 | 1.0 |
| 9 | 50 | 1 | 1 | 1.0 |
| 13 | 7 | 1 | 1 | 1.0 |
| 14 | 67 | 1 | 1 | 1.0 |
| 17 | 12 | 1 | 1 | 1.0 |
| 20 | 20 | 1 | 1 | 1.0 |
| 21 | 27 | 1 | 1 | 1.0 |
| 5 | 65 | 1 | 1 | 1.0 |
| 7 | 79 | 1 | 1 | 1.0 |
| 12 | 26 | 1 | 1 | 1.0 |
| 18 | 71 | 1 | 1 | 1.0 |
| 6 | 73 | 1 | 1 | 1.0 |
| 10 | 75 | 1 | 1 | 1.0 |
| 11 | 21 | 1 | 1 | 1.0 |
| 16 | 57 | 1 | 1 | 1.0 |
| 19 | 47 | 1 | 1 | 1.0 |
| 15 | 55 | 1 | 1 | 1.0 |
Linking them together!¶
In the last lesson, you’ve finished the bulk of the work on your effort to link restaurants and restaurants_new. You’ve generated the different pairs of potentially matching rows, searched for exact matches between the cuisine_type and city columns, but compared for similar strings in the rest_name column. You stored the DataFrame containing the scores in potential_matches.
Now it’s finally time to link both DataFrames. You will do so by first extracting all row indices of restaurants_new that are matching across the columns mentioned above from potential_matches. Then you will subset restaurants_new on these indices, then append the non-duplicate values to restaurants.
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)
print(matching_indices)
Int64Index([40, 28, 74, 1, 53, 43, 50, 7, 67, 12, 20, 27, 65, 79, 26, 71, 73,
75, 21, 57, 47, 55],
dtype='int64')
# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]
print(non_dup)
rest_name rest_addr city phone \
0 kokomo 6333 w. third st. la 2139330773
2 parkway 510 s. arroyo pkwy . pasadena 8187951001
3 r-23 923 e. third st. los angeles 2136877178
4 gumbo 6333 w. third st. la 2139330358
5 pink's 709 n. la brea ave. la 2139314223
6 original 875 s. figueroa st. downtown la 2136276879
8 langer's 704 s. alvarado st. la 2134838050
9 mo 7261 melrose ave. la 2139355280
10 killer 4000 colfax ave. studio city 8185081570
11 jody 2011 ocean front walk venice 3103061995
13 ocean 3117 ocean park blvd. santa monica 3104525728
14 bruno's 3838 centinela ave. mar vista 3103975703
15 brighton 9600 brighton way beverly hills 3102767732
16 diaghilev 1020 n. san vicente blvd. w. hollywood 3108541111
17 rubin's 15322 ventura blvd. encino 8189056515
18 ruth's 224 s. beverly dr. beverly hills 3108598744
19 la 1842 e. first st. boyle hts . 2132652887
22 ruby's 45 s. fair oaks ave. pasadena 8187967829
23 mishima 8474 w. third st. la 2137820181
24 baja 3345 kimber dr. westlake village 8054984049
25 falafel 1059 broxton ave. westwood 3102084444
29 asahi 2027 sawtelle blvd. west la 3104792231
30 bristol 1570 rosecrans ave. s. pasadena 3106435229
31 pho 642 broadway chinatown 2136265530
32 local 30869 thousand oaks blvd. westlake village 8187067706
33 joe's 1023 abbot kinney blvd. venice 3103995811
34 bistro 45 s. mentor ave. pasadena 8187952478
35 jan's 8424 beverly blvd. la 2136512866
36 michael's 1147 third st. santa monica 3104510843
37 ocean 145 n. atlantic blvd. monterey park 8183082128
38 cafe 9777 little santa monica blvd. beverly hills 3108880108
39 main 10509 w. pico blvd. rancho park 3104757564
41 indo 10428 1/2 national blvd. la 3108151290
42 chez 1716 pch redondo beach 3105401222
44 koo 8393 w. beverly blvd. la 2136559045
45 rae's 2901 pico blvd. santa monica 3108287937
46 benita's 1433 third st. promenade santa monica 3104582889
48 johnny 4663 long beach blvd. long beach 3104237327
49 johnny 7507 melrose ave. la 2136513361
51 cafe 838 lincoln blvd. venice 3103991955
52 cassell's 3266 w. sixth st. la 2134808668
54 maxwell's 13329 washington blvd. marina del rey 3103067829
56 hollywood 6145 franklin ave. hollywood 2134677678
58 duke's 8909 sunset blvd. w. hollywood 3106523100
59 johnnie's 4017 s. sepulveda blvd. culver city 3103976654
60 bernard's 515 s. olive st. los angeles 2136121580
61 apple 10801 w. pico blvd. west la 3104753585
62 belvedere 9882 little santa monica blvd. beverly hills 3107882306
63 poquito 2635 w. olive ave. burbank 8185632252
64 la 10506 little santa monica blvd. century city 3104704992
66 la 22800 pch malibu 3104566299
68 russell's 1198 pch seal beach 3105969556
69 mani's 519 s. fairfax ave. la 2139388800
70 john 10516 w. pico blvd. west la 3102040692
72 brent's 19565 parthenia ave. northridge 8188865679
76 don 1136 westwood blvd. westwood 3102091422
77 feast 1949 westwood blvd. west la 3104750400
78 mulberry 17040 ventura blvd. encino 8189068881
80 jiraffe 502 santa monica blvd santa monica 3109176671
81 martha's 22nd street grill 25 22nd st. hermosa beach 3103767786
cuisine_type
0 american
2 californian
3 japanese
4 cajun/creole
5 hot dogs
6 diners
8 delis
9 hamburgers
10 seafood
11 hot dogs
13 american
14 italian
15 coffee shops
16 russian
17 hot dogs
18 steakhouses
19 mexican/tex-mex
22 diners
23 noodle shops
24 mexican
25 middle eastern
29 noodle shops
30 californian
31 vietnamese
32 health food
33 american ( new )
34 californian
35 coffee shops
36 californian
37 seafood
38 pacific new wave
39 american
41 indonesian
42 eclectic
44 chicken
45 diners
46 fast food
48 southern/soul
49 american
51 american
52 hamburgers
54 american
56 coffee shops
58 coffee shops
59 delis
60 continental
61 american
62 pacific new wave
63 mexican
64 french ( new )
66 mexican
68 hamburgers
69 desserts
70 coffee shops
72 delis
76 italian
77 chinese
78 pizza
80 californian
81 american
# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)
rest_name rest_addr city \
0 arnie morton's of chicago 435 s. la cienega blv . los angeles
1 art's delicatessen 12224 ventura blvd. studio city
2 campanile 624 s. la brea ave. los angeles
3 fenix 8358 sunset blvd. west hollywood
4 grill on the alley 9560 dayton way los angeles
.. ... ... ...
76 don 1136 westwood blvd. westwood
77 feast 1949 westwood blvd. west la
78 mulberry 17040 ventura blvd. encino
80 jiraffe 502 santa monica blvd santa monica
81 martha's 22nd street grill 25 22nd st. hermosa beach
phone cuisine_type
0 3102461501 american
1 8187621221 american
2 2139381447 american
3 2138486677 american
4 3102760615 american
.. ... ...
76 3102091422 italian
77 3104750400 chinese
78 8189068881 pizza
80 3109176671 californian
81 3103767786 american
[396 rows x 5 columns]