In this chapter, you’ll level-up your data manipulation skills using multi-level indexing. You’ll learn how to reshape DataFrames by rearranging levels of the row indexes to the column axis, or vice versa. You’ll also gain the skills you need to handle missing data generated in the stacking and unstacking processes.
Stack the calls!¶
New week, new project! One of your clients, a telecommunication company, wants to know why its customers are leaving. You will perform an analysis to figure it out. First, you explored the dataset churn and realized some information is missing. The dataset contains data about the total number of calls and the minutes spent on the phone by different customers. However, the state and city they live in are not listed.
You predefined an array with that data. You’d like to add it as an index in your DataFrame.
import pandas as pd
# Predefined list to use as index
new_index = [['California', 'California', 'New York', 'Ohio'],
['Los Angeles', 'San Francisco', 'New York', 'Cleveland']]
# Create a multi-level index using predefined new_index
churn_new = pd.MultiIndex.from_arrays(new_index, names=['state', 'city'])
# Print churn_new
print(churn_new)
MultiIndex([('California', 'Los Angeles'),
('California', 'San Francisco'),
( 'New York', 'New York'),
( 'Ohio', 'Cleveland')],
names=['state', 'city'])
filename = 'churn.csv'
churn = pd.read_csv(filename, index_col=0)
# Assign the new index to the churn index
churn.index = churn_new
# Print churn
print(churn)
Area code total_day_calls total_day_minutes
state city
California Los Angeles 116 204 NaN
San Francisco 408 109 287.0
New York New York 415 84 84.0
Ohio Cleveland 510 67 50.0
# Reshape by stacking churn DataFrame
churn_stack = churn.stack()
# Print churn_stack
print(churn_stack)
state city
California Los Angeles Area code 116.0
total_day_calls 204.0
San Francisco Area code 408.0
total_day_calls 109.0
total_day_minutes 287.0
New York New York Area code 415.0
total_day_calls 84.0
total_day_minutes 84.0
Ohio Cleveland Area code 510.0
total_day_calls 67.0
total_day_minutes 50.0
dtype: float64
Phone directory index¶
After reshaping the dataset, you sent it to your colleagues and asked them to fill in some data. Now, they sent the new churn dataset back and you realized that its shape has changed.
Before you go on, you need to do some reshaping again. The dataset contains a multi-level index in the columns. You’d like to have some columns set as the row index. Also, this time you will only stack some levels. You believe it will help you discover some patterns in the data. The DataFrame churn contains data about state, city, total_day_calls and total_day_minutes during day and night time.
# https://medium.com/@gotashirato/how-to-import-csv-file-with-multi-level-columns-python-basics-and-a-question-b67cbbbc174d
filename = 'churn2.csv'
churn = pd.read_csv(filename, index_col=[1,2], header=[0,1])
# Print churn
print(churn)
print()
state Unnamed: 0_level_0 night \
Unnamed: 0_level_1 total calls total minutes
California Los Angeles 0 116 204
San Francisco 1 109 287
New York New York 2 84 84
Ohio Cleveland 3 67 50
state day
total calls total minutes
California Los Angeles 85 107
San Francisco 90 167
New York New York 75 90
Ohio Cleveland 67 110
churn = churn.iloc[:, 1:]
# Print churn
print(churn)
print()
state night day
total calls total minutes total calls total minutes
California Los Angeles 116 204 85 107
San Francisco 109 287 90 167
New York New York 84 84 75 90
Ohio Cleveland 67 50 67 110
# Reshape by stacking the second level
churn_stack = churn.stack(level=1)
# Print churn_stack
print(churn_stack)
state day night
California Los Angeles total calls 85 116
total minutes 107 204
San Francisco total calls 90 109
total minutes 167 287
New York New York total calls 75 84
total minutes 90 84
Ohio Cleveland total calls 67 67
total minutes 110 50
Text me!¶
You are making progress in your customer’s project. Now, you need to analyze a new dataset to find differences in the messages and gigabytes (GB) of data the customers use during the daytime and nighttime.
To that aim, you will reshape your dataset churn using different levels. The advantage of your new dataset is that the column indices have names. The DataFrame churn contains data about state, city, text messages and total GB during day and night time.
filename = 'churn3.csv'
churn = pd.read_csv(filename, index_col=[0,1], header=[0,1])
print(churn)
time day night
feature text messages total GB text messages total GB
state city
California Los Angeles 20 5 30 10
San Francisco 40 5 100 5
New York New York 50 2 20 9
Ohio Cleveland 100 3 40 6
# Stack churn by the time column level
churn_time = churn.stack(level=0)
# Print churn_time
print(churn_time)
feature text messages total GB
state city time
California Los Angeles day 20 5
night 30 10
San Francisco day 40 5
night 100 5
New York New York day 50 2
night 20 9
Ohio Cleveland day 100 3
night 40 6
# Stack churn by the feature column level
churn_feature = churn.stack(level='feature')
# Print churn_feature
print(churn_feature)
time day night
state city feature
California Los Angeles text messages 20 30
total GB 5 10
San Francisco text messages 40 100
total GB 5 5
New York New York text messages 50 20
total GB 2 9
Ohio Cleveland text messages 100 40
total GB 3 6
International caller¶
You have a new task. You will analyze the pattern of customers on international and domestic calls.
You explore the churn dataset, which contains a multi-level row index. Again, you will reshape the data, as you expect it will help you to do further analysis. The DataFrame churn contains data about minutes, calls, and charge for different times of the day, types of calls, and exited status.
filename = 'churn4.csv'
churn = pd.read_csv(filename).set_index(['time','type','exited'])
print(churn)
minutes calls charge
time type exited
day International churn 184.5 97 31.37
National churn 129.1 137 21.95
night International churn 332.9 67 56.59
National no churn 110.4 103 18.77
eve International no churn 119.3 117 20.28
National no churn 137.1 88 23.31
# Reshape the churn DataFrame by unstacking
churn_unstack = churn.unstack()
# Print churn_unstack
print(churn_unstack)
minutes calls charge
exited churn no churn churn no churn churn no churn
time type
day International 184.5 NaN 97.0 NaN 31.37 NaN
National 129.1 NaN 137.0 NaN 21.95 NaN
eve International NaN 119.3 NaN 117.0 NaN 20.28
National NaN 137.1 NaN 88.0 NaN 23.31
night International 332.9 NaN 67.0 NaN 56.59 NaN
National NaN 110.4 NaN 103.0 NaN 18.77
# Reshape churn by unstacking the first row level
churn_first = churn.unstack(level=0)
# Print churn_zero
print(churn_first)
minutes calls charge \
time day eve night day eve night day
type exited
International churn 184.5 NaN 332.9 97.0 NaN 67.0 31.37
no churn NaN 119.3 NaN NaN 117.0 NaN NaN
National churn 129.1 NaN NaN 137.0 NaN NaN 21.95
no churn NaN 137.1 110.4 NaN 88.0 103.0 NaN
time eve night
type exited
International churn NaN 56.59
no churn 20.28 NaN
National churn NaN NaN
no churn 23.31 18.77
# Reshape churn by unstacking the second row level
churn_second = churn.unstack(level=1)
# Print churn_second
print(churn_second)
minutes calls charge \
type International National International National International
time exited
day churn 184.5 129.1 97.0 137.0 31.37
eve no churn 119.3 137.1 117.0 88.0 20.28
night churn 332.9 NaN 67.0 NaN 56.59
no churn NaN 110.4 NaN 103.0 NaN
type National
time exited
day churn 21.95
eve no churn 23.31
night churn NaN
no churn 18.77
# Sort the index in descending order
churn_time = churn.unstack(level='time').sort_index(ascending=False)
# Print churn_time
print(churn_time)
minutes calls charge \
time day eve night day eve night day
type exited
National no churn NaN 137.1 110.4 NaN 88.0 103.0 NaN
churn 129.1 NaN NaN 137.0 NaN NaN 21.95
International no churn NaN 119.3 NaN NaN 117.0 NaN NaN
churn 184.5 NaN 332.9 97.0 NaN 67.0 31.37
time eve night
type exited
National no churn 23.31 18.77
churn NaN NaN
International no churn 20.28 NaN
churn NaN 56.59
Organizing your voicemail¶
You will perform one final task before moving to a new project. You will reshape the DataFrame churn again. This time, you’ll reorganize a row index as a column index. After that, you will move a column index to a row index. To do this, you will first unstack the DataFrame, and then stack it.
The same churn DataFrame contains data about minutes, calls, and charge for different times of the day, types of calls, and exited status.
# Unstack churn by type level
churn_final = churn.unstack(level='type').stack(level=0)
# Print churn_type
print(churn_final)
type International National
time exited
day churn calls 97.00 137.00
charge 31.37 21.95
minutes 184.50 129.10
eve no churn calls 117.00 88.00
charge 20.28 23.31
minutes 119.30 137.10
night churn calls 67.00 NaN
charge 56.59 NaN
minutes 332.90 NaN
no churn calls NaN 103.00
charge NaN 18.77
minutes NaN 110.40
Swap your SIM card¶
Great job so far! You were able to reshape your dataset in several ways. Now it’s time to go a step further and analyze the data to discover if a customer’s cell phone plan is related to the customer leaving.
You explore the churn dataset and notice that the row levels are not well organized. First, you want to rearrange your row indicesso it’s easier to reshape your DataFrame. The churn DataFrame contains data about minutes, voicemail, and data plans for different years. The data is indexed by state, city, and exited status.
filename = 'churn5.csv'
churn = pd.read_csv(filename, index_col=[0,1,2], header=[0,1])
print(churn)
year 2019 2020 plan minutes voicemail data minutes voicemail data exited state city churn California Los Angeles 0 1 2 1 1 5 no_churn California Los Angeles 0 1 3 1 0 2 churn New York New York 1 0 5 0 1 2 no_churn New York New York 1 0 4 1 0 6
# Switch the first and third row index levels in churn
churn_swap = churn.swaplevel(0, 2)
# Reshape by unstacking the last row level
churn_unstack = churn_swap.unstack()
# Print churn_unstack
print(churn_unstack)
year 2019 2020
plan minutes voicemail data minutes voicemail data
city state exited
Los Angeles California churn 0 1 2 1 1 5
no_churn 0 1 3 1 0 2
New York New York churn 1 0 5 0 1 2
no_churn 1 0 4 1 0 6
Two many calls¶
Your last analysis was successful, but you still have some questions to answer. You are not satisfied with the organization of the data in your DataFrame.
For that reason, you plan on switching and rearranging row and column indices by chaining the stacking and unstacking processes. Also, you would like to rearrange several levels at the same time. The same churn DataFrame contains data about minutes, voicemail, and data plans for different years. The data is indexed by state, city, and exited status.
# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])
# Print churn_unstack
print(churn_unstack)
year 2019 \ plan minutes voicemail exited churn no_churn churn state California New York California New York California New York city Los Angeles 0.0 NaN 0.0 NaN 1.0 NaN New York NaN 1.0 NaN 1.0 NaN 0.0 year ... 2020 \ plan data ... minutes exited no_churn churn ... no_churn state California New York California New York ... California New York city ... Los Angeles 1.0 NaN 2.0 NaN ... 1.0 NaN New York NaN 0.0 NaN 5.0 ... NaN 1.0 year \ plan voicemail data exited churn no_churn churn state California New York California New York California New York city Los Angeles 1.0 NaN 0.0 NaN 5.0 NaN New York NaN 1.0 NaN 0.0 NaN 2.0 year plan exited no_churn state California New York city Los Angeles 2.0 NaN New York NaN 6.0 [2 rows x 24 columns]
# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])
# Stack the resulting DataFrame using plan and year
churn_py = churn_unstack.stack(level=[1,0])
# Print churn_py
print(churn_py)
exited churn no_churn
state California New York California New York
city plan year
Los Angeles data 2019 2.0 NaN 3.0 NaN
2020 5.0 NaN 2.0 NaN
minutes 2019 0.0 NaN 0.0 NaN
2020 1.0 NaN 1.0 NaN
voicemail 2019 1.0 NaN 1.0 NaN
2020 1.0 NaN 0.0 NaN
New York data 2019 NaN 5.0 NaN 4.0
2020 NaN 2.0 NaN 6.0
minutes 2019 NaN 1.0 NaN 1.0
2020 NaN 0.0 NaN 1.0
voicemail 2019 NaN 0.0 NaN 0.0
2020 NaN 1.0 NaN 0.0
# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])
# Stack the resulting DataFrame using plan and year
churn_py = churn_unstack.stack(['plan', 'year'])
# Switch the first and second column levels
churn_switch = churn_py.swaplevel(0, 1, axis=1)
# Print churn_switch
print(churn_switch)
state California New York California New York
exited churn churn no_churn no_churn
city plan year
Los Angeles data 2019 2.0 NaN 3.0 NaN
2020 5.0 NaN 2.0 NaN
minutes 2019 0.0 NaN 0.0 NaN
2020 1.0 NaN 1.0 NaN
voicemail 2019 1.0 NaN 1.0 NaN
2020 1.0 NaN 0.0 NaN
New York data 2019 NaN 5.0 NaN 4.0
2020 NaN 2.0 NaN 6.0
minutes 2019 NaN 1.0 NaN 1.0
2020 NaN 0.0 NaN 1.0
voicemail 2019 NaN 0.0 NaN 0.0
2020 NaN 1.0 NaN 0.0
A missed phone call¶
You finished reshaping your churn dataset in the previous exercises. Now, it is ready to be used. You remember that something caught your attention. You are sure you saw a clear pattern in the data.
Before you fit a classification model, you decide to do something simpler. You want to see what else you can learn from the data. You will reshape your data by unstacking levels, but you know this process will generate missing data that you need to handle.
The churn DataFrame contains different features of customers located in Los Angeles and New York, and is available for you.
filename = 'churn6.csv'
churn = pd.read_csv(filename, index_col=[0,1,2,3])
print(churn)
total_day_calls \
state international_plan voice_mail_plan churn
LA No No False 106.818182
True 100.000000
Yes False 100.000000
NY No No False 90.900000
True 95.000000
Yes False 115.000000
Yes No False 109.000000
True 87.000000
LA Yes No False 78.000000
True 69.000000
NY Yes Yes False 120.000000
LA Yes Yes False 71.000000
total_night_calls
state international_plan voice_mail_plan churn
LA No No False 96.909091
True 119.000000
Yes False 84.250000
NY No No False 100.800000
True 101.500000
Yes False 121.000000
Yes No False 99.000000
True 113.000000
LA Yes No False 90.000000
True 104.000000
NY Yes Yes False 78.000000
LA Yes Yes False 101.000000
# Unstack churn level and fill missing values with zero
churn2 = churn.unstack(level=3, fill_value=0)
# Sort by descending voice mail plan and ascending international plan
churn_sorted = churn2.sort_index(level=[2, 1],
ascending=[False, True])
# Print final DataFrame and observe pattern
print(churn_sorted)
total_day_calls \
churn False True
state international_plan voice_mail_plan
LA No Yes 100.000000 0.0
NY No Yes 115.000000 0.0
LA Yes Yes 71.000000 0.0
NY Yes Yes 120.000000 0.0
LA No No 106.818182 100.0
NY No No 90.900000 95.0
LA Yes No 78.000000 69.0
NY Yes No 109.000000 87.0
total_night_calls
churn False True
state international_plan voice_mail_plan
LA No Yes 84.250000 0.0
NY No Yes 121.000000 0.0
LA Yes Yes 101.000000 0.0
NY Yes Yes 78.000000 0.0
LA No No 96.909091 119.0
NY No No 100.800000 101.5
LA Yes No 90.000000 104.0
NY Yes No 99.000000 113.0
Don’t drop the stack¶
It’s almost time to go home, but first, you need to finish your last task. You have a small dataset containing the total number of calls made by customers.
To perform your analysis, you need to reshape your churn data by stacking different levels. You know this process will generate missing data. You want to check if it is worth keeping the rows that contain all missing values, or if it’s better to drop that information.
filename = 'churn7.csv'
churn = pd.read_csv(filename, index_col=0, header=[0,1])
print(churn)
type total_day_calls total_night_calls scope international international national LA 23 30 NaN NY 8 34 24.0 CA 8 34 24.0
# Stack the level type from churn
churn_stack = churn.stack(level='type')
# Fill the resulting missing values with zero
churn_fill = churn_stack.fillna(0)
# Print churn_fill
print(churn_fill)
scope international national type LA total_day_calls 23 0.0 total_night_calls 30 0.0 NY total_day_calls 8 0.0 total_night_calls 34 24.0 CA total_day_calls 8 0.0 total_night_calls 34 24.0
# Stack the level scope without dropping rows with missing values
churn_stack = churn.stack(level='scope', dropna=False)
# Fill the resulting missing values with zero
churn_fill = churn_stack.fillna(0)
# Print churn_fill
print(churn_fill)
type total_day_calls total_night_calls scope LA international 23.0 30.0 national 0.0 0.0 NY international 8.0 34.0 national 0.0 24.0 CA international 8.0 34.0 national 0.0 24.0