Stacking and Unstacking Dataframes with Pandas

In this chapter, you’ll level-up your data manipulation skills using multi-level indexing. You’ll learn how to reshape DataFrames by rearranging levels of the row indexes to the column axis, or vice versa. You’ll also gain the skills you need to handle missing data generated in the stacking and unstacking processes.

Stack the calls!¶

New week, new project! One of your clients, a telecommunication company, wants to know why its customers are leaving. You will perform an analysis to figure it out. First, you explored the dataset churn and realized some information is missing. The dataset contains data about the total number of calls and the minutes spent on the phone by different customers. However, the state and city they live in are not listed.

You predefined an array with that data. You’d like to add it as an index in your DataFrame.

In [2]:

import pandas as pd
# Predefined list to use as index
new_index = [['California', 'California', 'New York', 'Ohio'], 
             ['Los Angeles', 'San Francisco', 'New York', 'Cleveland']]

# Create a multi-level index using predefined new_index
churn_new = pd.MultiIndex.from_arrays(new_index, names=['state', 'city'])

# Print churn_new
print(churn_new)

MultiIndex([('California',   'Los Angeles'),
            ('California', 'San Francisco'),
            (  'New York',      'New York'),
            (      'Ohio',     'Cleveland')],
           names=['state', 'city'])

In [3]:

filename = 'churn.csv'
churn = pd.read_csv(filename, index_col=0)

# Assign the new index to the churn index
churn.index = churn_new

# Print churn
print(churn)

                          Area code  total_day_calls  total_day_minutes
state      city                                                        
California Los Angeles          116              204                NaN
           San Francisco        408              109              287.0
New York   New York             415               84               84.0
Ohio       Cleveland            510               67               50.0

In [5]:

# Reshape by stacking churn DataFrame
churn_stack = churn.stack()

# Print churn_stack
print(churn_stack)

state       city                            
California  Los Angeles    Area code            116.0
                           total_day_calls      204.0
            San Francisco  Area code            408.0
                           total_day_calls      109.0
                           total_day_minutes    287.0
New York    New York       Area code            415.0
                           total_day_calls       84.0
                           total_day_minutes     84.0
Ohio        Cleveland      Area code            510.0
                           total_day_calls       67.0
                           total_day_minutes     50.0
dtype: float64

Phone directory index¶

After reshaping the dataset, you sent it to your colleagues and asked them to fill in some data. Now, they sent the new churn dataset back and you realized that its shape has changed.

Before you go on, you need to do some reshaping again. The dataset contains a multi-level index in the columns. You’d like to have some columns set as the row index. Also, this time you will only stack some levels. You believe it will help you discover some patterns in the data. The DataFrame churn contains data about state, city, total_day_calls and total_day_minutes during day and night time.

In [28]:

# https://medium.com/@gotashirato/how-to-import-csv-file-with-multi-level-columns-python-basics-and-a-question-b67cbbbc174d
filename = 'churn2.csv'
churn = pd.read_csv(filename, index_col=[1,2], header=[0,1])

# Print churn
print(churn)
print()

state                    Unnamed: 0_level_0       night                \
                         Unnamed: 0_level_1 total calls total minutes   
California Los Angeles                    0         116           204   
           San Francisco                  1         109           287   
New York   New York                       2          84            84   
Ohio       Cleveland                      3          67            50   

state                            day                
                         total calls total minutes  
California Los Angeles            85           107  
           San Francisco          90           167  
New York   New York               75            90  
Ohio       Cleveland              67           110

In [29]:

churn = churn.iloc[:, 1:]

# Print churn
print(churn)
print()

state                          night                       day              
                         total calls total minutes total calls total minutes
California Los Angeles           116           204          85           107
           San Francisco         109           287          90           167
New York   New York               84            84          75            90
Ohio       Cleveland              67            50          67           110

In [30]:

# Reshape by stacking the second level
churn_stack = churn.stack(level=1)

# Print churn_stack
print(churn_stack)

state                                   day  night
California Los Angeles   total calls     85    116
                         total minutes  107    204
           San Francisco total calls     90    109
                         total minutes  167    287
New York   New York      total calls     75     84
                         total minutes   90     84
Ohio       Cleveland     total calls     67     67
                         total minutes  110     50

Text me!¶

You are making progress in your customer’s project. Now, you need to analyze a new dataset to find differences in the messages and gigabytes (GB) of data the customers use during the daytime and nighttime.

To that aim, you will reshape your dataset churn using different levels. The advantage of your new dataset is that the column indices have names. The DataFrame churn contains data about state, city, text messages and total GB during day and night time.

In [34]:

filename = 'churn3.csv'
churn = pd.read_csv(filename, index_col=[0,1], header=[0,1])
print(churn)

time                               day                  night         
feature                  text messages total GB text messages total GB
state      city                                                       
California Los Angeles              20        5            30       10
           San Francisco            40        5           100        5
New York   New York                 50        2            20        9
Ohio       Cleveland               100        3            40        6

In [35]:

# Stack churn by the time column level
churn_time = churn.stack(level=0)

# Print churn_time
print(churn_time)

feature                         text messages  total GB
state      city          time                          
California Los Angeles   day               20         5
                         night             30        10
           San Francisco day               40         5
                         night            100         5
New York   New York      day               50         2
                         night             20         9
Ohio       Cleveland     day              100         3
                         night             40         6

In [36]:

# Stack churn by the feature column level
churn_feature = churn.stack(level='feature')

# Print churn_feature
print(churn_feature)

time                                    day  night
state      city          feature                  
California Los Angeles   text messages   20     30
                         total GB         5     10
           San Francisco text messages   40    100
                         total GB         5      5
New York   New York      text messages   50     20
                         total GB         2      9
Ohio       Cleveland     text messages  100     40
                         total GB         3      6

International caller¶

You have a new task. You will analyze the pattern of customers on international and domestic calls.

You explore the churn dataset, which contains a multi-level row index. Again, you will reshape the data, as you expect it will help you to do further analysis. The DataFrame churn contains data about minutes, calls, and charge for different times of the day, types of calls, and exited status.

In [47]:

filename = 'churn4.csv'
churn = pd.read_csv(filename).set_index(['time','type','exited'])

print(churn)

                              minutes  calls  charge
time  type          exited                          
day   International churn       184.5     97   31.37
      National      churn       129.1    137   21.95
night International churn       332.9     67   56.59
      National      no churn    110.4    103   18.77
eve   International no churn    119.3    117   20.28
      National      no churn    137.1     88   23.31

In [50]:

# Reshape the churn DataFrame by unstacking
churn_unstack = churn.unstack()

# Print churn_unstack
print(churn_unstack)

                    minutes           calls          charge         
exited                churn no churn  churn no churn  churn no churn
time  type                                                          
day   International   184.5      NaN   97.0      NaN  31.37      NaN
      National        129.1      NaN  137.0      NaN  21.95      NaN
eve   International     NaN    119.3    NaN    117.0    NaN    20.28
      National          NaN    137.1    NaN     88.0    NaN    23.31
night International   332.9      NaN   67.0      NaN  56.59      NaN
      National          NaN    110.4    NaN    103.0    NaN    18.77

In [51]:

# Reshape churn by unstacking the first row level
churn_first = churn.unstack(level=0)

# Print churn_zero
print(churn_first)

                       minutes                calls               charge  \
time                       day    eve  night    day    eve  night    day   
type          exited                                                       
International churn      184.5    NaN  332.9   97.0    NaN   67.0  31.37   
              no churn     NaN  119.3    NaN    NaN  117.0    NaN    NaN   
National      churn      129.1    NaN    NaN  137.0    NaN    NaN  21.95   
              no churn     NaN  137.1  110.4    NaN   88.0  103.0    NaN   

                                      
time                      eve  night  
type          exited                  
International churn       NaN  56.59  
              no churn  20.28    NaN  
National      churn       NaN    NaN  
              no churn  23.31  18.77

In [52]:

# Reshape churn by unstacking the second row level
churn_second = churn.unstack(level=1)

# Print churn_second
print(churn_second)

                     minutes                  calls                 charge  \
type           International National International National International   
time  exited                                                                 
day   churn            184.5    129.1          97.0    137.0         31.37   
eve   no churn         119.3    137.1         117.0     88.0         20.28   
night churn            332.9      NaN          67.0      NaN         56.59   
      no churn           NaN    110.4           NaN    103.0           NaN   

                         
type           National  
time  exited             
day   churn       21.95  
eve   no churn    23.31  
night churn         NaN  
      no churn    18.77

In [53]:

# Sort the index in descending order
churn_time = churn.unstack(level='time').sort_index(ascending=False)

# Print churn_time
print(churn_time)

                       minutes                calls               charge  \
time                       day    eve  night    day    eve  night    day   
type          exited                                                       
National      no churn     NaN  137.1  110.4    NaN   88.0  103.0    NaN   
              churn      129.1    NaN    NaN  137.0    NaN    NaN  21.95   
International no churn     NaN  119.3    NaN    NaN  117.0    NaN    NaN   
              churn      184.5    NaN  332.9   97.0    NaN   67.0  31.37   

                                      
time                      eve  night  
type          exited                  
National      no churn  23.31  18.77  
              churn       NaN    NaN  
International no churn  20.28    NaN  
              churn       NaN  56.59

Organizing your voicemail¶

You will perform one final task before moving to a new project. You will reshape the DataFrame churn again. This time, you’ll reorganize a row index as a column index. After that, you will move a column index to a row index. To do this, you will first unstack the DataFrame, and then stack it.

The same churn DataFrame contains data about minutes, calls, and charge for different times of the day, types of calls, and exited status.

In [55]:

# Unstack churn by type level
churn_final = churn.unstack(level='type').stack(level=0)

# Print churn_type
print(churn_final)

type                    International  National
time  exited                                   
day   churn    calls            97.00    137.00
               charge           31.37     21.95
               minutes         184.50    129.10
eve   no churn calls           117.00     88.00
               charge           20.28     23.31
               minutes         119.30    137.10
night churn    calls            67.00       NaN
               charge           56.59       NaN
               minutes         332.90       NaN
      no churn calls              NaN    103.00
               charge             NaN     18.77
               minutes            NaN    110.40

Swap your SIM card¶

Great job so far! You were able to reshape your dataset in several ways. Now it’s time to go a step further and analyze the data to discover if a customer’s cell phone plan is related to the customer leaving.

You explore the churn dataset and notice that the row levels are not well organized. First, you want to rearrange your row indicesso it’s easier to reshape your DataFrame. The churn DataFrame contains data about minutes, voicemail, and data plans for different years. The data is indexed by state, city, and exited status.

In [58]:

filename = 'churn5.csv'
churn = pd.read_csv(filename, index_col=[0,1,2], header=[0,1])
print(churn)

year                               2019                   2020               
plan                            minutes voicemail data minutes voicemail data
exited   state      city                                                     
churn    California Los Angeles       0         1    2       1         1    5
no_churn California Los Angeles       0         1    3       1         0    2
churn    New York   New York          1         0    5       0         1    2
no_churn New York   New York          1         0    4       1         0    6

In [60]:

# Switch the first and third row index levels in churn
churn_swap = churn.swaplevel(0, 2)

# Reshape by unstacking the last row level 
churn_unstack = churn_swap.unstack()

# Print churn_unstack
print(churn_unstack)

year                               2019                   2020               
plan                            minutes voicemail data minutes voicemail data
city        state      exited                                                
Los Angeles California churn          0         1    2       1         1    5
                       no_churn       0         1    3       1         0    2
New York    New York   churn          1         0    5       0         1    2
                       no_churn       1         0    4       1         0    6

Two many calls¶

Your last analysis was successful, but you still have some questions to answer. You are not satisfied with the organization of the data in your DataFrame.

For that reason, you plan on switching and rearranging row and column indices by chaining the stacking and unstacking processes. Also, you would like to rearrange several levels at the same time. The same churn DataFrame contains data about minutes, voicemail, and data plans for different years. The data is indexed by state, city, and exited status.

In [61]:

# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])

# Print churn_unstack
print(churn_unstack)

year              2019                                                   \
plan           minutes                               voicemail            
exited           churn            no_churn               churn            
state       California New York California New York California New York   
city                                                                      
Los Angeles        0.0      NaN        0.0      NaN        1.0      NaN   
New York           NaN      1.0        NaN      1.0        NaN      0.0   

year                                                 ...       2020           \
plan                                  data           ...    minutes            
exited        no_churn               churn           ...   no_churn            
state       California New York California New York  ... California New York   
city                                                 ...                       
Los Angeles        1.0      NaN        2.0      NaN  ...        1.0      NaN   
New York           NaN      0.0        NaN      5.0  ...        NaN      1.0   

year                                                                     \
plan         voicemail                                    data            
exited           churn            no_churn               churn            
state       California New York California New York California New York   
city                                                                      
Los Angeles        1.0      NaN        0.0      NaN        5.0      NaN   
New York           NaN      1.0        NaN      0.0        NaN      2.0   

year                             
plan                             
exited        no_churn           
state       California New York  
city                             
Los Angeles        2.0      NaN  
New York           NaN      6.0  

[2 rows x 24 columns]

In [62]:

# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])

# Stack the resulting DataFrame using plan and year
churn_py = churn_unstack.stack(level=[1,0])

# Print churn_py
print(churn_py)

exited                          churn            no_churn         
state                      California New York California New York
city        plan      year                                        
Los Angeles data      2019        2.0      NaN        3.0      NaN
                      2020        5.0      NaN        2.0      NaN
            minutes   2019        0.0      NaN        0.0      NaN
                      2020        1.0      NaN        1.0      NaN
            voicemail 2019        1.0      NaN        1.0      NaN
                      2020        1.0      NaN        0.0      NaN
New York    data      2019        NaN      5.0        NaN      4.0
                      2020        NaN      2.0        NaN      6.0
            minutes   2019        NaN      1.0        NaN      1.0
                      2020        NaN      0.0        NaN      1.0
            voicemail 2019        NaN      0.0        NaN      0.0
                      2020        NaN      1.0        NaN      0.0

In [63]:

# Unstack the first and second row level of churn
churn_unstack = churn.unstack(level=[0, 1])

# Stack the resulting DataFrame using plan and year
churn_py = churn_unstack.stack(['plan', 'year'])

# Switch the first and second column levels
churn_switch = churn_py.swaplevel(0, 1, axis=1)

# Print churn_switch
print(churn_switch)

state                      California New York California New York
exited                          churn    churn   no_churn no_churn
city        plan      year                                        
Los Angeles data      2019        2.0      NaN        3.0      NaN
                      2020        5.0      NaN        2.0      NaN
            minutes   2019        0.0      NaN        0.0      NaN
                      2020        1.0      NaN        1.0      NaN
            voicemail 2019        1.0      NaN        1.0      NaN
                      2020        1.0      NaN        0.0      NaN
New York    data      2019        NaN      5.0        NaN      4.0
                      2020        NaN      2.0        NaN      6.0
            minutes   2019        NaN      1.0        NaN      1.0
                      2020        NaN      0.0        NaN      1.0
            voicemail 2019        NaN      0.0        NaN      0.0
                      2020        NaN      1.0        NaN      0.0

A missed phone call¶

You finished reshaping your churn dataset in the previous exercises. Now, it is ready to be used. You remember that something caught your attention. You are sure you saw a clear pattern in the data.

Before you fit a classification model, you decide to do something simpler. You want to see what else you can learn from the data. You will reshape your data by unstacking levels, but you know this process will generate missing data that you need to handle.

The churn DataFrame contains different features of customers located in Los Angeles and New York, and is available for you.

In [91]:

filename = 'churn6.csv'
churn = pd.read_csv(filename, index_col=[0,1,2,3])

print(churn)

                                                total_day_calls  \
state international_plan voice_mail_plan churn                    
LA    No                 No              False       106.818182   
                                         True        100.000000   
                         Yes             False       100.000000   
NY    No                 No              False        90.900000   
                                         True         95.000000   
                         Yes             False       115.000000   
      Yes                No              False       109.000000   
                                         True         87.000000   
LA    Yes                No              False        78.000000   
                                         True         69.000000   
NY    Yes                Yes             False       120.000000   
LA    Yes                Yes             False        71.000000   

                                                total_night_calls  
state international_plan voice_mail_plan churn                     
LA    No                 No              False          96.909091  
                                         True          119.000000  
                         Yes             False          84.250000  
NY    No                 No              False         100.800000  
                                         True          101.500000  
                         Yes             False         121.000000  
      Yes                No              False          99.000000  
                                         True          113.000000  
LA    Yes                No              False          90.000000  
                                         True          104.000000  
NY    Yes                Yes             False          78.000000  
LA    Yes                Yes             False         101.000000

In [92]:

# Unstack churn level and fill missing values with zero
churn2 = churn.unstack(level=3, fill_value=0)

# Sort by descending voice mail plan and ascending international plan
churn_sorted = churn2.sort_index(level=[2, 1], 
                          ascending=[False, True])

# Print final DataFrame and observe pattern
print(churn_sorted)

                                         total_day_calls         \
churn                                              False   True   
state international_plan voice_mail_plan                          
LA    No                 Yes                  100.000000    0.0   
NY    No                 Yes                  115.000000    0.0   
LA    Yes                Yes                   71.000000    0.0   
NY    Yes                Yes                  120.000000    0.0   
LA    No                 No                   106.818182  100.0   
NY    No                 No                    90.900000   95.0   
LA    Yes                No                    78.000000   69.0   
NY    Yes                No                   109.000000   87.0   

                                         total_night_calls         
churn                                                False   True  
state international_plan voice_mail_plan                           
LA    No                 Yes                     84.250000    0.0  
NY    No                 Yes                    121.000000    0.0  
LA    Yes                Yes                    101.000000    0.0  
NY    Yes                Yes                     78.000000    0.0  
LA    No                 No                      96.909091  119.0  
NY    No                 No                     100.800000  101.5  
LA    Yes                No                      90.000000  104.0  
NY    Yes                No                      99.000000  113.0

Don’t drop the stack¶

It’s almost time to go home, but first, you need to finish your last task. You have a small dataset containing the total number of calls made by customers.

To perform your analysis, you need to reshape your churn data by stacking different levels. You know this process will generate missing data. You want to check if it is worth keeping the rows that contain all missing values, or if it’s better to drop that information.

In [101]:

filename = 'churn7.csv'
churn = pd.read_csv(filename, index_col=0, header=[0,1])

print(churn)

type  total_day_calls total_night_calls         
scope   international     international national
LA                 23                30      NaN
NY                  8                34     24.0
CA                  8                34     24.0

In [102]:

# Stack the level type from churn
churn_stack = churn.stack(level='type')

# Fill the resulting missing values with zero 
churn_fill = churn_stack.fillna(0)

# Print churn_fill
print(churn_fill)

scope                 international  national
   type                                      
LA total_day_calls               23       0.0
   total_night_calls             30       0.0
NY total_day_calls                8       0.0
   total_night_calls             34      24.0
CA total_day_calls                8       0.0
   total_night_calls             34      24.0

In [103]:

# Stack the level scope without dropping rows with missing values
churn_stack = churn.stack(level='scope', dropna=False)

# Fill the resulting missing values with zero
churn_fill = churn_stack.fillna(0)

# Print churn_fill
print(churn_fill)

type              total_day_calls  total_night_calls
   scope                                            
LA international             23.0               30.0
   national                   0.0                0.0
NY international              8.0               34.0
   national                   0.0               24.0
CA international              8.0               34.0
   national                   0.0               24.0