Advanced Dataframe Reshaping using Pandas and Json

You’ll finish by learning how to combine the reshaping process with grouping to produce quick data manipulations. Lastly, you’ll discover how to transform list-like columns and handle complex nested data, such as nested JSON files.

Less fast food, please!¶

Monday again! You will start working on a new project – analyzing the evolution of obesity through the years. You have a dataset called obesity with the percentage of obesity in different countries and years. The data is also disaggregated by biological sex.

Your main goals are to get the mean percentage of obesity by year and sex, and by country and sex. Also, you want to get the difference between years.

You notice that the dataset has multiple indices, so you know you will have to unstack levels to accomplish your goal.

In [3]:

import pandas as pd
filename = 'obesity.csv'
obesity = pd.read_csv(filename, index_col=[0,1,2])
print(obesity)

                               perc_obesity
country   biological_sex year              
Argentina Male           2005          21.5
          Female         2005          24.2
          Male           2015          26.8
          Female         2015          28.5
Japan     Male           2005           2.5
          Female         2005           2.6
          Male           2015           4.6
          Female         2015           3.6
Norway    Male           2005          17.6
          Female         2005          18.6
          Male           2015          23.0
          Female         2015          22.2

In [4]:

# Unstack the first level and calculate the mean of the columns
obesity_general = obesity.unstack(level=0).mean(axis=1)

# Print obesity_general
print(obesity_general)

biological_sex  year
Female          2005    15.133333
                2015    18.100000
Male            2005    13.866667
                2015    18.133333
dtype: float64

In [5]:

# Unstack the second level and calculate the mean of the columns
obesity_mean = obesity.unstack(level=1).mean(axis=1)

# Print obesity_mean
print(obesity_mean)

country    year
Argentina  2005    22.85
           2015    27.65
Japan      2005     2.55
           2015     4.10
Norway     2005    18.10
           2015    22.60
dtype: float64

In [6]:

# Unstack the third level and calculate the difference between columns
obesity_variation = obesity.unstack(level=2).diff(periods=1, axis=1)

# Print obesity_variation
print(obesity_variation)

                         perc_obesity     
year                             2005 2015
country   biological_sex                  
Argentina Female                  NaN  4.3
          Male                    NaN  5.3
Japan     Female                  NaN  1.0
          Male                    NaN  2.1
Norway    Female                  NaN  3.6
          Male                    NaN  5.4

In [10]:

filename = 'obesity2.csv'
obesity = pd.read_csv(filename, index_col=[0,1], header=[0,1])
print(obesity)

year                           1995                   2005            \
                       perc_obesity variation perc_obesity variation   
country biological_sex                                                 
France  Female                 15.3       7.7         18.1       8.2   
        Male                   12.8       7.6         16.9       8.4   
Germany Female                 14.4       4.6         17.2       5.2   
        Male                   14.4       5.1         18.7       5.9   

year                           2015            
                       perc_obesity variation  
country biological_sex                         
France  Female                 20.8      11.3  
        Male                   21.5      11.8  
Germany Female                 20.1       8.4  
        Male                   23.6       9.8

In [11]:

# Stack obesity, get median of columns and unstack again
median_obesity = obesity.stack().median(axis=1).unstack()

# Print median_obesity
print(median_obesity)

                        perc_obesity  variation
country biological_sex                         
France  Female                  18.1        8.2
        Male                    16.9        8.4
Germany Female                  17.2        5.2
        Male                    18.7        5.9

In [13]:

# Stack the first level, get sum, and unstack the second level
obesity_sum = obesity.stack(level=0).sum(axis=1).unstack(level=1)

# Print obesity_max
print(obesity_sum)

biological_sex  Female  Male
country year                
France  1995      23.0  20.4
        2005      26.3  25.3
        2015      32.1  33.3
Germany 1995      19.0  19.5
        2005      22.4  24.6
        2015      28.5  33.4

In [15]:

# Stack country level, group by country and get the mean
obesity_mean = obesity.stack(level=0).groupby(level='country').mean()

# Print obesity_mean
print(obesity_mean)

         perc_obesity  variation
country                         
France      17.566667   9.166667
Germany     18.066667   6.500000

In [30]:

# Stack country level, group by country and get the median 
obesity_median = obesity.stack(level=0)

# Print obesity_median
print(obesity_median)

                             perc_obesity  variation
country biological_sex year                         
France  Female         1995          15.3        7.7
                       2005          18.1        8.2
                       2015          20.8       11.3
        Male           1995          12.8        7.6
                       2005          16.9        8.4
                       2015          21.5       11.8
Germany Female         1995          14.4        4.6
                       2005          17.2        5.2
                       2015          20.1        8.4
        Male           1995          14.4        5.1
                       2005          18.7        5.9
                       2015          23.6        9.8

In [31]:

obesity_median = obesity_median.groupby(level='country').median()
# Print obesity_median
print(obesity_median)

         perc_obesity  variation
country                         
France          17.50       8.30
Germany         17.95       5.55

Merge it all¶

Time to keep working with the obesity project! You will analyze the mean obesity percentage in different countries, but this time, the obesity DataFrame has a new column named bounds. It contains the minimum and maximum values you can find in different parts of the same country.

You notice that these values are given in a list, so you decide that you need to transform that column. You would like to have each element in a new row.

In [64]:

filename = 'obesity3.csv'
obesity0 = pd.read_csv(filename, index_col=[0], header=[0])
print(obesity)

     country  perc_obesity      bounds
0  Argentina          21.5  15.4, 31.5
1    Germany          22.3  16.2, 32.4
2      Japan           2.5    1.1, 3.5
3     Norway          23.0  13.1, 33.0

In [66]:

# Explode the values of bounds to a separate row
obesity = obesity0.assign(bounds=obesity0['bounds'].str.split(','))

# Print obesity_bounds
print(obesity)

     country  perc_obesity         bounds
0  Argentina          21.5  [15.4,  31.5]
1    Germany          22.3  [16.2,  32.4]
2      Japan           2.5    [1.1,  3.5]
3     Norway          23.0  [13.1,  33.0]

In [67]:

# Explode the values of bounds to a separate row
obesity_bounds = obesity['bounds'].explode()

# Print obesity_bounds
print(obesity_bounds)

0     15.4
0     31.5
1     16.2
1     32.4
2      1.1
2      3.5
3     13.1
3     33.0
Name: bounds, dtype: object

In [68]:

# Merge obesity_bounds with country and perc_obesity columns of obesity using the indexes
obesity_final = obesity[['country', 'perc_obesity']].merge(obesity_bounds, 
                                        right_index=True, 
                                        left_index=True)

# Print obesity_final
print(obesity_final)

     country  perc_obesity bounds
0  Argentina          21.5   15.4
0  Argentina          21.5   31.5
1    Germany          22.3   16.2
1    Germany          22.3   32.4
2      Japan           2.5    1.1
2      Japan           2.5    3.5
3     Norway          23.0   13.1
3     Norway          23.0   33.0

Explode the bounds¶

You were able to transform the list-like column successfully, but you are not satisfied with the steps you had to take. You want to find an easier way to get the same reshaped DataFrame.

You remembered what you learned about exploding list-like columns, and you will apply a new strategy. The same DataFrame obesity contains the country, perc_obesity, and the column bounds with the minimum and maximum values you can find in different parts of the same country.

In [69]:

# Modify obesity_explode by resetting the index 
obesity_final.reset_index(drop=True, inplace=True)

# Print obesity_explode
print(obesity_final)

     country  perc_obesity bounds
0  Argentina          21.5   15.4
1  Argentina          21.5   31.5
2    Germany          22.3   16.2
3    Germany          22.3   32.4
4      Japan           2.5    1.1
5      Japan           2.5    3.5
6     Norway          23.0   13.1
7     Norway          23.0   33.0

In [72]:

filename = 'obesity4.csv'
obesity = pd.read_csv(filename, index_col=[0])
print(obesity)

        country  perc_obesity     bounds
0        France          14.5  11.4-25.5
1        Mexico          25.3  16.2-32.4
2         Spain          12.5   8.1-16.5
3  South Africa          11.3   9.1-20.1

In [73]:

# Split the columns bounds using a hyphen as delimiter
obesity_split = obesity['bounds'].str.split('-')

# Print obesity_split
print(obesity_split)

0    [11.4, 25.5]
1    [16.2, 32.4]
2     [8.1, 16.5]
3     [9.1, 20.1]
Name: bounds, dtype: object

In [74]:

# Transform the column bounds in the obesity DataFrame
obesity_split = obesity.assign(bounds=obesity['bounds'].str.split('-')).explode('bounds')

# Print obesity_split
print(obesity_split)

        country  perc_obesity bounds
0        France          14.5   11.4
0        France          14.5   25.5
1        Mexico          25.3   16.2
1        Mexico          25.3   32.4
2         Spain          12.5    8.1
2         Spain          12.5   16.5
3  South Africa          11.3    9.1
3  South Africa          11.3   20.1

In [85]:

# Import the json_normalize function
from pandas import json_normalize

movies = [{'director': 'Woody Allen', 'producer': 'Letty Aronson', 'features': {'title': 'Magic in the Moonlight', 'year': 2014}}, {'director': 'Niki Caro', 'producer': 'Jason Reed', 'features': {'title': 'Mulan', 'year': 2020}}]

# Normalize movies and separate the new columns with an underscore 
movies_norm = json_normalize(movies, sep='_')

print(movies_norm)
print()

# Reshape using director and producer as index, create movies from column starting from features
movies_long = pd.wide_to_long(movies_norm, stubnames='features', 
                      i=['director','producer'], j='movies', 
                      sep='_', suffix='\w+')

# Print movies_long
print(movies_long)

      director       producer          features_title  features_year
0  Woody Allen  Letty Aronson  Magic in the Moonlight           2014
1    Niki Caro     Jason Reed                   Mulan           2020

                                                features
director    producer      movies                        
Woody Allen Letty Aronson title   Magic in the Moonlight
                          year                      2014
Niki Caro   Jason Reed    title                    Mulan
                          year                      2020

In [76]:

# Specify the features column as the list of records 
normalize_movies = json_normalize(movies)

# Print normalize_movies
print(normalize_movies)

      director       producer          features.title  features.year
0  Woody Allen  Letty Aronson  Magic in the Moonlight           2014
1    Niki Caro     Jason Reed                   Mulan           2020

In [82]:

# Specify director and producer to use as metadata for each record 
# normalize_movies = json_normalize(movies, record_path='features', meta=['director','producer'])

# Print normalize_movies
# print(normalize_movies)

In [83]:

import json
names = ['Killdeer', 'Chipping Sparrow', 'Cedar Waxwing']
bird_facts = ['{"Size" : "Large", "Color": "Golden brown", "Behavior": "Runs swiftly along ground", "Habitat": "Rocky areas"}', '{"Size":"Small", "Color": "Gray-white", "Behavior": "Often in flocks", "Habitat": "Open woodlands"}', '{"Size":"Small", "Color": "Gray-brown", "Behavior": "Catch insects over open water", "Habitat": "Parks"}']

In [93]:

# Define birds reading names and bird_facts lists into names and bird_facts columns 
birds = pd.DataFrame(dict(names=names, bird_facts=bird_facts))

# Print birds
print(birds)

              names                                         bird_facts
0          Killdeer  {"Size" : "Large", "Color": "Golden brown", "B...
1  Chipping Sparrow  {"Size":"Small", "Color": "Gray-white", "Behav...
2     Cedar Waxwing  {"Size":"Small", "Color": "Gray-brown", "Behav...

In [94]:

# Apply the function json.loads function to the bird_facts column
data_split0 = birds['bird_facts'].apply(json.loads)

# Print birds
print(data_split0)

0    {'Size': 'Large', 'Color': 'Golden brown', 'Be...
1    {'Size': 'Small', 'Color': 'Gray-white', 'Beha...
2    {'Size': 'Small', 'Color': 'Gray-brown', 'Beha...
Name: bird_facts, dtype: object

In [95]:

data_split = data_split0.apply(pd.Series)
print(data_split)

    Size         Color                       Behavior         Habitat
0  Large  Golden brown      Runs swiftly along ground     Rocky areas
1  Small    Gray-white                Often in flocks  Open woodlands
2  Small    Gray-brown  Catch insects over open water           Parks

In [96]:

# Remove the bird_facts column from birds
birds = birds.drop(columns='bird_facts')

# Concatenate the columns of birds and data_split
birds = pd.concat([birds, data_split], axis=1)

print(birds)

              names   Size         Color                       Behavior  \
0          Killdeer  Large  Golden brown      Runs swiftly along ground   
1  Chipping Sparrow  Small    Gray-white                Often in flocks   
2     Cedar Waxwing  Small    Gray-brown  Catch insects over open water   

          Habitat  
0     Rocky areas  
1  Open woodlands  
2           Parks

In [100]:

import json
names = ['Killdeer', 'Chipping Sparrow', 'Cedar Waxwing']
bird_facts = ['{"Size" : "Large", "Color": "Golden brown", "Behavior": "Runs swiftly along ground", "Habitat": "Rocky areas"}', '{"Size":"Small", "Color": "Gray-white", "Behavior": "Often in flocks", "Habitat": "Open woodlands"}', '{"Size":"Small", "Color": "Gray-brown", "Behavior": "Catch insects over open water", "Habitat": "Parks"}']

# Define birds reading names and bird_facts lists into names and bird_facts columns 
birds = pd.DataFrame(dict(names=names, bird_facts=bird_facts))

# Print birds_facts
print(birds)
print()

# Apply json.loads to the bird_facts column and transform it to a list
birds_facts = birds['bird_facts'].apply(json.loads).to_list()

# Print birds_facts
print(birds_facts)

              names                                         bird_facts
0          Killdeer  {"Size" : "Large", "Color": "Golden brown", "B...
1  Chipping Sparrow  {"Size":"Small", "Color": "Gray-white", "Behav...
2     Cedar Waxwing  {"Size":"Small", "Color": "Gray-brown", "Behav...

[{'Size': 'Large', 'Color': 'Golden brown', 'Behavior': 'Runs swiftly along ground', 'Habitat': 'Rocky areas'}, {'Size': 'Small', 'Color': 'Gray-white', 'Behavior': 'Often in flocks', 'Habitat': 'Open woodlands'}, {'Size': 'Small', 'Color': 'Gray-brown', 'Behavior': 'Catch insects over open water', 'Habitat': 'Parks'}]

In [99]:

# Convert birds_facts into a JSON 
birds_dump = json.dumps(birds_facts)

# Print birds_dump
print(birds_dump)

[{"Size": "Large", "Color": "Golden brown", "Behavior": "Runs swiftly along ground", "Habitat": "Rocky areas"}, {"Size": "Small", "Color": "Gray-white", "Behavior": "Often in flocks", "Habitat": "Open woodlands"}, {"Size": "Small", "Color": "Gray-brown", "Behavior": "Catch insects over open water", "Habitat": "Parks"}]

In [101]:

# Read the JSON birds_dump into a DataFrame 
birds_df = pd.read_json(birds_dump)

# Print birds_df
print(birds_df)

    Size         Color                       Behavior         Habitat
0  Large  Golden brown      Runs swiftly along ground     Rocky areas
1  Small    Gray-white                Often in flocks  Open woodlands
2  Small    Gray-brown  Catch insects over open water           Parks

In [103]:

# Concatenate the 'names' column of birds with birds_df 
birds_final = pd.concat([birds['names'], birds_df], axis=1)

print(birds_final)

              names   Size         Color                       Behavior  \
0          Killdeer  Large  Golden brown      Runs swiftly along ground   
1  Chipping Sparrow  Small    Gray-white                Often in flocks   
2     Cedar Waxwing  Small    Gray-brown  Catch insects over open water   

          Habitat  
0     Rocky areas  
1  Open woodlands  
2           Parks