You’ll finish by learning how to combine the reshaping process with grouping to produce quick data manipulations. Lastly, you’ll discover how to transform list-like columns and handle complex nested data, such as nested JSON files.
Less fast food, please!¶
Monday again! You will start working on a new project – analyzing the evolution of obesity through the years. You have a dataset called obesity with the percentage of obesity in different countries and years. The data is also disaggregated by biological sex.
Your main goals are to get the mean percentage of obesity by year and sex, and by country and sex. Also, you want to get the difference between years.
You notice that the dataset has multiple indices, so you know you will have to unstack levels to accomplish your goal.
import pandas as pd
filename = 'obesity.csv'
obesity = pd.read_csv(filename, index_col=[0,1,2])
print(obesity)
perc_obesity
country biological_sex year
Argentina Male 2005 21.5
Female 2005 24.2
Male 2015 26.8
Female 2015 28.5
Japan Male 2005 2.5
Female 2005 2.6
Male 2015 4.6
Female 2015 3.6
Norway Male 2005 17.6
Female 2005 18.6
Male 2015 23.0
Female 2015 22.2
# Unstack the first level and calculate the mean of the columns
obesity_general = obesity.unstack(level=0).mean(axis=1)
# Print obesity_general
print(obesity_general)
biological_sex year
Female 2005 15.133333
2015 18.100000
Male 2005 13.866667
2015 18.133333
dtype: float64
# Unstack the second level and calculate the mean of the columns
obesity_mean = obesity.unstack(level=1).mean(axis=1)
# Print obesity_mean
print(obesity_mean)
country year
Argentina 2005 22.85
2015 27.65
Japan 2005 2.55
2015 4.10
Norway 2005 18.10
2015 22.60
dtype: float64
# Unstack the third level and calculate the difference between columns
obesity_variation = obesity.unstack(level=2).diff(periods=1, axis=1)
# Print obesity_variation
print(obesity_variation)
perc_obesity
year 2005 2015
country biological_sex
Argentina Female NaN 4.3
Male NaN 5.3
Japan Female NaN 1.0
Male NaN 2.1
Norway Female NaN 3.6
Male NaN 5.4
filename = 'obesity2.csv'
obesity = pd.read_csv(filename, index_col=[0,1], header=[0,1])
print(obesity)
year 1995 2005 \
perc_obesity variation perc_obesity variation
country biological_sex
France Female 15.3 7.7 18.1 8.2
Male 12.8 7.6 16.9 8.4
Germany Female 14.4 4.6 17.2 5.2
Male 14.4 5.1 18.7 5.9
year 2015
perc_obesity variation
country biological_sex
France Female 20.8 11.3
Male 21.5 11.8
Germany Female 20.1 8.4
Male 23.6 9.8
# Stack obesity, get median of columns and unstack again
median_obesity = obesity.stack().median(axis=1).unstack()
# Print median_obesity
print(median_obesity)
perc_obesity variation
country biological_sex
France Female 18.1 8.2
Male 16.9 8.4
Germany Female 17.2 5.2
Male 18.7 5.9
# Stack the first level, get sum, and unstack the second level
obesity_sum = obesity.stack(level=0).sum(axis=1).unstack(level=1)
# Print obesity_max
print(obesity_sum)
biological_sex Female Male
country year
France 1995 23.0 20.4
2005 26.3 25.3
2015 32.1 33.3
Germany 1995 19.0 19.5
2005 22.4 24.6
2015 28.5 33.4
# Stack country level, group by country and get the mean
obesity_mean = obesity.stack(level=0).groupby(level='country').mean()
# Print obesity_mean
print(obesity_mean)
perc_obesity variation country France 17.566667 9.166667 Germany 18.066667 6.500000
# Stack country level, group by country and get the median
obesity_median = obesity.stack(level=0)
# Print obesity_median
print(obesity_median)
perc_obesity variation
country biological_sex year
France Female 1995 15.3 7.7
2005 18.1 8.2
2015 20.8 11.3
Male 1995 12.8 7.6
2005 16.9 8.4
2015 21.5 11.8
Germany Female 1995 14.4 4.6
2005 17.2 5.2
2015 20.1 8.4
Male 1995 14.4 5.1
2005 18.7 5.9
2015 23.6 9.8
obesity_median = obesity_median.groupby(level='country').median()
# Print obesity_median
print(obesity_median)
perc_obesity variation country France 17.50 8.30 Germany 17.95 5.55
Merge it all¶
Time to keep working with the obesity project! You will analyze the mean obesity percentage in different countries, but this time, the obesity DataFrame has a new column named bounds. It contains the minimum and maximum values you can find in different parts of the same country.
You notice that these values are given in a list, so you decide that you need to transform that column. You would like to have each element in a new row.
filename = 'obesity3.csv'
obesity0 = pd.read_csv(filename, index_col=[0], header=[0])
print(obesity)
country perc_obesity bounds 0 Argentina 21.5 15.4, 31.5 1 Germany 22.3 16.2, 32.4 2 Japan 2.5 1.1, 3.5 3 Norway 23.0 13.1, 33.0
# Explode the values of bounds to a separate row
obesity = obesity0.assign(bounds=obesity0['bounds'].str.split(','))
# Print obesity_bounds
print(obesity)
country perc_obesity bounds 0 Argentina 21.5 [15.4, 31.5] 1 Germany 22.3 [16.2, 32.4] 2 Japan 2.5 [1.1, 3.5] 3 Norway 23.0 [13.1, 33.0]
# Explode the values of bounds to a separate row
obesity_bounds = obesity['bounds'].explode()
# Print obesity_bounds
print(obesity_bounds)
0 15.4 0 31.5 1 16.2 1 32.4 2 1.1 2 3.5 3 13.1 3 33.0 Name: bounds, dtype: object
# Merge obesity_bounds with country and perc_obesity columns of obesity using the indexes
obesity_final = obesity[['country', 'perc_obesity']].merge(obesity_bounds,
right_index=True,
left_index=True)
# Print obesity_final
print(obesity_final)
country perc_obesity bounds 0 Argentina 21.5 15.4 0 Argentina 21.5 31.5 1 Germany 22.3 16.2 1 Germany 22.3 32.4 2 Japan 2.5 1.1 2 Japan 2.5 3.5 3 Norway 23.0 13.1 3 Norway 23.0 33.0
Explode the bounds¶
You were able to transform the list-like column successfully, but you are not satisfied with the steps you had to take. You want to find an easier way to get the same reshaped DataFrame.
You remembered what you learned about exploding list-like columns, and you will apply a new strategy. The same DataFrame obesity contains the country, perc_obesity, and the column bounds with the minimum and maximum values you can find in different parts of the same country.
# Modify obesity_explode by resetting the index
obesity_final.reset_index(drop=True, inplace=True)
# Print obesity_explode
print(obesity_final)
country perc_obesity bounds 0 Argentina 21.5 15.4 1 Argentina 21.5 31.5 2 Germany 22.3 16.2 3 Germany 22.3 32.4 4 Japan 2.5 1.1 5 Japan 2.5 3.5 6 Norway 23.0 13.1 7 Norway 23.0 33.0
filename = 'obesity4.csv'
obesity = pd.read_csv(filename, index_col=[0])
print(obesity)
country perc_obesity bounds 0 France 14.5 11.4-25.5 1 Mexico 25.3 16.2-32.4 2 Spain 12.5 8.1-16.5 3 South Africa 11.3 9.1-20.1
# Split the columns bounds using a hyphen as delimiter
obesity_split = obesity['bounds'].str.split('-')
# Print obesity_split
print(obesity_split)
0 [11.4, 25.5] 1 [16.2, 32.4] 2 [8.1, 16.5] 3 [9.1, 20.1] Name: bounds, dtype: object
# Transform the column bounds in the obesity DataFrame
obesity_split = obesity.assign(bounds=obesity['bounds'].str.split('-')).explode('bounds')
# Print obesity_split
print(obesity_split)
country perc_obesity bounds 0 France 14.5 11.4 0 France 14.5 25.5 1 Mexico 25.3 16.2 1 Mexico 25.3 32.4 2 Spain 12.5 8.1 2 Spain 12.5 16.5 3 South Africa 11.3 9.1 3 South Africa 11.3 20.1
# Import the json_normalize function
from pandas import json_normalize
movies = [{'director': 'Woody Allen', 'producer': 'Letty Aronson', 'features': {'title': 'Magic in the Moonlight', 'year': 2014}}, {'director': 'Niki Caro', 'producer': 'Jason Reed', 'features': {'title': 'Mulan', 'year': 2020}}]
# Normalize movies and separate the new columns with an underscore
movies_norm = json_normalize(movies, sep='_')
print(movies_norm)
print()
# Reshape using director and producer as index, create movies from column starting from features
movies_long = pd.wide_to_long(movies_norm, stubnames='features',
i=['director','producer'], j='movies',
sep='_', suffix='\w+')
# Print movies_long
print(movies_long)
director producer features_title features_year
0 Woody Allen Letty Aronson Magic in the Moonlight 2014
1 Niki Caro Jason Reed Mulan 2020
features
director producer movies
Woody Allen Letty Aronson title Magic in the Moonlight
year 2014
Niki Caro Jason Reed title Mulan
year 2020
# Specify the features column as the list of records
normalize_movies = json_normalize(movies)
# Print normalize_movies
print(normalize_movies)
director producer features.title features.year 0 Woody Allen Letty Aronson Magic in the Moonlight 2014 1 Niki Caro Jason Reed Mulan 2020
# Specify director and producer to use as metadata for each record
# normalize_movies = json_normalize(movies, record_path='features', meta=['director','producer'])
# Print normalize_movies
# print(normalize_movies)
import json
names = ['Killdeer', 'Chipping Sparrow', 'Cedar Waxwing']
bird_facts = ['{"Size" : "Large", "Color": "Golden brown", "Behavior": "Runs swiftly along ground", "Habitat": "Rocky areas"}', '{"Size":"Small", "Color": "Gray-white", "Behavior": "Often in flocks", "Habitat": "Open woodlands"}', '{"Size":"Small", "Color": "Gray-brown", "Behavior": "Catch insects over open water", "Habitat": "Parks"}']
# Define birds reading names and bird_facts lists into names and bird_facts columns
birds = pd.DataFrame(dict(names=names, bird_facts=bird_facts))
# Print birds
print(birds)
names bird_facts
0 Killdeer {"Size" : "Large", "Color": "Golden brown", "B...
1 Chipping Sparrow {"Size":"Small", "Color": "Gray-white", "Behav...
2 Cedar Waxwing {"Size":"Small", "Color": "Gray-brown", "Behav...
# Apply the function json.loads function to the bird_facts column
data_split0 = birds['bird_facts'].apply(json.loads)
# Print birds
print(data_split0)
0 {'Size': 'Large', 'Color': 'Golden brown', 'Be...
1 {'Size': 'Small', 'Color': 'Gray-white', 'Beha...
2 {'Size': 'Small', 'Color': 'Gray-brown', 'Beha...
Name: bird_facts, dtype: object
data_split = data_split0.apply(pd.Series)
print(data_split)
Size Color Behavior Habitat 0 Large Golden brown Runs swiftly along ground Rocky areas 1 Small Gray-white Often in flocks Open woodlands 2 Small Gray-brown Catch insects over open water Parks
# Remove the bird_facts column from birds
birds = birds.drop(columns='bird_facts')
# Concatenate the columns of birds and data_split
birds = pd.concat([birds, data_split], axis=1)
print(birds)
names Size Color Behavior \
0 Killdeer Large Golden brown Runs swiftly along ground
1 Chipping Sparrow Small Gray-white Often in flocks
2 Cedar Waxwing Small Gray-brown Catch insects over open water
Habitat
0 Rocky areas
1 Open woodlands
2 Parks
import json
names = ['Killdeer', 'Chipping Sparrow', 'Cedar Waxwing']
bird_facts = ['{"Size" : "Large", "Color": "Golden brown", "Behavior": "Runs swiftly along ground", "Habitat": "Rocky areas"}', '{"Size":"Small", "Color": "Gray-white", "Behavior": "Often in flocks", "Habitat": "Open woodlands"}', '{"Size":"Small", "Color": "Gray-brown", "Behavior": "Catch insects over open water", "Habitat": "Parks"}']
# Define birds reading names and bird_facts lists into names and bird_facts columns
birds = pd.DataFrame(dict(names=names, bird_facts=bird_facts))
# Print birds_facts
print(birds)
print()
# Apply json.loads to the bird_facts column and transform it to a list
birds_facts = birds['bird_facts'].apply(json.loads).to_list()
# Print birds_facts
print(birds_facts)
names bird_facts
0 Killdeer {"Size" : "Large", "Color": "Golden brown", "B...
1 Chipping Sparrow {"Size":"Small", "Color": "Gray-white", "Behav...
2 Cedar Waxwing {"Size":"Small", "Color": "Gray-brown", "Behav...
[{'Size': 'Large', 'Color': 'Golden brown', 'Behavior': 'Runs swiftly along ground', 'Habitat': 'Rocky areas'}, {'Size': 'Small', 'Color': 'Gray-white', 'Behavior': 'Often in flocks', 'Habitat': 'Open woodlands'}, {'Size': 'Small', 'Color': 'Gray-brown', 'Behavior': 'Catch insects over open water', 'Habitat': 'Parks'}]
# Convert birds_facts into a JSON
birds_dump = json.dumps(birds_facts)
# Print birds_dump
print(birds_dump)
[{"Size": "Large", "Color": "Golden brown", "Behavior": "Runs swiftly along ground", "Habitat": "Rocky areas"}, {"Size": "Small", "Color": "Gray-white", "Behavior": "Often in flocks", "Habitat": "Open woodlands"}, {"Size": "Small", "Color": "Gray-brown", "Behavior": "Catch insects over open water", "Habitat": "Parks"}]
# Read the JSON birds_dump into a DataFrame
birds_df = pd.read_json(birds_dump)
# Print birds_df
print(birds_df)
Size Color Behavior Habitat 0 Large Golden brown Runs swiftly along ground Rocky areas 1 Small Gray-white Often in flocks Open woodlands 2 Small Gray-brown Catch insects over open water Parks
# Concatenate the 'names' column of birds with birds_df
birds_final = pd.concat([birds['names'], birds_df], axis=1)
print(birds_final)
names Size Color Behavior \
0 Killdeer Large Golden brown Runs swiftly along ground
1 Chipping Sparrow Small Gray-white Often in flocks
2 Cedar Waxwing Small Gray-brown Catch insects over open water
Habitat
0 Rocky areas
1 Open woodlands
2 Parks