Average Baby Weight from CDC National Survey of Family Growth, 2013-2015

Read, clean, and validate¶

The first step of almost any data project is to read the data, check for errors and special cases, and prepare data for analysis. This is exactly what you’ll do in this chapter, while working with a dataset obtained from the National Survey of Family Growth.

Read the codebook¶

When you work with datasets like the NSFG, it is important to read the documentation carefully. If you interpret a variable incorrectly, you can generate nonsense results and never realize it. So before you start coding, you’ll need to get familiar with the NSFG codebook, which describes every variable.

Code books can be found here: https://www.cdc.gov/nchs/nsfg/nsfg_questionnaires.htm

We are using the 2013-2015 Female Pregnancy Data, and the code books for this data can be found here:
https://www.cdc.gov/nchs/data/nsfg/2013-2015_NSFG_FemPregFile_Codebook-508.pdf

In [139]:

# import
import pandas as pd
import numpy as np

# Data found at https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/
nsfg2013_2015 = pd.read_csv('2013_2015_FemPregData.dat', header=None)

# Select columns
nsfg['caseid'] = nsfg2013_2015[0].str[0:5].str.strip().replace('',np.nan) #CASEID(1-5)
nsfg['outcome'] = nsfg2013_2015[0].str[310:311].str.strip().replace('',np.nan) #OUTCOME(311-311)
nsfg['birthwgt_lb1'] = nsfg2013_2015[0].str[45:47].str.strip().replace('',np.nan).astype(float) #BIRTHWGT_LB1(46-47)
nsfg['birthwgt_oz1'] = nsfg2013_2015[0].str[47:49].str.strip().replace('',np.nan).astype(float) #BIRTHWGT_OZ1(48-49)
nsfg['prglngth'] = nsfg2013_2015[0].str[308:310].str.strip().replace('',np.nan).astype(float) #PRGLNGTH(309-310)
nsfg['nbrnaliv'] = nsfg2013_2015[0].str[15:16].str.strip().replace('',np.nan) #NBRNALIV(16-16)
nsfg['agecon'] = nsfg2013_2015[0].str[325:329].str.strip().replace('',np.nan) #AGECON(326-329)
nsfg['agepreg'] = nsfg2013_2015[0].str[317:321].str.strip().replace('',np.nan) #AGEPREG(318-321)
nsfg['hpagelb'] = nsfg2013_2015[0].str[74:76].str.strip().replace('',np.nan) #HPAGELB(75-76)
nsfg['wgt2013_2015'] = nsfg2013_2015[0].str[430:446].str.strip().replace('',np.nan).astype(float) #WGT2013_2015(431-446)

# .str.strip().replace('','0').astype(int)
print(nsfg.head())

  caseid outcome  birthwgt_lb1  birthwgt_oz1  prglngth nbrnaliv agecon  \
0  60418       1           5.0           4.0      40.0        1   2000   
1  60418       1           4.0          12.0      36.0        1   2291   
2  60418       1           5.0           4.0      36.0        1   3241   
3  60419       6           NaN           NaN      33.0      NaN   3650   
4  60420       1           8.0          13.0      41.0        1   2191   

  agepreg hpagelb  wgt2013_2015  
0    2075      22   3554.964843  
1    2358      25   3554.964843  
2    3308      52   3554.964843  
3     NaN     NaN   2484.535358  
4    2266      24   2903.782914

In [129]:

# Display the number of rows and columns
print(nsfg.shape)
print()

# Display the names of the columns
print(nsfg.columns)
print()

(9358, 10)

Index(['caseid', 'outcome', 'birthwgt_lb1', 'birthwgt_oz1', 'prglngth',
       'nbrnaliv', 'agecon', 'agepreg', 'hpagelb', 'wgt2013_2015'],
      dtype='object')

In [130]:

# Select column birthwgt_oz1: ounces
ounces = nsfg['birthwgt_oz1']

# Print the first 5 elements of ounces
print(ounces.head())

0     4.0
1    12.0
2     4.0
3     NaN
4    13.0
Name: birthwgt_oz1, dtype: float64

Clean a variable¶

In the NSFG dataset, the variable 'nbrnaliv' records the number of babies born alive at the end of a pregnancy.

Using .value_counts() to view the responses, you’ll see that the value 8 appears once, and if you consult the codebook, you’ll see that this value indicates that the respondent refused to answer the question.

In [131]:

# Print the values and their frequencies
print(nsfg['nbrnaliv'].value_counts())

1    6379
2     100
3       5
8       1
Name: nbrnaliv, dtype: int64

In [132]:

# Replace the value 8 with NaN
nsfg['nbrnaliv'].replace(['8'], np.nan, inplace = True)

# Print the values and their frequencies
# Filter blank entries out from the sample for this assessemnt
print(nsfg['nbrnaliv'][nsfg['nbrnaliv'].isin(['1', '2', '3'])].value_counts())

1    6379
2     100
3       5
Name: nbrnaliv, dtype: int64

The missingno Library¶

Missingno is an excellent and simple to use Python library that provides a series of visualisations to understand the presence and distribution of missing data within a pandas dataframe. This can be in the form of either a barplot, matrix plot, heatmap, or a dendrogram.

From these plots, we can identify where missing values occur, the extent of the missingness and whether any of the missing values are correlated with each other. Often, missing values may be seen as not contributing any information, but if analysed closely there may be an underlying story.

In [133]:

import missingno as msno

print(nsfg.isna().sum())

caseid             0
outcome            0
birthwgt_lb1    2873
birthwgt_oz1    2967
prglngth           0
nbrnaliv        2874
agecon             0
agepreg          249
hpagelb         2873
wgt2013_2015       0
dtype: int64

In [134]:

msno.matrix(nsfg)

Out[134]:

<AxesSubplot:>

Compute a variable¶

For each pregnancy in the NSFG dataset, the variable 'agecon' encodes the respondent’s age at conception, and 'agepreg' the respondent’s age at the end of the pregnancy.

Both variables are recorded as integers with two implicit decimal places, so the value 2575 means that the respondent’s age was 25.75.

In [135]:

# subset dataset so that agepreg is not null
nsfg_nomsno_agepreg = nsfg[['agecon','agepreg']].dropna()

msno.matrix(nsfg_nomsno_agepreg)

Out[135]:

<AxesSubplot:>

In [138]:

# Select the columns and divide by 100
agecon = nsfg_nomsno_agepreg['agecon']/100
agepreg = nsfg_nomsno_agepreg['agepreg']/100

# Compute the difference
preg_length = agepreg - agecon

# Compute summary statistics
print(preg_length.describe())

count    9109.000000
mean        0.552069
std         0.271479
min         0.000000
25%         0.250000
50%         0.670000
75%         0.750000
max         0.920000
dtype: float64

Make a histogram¶

Histograms are one of the most useful tools in exploratory data analysis. They quickly give you an overview of the distribution of a variable, that is, what values the variable can have, and how many times each value appears.

As we saw in a previous exercise, the NSFG dataset includes a variable 'agecon' that records age at conception for each pregnancy. Here, you’re going to plot a histogram of this variable. You’ll use the bins parameter and the parameter histtype.

In [111]:

# import
import matplotlib.pyplot as plt

# Plot the histogram
plt.hist(agecon, bins=20, width=1.6)

# Label the axes
plt.xlabel('Age at conception')
plt.ylabel('Number of pregnancies')

# Show the figure
plt.show()

In [113]:

# Adapt code to make an unfilled histogram by setting the parameter histtype to be 'step'

# Plot the histogram
plt.hist(agecon, bins=20, histtype = 'step')

# Label the axes
plt.xlabel('Age at conception')
plt.ylabel('Number of pregnancies')

# Show the figure
plt.show()

Compute birth weight¶

Now let’s pull together the steps in this chapter to compute the average birth weight for full-term babies.

I’ve provided a function, resample_rows_weighted, that takes the NSFG data and resamples it using the sampling weights in wgt2013_2015. The result is a sample that is representative of the U.S. population.

Then I extract birthwgt_lb1 and birthwgt_oz1, replace special codes with NaN, and compute total birth weight in pounds, birth_weight.

# Resample the data
nsfg = resample_rows_weighted(nsfg, 'wgt2013_2015')

# Clean the weight variables
pounds = nsfg['birthwgt_lb1'].replace([98, 99], np.nan)
ounces = nsfg['birthwgt_oz1'].replace([98, 99], np.nan)

# Compute total birth weight
birth_weight = pounds + ounces/16

In [146]:

# subset dataset so not null
nsfg_resample = nsfg[['wgt2013_2015','birthwgt_lb1','birthwgt_oz1', 'prglngth', 'nbrnaliv']].dropna()

# subset dataset to fullterm
nsfg_fullterm_resample = nsfg_resample[nsfg_resample['prglngth'] >= 37]

print(nsfg_fullterm_resample.head())

    wgt2013_2015  birthwgt_lb1  birthwgt_oz1  prglngth nbrnaliv
0    3554.964843           5.0           4.0      40.0        1
4    2903.782914           8.0          13.0      41.0        1
9    9682.211381           8.0          10.0      39.0        1
14   2588.500365           6.0           8.0      39.0        1
15   2588.500365           5.0           8.0      37.0        1

In [154]:

# Clean the weight variables
pounds = nsfg_fullterm_resample['birthwgt_lb1'].replace([98, 99], np.nan)
ounces = nsfg_fullterm_resample['birthwgt_oz1'].replace([98, 99], np.nan)

# Compute total birth weight
birth_weight = pounds + ounces/16

# Create a Boolean Series for full-term babies
full_term = nsfg_fullterm_resample['prglngth'] >= 37

# Select the weights of full-term babies
full_term_weight = birth_weight[full_term]

# Compute the mean weight of full-term babies
print('Full-term mean:\n', full_term_weight.mean())

Full-term mean:
 7.372323879231473

Filter¶

In the previous exercise, we computed the mean birth weight for full-term babies; we filtered out preterm babies because their distribution of weight is different. The distribution of weight is also different for multiple births, like twins and triplets. In this exercise, we’ll filter them out too and see what effect it has on the mean.

In [156]:

# Filter single births
single = nsfg_fullterm_resample['nbrnaliv'] == '1'

# Compute birth weight for single full-term babies
single_full_term_weight = birth_weight[full_term & single]

# Compute birth weight for multiple full-term babies
mult_full_term_weight = birth_weight[full_term & ~single]

# Print Averages
print('Single full-term mean:', single_full_term_weight.mean())
print('Multiple full-term mean:', mult_full_term_weight.mean())

Single full-term mean: 7.385643450184502
Multiple full-term mean: 5.768055555555556