Chapter 4 - Numeric data wrangling#

2023 April 7

For text preprocessing, see Chapter 7 “English text preprocessing basics”

Open In Colab

wrangle

The pandas library#

Import the pandas library with the alias pd. This can be a useful shortcut to call its methods!

Use “dot notation” to apply its methods to the gapminder dataset, which is stored in a tabular .csv file.

To import the .csv file, use the .read_csv() pandas method. The only argument for now is the file path to a .csv file.

Learn more about the Gapminder data

import pandas as pd
# import the gapminder dataset
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
print(type(gap))
<class 'pandas.core.frame.DataFrame'>

Pandas methods#

Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. How might the wrangling process look?

  1. First, look at the data

  2. Compute summary statistics

  3. Subset rows

    • first row

    • first three rows

    • rows 10 thru 14

  4. Subset columns:

    • one column

    • multiple columns

  5. Row and column subset

  6. Subset by logical condition(s)

First, look at the data#

print or call the dataset to view its output. Use pandas methods to learn about the data!

print(gap)
          country  year         pop continent  lifeExp   gdpPercap
0     Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1     Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2     Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3     Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4     Afghanistan  1972  13079460.0      Asia   36.088  739.981106
...           ...   ...         ...       ...      ...         ...
1699     Zimbabwe  1987   9216418.0    Africa   62.351  706.157306
1700     Zimbabwe  1992  10704340.0    Africa   60.377  693.420786
1701     Zimbabwe  1997  11404948.0    Africa   46.809  792.449960
1702     Zimbabwe  2002  11926563.0    Africa   39.989  672.038623
1703     Zimbabwe  2007  12311143.0    Africa   43.487  469.709298

[1704 rows x 6 columns]
gap
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418.0 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340.0 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948.0 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563.0 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143.0 Africa 43.487 469.709298

1704 rows × 6 columns

.columns#

# .columns shows the column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

.shape#

# .shape shows the number of rows by columns
gap.shape
(1704, 6)

.info()#

# .info() shows the compact structure of the dataset, including:
# class, dimensions, column names, number of missing values, and types
gap.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   year       1704 non-null   int64  
 2   pop        1704 non-null   float64
 3   continent  1704 non-null   object 
 4   lifeExp    1704 non-null   float64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB

Summary statistics#

Produce summary statistics, including:

  • Count, mean, sd, quartiles, min/max

  • Tabulate frequencies

.describe()#

# produce summary statistics for numeric data
gap.describe()
year pop lifeExp gdpPercap
count 1704.00000 1.704000e+03 1704.000000 1704.000000
mean 1979.50000 2.960121e+07 59.474439 7215.327081
std 17.26533 1.061579e+08 12.917107 9857.454543
min 1952.00000 6.001100e+04 23.599000 241.165876
25% 1965.75000 2.793664e+06 48.198000 1202.060309
50% 1979.50000 7.023596e+06 60.712500 3531.846988
75% 1993.25000 1.958522e+07 70.845500 9325.462346
max 2007.00000 1.318683e+09 82.603000 113523.132900

.mean() and .std()#

# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print("Life expectancy mean:", lifeExp_mean)
print("Life expectancy sd:", lifeExp_sd)
Life expectancy mean: 59.47443936619714
Life expectancy sd: 12.917107415241187

.groupby() and .count()#

These two methods are useful for tabulating frequencies by a grouping variable!

import pandas as pd
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
pd.set_option('display.float_format', lambda x: '%.6f' % x)
# count the number of observations grouped by each continent
gap.groupby("continent").mean()["pop"]
continent
Africa      9916003.142628
Americas   24504794.996667
Asia       77038721.974747
Europe     17169764.733333
Oceania     8874672.333333
Name: pop, dtype: float64

Subset rows or columns#

Sampling data is necessary for many reasons, including quick sanity checks.

Slice a data frame by using bracket notation to specify start and end points [start : end]

The [start index is included and the end] index is excluded.

Remember that Python is a zero-indexed language, so starts counting from zero, not one.

Leave the start or end values blank to start from the beginning, or go to the end of a collection.

Row subset: slice just the first row#

gap[:1]
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.000000 Asia 28.801000 779.445314

Row subset: slice first three rows#

gap[:3]
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.000000 Asia 28.801000 779.445314
1 Afghanistan 1957 9240934.000000 Asia 30.332000 820.853030
2 Afghanistan 1962 10267083.000000 Asia 31.997000 853.100710

Row subset: slice rows 10 thru 14#

subset1 = gap[10:15]
subset1
country year pop continent lifeExp gdpPercap
10 Afghanistan 2002 25268405.000000 Asia 42.129000 726.734055
11 Afghanistan 2007 31889923.000000 Asia 43.828000 974.580338
12 Albania 1952 1282697.000000 Europe 55.230000 1601.056136
13 Albania 1957 1476505.000000 Europe 59.280000 1942.284244
14 Albania 1962 1728137.000000 Europe 64.820000 2312.888958

Column subset: one column#

# type the column name as a string in square brackets
gap['lifeExp']
0      28.801000
1      30.332000
2      31.997000
3      34.020000
4      36.088000
          ...   
1699   62.351000
1700   60.377000
1701   46.809000
1702   39.989000
1703   43.487000
Name: lifeExp, Length: 1704, dtype: float64

Column subset: multiple columns#

# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2
continent lifeExp gdpPercap
0 Asia 28.801000 779.445314
1 Asia 30.332000 820.853030
2 Asia 31.997000 853.100710
3 Asia 34.020000 836.197138
4 Asia 36.088000 739.981106
... ... ... ...
1699 Africa 62.351000 706.157306
1700 Africa 60.377000 693.420786
1701 Africa 46.809000 792.449960
1702 Africa 39.989000 672.038623
1703 Africa 43.487000 469.709298

1704 rows × 3 columns

Row and column subset#

# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3
continent lifeExp gdpPercap
855 Asia 64.624000 80894.883260
856 Asia 67.712000 109347.867000
857 Asia 69.343000 59265.477140
858 Asia 71.309000 31354.035730
# A column in a pandas data frame Pandas "Series" can be thought of like numpy arrays
# But, beware, they do not function exactly the same!
type(gap["lifeExp"])
pandas.core.series.Series

Subset by logical condition(s)#

# lifeExp is greater than 81
le2 = gap[gap['lifeExp'] > 81]
le2
country year pop continent lifeExp gdpPercap
71 Australia 2007 20434176.000000 Oceania 81.235000 34435.367440
670 Hong Kong China 2002 6762476.000000 Asia 81.495000 30209.015160
671 Hong Kong China 2007 6980412.000000 Asia 82.208000 39724.978670
695 Iceland 2007 301931.000000 Europe 81.757000 36180.789190
802 Japan 2002 127065841.000000 Asia 82.000000 28604.591900
803 Japan 2007 127467972.000000 Asia 82.603000 31656.068060
1487 Switzerland 2007 7554661.000000 Europe 81.701000 37506.419070

logical AND (&)#

All conditions must be satisfied to be included in the subset

# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002
country year pop continent lifeExp gdpPercap
695 Iceland 2007 301931.000000 Europe 81.757000 36180.789190

logical OR (|)#

Just one of multiple conditions must be satisfied to be included in the subset

# create a subset that includes country equals Ireland OR life expectancy greater than 82. 
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82
country year pop continent lifeExp gdpPercap
671 Hong Kong China 2007 6980412.000000 Asia 82.208000 39724.978670
744 Ireland 1952 2952156.000000 Europe 66.910000 5210.280328
745 Ireland 1957 2878220.000000 Europe 68.900000 5599.077872
746 Ireland 1962 2830000.000000 Europe 70.290000 6631.597314
747 Ireland 1967 2900100.000000 Europe 71.080000 7655.568963
748 Ireland 1972 3024400.000000 Europe 71.280000 9530.772896
749 Ireland 1977 3271900.000000 Europe 72.030000 11150.981130
750 Ireland 1982 3480000.000000 Europe 73.100000 12618.321410
751 Ireland 1987 3539900.000000 Europe 74.360000 13872.866520
752 Ireland 1992 3557761.000000 Europe 75.467000 17558.815550
753 Ireland 1997 3667233.000000 Europe 76.122000 24521.947130
754 Ireland 2002 3879155.000000 Europe 77.783000 34077.049390
755 Ireland 2007 4109086.000000 Europe 78.885000 40675.996350
803 Japan 2007 127467972.000000 Asia 82.603000 31656.068060

Exercises#

  1. Load the file “gapminder-FiveYearData.csv” and save it in a variable named gap

  2. Print the column names

  3. Compute the mean for one numeric column

  4. Compute the mean for all numeric columns

  5. Tabulate frequencies for the “continent” column

  6. Compute mean lifeExp and dgpPercap by continent

  7. Create a subset of gap that contains only countries with lifeExp greater than 75 and gdpPercap less than 5000.

# 1.
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
gap
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.000000 Asia 28.801000 779.445314
1 Afghanistan 1957 9240934.000000 Asia 30.332000 820.853030
2 Afghanistan 1962 10267083.000000 Asia 31.997000 853.100710
3 Afghanistan 1967 11537966.000000 Asia 34.020000 836.197138
4 Afghanistan 1972 13079460.000000 Asia 36.088000 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418.000000 Africa 62.351000 706.157306
1700 Zimbabwe 1992 10704340.000000 Africa 60.377000 693.420786
1701 Zimbabwe 1997 11404948.000000 Africa 46.809000 792.449960
1702 Zimbabwe 2002 11926563.000000 Africa 39.989000 672.038623
1703 Zimbabwe 2007 12311143.000000 Africa 43.487000 469.709298

1704 rows × 6 columns

# 2. 
gap.columnsns
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[26], line 2
      1 # 2. 
----> 2 gap.columnsns

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'columnsns'
# 3. compute mean for one numeric column
gap["pop"].mean()
29601212.32511736
gap.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   year       1704 non-null   int64  
 2   pop        1704 non-null   float64
 3   continent  1704 non-null   object 
 4   lifeExp    1704 non-null   float64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB
# 4. compute mean for ALL numeric columns
# gap.describe()
gap.mean()
/var/folders/9g/fhnd1v790cj5ccxlv4rcvsy40000gq/T/ipykernel_60936/4035141191.py:3: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  gap.mean()
year            1979.500000
pop         29601212.325117
lifeExp           59.474439
gdpPercap       7215.327081
dtype: float64
sum(gap.groupby("continent").count()["country"] / 12)
142.0
gap.shape
(1704, 6)
142 * 12
1704
# 6.
gap.groupby("continent")[["lifeExp", "gdpPercap"]].mean()
lifeExp gdpPercap
continent
Africa 48.865330 2193.754578
Americas 64.658737 7136.110356
Asia 60.064903 7902.150428
Europe 71.903686 14469.475533
Oceania 74.326208 18621.609223
# Create a subset of gap that contains only countries with lifeExp greater 
# than 75 and gdpPercap less than 5000.
healthy_poor = gap[(gap["lifeExp"] > 75) & (gap["gdpPercap"] < 5000)]
healthy_poor
country year pop continent lifeExp gdpPercap
22 Albania 2002 3508512.000000 Europe 75.651000 4604.211737
gap.iloc[1501]
country              Taiwan
year                   1957
pop         10164215.000000
continent              Asia
lifeExp           62.400000
gdpPercap       1507.861290
Name: 1501, dtype: object
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
gap[gap.columns[[2, 3]]]
pop continent
0 8425333.000000 Asia
1 9240934.000000 Asia
2 10267083.000000 Asia
3 11537966.000000 Asia
4 13079460.000000 Asia
... ... ...
1699 9216418.000000 Africa
1700 10704340.000000 Africa
1701 11404948.000000 Africa
1702 11926563.000000 Africa
1703 12311143.000000 Africa

1704 rows × 2 columns

gap.ix[2]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[45], line 1
----> 1 gap.ix[2]

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'ix'

Data visualization#

After importing, exploring, and subsetting your data, visualization is a common technique to perform next. Read Chapter 5 “Data visualization essentials” to get started.