Chapter 4 - Numeric data wrangling#

2022 August 25

For text preprocessing, see Chapter 7 “English text preprocessing basics”

Open In Colab

wrangle

The pandas library#

Import the pandas library with the alias pd. This is just a shortcut to call its methods!

Use “dot notation” to apply its methods to the gapminder dataset, which is stored in a tabular .csv file.

To import the .csv file, use the .read_csv() pandas method. The only argument for now is the file path to a .csv file.

Learn more about the Gapminder data

import pandas as pd
# import the gapminder dataset
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
print(type(gap))
<class 'pandas.core.frame.DataFrame'>

Pandas methods#

Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. How might the wrangling process look?

  1. First, look at the data

  2. Compute summary statistics

  3. Subset rows

    • first row

    • first three rows

    • rows 10 thru 14

  4. Subset columns:

    • one column

    • multiple columns

  5. Row and column subset

  6. Subset by logical condition(s)

First, look at the data#

print or call the dataset to view its output. Use pandas methods to learn about the data!

print(gap)
          country  year         pop continent  lifeExp   gdpPercap
0     Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1     Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2     Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3     Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4     Afghanistan  1972  13079460.0      Asia   36.088  739.981106
...           ...   ...         ...       ...      ...         ...
1699     Zimbabwe  1987   9216418.0    Africa   62.351  706.157306
1700     Zimbabwe  1992  10704340.0    Africa   60.377  693.420786
1701     Zimbabwe  1997  11404948.0    Africa   46.809  792.449960
1702     Zimbabwe  2002  11926563.0    Africa   39.989  672.038623
1703     Zimbabwe  2007  12311143.0    Africa   43.487  469.709298

[1704 rows x 6 columns]
gap
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418.0 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340.0 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948.0 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563.0 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143.0 Africa 43.487 469.709298

1704 rows × 6 columns

.columns#

# .columns shows the column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

.shape#

# .shape shows the number of rows by columns
gap.shape
(1704, 6)

Summary statistics#

Produce summary statistics, including:

  • Count, mean, sd, quartiles, min/max

  • Tabulate frequencies

.describe()#

# produce summary statistics for numeric data
gap.describe()
year pop lifeExp gdpPercap
count 1704.00000 1.704000e+03 1704.000000 1704.000000
mean 1979.50000 2.960121e+07 59.474439 7215.327081
std 17.26533 1.061579e+08 12.917107 9857.454543
min 1952.00000 6.001100e+04 23.599000 241.165876
25% 1965.75000 2.793664e+06 48.198000 1202.060309
50% 1979.50000 7.023596e+06 60.712500 3531.846988
75% 1993.25000 1.958522e+07 70.845500 9325.462346
max 2007.00000 1.318683e+09 82.603000 113523.132900

.mean() and .std()#

# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print("Life expectancy mean:", lifeExp_mean)
print("Life expectancy sd:", lifeExp_sd)
Life expectancy mean: 59.474439366197174
Life expectancy sd: 12.917107415241192

.groupby() and .count()#

These two methods are useful for tabulating frequencies by a grouping variable!

# count the number of observations grouped by each continent
gap.groupby("continent").count()["country"]
continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
Name: country, dtype: int64

Subset rows or columns#

Sampling data is necessary for many reasons, including quick sanity checks.

Slice a data frame by using bracket notation to specify start and end points [start : end]

The [start index is included and the end] index is excluded.

Remember that Python is a zero-indexed language, so starts counting from zero, not one.

Leave the start or end values blank to start from the beginning, or go to the end of a collection.

Row subset: slice just the first row#

gap[:1]
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314

Row subset: slice first three rows#

gap[:3]
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710

Row subset: slice rows 10 thru 14#

subset1 = gap[10:15]
subset1
country year pop continent lifeExp gdpPercap
10 Afghanistan 2002 25268405.0 Asia 42.129 726.734055
11 Afghanistan 2007 31889923.0 Asia 43.828 974.580338
12 Albania 1952 1282697.0 Europe 55.230 1601.056136
13 Albania 1957 1476505.0 Europe 59.280 1942.284244
14 Albania 1962 1728137.0 Europe 64.820 2312.888958

Column subset: one column#

# type the column name as a string in square brackets
gap['lifeExp']
0       28.801
1       30.332
2       31.997
3       34.020
4       36.088
         ...  
1699    62.351
1700    60.377
1701    46.809
1702    39.989
1703    43.487
Name: lifeExp, Length: 1704, dtype: float64

Column subset: multiple columns#

# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2
continent lifeExp gdpPercap
0 Asia 28.801 779.445314
1 Asia 30.332 820.853030
2 Asia 31.997 853.100710
3 Asia 34.020 836.197138
4 Asia 36.088 739.981106
... ... ... ...
1699 Africa 62.351 706.157306
1700 Africa 60.377 693.420786
1701 Africa 46.809 792.449960
1702 Africa 39.989 672.038623
1703 Africa 43.487 469.709298

1704 rows × 3 columns

Row and column subset#

# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3
continent lifeExp gdpPercap
855 Asia 64.624 80894.88326
856 Asia 67.712 109347.86700
857 Asia 69.343 59265.47714
858 Asia 71.309 31354.03573
# A column in a pandas data frame Pandas "Series" can be thought of like numpy arrays
# But, beware, they do not function exactly the same!
type(gap["lifeExp"])
pandas.core.series.Series

Subset by logical condition(s)#

# lifeExp is greater than 80
le2 = gap[gap['lifeExp'] > 80]
le2
country year pop continent lifeExp gdpPercap
70 Australia 2002 19546792.0 Oceania 80.370 30687.75473
71 Australia 2007 20434176.0 Oceania 81.235 34435.36744
251 Canada 2007 33390141.0 Americas 80.653 36319.23501
539 France 2007 61083916.0 Europe 80.657 30470.01670
670 Hong Kong China 2002 6762476.0 Asia 81.495 30209.01516
671 Hong Kong China 2007 6980412.0 Asia 82.208 39724.97867
694 Iceland 2002 288030.0 Europe 80.500 31163.20196
695 Iceland 2007 301931.0 Europe 81.757 36180.78919
767 Israel 2007 6426679.0 Asia 80.745 25523.27710
778 Italy 2002 57926999.0 Europe 80.240 27968.09817
779 Italy 2007 58147733.0 Europe 80.546 28569.71970
801 Japan 1997 125956499.0 Asia 80.690 28816.58499
802 Japan 2002 127065841.0 Asia 82.000 28604.59190
803 Japan 2007 127467972.0 Asia 82.603 31656.06806
1103 New Zealand 2007 4115771.0 Oceania 80.204 25185.00911
1151 Norway 2007 4627926.0 Europe 80.196 49357.19017
1427 Spain 2007 40448191.0 Europe 80.941 28821.06370
1474 Sweden 2002 8954175.0 Europe 80.040 29341.63093
1475 Sweden 2007 9031088.0 Europe 80.884 33859.74835
1486 Switzerland 2002 7361757.0 Europe 80.620 34480.95771
1487 Switzerland 2007 7554661.0 Europe 81.701 37506.41907

logical AND (&)#

All conditions must be satisfied to be included in the subset

# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002
country year pop continent lifeExp gdpPercap
695 Iceland 2007 301931.0 Europe 81.757 36180.78919

logical OR (|)#

Just one of multiple conditions must be satisfied to be included in the subset

# create a subset that includes country equals Ireland OR life expectancy greater than 82. 
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82
country year pop continent lifeExp gdpPercap
671 Hong Kong China 2007 6980412.0 Asia 82.208 39724.978670
744 Ireland 1952 2952156.0 Europe 66.910 5210.280328
745 Ireland 1957 2878220.0 Europe 68.900 5599.077872
746 Ireland 1962 2830000.0 Europe 70.290 6631.597314
747 Ireland 1967 2900100.0 Europe 71.080 7655.568963
748 Ireland 1972 3024400.0 Europe 71.280 9530.772896
749 Ireland 1977 3271900.0 Europe 72.030 11150.981130
750 Ireland 1982 3480000.0 Europe 73.100 12618.321410
751 Ireland 1987 3539900.0 Europe 74.360 13872.866520
752 Ireland 1992 3557761.0 Europe 75.467 17558.815550
753 Ireland 1997 3667233.0 Europe 76.122 24521.947130
754 Ireland 2002 3879155.0 Europe 77.783 34077.049390
755 Ireland 2007 4109086.0 Europe 78.885 40675.996350
803 Japan 2007 127467972.0 Asia 82.603 31656.068060

Exercises#

  1. Load the file “gapminder-FiveYearData.csv” and save it in a variable named gap

  2. Print the column names

  3. Compute the mean for one numeric column

  4. Compute the mean for all numeric columns

  5. Tabulate frequencies for the “continent” column

  6. Compute mean lifeExp and dgpPercap by continent

  7. Create a subset of gap that contains only countries with lifeExp greater than 75 and gdpPercap less than 5000.

Data visualization#

After importing, exploring, and subsetting your data, visualization is a common technique to perform next. Read Chapter 5 “Data visualization essentials” to get started.