Chapter 4 - Numeric data wrangling
Contents
Chapter 4 - Numeric data wrangling#
2022 August 25
For text preprocessing, see Chapter 7 “English text preprocessing basics”
The pandas library#
Import the pandas library with the alias pd
. This is just a shortcut to call its methods!
Use “dot notation” to apply its methods to the gapminder dataset, which is stored in a tabular .csv file.
To import the .csv file, use the .read_csv()
pandas method. The only argument for now is the file path to a .csv file.
Learn more about the Gapminder data
import pandas as pd
# import the gapminder dataset
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
print(type(gap))
<class 'pandas.core.frame.DataFrame'>
Pandas methods#
Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. How might the wrangling process look?
First, look at the data
Compute summary statistics
Subset rows
first row
first three rows
rows 10 thru 14
Subset columns:
one column
multiple columns
Row and column subset
Subset by logical condition(s)
First, look at the data#
print
or call the dataset to view its output. Use pandas methods to learn about the data!
print(gap)
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418.0 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340.0 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948.0 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563.0 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143.0 Africa 43.487 469.709298
[1704 rows x 6 columns]
gap
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
3 | Afghanistan | 1967 | 11537966.0 | Asia | 34.020 | 836.197138 |
4 | Afghanistan | 1972 | 13079460.0 | Asia | 36.088 | 739.981106 |
... | ... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | 1987 | 9216418.0 | Africa | 62.351 | 706.157306 |
1700 | Zimbabwe | 1992 | 10704340.0 | Africa | 60.377 | 693.420786 |
1701 | Zimbabwe | 1997 | 11404948.0 | Africa | 46.809 | 792.449960 |
1702 | Zimbabwe | 2002 | 11926563.0 | Africa | 39.989 | 672.038623 |
1703 | Zimbabwe | 2007 | 12311143.0 | Africa | 43.487 | 469.709298 |
1704 rows × 6 columns
.head()
#
# .head() shows the first five rows by default
gap.head()
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
3 | Afghanistan | 1967 | 11537966.0 | Asia | 34.020 | 836.197138 |
4 | Afghanistan | 1972 | 13079460.0 | Asia | 36.088 | 739.981106 |
.columns
#
# .columns shows the column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
.shape
#
# .shape shows the number of rows by columns
gap.shape
(1704, 6)
Summary statistics#
Produce summary statistics, including:
Count, mean, sd, quartiles, min/max
Tabulate frequencies
.describe()
#
# produce summary statistics for numeric data
gap.describe()
year | pop | lifeExp | gdpPercap | |
---|---|---|---|---|
count | 1704.00000 | 1.704000e+03 | 1704.000000 | 1704.000000 |
mean | 1979.50000 | 2.960121e+07 | 59.474439 | 7215.327081 |
std | 17.26533 | 1.061579e+08 | 12.917107 | 9857.454543 |
min | 1952.00000 | 6.001100e+04 | 23.599000 | 241.165876 |
25% | 1965.75000 | 2.793664e+06 | 48.198000 | 1202.060309 |
50% | 1979.50000 | 7.023596e+06 | 60.712500 | 3531.846988 |
75% | 1993.25000 | 1.958522e+07 | 70.845500 | 9325.462346 |
max | 2007.00000 | 1.318683e+09 | 82.603000 | 113523.132900 |
.mean()
and .std()
#
# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print("Life expectancy mean:", lifeExp_mean)
print("Life expectancy sd:", lifeExp_sd)
Life expectancy mean: 59.474439366197174
Life expectancy sd: 12.917107415241192
.groupby()
and .count()
#
These two methods are useful for tabulating frequencies by a grouping variable!
# count the number of observations grouped by each continent
gap.groupby("continent").count()["country"]
continent
Africa 624
Americas 300
Asia 396
Europe 360
Oceania 24
Name: country, dtype: int64
Subset rows or columns#
Sampling data is necessary for many reasons, including quick sanity checks.
Slice a data frame by using bracket notation to specify start and end points [start : end]
The [start
index is included and the end]
index is excluded.
Remember that Python is a zero-indexed language, so starts counting from zero, not one.
Leave the start or end values blank to start from the beginning, or go to the end of a collection.
Row subset: slice just the first row#
gap[:1]
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
Row subset: slice first three rows#
gap[:3]
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
Row subset: slice rows 10 thru 14#
subset1 = gap[10:15]
subset1
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
10 | Afghanistan | 2002 | 25268405.0 | Asia | 42.129 | 726.734055 |
11 | Afghanistan | 2007 | 31889923.0 | Asia | 43.828 | 974.580338 |
12 | Albania | 1952 | 1282697.0 | Europe | 55.230 | 1601.056136 |
13 | Albania | 1957 | 1476505.0 | Europe | 59.280 | 1942.284244 |
14 | Albania | 1962 | 1728137.0 | Europe | 64.820 | 2312.888958 |
Column subset: one column#
# type the column name as a string in square brackets
gap['lifeExp']
0 28.801
1 30.332
2 31.997
3 34.020
4 36.088
...
1699 62.351
1700 60.377
1701 46.809
1702 39.989
1703 43.487
Name: lifeExp, Length: 1704, dtype: float64
Column subset: multiple columns#
# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2
continent | lifeExp | gdpPercap | |
---|---|---|---|
0 | Asia | 28.801 | 779.445314 |
1 | Asia | 30.332 | 820.853030 |
2 | Asia | 31.997 | 853.100710 |
3 | Asia | 34.020 | 836.197138 |
4 | Asia | 36.088 | 739.981106 |
... | ... | ... | ... |
1699 | Africa | 62.351 | 706.157306 |
1700 | Africa | 60.377 | 693.420786 |
1701 | Africa | 46.809 | 792.449960 |
1702 | Africa | 39.989 | 672.038623 |
1703 | Africa | 43.487 | 469.709298 |
1704 rows × 3 columns
Row and column subset#
# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3
continent | lifeExp | gdpPercap | |
---|---|---|---|
855 | Asia | 64.624 | 80894.88326 |
856 | Asia | 67.712 | 109347.86700 |
857 | Asia | 69.343 | 59265.47714 |
858 | Asia | 71.309 | 31354.03573 |
# A column in a pandas data frame Pandas "Series" can be thought of like numpy arrays
# But, beware, they do not function exactly the same!
type(gap["lifeExp"])
pandas.core.series.Series
Subset by logical condition(s)#
# lifeExp is greater than 80
le2 = gap[gap['lifeExp'] > 80]
le2
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
70 | Australia | 2002 | 19546792.0 | Oceania | 80.370 | 30687.75473 |
71 | Australia | 2007 | 20434176.0 | Oceania | 81.235 | 34435.36744 |
251 | Canada | 2007 | 33390141.0 | Americas | 80.653 | 36319.23501 |
539 | France | 2007 | 61083916.0 | Europe | 80.657 | 30470.01670 |
670 | Hong Kong China | 2002 | 6762476.0 | Asia | 81.495 | 30209.01516 |
671 | Hong Kong China | 2007 | 6980412.0 | Asia | 82.208 | 39724.97867 |
694 | Iceland | 2002 | 288030.0 | Europe | 80.500 | 31163.20196 |
695 | Iceland | 2007 | 301931.0 | Europe | 81.757 | 36180.78919 |
767 | Israel | 2007 | 6426679.0 | Asia | 80.745 | 25523.27710 |
778 | Italy | 2002 | 57926999.0 | Europe | 80.240 | 27968.09817 |
779 | Italy | 2007 | 58147733.0 | Europe | 80.546 | 28569.71970 |
801 | Japan | 1997 | 125956499.0 | Asia | 80.690 | 28816.58499 |
802 | Japan | 2002 | 127065841.0 | Asia | 82.000 | 28604.59190 |
803 | Japan | 2007 | 127467972.0 | Asia | 82.603 | 31656.06806 |
1103 | New Zealand | 2007 | 4115771.0 | Oceania | 80.204 | 25185.00911 |
1151 | Norway | 2007 | 4627926.0 | Europe | 80.196 | 49357.19017 |
1427 | Spain | 2007 | 40448191.0 | Europe | 80.941 | 28821.06370 |
1474 | Sweden | 2002 | 8954175.0 | Europe | 80.040 | 29341.63093 |
1475 | Sweden | 2007 | 9031088.0 | Europe | 80.884 | 33859.74835 |
1486 | Switzerland | 2002 | 7361757.0 | Europe | 80.620 | 34480.95771 |
1487 | Switzerland | 2007 | 7554661.0 | Europe | 81.701 | 37506.41907 |
logical AND (&
)#
All conditions must be satisfied to be included in the subset
# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
695 | Iceland | 2007 | 301931.0 | Europe | 81.757 | 36180.78919 |
logical OR (|
)#
Just one of multiple conditions must be satisfied to be included in the subset
# create a subset that includes country equals Ireland OR life expectancy greater than 82.
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
671 | Hong Kong China | 2007 | 6980412.0 | Asia | 82.208 | 39724.978670 |
744 | Ireland | 1952 | 2952156.0 | Europe | 66.910 | 5210.280328 |
745 | Ireland | 1957 | 2878220.0 | Europe | 68.900 | 5599.077872 |
746 | Ireland | 1962 | 2830000.0 | Europe | 70.290 | 6631.597314 |
747 | Ireland | 1967 | 2900100.0 | Europe | 71.080 | 7655.568963 |
748 | Ireland | 1972 | 3024400.0 | Europe | 71.280 | 9530.772896 |
749 | Ireland | 1977 | 3271900.0 | Europe | 72.030 | 11150.981130 |
750 | Ireland | 1982 | 3480000.0 | Europe | 73.100 | 12618.321410 |
751 | Ireland | 1987 | 3539900.0 | Europe | 74.360 | 13872.866520 |
752 | Ireland | 1992 | 3557761.0 | Europe | 75.467 | 17558.815550 |
753 | Ireland | 1997 | 3667233.0 | Europe | 76.122 | 24521.947130 |
754 | Ireland | 2002 | 3879155.0 | Europe | 77.783 | 34077.049390 |
755 | Ireland | 2007 | 4109086.0 | Europe | 78.885 | 40675.996350 |
803 | Japan | 2007 | 127467972.0 | Asia | 82.603 | 31656.068060 |
Exercises#
Load the file “gapminder-FiveYearData.csv” and save it in a variable named
gap
Print the column names
Compute the mean for one numeric column
Compute the mean for all numeric columns
Tabulate frequencies for the “continent” column
Compute mean lifeExp and dgpPercap by continent
Create a subset of
gap
that contains only countries with lifeExp greater than 75 and gdpPercap less than 5000.
Data visualization#
After importing, exploring, and subsetting your data, visualization is a common technique to perform next. Read Chapter 5 “Data visualization essentials” to get started.