Chapter 4 - Numeric data wrangling
Contents
Chapter 4 - Numeric data wrangling#
2023 April 7
For text preprocessing, see Chapter 7 “English text preprocessing basics”
The pandas library#
Import the pandas library with the alias pd
. This can be a useful shortcut to call its methods!
Use “dot notation” to apply its methods to the gapminder dataset, which is stored in a tabular .csv file.
To import the .csv file, use the .read_csv()
pandas method. The only argument for now is the file path to a .csv file.
Learn more about the Gapminder data
import pandas as pd
# import the gapminder dataset
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
print(type(gap))
<class 'pandas.core.frame.DataFrame'>
Pandas methods#
Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. How might the wrangling process look?
First, look at the data
Compute summary statistics
Subset rows
first row
first three rows
rows 10 thru 14
Subset columns:
one column
multiple columns
Row and column subset
Subset by logical condition(s)
First, look at the data#
print
or call the dataset to view its output. Use pandas methods to learn about the data!
print(gap)
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418.0 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340.0 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948.0 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563.0 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143.0 Africa 43.487 469.709298
[1704 rows x 6 columns]
gap
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
3 | Afghanistan | 1967 | 11537966.0 | Asia | 34.020 | 836.197138 |
4 | Afghanistan | 1972 | 13079460.0 | Asia | 36.088 | 739.981106 |
... | ... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | 1987 | 9216418.0 | Africa | 62.351 | 706.157306 |
1700 | Zimbabwe | 1992 | 10704340.0 | Africa | 60.377 | 693.420786 |
1701 | Zimbabwe | 1997 | 11404948.0 | Africa | 46.809 | 792.449960 |
1702 | Zimbabwe | 2002 | 11926563.0 | Africa | 39.989 | 672.038623 |
1703 | Zimbabwe | 2007 | 12311143.0 | Africa | 43.487 | 469.709298 |
1704 rows × 6 columns
.head()
#
# .head() shows the first five rows by default
gap.head()
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
3 | Afghanistan | 1967 | 11537966.0 | Asia | 34.020 | 836.197138 |
4 | Afghanistan | 1972 | 13079460.0 | Asia | 36.088 | 739.981106 |
.columns
#
# .columns shows the column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
.shape
#
# .shape shows the number of rows by columns
gap.shape
(1704, 6)
.info()
#
# .info() shows the compact structure of the dataset, including:
# class, dimensions, column names, number of missing values, and types
gap.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 year 1704 non-null int64
2 pop 1704 non-null float64
3 continent 1704 non-null object
4 lifeExp 1704 non-null float64
5 gdpPercap 1704 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB
Summary statistics#
Produce summary statistics, including:
Count, mean, sd, quartiles, min/max
Tabulate frequencies
.describe()
#
# produce summary statistics for numeric data
gap.describe()
year | pop | lifeExp | gdpPercap | |
---|---|---|---|---|
count | 1704.00000 | 1.704000e+03 | 1704.000000 | 1704.000000 |
mean | 1979.50000 | 2.960121e+07 | 59.474439 | 7215.327081 |
std | 17.26533 | 1.061579e+08 | 12.917107 | 9857.454543 |
min | 1952.00000 | 6.001100e+04 | 23.599000 | 241.165876 |
25% | 1965.75000 | 2.793664e+06 | 48.198000 | 1202.060309 |
50% | 1979.50000 | 7.023596e+06 | 60.712500 | 3531.846988 |
75% | 1993.25000 | 1.958522e+07 | 70.845500 | 9325.462346 |
max | 2007.00000 | 1.318683e+09 | 82.603000 | 113523.132900 |
.mean()
and .std()
#
# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print("Life expectancy mean:", lifeExp_mean)
print("Life expectancy sd:", lifeExp_sd)
Life expectancy mean: 59.47443936619714
Life expectancy sd: 12.917107415241187
.groupby()
and .count()
#
These two methods are useful for tabulating frequencies by a grouping variable!
import pandas as pd
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
pd.set_option('display.float_format', lambda x: '%.6f' % x)
# count the number of observations grouped by each continent
gap.groupby("continent").mean()["pop"]
continent
Africa 9916003.142628
Americas 24504794.996667
Asia 77038721.974747
Europe 17169764.733333
Oceania 8874672.333333
Name: pop, dtype: float64
Subset rows or columns#
Sampling data is necessary for many reasons, including quick sanity checks.
Slice a data frame by using bracket notation to specify start and end points [start : end]
The [start
index is included and the end]
index is excluded.
Remember that Python is a zero-indexed language, so starts counting from zero, not one.
Leave the start or end values blank to start from the beginning, or go to the end of a collection.
Row subset: slice just the first row#
gap[:1]
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.000000 | Asia | 28.801000 | 779.445314 |
Row subset: slice first three rows#
gap[:3]
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.000000 | Asia | 28.801000 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.000000 | Asia | 30.332000 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.000000 | Asia | 31.997000 | 853.100710 |
Row subset: slice rows 10 thru 14#
subset1 = gap[10:15]
subset1
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
10 | Afghanistan | 2002 | 25268405.000000 | Asia | 42.129000 | 726.734055 |
11 | Afghanistan | 2007 | 31889923.000000 | Asia | 43.828000 | 974.580338 |
12 | Albania | 1952 | 1282697.000000 | Europe | 55.230000 | 1601.056136 |
13 | Albania | 1957 | 1476505.000000 | Europe | 59.280000 | 1942.284244 |
14 | Albania | 1962 | 1728137.000000 | Europe | 64.820000 | 2312.888958 |
Column subset: one column#
# type the column name as a string in square brackets
gap['lifeExp']
0 28.801000
1 30.332000
2 31.997000
3 34.020000
4 36.088000
...
1699 62.351000
1700 60.377000
1701 46.809000
1702 39.989000
1703 43.487000
Name: lifeExp, Length: 1704, dtype: float64
Column subset: multiple columns#
# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2
continent | lifeExp | gdpPercap | |
---|---|---|---|
0 | Asia | 28.801000 | 779.445314 |
1 | Asia | 30.332000 | 820.853030 |
2 | Asia | 31.997000 | 853.100710 |
3 | Asia | 34.020000 | 836.197138 |
4 | Asia | 36.088000 | 739.981106 |
... | ... | ... | ... |
1699 | Africa | 62.351000 | 706.157306 |
1700 | Africa | 60.377000 | 693.420786 |
1701 | Africa | 46.809000 | 792.449960 |
1702 | Africa | 39.989000 | 672.038623 |
1703 | Africa | 43.487000 | 469.709298 |
1704 rows × 3 columns
Row and column subset#
# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3
continent | lifeExp | gdpPercap | |
---|---|---|---|
855 | Asia | 64.624000 | 80894.883260 |
856 | Asia | 67.712000 | 109347.867000 |
857 | Asia | 69.343000 | 59265.477140 |
858 | Asia | 71.309000 | 31354.035730 |
# A column in a pandas data frame Pandas "Series" can be thought of like numpy arrays
# But, beware, they do not function exactly the same!
type(gap["lifeExp"])
pandas.core.series.Series
Subset by logical condition(s)#
# lifeExp is greater than 81
le2 = gap[gap['lifeExp'] > 81]
le2
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
71 | Australia | 2007 | 20434176.000000 | Oceania | 81.235000 | 34435.367440 |
670 | Hong Kong China | 2002 | 6762476.000000 | Asia | 81.495000 | 30209.015160 |
671 | Hong Kong China | 2007 | 6980412.000000 | Asia | 82.208000 | 39724.978670 |
695 | Iceland | 2007 | 301931.000000 | Europe | 81.757000 | 36180.789190 |
802 | Japan | 2002 | 127065841.000000 | Asia | 82.000000 | 28604.591900 |
803 | Japan | 2007 | 127467972.000000 | Asia | 82.603000 | 31656.068060 |
1487 | Switzerland | 2007 | 7554661.000000 | Europe | 81.701000 | 37506.419070 |
logical AND (&
)#
All conditions must be satisfied to be included in the subset
# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
695 | Iceland | 2007 | 301931.000000 | Europe | 81.757000 | 36180.789190 |
logical OR (|
)#
Just one of multiple conditions must be satisfied to be included in the subset
# create a subset that includes country equals Ireland OR life expectancy greater than 82.
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
671 | Hong Kong China | 2007 | 6980412.000000 | Asia | 82.208000 | 39724.978670 |
744 | Ireland | 1952 | 2952156.000000 | Europe | 66.910000 | 5210.280328 |
745 | Ireland | 1957 | 2878220.000000 | Europe | 68.900000 | 5599.077872 |
746 | Ireland | 1962 | 2830000.000000 | Europe | 70.290000 | 6631.597314 |
747 | Ireland | 1967 | 2900100.000000 | Europe | 71.080000 | 7655.568963 |
748 | Ireland | 1972 | 3024400.000000 | Europe | 71.280000 | 9530.772896 |
749 | Ireland | 1977 | 3271900.000000 | Europe | 72.030000 | 11150.981130 |
750 | Ireland | 1982 | 3480000.000000 | Europe | 73.100000 | 12618.321410 |
751 | Ireland | 1987 | 3539900.000000 | Europe | 74.360000 | 13872.866520 |
752 | Ireland | 1992 | 3557761.000000 | Europe | 75.467000 | 17558.815550 |
753 | Ireland | 1997 | 3667233.000000 | Europe | 76.122000 | 24521.947130 |
754 | Ireland | 2002 | 3879155.000000 | Europe | 77.783000 | 34077.049390 |
755 | Ireland | 2007 | 4109086.000000 | Europe | 78.885000 | 40675.996350 |
803 | Japan | 2007 | 127467972.000000 | Asia | 82.603000 | 31656.068060 |
Exercises#
Load the file “gapminder-FiveYearData.csv” and save it in a variable named
gap
Print the column names
Compute the mean for one numeric column
Compute the mean for all numeric columns
Tabulate frequencies for the “continent” column
Compute mean lifeExp and dgpPercap by continent
Create a subset of
gap
that contains only countries with lifeExp greater than 75 and gdpPercap less than 5000.
# 1.
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
gap
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.000000 | Asia | 28.801000 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.000000 | Asia | 30.332000 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.000000 | Asia | 31.997000 | 853.100710 |
3 | Afghanistan | 1967 | 11537966.000000 | Asia | 34.020000 | 836.197138 |
4 | Afghanistan | 1972 | 13079460.000000 | Asia | 36.088000 | 739.981106 |
... | ... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | 1987 | 9216418.000000 | Africa | 62.351000 | 706.157306 |
1700 | Zimbabwe | 1992 | 10704340.000000 | Africa | 60.377000 | 693.420786 |
1701 | Zimbabwe | 1997 | 11404948.000000 | Africa | 46.809000 | 792.449960 |
1702 | Zimbabwe | 2002 | 11926563.000000 | Africa | 39.989000 | 672.038623 |
1703 | Zimbabwe | 2007 | 12311143.000000 | Africa | 43.487000 | 469.709298 |
1704 rows × 6 columns
# 2.
gap.columnsns
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[26], line 2
1 # 2.
----> 2 gap.columnsns
File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'columnsns'
# 3. compute mean for one numeric column
gap["pop"].mean()
29601212.32511736
gap.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 year 1704 non-null int64
2 pop 1704 non-null float64
3 continent 1704 non-null object
4 lifeExp 1704 non-null float64
5 gdpPercap 1704 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB
# 4. compute mean for ALL numeric columns
# gap.describe()
gap.mean()
/var/folders/9g/fhnd1v790cj5ccxlv4rcvsy40000gq/T/ipykernel_60936/4035141191.py:3: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
gap.mean()
year 1979.500000
pop 29601212.325117
lifeExp 59.474439
gdpPercap 7215.327081
dtype: float64
sum(gap.groupby("continent").count()["country"] / 12)
142.0
gap.shape
(1704, 6)
142 * 12
1704
# 6.
gap.groupby("continent")[["lifeExp", "gdpPercap"]].mean()
lifeExp | gdpPercap | |
---|---|---|
continent | ||
Africa | 48.865330 | 2193.754578 |
Americas | 64.658737 | 7136.110356 |
Asia | 60.064903 | 7902.150428 |
Europe | 71.903686 | 14469.475533 |
Oceania | 74.326208 | 18621.609223 |
# Create a subset of gap that contains only countries with lifeExp greater
# than 75 and gdpPercap less than 5000.
healthy_poor = gap[(gap["lifeExp"] > 75) & (gap["gdpPercap"] < 5000)]
healthy_poor
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
22 | Albania | 2002 | 3508512.000000 | Europe | 75.651000 | 4604.211737 |
gap.iloc[1501]
country Taiwan
year 1957
pop 10164215.000000
continent Asia
lifeExp 62.400000
gdpPercap 1507.861290
Name: 1501, dtype: object
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
gap[gap.columns[[2, 3]]]
pop | continent | |
---|---|---|
0 | 8425333.000000 | Asia |
1 | 9240934.000000 | Asia |
2 | 10267083.000000 | Asia |
3 | 11537966.000000 | Asia |
4 | 13079460.000000 | Asia |
... | ... | ... |
1699 | 9216418.000000 | Africa |
1700 | 10704340.000000 | Africa |
1701 | 11404948.000000 | Africa |
1702 | 11926563.000000 | Africa |
1703 | 12311143.000000 | Africa |
1704 rows × 2 columns
gap.ix[2]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[45], line 1
----> 1 gap.ix[2]
File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'ix'
Data visualization#
After importing, exploring, and subsetting your data, visualization is a common technique to perform next. Read Chapter 5 “Data visualization essentials” to get started.