iv. Numeric data wrangling

For text preprocessing, see the January 19 notebook.

wrangle

The pandas library

Import the pandas library with the alias pd.

Use “dot notation” to apply its methods to the dataset, which is stored in a tabular .csv file.

To import the .csv file, use the .read_csv() pandas method. The only argument is the file path to a .csv file.

Import the data below and investigate it by applying pandas methods to the data frame gap.

Learn more about the Gapminder data

import pandas as pd
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
print(type(gap))
<class 'pandas.core.frame.DataFrame'>

Pandas methods

Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. See the examplea code below.

  • First, look at the data

  • Summary statistics

  • Subset rows:

    • first row

    • first three rows

    • rows 10 thru 14

  • Subset columns:

    • one column

    • multiple columns

  • Row and column subset

  • Subset by logical condition(s)

First, look at the data

Use pandas methods to learn about the data!

.columns

# View column names
# this is an attribute - note the lack of parentheses ()
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

.shape

# Show  number of rows by columns
# this is an attribute - note the lack of parentheses ()
gap.shape
(1704, 6)

Summary statistics

Produce summary statistics, including:

  • Count, mean, sd, quartiles, min/max

  • Tabulate frequencies

.describe()

# produce summary statistics for numeric data
gap.describe()
year pop lifeExp gdpPercap
count 1704.00000 1.704000e+03 1704.000000 1704.000000
mean 1979.50000 2.960121e+07 59.474439 7215.327081
std 17.26533 1.061579e+08 12.917107 9857.454543
min 1952.00000 6.001100e+04 23.599000 241.165876
25% 1965.75000 2.793664e+06 48.198000 1202.060309
50% 1979.50000 7.023596e+06 60.712500 3531.846988
75% 1993.25000 1.958522e+07 70.845500 9325.462346
max 2007.00000 1.318683e+09 82.603000 113523.132900

.mean() and .std()

# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print(lifeExp_mean)
print(lifeExp_sd)
59.47443936619714
12.917107415241187

.groupby() and .count()

# count the number of countries grouped by each continent
gap.groupby("continent").count()["country"]
continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
Name: country, dtype: int64

Subset rows or columns

Sampling data is necessary for many reasons, including quick sanity checks.

Slice a data frame by using bracket notation to specify start and end points [start : end]

The [start index is included and the end] index is excluded.

Remember that Python is a zero-indexed language, or starts counting from zero, not one.

Leave the start or end values blank to start from the beginning, or go to the end of a collection.

Row subset: slice just the first row

gap[:1]
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314

Row subset: slice first three rows

gap[:3]
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710

Row subset: slice rows 10 thru 14

subset1 = gap[10:15]
subset1
country year pop continent lifeExp gdpPercap
10 Afghanistan 2002 25268405.0 Asia 42.129 726.734055
11 Afghanistan 2007 31889923.0 Asia 43.828 974.580338
12 Albania 1952 1282697.0 Europe 55.230 1601.056136
13 Albania 1957 1476505.0 Europe 59.280 1942.284244
14 Albania 1962 1728137.0 Europe 64.820 2312.888958

Column subset: one column

# type the column name as a string in square brackets
gap['lifeExp']
0       28.801
1       30.332
2       31.997
3       34.020
4       36.088
         ...  
1699    62.351
1700    60.377
1701    46.809
1702    39.989
1703    43.487
Name: lifeExp, Length: 1704, dtype: float64

Column subset: multiple columns

# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2
continent lifeExp gdpPercap
0 Asia 28.801 779.445314
1 Asia 30.332 820.853030
2 Asia 31.997 853.100710
3 Asia 34.020 836.197138
4 Asia 36.088 739.981106
... ... ... ...
1699 Africa 62.351 706.157306
1700 Africa 60.377 693.420786
1701 Africa 46.809 792.449960
1702 Africa 39.989 672.038623
1703 Africa 43.487 469.709298

1704 rows × 3 columns

Row and column subset

# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3
continent lifeExp gdpPercap
855 Asia 64.624 80894.88326
856 Asia 67.712 109347.86700
857 Asia 69.343 59265.47714
858 Asia 71.309 31354.03573
type(gap["lifeExp"])
pandas.core.series.Series

Subset by logical condition(s)

# lifeExp is greater than 80
le2 = gap[gap['lifeExp'] > 81]
le2
country year pop continent lifeExp gdpPercap
71 Australia 2007 20434176.0 Oceania 81.235 34435.36744
670 Hong Kong China 2002 6762476.0 Asia 81.495 30209.01516
671 Hong Kong China 2007 6980412.0 Asia 82.208 39724.97867
695 Iceland 2007 301931.0 Europe 81.757 36180.78919
802 Japan 2002 127065841.0 Asia 82.000 28604.59190
803 Japan 2007 127467972.0 Asia 82.603 31656.06806
1487 Switzerland 2007 7554661.0 Europe 81.701 37506.41907
# logical AND (all conditions must be satisfied to be included)

# create subset that includes life expectancy greater than 81 AND pop < 500,000.

year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002
country year pop continent lifeExp gdpPercap
695 Iceland 2007 301931.0 Europe 81.757 36180.78919
# logical OR (one of multiple conditions must be satisfied to be included)

# create subset that includes country equals Ireland OR life expectancy greater than 82. 

ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82
country year pop continent lifeExp gdpPercap
671 Hong Kong China 2007 6980412.0 Asia 82.208 39724.978670
744 Ireland 1952 2952156.0 Europe 66.910 5210.280328
745 Ireland 1957 2878220.0 Europe 68.900 5599.077872
746 Ireland 1962 2830000.0 Europe 70.290 6631.597314
747 Ireland 1967 2900100.0 Europe 71.080 7655.568963
748 Ireland 1972 3024400.0 Europe 71.280 9530.772896
749 Ireland 1977 3271900.0 Europe 72.030 11150.981130
750 Ireland 1982 3480000.0 Europe 73.100 12618.321410
751 Ireland 1987 3539900.0 Europe 74.360 13872.866520
752 Ireland 1992 3557761.0 Europe 75.467 17558.815550
753 Ireland 1997 3667233.0 Europe 76.122 24521.947130
754 Ireland 2002 3879155.0 Europe 77.783 34077.049390
755 Ireland 2007 4109086.0 Europe 78.885 40675.996350
803 Japan 2007 127467972.0 Asia 82.603 31656.068060