iv. Numeric data wrangling
Contents
iv. Numeric data wrangling#
For text preprocessing, see the January 19 notebook.

The pandas library#
Import the pandas library with the alias pd.
Use “dot notation” to apply its methods to the dataset, which is stored in a tabular .csv file.
To import the .csv file, use the .read_csv() pandas method. The only argument is the file path to a .csv file.
Import the data below and investigate it by applying pandas methods to the data frame gap.
Learn more about the Gapminder data
import pandas as pd
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
print(type(gap))
<class 'pandas.core.frame.DataFrame'>
Pandas methods#
Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. See the examplea code below.
First, look at the data
Summary statistics
Subset rows:
first row
first three rows
rows 10 thru 14
Subset columns:
one column
multiple columns
Row and column subset
Subset by logical condition(s)
First, look at the data#
Use pandas methods to learn about the data!
.head()#
# look at first five rows by default
gap.head()
| country | year | pop | continent | lifeExp | gdpPercap | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
| 1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
| 2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
| 3 | Afghanistan | 1967 | 11537966.0 | Asia | 34.020 | 836.197138 |
| 4 | Afghanistan | 1972 | 13079460.0 | Asia | 36.088 | 739.981106 |
.columns#
# View column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
.shape#
# Show number of rows by columns
# this is an attribute instead of a method - note the lack of parentheses ()
gap.shape
(1704, 6)
Summary statistics#
Produce summary statistics, including:
Count, mean, sd, quartiles, min/max
Tabulate frequencies
.describe()#
# produce summary statistics for numeric data
gap.describe()
| year | pop | lifeExp | gdpPercap | |
|---|---|---|---|---|
| count | 1704.00000 | 1.704000e+03 | 1704.000000 | 1704.000000 |
| mean | 1979.50000 | 2.960121e+07 | 59.474439 | 7215.327081 |
| std | 17.26533 | 1.061579e+08 | 12.917107 | 9857.454543 |
| min | 1952.00000 | 6.001100e+04 | 23.599000 | 241.165876 |
| 25% | 1965.75000 | 2.793664e+06 | 48.198000 | 1202.060309 |
| 50% | 1979.50000 | 7.023596e+06 | 60.712500 | 3531.846988 |
| 75% | 1993.25000 | 1.958522e+07 | 70.845500 | 9325.462346 |
| max | 2007.00000 | 1.318683e+09 | 82.603000 | 113523.132900 |
.mean() and .std()#
# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print(lifeExp_mean)
print(lifeExp_sd)
59.474439366197174
12.917107415241192
.groupby() and .count()#
# count the number of countries grouped by each continent
gap.groupby("continent").count()["country"]
continent
Africa 624
Americas 300
Asia 396
Europe 360
Oceania 24
Name: country, dtype: int64
Subset rows or columns#
Sampling data is necessary for many reasons, including quick sanity checks.
Slice a data frame by using bracket notation to specify start and end points [start : end]
The [start index is included and the end] index is excluded.
Remember that Python is a zero-indexed language, so starts counting from zero, not one.
Leave the start or end values blank to start from the beginning, or go to the end of a collection.
Row subset: slice just the first row#
gap[:1]
| country | year | pop | continent | lifeExp | gdpPercap | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
Row subset: slice first three rows#
gap[:3]
| country | year | pop | continent | lifeExp | gdpPercap | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
| 1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
| 2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
Row subset: slice rows 10 thru 14#
subset1 = gap[10:15]
subset1
| country | year | pop | continent | lifeExp | gdpPercap | |
|---|---|---|---|---|---|---|
| 10 | Afghanistan | 2002 | 25268405.0 | Asia | 42.129 | 726.734055 |
| 11 | Afghanistan | 2007 | 31889923.0 | Asia | 43.828 | 974.580338 |
| 12 | Albania | 1952 | 1282697.0 | Europe | 55.230 | 1601.056136 |
| 13 | Albania | 1957 | 1476505.0 | Europe | 59.280 | 1942.284244 |
| 14 | Albania | 1962 | 1728137.0 | Europe | 64.820 | 2312.888958 |
Column subset: one column#
# type the column name as a string in square brackets
gap['lifeExp']
0 28.801
1 30.332
2 31.997
3 34.020
4 36.088
...
1699 62.351
1700 60.377
1701 46.809
1702 39.989
1703 43.487
Name: lifeExp, Length: 1704, dtype: float64
Column subset: multiple columns#
# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2
| continent | lifeExp | gdpPercap | |
|---|---|---|---|
| 0 | Asia | 28.801 | 779.445314 |
| 1 | Asia | 30.332 | 820.853030 |
| 2 | Asia | 31.997 | 853.100710 |
| 3 | Asia | 34.020 | 836.197138 |
| 4 | Asia | 36.088 | 739.981106 |
| ... | ... | ... | ... |
| 1699 | Africa | 62.351 | 706.157306 |
| 1700 | Africa | 60.377 | 693.420786 |
| 1701 | Africa | 46.809 | 792.449960 |
| 1702 | Africa | 39.989 | 672.038623 |
| 1703 | Africa | 43.487 | 469.709298 |
1704 rows × 3 columns
Row and column subset#
# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3
| continent | lifeExp | gdpPercap | |
|---|---|---|---|
| 855 | Asia | 64.624 | 80894.88326 |
| 856 | Asia | 67.712 | 109347.86700 |
| 857 | Asia | 69.343 | 59265.47714 |
| 858 | Asia | 71.309 | 31354.03573 |
type(gap["lifeExp"])
pandas.core.series.Series
Subset by logical condition(s)#
# lifeExp is greater than 80
le2 = gap[gap['lifeExp'] > 81]
le2
| country | year | pop | continent | lifeExp | gdpPercap | |
|---|---|---|---|---|---|---|
| 71 | Australia | 2007 | 20434176.0 | Oceania | 81.235 | 34435.36744 |
| 670 | Hong Kong China | 2002 | 6762476.0 | Asia | 81.495 | 30209.01516 |
| 671 | Hong Kong China | 2007 | 6980412.0 | Asia | 82.208 | 39724.97867 |
| 695 | Iceland | 2007 | 301931.0 | Europe | 81.757 | 36180.78919 |
| 802 | Japan | 2002 | 127065841.0 | Asia | 82.000 | 28604.59190 |
| 803 | Japan | 2007 | 127467972.0 | Asia | 82.603 | 31656.06806 |
| 1487 | Switzerland | 2007 | 7554661.0 | Europe | 81.701 | 37506.41907 |
# logical AND (all conditions must be satisfied to be included)
# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002
| country | year | pop | continent | lifeExp | gdpPercap | |
|---|---|---|---|---|---|---|
| 695 | Iceland | 2007 | 301931.0 | Europe | 81.757 | 36180.78919 |
# logical OR (one of multiple conditions must be satisfied to be included)
# create subset that includes country equals Ireland OR life expectancy greater than 82.
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82
| country | year | pop | continent | lifeExp | gdpPercap | |
|---|---|---|---|---|---|---|
| 671 | Hong Kong China | 2007 | 6980412.0 | Asia | 82.208 | 39724.978670 |
| 744 | Ireland | 1952 | 2952156.0 | Europe | 66.910 | 5210.280328 |
| 745 | Ireland | 1957 | 2878220.0 | Europe | 68.900 | 5599.077872 |
| 746 | Ireland | 1962 | 2830000.0 | Europe | 70.290 | 6631.597314 |
| 747 | Ireland | 1967 | 2900100.0 | Europe | 71.080 | 7655.568963 |
| 748 | Ireland | 1972 | 3024400.0 | Europe | 71.280 | 9530.772896 |
| 749 | Ireland | 1977 | 3271900.0 | Europe | 72.030 | 11150.981130 |
| 750 | Ireland | 1982 | 3480000.0 | Europe | 73.100 | 12618.321410 |
| 751 | Ireland | 1987 | 3539900.0 | Europe | 74.360 | 13872.866520 |
| 752 | Ireland | 1992 | 3557761.0 | Europe | 75.467 | 17558.815550 |
| 753 | Ireland | 1997 | 3667233.0 | Europe | 76.122 | 24521.947130 |
| 754 | Ireland | 2002 | 3879155.0 | Europe | 77.783 | 34077.049390 |
| 755 | Ireland | 2007 | 4109086.0 | Europe | 78.885 | 40675.996350 |
| 803 | Japan | 2007 | 127467972.0 | Asia | 82.603 | 31656.068060 |